Phonetically Balanced Bangla Speech Corpus: Conference Paper
Phonetically Balanced Bangla Speech Corpus: Conference Paper
Phonetically Balanced Bangla Speech Corpus: Conference Paper
net/publication/277671306
CITATIONS READS
4 716
5 authors, including:
9 PUBLICATIONS 93 CITATIONS
Qatar Computing Research Institute
50 PUBLICATIONS 363 CITATIONS
SEE PROFILE
SEE PROFILE
1 PUBLICATION 4 CITATIONS
Università degli Studi di Trento
35 PUBLICATIONS 171 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Automatic Persona Generation (APG) - Creating Personas From Online Analytics and Social Media Data View project
All content following this page was uploaded by Shammur Absar Chowdhury on 04 June 2015.
S.M. Murtoza Firoj Alam1 Rabia Sultana1 Shammur Absar Mumit Khan1,2
Habib1 Chowdhur1,2
{murtoza, firojalam, ummi.ulab.bu}@gmail.com,
{shammur, mumit}@bracu.ac.bd
1
Center for Research on Bangla Language Processing, BRAC University
2
Department of Computer Science and Engineering, BRAC University
87
Conference on Human Language Technology for Development, Alexandria, Egypt, 2-5 May 2011.
nek and Black, (2004) database. A summary and 3 Development procedure
conclusion of the study are given in section 5.
This section presents the methodology of text
2 Literature review selection procedure for phonetically balanced
speech corpus. We measured the phonetic cover-
It is already well established that the success of age based on biphone in the phonetically tran-
speech research mostly depends on speech cor- scribed database. In addition to phonetic cover-
pora. Since speech corpora contain the variation age, we have also tried to maintain a prosodic
of real phenomena of speech utterances, thus we and syntactic variation in the corpus for future
are able to analyze the phenomena from speech works. Some sparse work has been done on
corpus. Speech related research such as phonetic phontactic constraint of Bangla. This is one of
research, acoustic model and intonation model the obstacles to define an optimized biphone list.
can be drawn from a speech corpus. Research on Therefore, we used the whole list of biphones in
speech synthesis shows great improvement on this study. Here, optimized means phonetically
the quality and naturalness of synthesized speech constrained biphones need to be omitted from the
which adapted speech corpus (Kominek and list. The system diagram of the whole develop-
Black, 2004) (Gibbon, 1997). Likewise, the de- ment process is shown in figure 1.
velopment of a speech recognition system largely
depends on speech corpus. The inspiration of this
Text corpus
work came from (Kominek and Black, 2004)
(Fisher et al., 1986) (Yoshida et al., 2002) ( Pat- Split into sentences
charika et al., 2002) (Radová and Vopálka, 1999)
where significant amount of work have done for Sentence list
different languages. There is a claim by LDC-IL4
that, a phonetically balanced Bangla speech cor- Sentences length: 1-15 words
pus is available. Besides that, CDAC 5 has also
developed speech corpora which are publicly Trimmed Sentence list
available through the web. There is a speech cor- Text normalization
pus publicly available for Bangladeshi Bangla -
CRBLP 6 speech corpus (Firoj et al., 2010), Normalized text
which is a read speech corpus. The CRBLP
G2P system and pronuncia-
speech corpus contains nine categories of speech tion dictionary
but it is not phonetically balanced. Such catego-
ries are Magazines, Novels, Legal documents Phonetically Tran-
(Child), History (Dhaka, Bangladesh, Language scribed text
movement, 7th March), Blogs (interview), Novel
(Rupaly Dip), Editorials (prothom-alo) and Con- Applying greedy algorithm
stitution of Bangladesh. However, to the best of
our knowledge there is no published account. In Optimally selected text
addition, due to the differences in the writing Hand pruning
style as well as the phonetic structure between
Indian and Bangladeshi Bangla we have decided Phonetically balanced text
to compile a phonetically balanced Bangladeshi
Figure 1: System diagram of phonetically balanced
Bangla speech corpus based on phone and bi- text selection
phone coverage, which will be the first of its
kind for Bangladeshi Bangla.
3.1 Text collection and normalization
Text selection from various domains is one of the
most frequently used techniques for development
of a speech corpus. However, it is one of the
most time consuming phase, since a lot of ma-
4
http://www.ldcil.org/ nual work needs to be done, such as selecting
5
http://www.kolkatacdac.in/html/txttospeeh/corpora/corpora different categories of text, proof reading and
_main/MainB.html manual correction after text normalization.
6
CRBLP - Center for Research on Bangla Language Therefore some constrains were considered dur-
Processing
88
ing text selection. The text was collected from and at the same time it could be considered as
two different sources such as prothom-alo news number and (ii). the token ১২. ৮০ (12.80) could
corpus (Yeasir et al., 2006) and CRBLP speech be considered as a floating point number or it
corpora (Firoj et al., 2010). The prothom-alo could be considered as a time. In case of these
news corpus has 39 categories of text such as – ambiguous tokens the accuracy of the tool is
general news, national news, international news, 87% (Firoj et al., 2009). The 13% error was
sports, special feature, editorial, sub-editorial and solved in the final text selection procedure. On
so on. Table 1 shows the frequency analysis of the other hand, the accuracy rate of non-
the initial text corpus. ambiguous token is 100% such as date (e.g: ০২-
০৬- ২০০৬), range (e.g: ১০- ১২), ratio (e.g: ১/ ২),
Corpus Sentences Total to- Token
kens type roman (e.g: I, II,), ordinal number (e.g: ১ম,
Prothom-alo 2,514,085 31,158,189 529,620 ২য়, ৩য়) and so on.
news corpus
CRBLP read 10,896 1,06,303 17,797 Example of Normalized form
speech corpus token
Total: 2,524,981 31,264,492 547,417 ১২১ একশত এ শ
১ম ম
Table 1: Frequency distribution of the corpus
০২৯৫৬৭৪৪৭ নয় চ ছয় ত
Starting with our initial text corpus consisting চ চ ত
~31 millions words and ~2.5 millions sentences
we used a python script to split the corpus into 3.2 Phoneme set
sentences based on punctuation marks such as ?, Defining the phoneme set is important for a
। and !. It is observed that the length of some phonetically balanced corpus. We considered the
sentences is too long i.e. more than 15 words, biphone as a unit for phonetically balanced cor-
even more than 25 words. Study of Kominek and pus. The phoneme inventory used in this study is
Black, (2004) explained and our recording expe- the one found in Firoj et al., (2008 a) and Firoj et
rience claimed that, sentences longer than 15 al., (2008 b). The phoneme inventory consists of
words are difficult to read aloud without making 30 consonants and 35 vowels including diph-
a mistake. With respect to the length, we main- thongs. Since a biphone is the combination of
tained the constraints and filtered out sentences two phones and Bangla phone inventory has 65
that are not between 1-20 words. Table 2 shows phones (Firoj et al., 2008 a) (Firoj et al., 2008 b),
the frequency analysis of the corpus after the fil- so the total number of biphones consist 65X65 =
tering. 4225. However, all biphones would not belong to
the language in terms of phonotactic constraints.
Corpus Sentences Total to- Token Since no notable work has been done on pho-
kens type netic constraint and as it is beyond our scope, we
Prothom-alo 2,014,032 21,177,137 487,158 have not optimized the biphone list in this work.
news corpus
CRBLP read 9,130 68,306 13,270 3.3 Phonetic transcription
speech corpus
Total: 2,023,162 21,245,443 500,428 The phonetically transcribed text is needed to
represent the phonetic coverage. Therefore, the
text has to be phonetized so that the distribution
Table 2: Frequency distribution of the corpus after of the phonetic unit can be analyzed. To perform
filtering
phonetic transcription each sentence is tokenized
The text contains large number of non-standard based on ‗white space‘ character. Then each
words (NSW) (Sproat et al., 2008) such as num- word is passed through the CRBLP pronuncia-
ber, date and phone number which need to be tion lexicon (2009) and a G2P (Grapheme to
Phoneme) converter (Ayesha et al., 2006). The
normalized to get full form of the NSW‘s. It is
system first checks the word in the lexicon, if the
then normalized using a text normalization tool
Firoj et al., (2009). There are some ambiguous word is not available in the lexicon then it is
NSW tokens such as year-number and time- passed through the G2P system for phonetic
floating point number. For example: (i). the to- transcription. The CRBLP pronunciation lexicon
contains 93K entries and the accuracy of the G2P
ken ১৯৯৯ (1999) could be considered as a year
system is 89%. So there could be errors in pho-
89
netic transcription due to the low accuracy rate of Step 8: Repeat from Step 2 to 6 until the biphone
G2P system which is unavoidable. Manual cor- list is empty.
rection of every word is not practical so a deci-
sion had been made that this problem would be Np 1
solved in the "hand pruning" stage. In phonetic Score =
transcription we used IPA7, since IPA has been i Pfi
standardized as an internationally accepted tran-
scription system. A phonetic measurement has Equation 1: Sentence score
been conducted after the text has been phoneti-
cally transcribed. The phonetic measurement of SC - sentence score
phone, biphone and triphone is shown in table 3. N p - the number of phonemes in each sentence
Pf i - the ith phoneme frequency of the sentence
Pattern type Unique Total in the corpus in the corpus.
Phones 65 119,068,607 (~119 millions)
Biphones 3,277 47,360,819 (~47 millions) This algorithm successively selects sentences.
Triphones 274,625 115,048,711 (~115 millions) The first sentence is the sentence, which cover
largest biphone count.
Table 3: Phonetic measurement of the corpus Our text corpus contains ~2 millions sentences
with 47 millions biphones. Based on experiment,
Though the corpus contains all the phones, it it took 26 hours 44 mins of CPU time in a Core
does not cover all the biphones. There could be i5 due 2.4 GHz PC equipped with 3 GB memory
several reasons: to run the greedy selection process.
1. Trimming the main sentence list to 20 It is observed that, the results of automatic selec-
words per sentence. tion are not ideal due to accuracy rate of text
2. The frequency of these missing biphones normalizer and G2P system. So a hand pruning
is too low in the spoken space of this language. (Kominek and Black, 2004) is required. A visual
inspection was made by considering several cri-
3.4 Balanced Corpus Design
teria such as awkward grammar, confusable ho-
The greedy selection algorithm (Santen and mographs and hard to pronounce words. Next,
Buchsbaum, 1997) has been used in many stu- the phonetically transcribed text was visually
dies of the corpus design. This is an optimization inspected, as our text normalization and G2P sys-
technique for constructing a subset of sentences tem produced some errors. Finally, a phonetical-
from a large set of sentences to cover the largest ly balanced text corpus was developed.
unit space with the smallest number of sentences.
Prior to the selection process, the target space i.e. 4 Recording
biphone is defined by the unit definition, mainly
the feature space of a unit. A detail of the algo- The next issue is the recording, which relates to
rithm is shown below: selecting speaker, recording environment and
Algorithm recording instrument. Since speaker choice is
Step 1: Generate a unique biphone list from the perhaps one of the most vital areas for recording
corpus. so a careful measure was taken. A female speak-
Step 2: Calculate frequency of the biphone in the er was chosen who is a professional speaker and
list form the corpus. aged 29.
Step 3: Calculate weight of each biphone in the As far as recording conditions are concerned,
list where weight of a biphone is inverse of the we tried to maintain as high quality as possible.
frequency. The recording of the utterances was done using
Step 4: Calculate a score for every sentence. The the Nundo speech processing software. A profes-
sentence score is defined by the equation (1). sional voice recording studio was chosen to
Step 5: Select the highest scored sentence. record the utterances. The equipment consisted
Step 6: Delete the selected sentence from the of an integrated Tascam TM-D4000 Digital-
corpus. Mixer, a high fidelity noise free Audiotechnica
Step 7: Delete all the biphones found in the se- microphone and two high quality multimedia
lected sentence from the biphone list. speaker systems. The voice talent was asked to
keep a distance of 10-12 inches from the micro-
7
phone. Optionally a pop filter was used between
IPA- International Phonetic alphabet
90
the speaker and the microphone to reduce the 6 Results
force of air puffs from bilabial plosive and other
strongly released stops. The speech data was di- It was our desire to design a speech corpus which
gitized at a sample rate 44.1 kHz, sample width will exhibit good biphone coverage. The infor-
24-bit resolution and stored as wave format. Af- mation of phonetic coverage of the corpus is
ter each recording, the moderator checked for shown in table 4.
any misleading pronunciation during the record-
Pattern No of Total found in Coverage
ing, and if so, the affected utterances were re-
type unique the corpus
recorded. units (unique)
There were a few challenges in the recording. Phones 65 65 100%
First, speaker was asked to keep the speaking Biphones 4225 3,277 77.56%
style consistent. Second, speaker was supervised Triphones 274,625 13911 5.06%
to keep the same tone in the recording. Since
speaking styles vary in different sessions, moni- Table 4: Phonetic coverage information of the cor-
toring was required to maintain the consistency. pus
To keep the consistency of the speaking style, in
addition to Zhu et al. [29] recommendation the The percentages for biphone and triphone cover-
following specifications were maintained: age are based on a simple combination. Thus the
1. Recording were done in the same time number of possible biphone is 4,225 and tri-
slot in every session i.e 9.00 am to 1.00 pm. phones is 274,625. This corpus covers nearly
2. A 5 minutes break was maintained after 100% phone and 77.56% biphone. A natural
each 10 minutes recording. speech synthesizer achieved with 80% coverage
3. Consistent volume of sound. of biphone as shown in Arctic of Kominek and
4. Normal intonation was maintained with- Black, (2004). So it is hoped that better speech
out any emotion. applications could be achievable by using this
5. Accurate pronunciation. corpus.
Pre-recorded voice with appropriate speaking We have also performed a frequency
style was used as a reference. In each session, analysis of the phonemes on the whole phonetic
speaker was asked to adjust his speaking style corpus. And during the analysis, we observed an
according to the reference voice. interesting phenomenon about the missing
22.44% biphone. That is, the phonemes whose
5 Annotation frequency is less frequent (<0.11%) in the pho-
netic corpus, and surprisingly their combination
The un-cleaned recorded data was around 2 came in the missing 22.44% biphone list. Obser-
hours 5 minutes and it had a lot of repetition of vation also says that this combination basically
the utterances. So in annotation, the recorded came from diphthongs and a few nasals. So
wave was cleaned manually using a proprietary based on our empirical study we can claim that
software wavlab which tends to reduce the rec- our findings of speech corpus are balanced and it
orded data to 1 hours 11 minutes. Then, it was would cover all of our everyday spoken space.
labeled (annotated) based on id using praat (Firoj This analysis leads to another assumption that
et al., 2010). Praat provides a textgrid file which this missing biphone list could be part of phono-
contains labels (in our case it is wave id) along tactic constraint of Bangla language. This means
with start and end time for each label. A separate that this combination possibly, may never occur
praat script was written to split the whole wave in the spoken space of Bangla language. An ef-
into individual wave based on id with start and fort needs to be done about the missing diphone
end time. We used id instead of text in labeling. using dictionary words and linguistic experts.
The structure of the corpus was constructed in a
hierarchical organization using the XML stan- 6.1 Comparison between festvox and our
dard. The file contains meta-data followed by approach
data. The metadata contains recording protocol, We have made a comparison between the tech-
speaker profile, text, annotation and spoken con- niques that is used in festvox ―nice utterance se-
tent. The data contains sentences with id, ortho- lection‖ (Kominek and Black, 2004)( Black and
graphic form, phonetic form and wave id. Lenzo, 2000) and the approach used in S. Kasu-
riya et al., (2003) and Patcharika et al., (2002)
that we followed in this study. The difference
91
between these two approaches is that festvox ing the voice by huge number (>50) of male and
technique uses most frequent words to select the female voice talents.
text in the first step. The rest of the steps are the
same in both approaches. We experimented fest- References
vox nice utterance selection tool (text2utts) using
A. W. Black and K. Lenzo, 2000. Building voices in
our data and got the result that is shown in table the Festival speech synthesis system,
5. A comparison between two techniques is http://festvox.org/bsv.
shown in table 5. According to Firoj et al., (2008
a) and Firoj et al., (2008 b) Bangla has 65 pho- Ayesha Binte Mosaddeque, Naushad UzZaman and
Mumit Khan, 2006. Rule based Automated Pro-
nemes including vowels and consonants. This
nunciation Generator , Proc. of 9th International
leads to 65X65 = 4225 biphones in Bangla. So in Conference on Computer and Information Tech-
selecting corpus, a corpus would be best when it nology (ICCIT 2006), Dhaka, Bangladesh, De-
covers maximum number of biphones. So in cember
festvox approach, the selected corpus covers
C. W. Patcharika, C. Wutiwiwatchai, P. Cotsomrong,
1,495 biphones which is 35.38% of the whole
and S. Suebvisai, 2002. Phonetically distributed
biphone set. On the other hand in our approach continuous speech corpus for thai language,
we got 3277 biphones of the selected text which Available online at:
results 77.56% coverage. http://citeseerx.ist.psu.edu/viewdoc/summary?doi
Here, in fextvox experiment we used =10.1.1.1.7778
most frequent 50,000 words in the first step to
CRBLP pronunciation lexicon, 2009. CRBLP, Avail-
select the text. This approach limits the festvox able: http://crblp.bracu.ac.bd/demo/PL/
text2utts tool to select the maximum number of
utterances. Dafydd Gibbon, Inge Mertins, Roger K. Moore, 2000.
Handbook of Multimodal and Spoken Dialogue
Pattern Festvox ap- Approach used Systems: Resources, Terminology and Product
proach in this study Evaluation (The Springer International Series in
No. of sentences 677 977 Engineering and Computer Science), Springer,
No. of biphones 26,274 70,030 August 31,
Firoj Alam , S. M. Murtoza Habib and Professor
No. of phones 26,914 71,007 Mumit Khan, 2008 a. Acoustic Analysis of Bangla
Biphone cover- 35.38% (1495 77.56% (3276 Consonants , Proc. Spoken Language Technolo-
age biphones) biphones) gies for Under-resourced language (SLTU‘08),
Phone coverage 84.61% (55 100% (65
Vietnam, May 5-7, page 108-113.
phones) phones)
Firoj Alam, S .M. Murtoza Habib, and Mumit Khan,
Table 5: Comparison between festvox and our tech- 2008 b. Research Report on Acoustic Analysis of
nique Bangla Vowel Inventory , Center for Research on
Bangla Language Processing, BRAC University,
Firoj Alam, S. M. Murtoza Habib and Mumit Khan,
7 Conclusion and future remarks 2009. Text Normalization System for Bangla ,
In this paper we presented the development of a Conference on Language and Technology 2009
(CLT09), NUCES, Lahore, Pakistan, January 22-
phonetically balanced Bangla speech corpus.
24,
This speech corpus contains 977 sentences with
77.56% biphone coverage. It needs more sen- Firoj Alam, S. M. Murtoza Habib, Dil Afroza Sultana
tences to cover all biphones. To do that, more and Mumit Khan, 2010. Development of Anno-
text corpora may be required. However, finding tated Bangla Speech Corpora , Spoken Language
Technologies for Under-resourced language
out all biphones is pragmatically impossible due
(SLTU‘10), Universiti Sains Malaysia, Penang,
to the linguistic diversity and phonotactic con- Malasia, May 3 - 5,
straint of a language. Besides, a significant
amount of effort is needed to be able to use this Fisher, William M.; Doddington, George R. and Gou-
resource in real speech applications. The efforts die Marshall, Kathleen M. 1986. The DARPA
Speech Recognition Research Database: Specifi-
include recording voices by number of male and
cations and Status , Proceedings of DARPA
female voice talents for speech synthesis in a Workshop on Speech Recognition. pp. 93–99.
professional recording environment. Speech rec-
ognition application requires more recording data François, H. and Boëffard, O., 2002. The Greedy Al-
in different environments which includes record- gorithm and its Application to the Construction of
92
a Continuous Speech Database , Proc. of LREC, Standard Words: WS'99 Final Report , CLSP
Las Palmas de Gran Canaria, Spain, Summer Workshop, Johns Hopkins University,
1999, Retrieved (June, 1, 2008). Available:
François, H. and Boëffard, O., 2001. Design of an
www.clsp.jhu.edu/ws99/projects/normal
Optimal Continuous Speech Database for Text-
To-Speech Synthesis Considered as a Set Covering V. Radová and P. Vopálka, 1999. Methods of sen-
Problem , Proc. of Eurospeech, Aalborg, Den- tences selection for read-speech corpus design , in
mark, TSD '99: Proceedings of the Second International
Workshop on Text, Speech and Dialogue. London,
Gibbon, D., Moore, R., Winski, R. (eds.), 1997.
UK: Springer-Verlag, pp. 165-170. [Online].
Handbook of Standards and Resources for Spoken
Available:
Language Systems , Mouton de Gruyter, Berlin
http://portal.acm.org/citation.cfm?id=720594
New York
Van Santen, J P. H. and Buchsbaum, A. L., 1997.
J. Kominek and A. Black, 2004. The cmu arctic
Methods for optimal text selection , Proc. of Eu-
speech databases , 5th ISCA Speech Synthesis
rospeech, p. 553-556, Rhodes, Greece,
Workshop, pp. 223-224,
Yeasir Arafat, Md. Zahurul Islam and Mumit Khan,
S. Kasuriya, V. Sornlertlamvanich, P. Cotsomrong, S.
2006. Analysis and Observations From a Bangla
Kanokphara, and N. Thatphithakkul, 2003. Thai
news corpus , Proc. of 9th International Confe-
speech corpus for thai speech recognition , The
rence on Computer and Information Technology
Oriental COCOSDA 2003, pp. 54-61. [Online].
(ICCIT 2006), Dhaka, Bangladesh, December
Available online at:
http://www.tcllab.org/virach/paper/virach/colips20 Yoshida Yoshio, Fukuroya Takeo, Takezawa To-
04_final.rtf shiyuki, 2002. ATR Speech Database , Proceed-
ings of the Annual Conference of JSAI,
Sproat R., Black A., Chen S., Kumar S., Ostendorf
VOL.16th, 124-125, Japan
M., and Richards C, 2008. Normalization of Non-
View public
93