An Experimental Analysis of Speech Features For Tone Speech Recognition
An Experimental Analysis of Speech Features For Tone Speech Recognition
information present in different speech features are one of the most crucial design decision for speech based
redundant and overlapping. Therefore, it is difficult to system development. The speech features can be categorize
identify and separate which aspect of the speech signal is into three categories -- Excitation source features, vocal tract
represented by which feature. In speech research, very often features and prosodic features.
features are selected on experimental basis, and sometimes Speech features extracted from excitation source signal is
using the mathematical approach like Principal component called source features. Excitation source signal is obtained
analysis (PCA). by discarding the vocal tract information from the speech
The Apatani language of Arunachal Pradesh of North signal. This is achieved by first predicting the vocal tract
East India is belongs to the Tani group of language. Tani information using linear predictor filter coefficients
languages constitute a distinct subgroup within Tibeto- extracted from the speech signal and then separating it by
Burman group of languages [10]. The other languages of the using inverse transformation. The resulting signal is called
group are Adi, Bangni, Bokar, Bori, Damu, Gaol, Hill Miri, linear predictor residual signal [14]. The features extracted
Milang, Na, Nyishi, Tagin, Tangam and yano. The Tani from LP residual signal is called excitation source features
languages are found basically in the continuous areas from or source features. A sound unit is characterized by a
the Kamng river to the Siang river of Arunachal Pradesh. A sequence of shapes assumed by the vocal tract during
small number of Tani speakers are found in the contiguous production of the sound. The vocal tract system can be
area of Tibet and only the speakers of Missing language are considered as a cascade of cavities of varying cross sectional
found in the Brahmaputra valley of Assam [11]. The areas. During speech production, the vocal tract act as a
Apatani language has 06(six) vowels and 17 (seventeen) resonator and emphasizes certain frequency components
consonants [12]. depending on the shape of the oral cavity. The information
The Table. 1 presents the Apatani vowels and Table. 2 about the sequence of shapes of vocal tract that produce the
presents the Apatani consonants with their manner and sound unit is captured by vocal tract features also called
position of articulation. system or spectral features. The vocal tract characteristics
can be approximately modelled by spectral features like
Table1: Apatani vowels. linear predictor coefficients (LPC) and ceptral coefficients
Tongue Tongue position (CC) [13]. Prosody plays a key role in the perception of
Height Front central Back human speech. The information contained in prosodic
High ɪ ʊ features is partly different from the information contained in
Mid ɛ ə ɔ source and spectral features. Therefore, more and more
Low ɑ: researchers from the speech recognition area are showing
interests in prosodic features. Generally, prosody means "the
structure that organizes sound". Pitch (tone), Energy
Table 2: Apatani consonants with their manner and place
(loudness) and normalized duration (rhythm) are the main
of articulation
components of prosody for a speaker. Prosody can vary
Manner of Place of Articulation
Articulation Labial Alveola Palatal Velar Glottal from speaker to speaker and relies on long-term information
r of speech.
Stop p, b t, d ʧ, ʤ k, g Very often, prosodic features are extracted with larger
frame size than acoustical features as prosodic features exist
Nasals m n ŋ
over a long speech segment such as syllables. The pitch and
Fricative s kʰ h energy contours change slowly compared to the spectrum,
Flap r which implies that the variation can be captured over a long
Approximat ɭ ȷ
speech segment [15].
e The source, system and prosodic features are distinct
from each other in speech production, feature extraction and
perception point of view. They are mostly non-overlapping
II. THE SPEECH FEATURES in nature and represent different aspects of the speech
production system. The basic objective of ASR system is to
Speech is the output of a vocal tract system excited by an recognize the phonetic content of the speech signal
excitation source signal. Characteristics of both the vocal discarding other irrelevant information.
tract response and excitation source signal vary with time to
produce different sounds. At the time of speech production,
human beings impose duration and intonational pattern on
top of the vocal tract response to convey the intended
message [13]. Speech signal not only conveys the linguistic
information but lots of other information like information
about the speaker, gender, social and regional identity,
health and emotional status etc. The first step of automatic
speech recognition system is to form a compact
representation of the speech signal emphasizing phonetic
information of the signal over other information. Choosing
suitable features for developing a speech based system is
III. PROPOSED METHOD 2. Use a data structure for the centroid as (centroid_values,
proximity_index), the proximity_index referred to the
The block diagram of the proposed model is given in Fig. central location of each cluster derived in the time scale.
1. The pre-emphasized speech signal is first blocked into
frame of 100 ms duration with 50% overlapping. From each 3. For each frame j repeat step 4 to 6
block, two types of features have been extracted -- spectral
features and prosodic features. The spectral features 4. Select the two nearby clusters m and k for jth frame based
considered in the present study are Mel Frequency Cepstral on proximity index. The cluster with two consecutive
Coefficients (MFCC) and Linear Predictor Cepstral proximity index m and k are nearby clusters to j if
Coefficients (LPCC). To extract the spectral features, each M∗m≤ j≤ k∗M … (2)
speech frame of 100 ms has been re-framed into frame of
size 20 ms with 50% overlapping. The spectral features 5. Compute the distance of the
namely MFCC and LPCC have been extracted from each 20 jth frame from this two cluster
ms frame separately. In the present study we have proposed centroids.
6. Assign the frame to the nearby cluster and update its [ ə́] Vowel ə with level tone
cluster centroid.
A feature would be effective in discriminating between
The algorithm has been applied separately to both MFCC
different tonal vowels if the distribution of different tonal
and LPCC features and reduced feature sets have been
vowels are concentrated at widely different location in the
extracted which represents the spectral characteristic of the
parameter space although they are different from each other
speech signal for the entire 100 ms duration. These features
only in associated tone[16]. A good measure of
are combined with prosodic features extracted from the 100
effectiveness would be the ratio of inter-vowel to intra-
ms frame considering it as a single unit. The prosodic
vowel (within the class) variance for the tonal vowels,
features extracted are maximum, minimum and average
referred to as F-ratio, which is defined as
values of F0 and Energy computed over the entire 100 ms
period. These prosodic features are combined with MFCC
Variance withinthe class
and LPCC features separately and two different sets of F=
features have been computed. Each feature set is evaluated Average variance across all classes
for their relative performance in tonal speech recognition. … (3)
To compute the overall F-ratio values across all class. The
IV. EXPERIMENTAL SETUP equation is:
N
In the present study, each tonal instance of a vowel has 1
been considered as different tonal vowel. For example, the
∑ (μ − μ́)
N i=1 i
vowel [ɑ :] has three associated tones -- rising, falling and
F= N
1
level. Thus vowel ¿:] gives raise to the tonal vowels [ɑ́ :] (¿ ∑S
N i=1 i
] rising), [ ɑ̀ :] ((¿:] falling) and [ɑ́ :] ((¿:] level). We
… (4)
referred to these vowels as tonal vowels. Considering the
tonal instances as a separate vowel, we get sixteen tonal Where N is the number of tonal vowels, μi is the mean of
vowels in Apatani language. The vowels are given in Table. a particular coefficient of the feature vector for ith tonal
3. Since the vowel [ə] has only one tone, it is not taken into vowel, μ́ is the overall mean value for that coefficient of the
consideration while evaluating the performance of the feature vector for all the tonal vowels. Si , within a tonal
feature vectors. vowel variance is given by
A speech database of Apatani tonal words has been Mi
prepared to carry out the experiments. The database consist 1
Si= ∑ (x −μ )
of 12 isolated tonal words spoken by 20 different speakers M i j=1 ij i
(13 males and 7 females). The recording has been done in a … (5)
controlled acoustical environment at 16 KHz sampling
frequency and 16 bit mono format. A headphone where x ij is the value of the coefficient for jth observation
microphone has been used for recoding the database. The
of the ith tonal vowel and M i is the number of observations
words are selected in such a way that each tonal instance of
the vowel has at least 5 instances among the words. Thus, for ith tonal vowel. Higher F-ratio value for a coefficient
for each tonal vowel, we have minimum 100 instance indicates that it can be used for good classification
recorded from 20 speakers. Another metric used for measuring the performance of
Table. 3. Apatani Tonal vowels. features in discriminating among the tonal instances of a
vowel is the Kullback-Leibler distance (KLD). The KLD
provides a natural distance between a probability
[ɑ́ :] Vowel ɑ: with level tone
distribution and a target probability distribution. KL
[ɑ́ :] Vowel ɑ: with rising tone
distances have been measure among features extracted from
[ ɑ̀ :] Vowel ɑ: with falling tone the tonal vowel and their average has been taken. If the
[ ɪ́ ] Vowel ɪ with level tone distance is higher, the feature has better tonal phoneme
discrimination capability.
[ ɪ́ ] Vowel ɪ with rising tone
[ ɪ̀ ] Vowel ɪ with falling tone
V. RESULTS AND DISCUSSIONS
[ ɔ́] Vowel ɔ with level tone
All the experiments were carried out using the database
[ ɔ́] Vowel ɔ with rising tone
described in Section - IV. The vowels are segmented from
[ ɔ̀] Vowel ɔ with falling tone the isolated words for all its tonal instances. The
[ ɛ́ ] Vowel ɛ with level tone segmentation has been done using PRAAT software which
is followed by subjective verification. The speech signal is
[ ɛ́ ] Vowel ɛ with rising tone
first segmented into frame of 100 ms with 50% overlapping.
[ ɛ̀ ] Vowel ɛ with falling tone
We will refer to this as 1st level frame. Each 1st level frame
[ʊ́ ] Vowel ʊ with level tone is now passed through two parallel system. The 1st system
[ʊ́ ] Vowel ʊ with rising tone extracts the spectral features –MFCC and LPCC separately.
To extract the spectral features,
[ ʊ̀ ] Vowel ʊ with falling tone
whose characteristics are correctly
visible only in short duration frame, we have re-framed the High-Level MFCC 4.2870 0.4258
1st level frame into frame of size 20 ms with 50% High-Level LPCC 4.4580 0.3516
overlapping. We refer to this as 2nd level frame. The MFCC
and LPCC features are extracted from each 2nd level frame.
From the above results it has been observed that the
The MFCC feature has been computed using a 21-channel
proposed features have better intra-tone phone
filter bank resulting in a 13-dimensional cepstral features
discrimination capability. This observation justify the fact
consisting of c 0 to c 12 coefficients. The LPCC has been
that these features can be used for both tonal and non-tonal
computed using a 10th dimensional predictor signal speech recognizer.
aggregated to a 13-dimenaional cepstral coefficients. Now, In the third set of experiments, we have evaluated the
the MFCC and LPCC features are clustered into 3 clusters performance of features for their inter-tone discrimination
using temporal k-mean algorithm. The cluster centroids are capability. In this experiment, we have computed F-ratio
clubbed together and we get a 39-dimentional MFCC and value considering all the instances of a tonal vowel as intra-
39-dimensional LPCC feature vector for the 1st level frame class and other tonal instances of the same vowel as inter-
of the speech signal. These two set of features are then class. Further, KL-distances have been measures among the
combined with the prosodic features separately. The tonal instances of the same base vowel only. The results of
prosodic features – maximum, minimum and average F0 and the experiments are given in Table. 6.
Energy are computed from each 1st level frame directly.
Thus, we get two sets of 45-dimensional feature vectors (39 Feature vector F-ratio KL Distance
spectral features and 6 prosodic features) for each 1st level
Baseline MFCC + ∆ + ∆∆ 0.7365 0.0538
frame. We will refer to this features as High-level MFCC
and High-level LPCC features respectively. Baseline LPCC + ∆ + ∆∆ 0.8383 0.293
To perform a comparative study of the proposed feature High-Level MFCC 4.7813 0.5754
set, we have extracted baseline MFCC and LPCC features High-Level LPCC 3.9852 0.2958
from the speech signal with 20 ms frame size and 50%
overlapping considering the same experimental setup as
described above. To capture the dynamic property of the From the above results it has been observed that the
speech signal, the 1st order and second order derivatives of proposed features are performing significantly well in inter-
the coefficients are also added. Thus we get a 39- tone discrimination of the phoneme when the base phoneme
dimensional MFCC feature vector and 39-dimensional is the same and different tonal instances are distinct from
LPCC feature vector. The result of the experiment carried each other only due to change in tone. In this scenario the
out is given in the Table. 4. baseline MFCC and LPCC features are completely failed to
discrimination among the phonemes.
Table. 4. Average F-ratio and KL Distance for the
features.
Feature vector F-ratio KL Distance VI. CONCLUSION
Baseline MFCC + ∆ + ∆∆ 2.0136 0.4404 This paper presents a feature set for tonal speech
recognition. The spectral and prosodic features are
Baseline LPCC + ∆ + ∆∆ 2.5569 0.6956
combined together using a late fusion technique to produce a
High-Level MFCC 5.3350 0.8727 feature set for the classifier. The proposed feature extraction
High-Level LPCC 4.3350 0.8754 technique has been evaluated for tonal phoneme
discrimination task. It has been observed that the proposed
feature set is performing significantly well in tonal as well
From the above experiments it have been observed that as as tone-independent evaluation scenario. Therefore, the
a result of adding prosodic features along with the MFCC proposed feature set can be used as a universal feature
and LPCC features, the overall tonal phoneme vector for both tonal and non-tonal speech recognition
discrimination capability increases considerably compared systems which is a long standing need for global
to baseline MFCC and LPCC features. acceptability of automatic speech recognition system.
In the second set of experiments, we have computed the
intra-tone phoneme discrimination capability of the
proposed feature set. We have computed the F-ratio value ACKNOWLEDGMENT
considering all the phonemes of a particular tone (level,
rising or falling) intra-class. Similarly, KL-distance has been This work is supported by UGC major project grant MRP-
measures only with other vowels of the same tone. The MAJOR-COM-2013-40580.
result is summarized in Table. 5.
REFERENCES
Table. 5. Average F-ratio and KL Distance for the 1. M. Yip, The Tonal Phonology of Chinese, New York: Garland
Publishing, 1991.
features for intra-tone phoneme discrimination capability 2. D. M. Beach, “The Science of Tonetics and Its Application to Bantu
Feature vector F-ratio KL Distance Languages”, in Bantu Studies, 2nd
Baseline MFCC + ∆ + ∆∆ 3.0731 0.4721 Series, Vol. 2, PP. 75-106, 1924.
3. N. H. Woo, Prosody and
Baseline LPCC + ∆ + ∆∆ 3.7763 0.3846 Phonology, Doctoral dissertation,
MIT, 1969.
AUTHORS PROFILE