Automatic Pronunciation Assessment For Language Learners With Acoustic-Phonetic Features
Automatic Pronunciation Assessment For Language Learners With Acoustic-Phonetic Features
Automatic Pronunciation Assessment For Language Learners With Acoustic-Phonetic Features
2 | (
1 | (
log ) (
x L
x L
x d
(1)
where is the likelihood of a test point x in the observation space for model of class 1
(likewise for class 2). Here class1 refers to unaspirated stops and class2 to aspirated
stops. In case of proper articulation, d(x) is expected to be greater than zero for unaspirated stops
and less than zero for aspirated stops.
For each test speaker, we compute the distribution of the likelihood ratios computed across the
speakers set of intended unaspirated stops and also across the set of intended aspirated stops. If
the stops are all properly articulated, we expect a good separation of the two distributions. Fig. 1
show the distributions obtained for each of the 10 native and non-native speakers using the AP
features system. We note the prominent difference in the extent of overlap between the likelihood
ratios in the case of native speakers with respect to that of non-native speakers. Fig. 2 shows the
corresponding results for the MFCC feature system. While there is a difference in the overlap
observed for the non-native speakers, the distinction between native and non-native speakers is
much more clear across speakers with the AP features.
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
Distribution of likelihood ratio for native data using AP features
-100 -80 -60 -40 -20 0 20
0
0.5
D
e
n
s
i
t
y
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
Likelihood ratio
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
Distribution of likelihood ratio for non-native data using AP features
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
-100 -80 -60 -40 -20 0 20
0
0.2
0.4
D
e
n
s
i
t
y
Likelihood ratio
FIGURE 1 Speaker wise distribution of likelihood ratio for native and non-native data using AP cues
(solid line: intended unaspirated; dashed line: intended aspirated)
21
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
Distribution of likelihood ratio for native data using MFCCs
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
Likelihood ratio
-200 -150 -100 -50 0 50 100
0
0.1
0.2
D
e
n
s
i
t
y
Distribution of likelihood ratio for non-native data using MFCCs
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
-200 -150 -100 -50 0 50 100
0
0.2
0.4
D
e
n
s
i
t
y
Likelihood ratio
FIGURE 2 Speaker wise distribution of likelihood ratio for native and non-native data using MFCCs
(solid line: intended unaspirated; dashed line: intended aspirated)
Native test set Non-native test set
Speaker
no.
AP MFCCs Speaker
no.
AP MFCCs
1 132.79 79.66 1 0.01 2.11
2 373.42 2.12 2 3.3 11.38
3 76.57 12.89 3 0.3 0.29
4 113.87 23.09 4 0.56 1.08
5 74.72 67.88 5 6.91 14.41
TABLE 3 Speaker wise F-ratio of unaspirated-aspirated likelihood ratio for native and non-
native test sets.
The difference between the performances of MFCC and AP features in the task of detecting non-
native pronunciation can be understood from the values of F-ratios across the 10 speakers in
Table 3. The F-ratio is computed for the pair of corresponding of unaspirated-aspirated likelihood
ratio distributions for each speaker and each feature set. A larger value of F-ratio indicates a
better separation of the particular speakers aspirated and unaspirated utterances in the
corresponding feature space, which may be interpreted as higher intelligibility. We see from
Table 3 that this intelligibility measure takes on distinctly different values in the case of the AP
feature based system, and consequently an accurate detection of non-nativeness is possible. In the
case of the MFCC features, however, there is no clear threshold separating the F-ratios of non-
native from native speakers.
To summarise, we have proposed a methodology for evaluating pronunciation quality in the
context of a selected phonemic attribute. It was demonstrated that acoustic-phonetic features
provide better discriminability between correctly and incorrectly uttered aspirated stops of Hindi
compared with the more generic MFCC features. Future work will address other phonemic
attributes while also expanding the dataset of test speakers.
22
References
Bhat, C., Srinivas, K. L. and Rao, P. (2010). Pronunciation scoring for language learners using a
phone recognition system. Proc. of the First International Conference on Intelligent Interactive
Technologies and Multimedia 2010, pages 135-139, Allahabad, India.
Niyogi, P. and Ramesh, P. (2003). The voicing feature for stop consonants: recognition
experiments with continuously spoken alphabets. Speech Communication, 41: 349-367.
Patil, V., Joshi, S. and Rao, P. (2009). Improving the robustness of phonetic segmentation to
accent and style variation with a two-staged approach. Proc. of Interspeech 2009, pages 2543-
2546, Brighton, U.K.
Patil, V. and Rao, P. (2011). Acoustic features for detection of aspirated stops. Proc. of National
Conf. on Communication 2011, pages 1-5, Bangalore, India.
Strik, H., Troung, K., Wet F. and Cucchiarini, C. (2007). Comparing classifiers for pronunciation
error detection. Proc. of Interspeech 2007, pages 1837-1840, Antwerp, Belgium.
Thangarajan, R., Natarajan, A. and Selvam, M. (2008). Word and Triphone Based Approaches in
Continuous Speech Recognition for Tamil Language. WSEAS Transactions on Signal
Processing, 4(3): 76-86.
Truong, K., Neri, A., Cuchiarini, C. and Strik, H. (2004). Automatic pronunciation error
detection: an acoustic-phonetic approach. Proc. of the InSTIL/ICALL Symposium, 2004, pages
135138, Venice, Italy.
Young, S. et al. (2006). The HTK Book v3.4. Cambridge University.
23