Data driven method for non-intrusive speech intelligibility estimation

Mark Huckvale

Data driven method for non-intrusive speech intelligibility estimation

2010

We propose a data driven, non-intrusive method for speech intelligibility estimation. We begin with a large set of speech signal specific features and use a dimensionality reduction approach based on correlation and principal component analysis to find the most relevant features for intelligibility prediction. These are then used to train a Gaussian mixture model from which the intelligibility of unseen data is inferred. Experimental results show that our method gives a correlation with subjective intelligibility of 0.92 and a correlation of 0.96 with the ANSI standard Speech Intelligibility Index....Read more

DATA DRIVEN METHOD FOR NON-INTRUSIVE SPEECH INTELLIGIBILITY ESTIMATION Dushyant Sharma 1 , Gaston Hilkhuysen 2 , Nikolay D. Gaubitch 1 , Patrick A. Naylor 1 ,Mike Brookes 1 ,Mark Huckvale 2 Centre for Law Enforcement Audio Research (CLEAR) 1 Electrical and Electronic Engineering, Imperial College London, UK email: {dushyant.sharma02, ndg, p.naylor, mike.brookes}@ic.ac.uk 2 Speech, Hearing & Phonetic Sciences, University College London, UK email: {g.hilkhuysen, m.huckvale}@ucl.ac.uk ABSTRACT We propose a data driven, non-intrusive method for speech intelligibility estimation. We begin with a large set of speech signal speciﬁc features and use a dimensionality reduction approach based on correlation and principal component anal- ysis to ﬁnd the most relevant features for intelligibility pre- diction. These are then used to train a Gaussian mixture model from which the intelligibility of unseen data is in- ferred. Experimental results show that our method gives a correlation with subjective intelligibility of 0.92 and a corre- lation of 0.96 with the ANSI standard Speech Intelligibility Index. 1. INTRODUCTION Speech intelligibility is a measure of how much of what is spoken is recognized by a listener. It is an important quan- tiﬁer for speech communication applications in telecommu- nications, hearing aids and intelligence gathering in law en- forcement applications. Intelligibility scores can be classi- ﬁed as being either subjective or objective. Subjective speech intelligibility scores are obtained through listening experiments where subjects listen to speech samples and are either asked to repeat the words they have heard or else to select one from a predeﬁned set of answers. It is necessary to perform the experiments on many subjects in order to get a statistically reliable estimate, which makes the task of obtaining subjective intelligibility scores expen- sive and time consuming. Objective intelligibility estima- tion that can be performed algorithmically is clearly advan- tageous and several methods have been developed, including, the ANSI standard Speech Intelligibility Index (SII) [1] that is a development of the Articulation Index (AI) [2]. These measures are intrusive in nature as they require knowledge of the clean speech signal, and although they are useful in controlled experiments, there are many situations where only the noisy speech signal is available; in such cases, it would be valuable to have a non-intrusive measure that operates di- rectly on the observed signals. We propose a data driven approach to non-intrusive intel- ligibility estimation inspired by the Low Complexity Speech Quality Assessment (LCQA) method developed by Gran- charov et al. [3]. We begin by deﬁning a large set of local and global speech speciﬁc features. Subsequently, we em- ploy a dimensionality reduction scheme based on correlation and Principal Component Analysis (PCA) in order to ﬁnd the features that are best suited for predicting speech intelligibil- ity. Finally, these features are used to train a Gaussian Mix- ture Model (GMM) which is used to infer the intelligibility of new, unseen, data from the noisy speech signal alone. The remainder of the paper is organized as follows. In Section 2, we review the LCQA method as it was origi- nally proposed, for non-intrusive quality estimation. We then show, in Section 3, how the LCQA framework can be devel- oped for our non-intrusive intelligibility measure. Section 4 presents results of our measure in terms of its correlation with subjective intelligibility scores as well as with intrusive intel- ligibility measures. Finally, conclusions from this work are drawn in Section 5. 2. LCQA REVIEW LCQA [3] is a data driven approach to speech quality eval- uation which has been shown to correlate well with subjec- tive Mean Opinion Score (MOS) [4]; the correlation is higher than that of the standard ITU-T P.563 [5] which, like LCQA, is non-intrusive. In the following, we summarize the key fea- tures of LCQA and refer the reader to [3] for further details. A frame selection scheme is developed using thresholds applied to the spectral ﬂatness, spectral dynamics and the speech variance per frame features. This allows a ﬂexible voice activity detection to be performed, based on the opti- mization of the feature thresholds that maximize the quality estimate. The algorithm models the statistical properties of the per frame features using their mean, variance, skewness and kurtosis. In modeling the global properties of the op- timal per frame features, the dimensionality of the feature space is signiﬁcantly reduced to 44 features for each speech utterance. In order to optimize the performance of the classiﬁca- tion, it is required to retain the minimum number of global features that maximize the estimation criteria (quality in the original context). This is achieved by the sequential ﬂoating backward selection algorithm [6], [7]. After a minimization of the root-mean-square error (RMSE) performance of the LCQA algorithm, the ﬁnal feature vector is reduced to 14 dimensions. The LCQA algorithm is trained on a large number of speech utterances (typically 2 sentences separated by a small pause) that have been subjectively labeled (through listen- ing experiments for example) with the mean opinion score (MOS) [8]. Fourteen global features are extracted for each utterance and a GMM is trained on the joint distribution of 18th European Signal Processing Conference (EUSIPCO-2010) Aalborg, Denmark, August 23-27, 2010 © EURASIP, 2010 ISSN 2076-1465 1899

the global features and the MOS for each utterance. The GMM containing M mixtures is deﬁned by a set of mean vectors, covariance matrices and mixture weights, estimated using the Expectation Maximization (EM) algorithm [9]. The global feature vector describes the statistical proper- ties of certain aspects of the speech signal; at no point explicit auditory or cognitive modeling is performed. This suggests that the algorithm framework may be able to model different subjective criteria, such as the intelligibility of the utterance. 3. NON-INTRUSIVE INTELLIGIBILITY ASSESSMENT In this section, we describe the Low Cost Intelligibility Assessment (LCIA) algorithm for estimating the speech in- telligibility by deriving per frame features from the speech waveform, then applying a statistical model followed by a dimensionality reduction and GMM mapping. We also de- scribe the database used for evaluation of the algorithm and the training procedure. 3.1 Algorithm overview The key algorithm blocks are illustrated in Fig. 2 and de- scribed further in this section. 3.1.1 Derived Features The ﬁrst step is a Linear Prediction Coding (LPC) using 20 ms, non overlapping windows of the speech signal. The frequency response of the LPC coefﬁcients is used to derive a number of per frame features including the spectral ﬂat- ness, spectral centroid, excitation variance and spectral dy- namics. In addition, the speech variance and the iSNR (de- ﬁned in Section 3.2) per frame are computed giving a total of 6 per frame features. In addition, the ﬁrst time derivatives of these (except spectral dynamics) are also computed, resulting in 11 features per frame. The statistical properties of the pitch period are used in the LCQA algorithm and pitch estimation in low SNR envi- ronments is a challenging task, where current algorithms may fail to perform reliably in such conditions [10]. For the pur- pose of intelligibility estimation in very noisy speech, pitch information obtained through the YIN algorithm [11] was found to correlate poorly with the subjective score. Given the computational complexity of the pitch tracker, and the poor robustness of pitch estimation algorithms in noisy speech, pitch has not been included as a feature. 3.1.2 Global Features The per frame features are transformed into N per utter- ance features by modeling the statistical properties of the per frame features through the mean, variance, skewness and kurtosis operators. This statistical description gives a global description of the per frame features and helps to reduce con- siderably the dimension of the feature set. 3.1.3 Dimensionality Reduction In order to improve the performance of the classiﬁcation, it is necessary to retain those features that model the various properties of the signal most effectively. We apply a two step dimensionality reduction scheme based on a feature subset selection followed by a feature extraction step on the training F 1 F 2 . . . . F N F 1 F 2 . . F P F 1 F 2 . . F Q Feature Selection Feature Extraction Figure 1: The dimensionality reduction scheme involves a feature selection (correlation) followed by feature extraction (PCA). data, as shown in Fig. 1. The ﬁrst stage is a feature subset selection, which is achieved through a correlation analysis of the features. It is desirable to retain only those features that have a high correlation with the intelligibility and at the same time, are uncorrelated with other features. The correlation coefﬁcient based measure for feature i is obtained as : Cor i = R i ∑ N j=1 R ij , (1) where N is the number of features in the global set before feature selection and R i is the correlation of the feature i with intelligibility scores and R ij is the correlation of feature i with feature j. The correlation coefﬁcient between vectors ˆ I and I is deﬁned as: R = ∑ n ( ˆ I n - μ ˆ I )(I n - μ I ) √ ∑ n ( ˆ I n - μ ˆ I ) 2 ∑ n (I n - μ I ) 2 , (2) where μ I and μ ˆ I denote the mean of I and ˆ I respectively. The correlation coefﬁcient based measure is optimized to select P features with the highest correlation coefﬁcient Cor i from the set of N global features. The second step is a feature extraction, where PCA is used to transform the P features into Q dimensions by a linear combination (N > P > Q). In our experiments described later in this paper we have shown examples for the illustrative case of P = 8 and Q = 7. 3.1.4 Gaussian Mixture Modeling A joint GMM is trained on the Q extracted features and the intelligibility score for each speech utterance in the training data. The GMM was tested with a range of mixtures and the optimal number of mixtures was found to be 7, giving the highest correlation and lowest MSE of estimated intelligibil- ity with subjective scores (determined experimentally). 3.2 Importance weighted signal-to-noise ratio (iSNR) feature The signal-to-noise ratio is a popular objective measure for quantifying the amount of additive noise in the signal. We use an intelligibility speciﬁc frequency weighted SNR mea- sure to quantify effects of additive noise for each time frame of the signal. This forms a per frame feature whose statis- tical properties over the entire utterance is evaluated. The 1900

18th European Signal Processing Conference (EUSIPCO-2010) Aalborg, Denmark, August 23-27, 2010 DATA DRIVEN METHOD FOR NON-INTRUSIVE SPEECH INTELLIGIBILITY ESTIMATION Dushyant Sharma1 , Gaston Hilkhuysen2 , Nikolay D. Gaubitch1 , Patrick A. Naylor1 ,Mike Brookes1 ,Mark Huckvale2 Centre for Law Enforcement Audio Research (CLEAR) Electrical and Electronic Engineering, 2 Speech, Hearing & Phonetic Sciences, Imperial College London, UK University College London, UK email: {dushyant.sharma02, ndg, p.naylor, email: {g.hilkhuysen, m.huckvale}@ucl.ac.uk mike.brookes}@ic.ac.uk 1 ABSTRACT We propose a data driven, non-intrusive method for speech intelligibility estimation. We begin with a large set of speech signal specific features and use a dimensionality reduction approach based on correlation and principal component analysis to find the most relevant features for intelligibility prediction. These are then used to train a Gaussian mixture model from which the intelligibility of unseen data is inferred. Experimental results show that our method gives a correlation with subjective intelligibility of 0.92 and a correlation of 0.96 with the ANSI standard Speech Intelligibility Index. ity. Finally, these features are used to train a Gaussian Mixture Model (GMM) which is used to infer the intelligibility of new, unseen, data from the noisy speech signal alone. The remainder of the paper is organized as follows. In Section 2, we review the LCQA method as it was originally proposed, for non-intrusive quality estimation. We then show, in Section 3, how the LCQA framework can be developed for our non-intrusive intelligibility measure. Section 4 presents results of our measure in terms of its correlation with subjective intelligibility scores as well as with intrusive intelligibility measures. Finally, conclusions from this work are drawn in Section 5. 1. INTRODUCTION 2. LCQA REVIEW Speech intelligibility is a measure of how much of what is spoken is recognized by a listener. It is an important quantifier for speech communication applications in telecommunications, hearing aids and intelligence gathering in law enforcement applications. Intelligibility scores can be classified as being either subjective or objective. Subjective speech intelligibility scores are obtained through listening experiments where subjects listen to speech samples and are either asked to repeat the words they have heard or else to select one from a predefined set of answers. It is necessary to perform the experiments on many subjects in order to get a statistically reliable estimate, which makes the task of obtaining subjective intelligibility scores expensive and time consuming. Objective intelligibility estimation that can be performed algorithmically is clearly advantageous and several methods have been developed, including, the ANSI standard Speech Intelligibility Index (SII) [1] that is a development of the Articulation Index (AI) [2]. These measures are intrusive in nature as they require knowledge of the clean speech signal, and although they are useful in controlled experiments, there are many situations where only the noisy speech signal is available; in such cases, it would be valuable to have a non-intrusive measure that operates directly on the observed signals. We propose a data driven approach to non-intrusive intelligibility estimation inspired by the Low Complexity Speech Quality Assessment (LCQA) method developed by Grancharov et al. [3]. We begin by defining a large set of local and global speech specific features. Subsequently, we employ a dimensionality reduction scheme based on correlation and Principal Component Analysis (PCA) in order to find the features that are best suited for predicting speech intelligibil- LCQA [3] is a data driven approach to speech quality evaluation which has been shown to correlate well with subjective Mean Opinion Score (MOS) [4]; the correlation is higher than that of the standard ITU-T P.563 [5] which, like LCQA, is non-intrusive. In the following, we summarize the key features of LCQA and refer the reader to [3] for further details. A frame selection scheme is developed using thresholds applied to the spectral flatness, spectral dynamics and the speech variance per frame features. This allows a flexible voice activity detection to be performed, based on the optimization of the feature thresholds that maximize the quality estimate. The algorithm models the statistical properties of the per frame features using their mean, variance, skewness and kurtosis. In modeling the global properties of the optimal per frame features, the dimensionality of the feature space is significantly reduced to 44 features for each speech utterance. In order to optimize the performance of the classification, it is required to retain the minimum number of global features that maximize the estimation criteria (quality in the original context). This is achieved by the sequential floating backward selection algorithm [6], [7]. After a minimization of the root-mean-square error (RMSE) performance of the LCQA algorithm, the final feature vector is reduced to 14 dimensions. © EURASIP, 2010 ISSN 2076-1465 The LCQA algorithm is trained on a large number of speech utterances (typically 2 sentences separated by a small pause) that have been subjectively labeled (through listening experiments for example) with the mean opinion score (MOS) [8]. Fourteen global features are extracted for each utterance and a GMM is trained on the joint distribution of 1899 the global features and the MOS for each utterance. The GMM containing M mixtures is defined by a set of mean vectors, covariance matrices and mixture weights, estimated using the Expectation Maximization (EM) algorithm [9]. The global feature vector describes the statistical properties of certain aspects of the speech signal; at no point explicit auditory or cognitive modeling is performed. This suggests that the algorithm framework may be able to model different subjective criteria, such as the intelligibility of the utterance. F1 F1 F2 Feature Selection . . . . Feature Extraction F2 . . F2 . . FQ FP FN F1 3. NON-INTRUSIVE INTELLIGIBILITY ASSESSMENT In this section, we describe the Low Cost Intelligibility Assessment (LCIA) algorithm for estimating the speech intelligibility by deriving per frame features from the speech waveform, then applying a statistical model followed by a dimensionality reduction and GMM mapping. We also describe the database used for evaluation of the algorithm and the training procedure. 3.1 Algorithm overview Figure 1: The dimensionality reduction scheme involves a feature selection (correlation) followed by feature extraction (PCA). data, as shown in Fig. 1. The first stage is a feature subset selection, which is achieved through a correlation analysis of the features. It is desirable to retain only those features that have a high correlation with the intelligibility and at the same time, are uncorrelated with other features. The correlation coefficient based measure for feature i is obtained as : The key algorithm blocks are illustrated in Fig. 2 and described further in this section. Cori = 3.1.1 Derived Features The first step is a Linear Prediction Coding (LPC) using 20 ms, non overlapping windows of the speech signal. The frequency response of the LPC coefficients is used to derive a number of per frame features including the spectral flatness, spectral centroid, excitation variance and spectral dynamics. In addition, the speech variance and the iSNR (defined in Section 3.2) per frame are computed giving a total of 6 per frame features. In addition, the first time derivatives of these (except spectral dynamics) are also computed, resulting in 11 features per frame. The statistical properties of the pitch period are used in the LCQA algorithm and pitch estimation in low SNR environments is a challenging task, where current algorithms may fail to perform reliably in such conditions [10]. For the purpose of intelligibility estimation in very noisy speech, pitch information obtained through the YIN algorithm [11] was found to correlate poorly with the subjective score. Given the computational complexity of the pitch tracker, and the poor robustness of pitch estimation algorithms in noisy speech, pitch has not been included as a feature. 3.1.2 Global Features The per frame features are transformed into N per utterance features by modeling the statistical properties of the per frame features through the mean, variance, skewness and kurtosis operators. This statistical description gives a global description of the per frame features and helps to reduce considerably the dimension of the feature set. 3.1.3 Dimensionality Reduction In order to improve the performance of the classification, it is necessary to retain those features that model the various properties of the signal most effectively. We apply a two step dimensionality reduction scheme based on a feature subset selection followed by a feature extraction step on the training Ri ∑Nj=1 Ri j , (1) where N is the number of features in the global set before feature selection and Ri is the correlation of the feature i with intelligibility scores and Ri j is the correlation of feature i with feature j. The correlation coefficient between vectors Iˆ and I is defined as: ∑ (Iˆn − µIˆ)(In − µI ) , R= √ n 2 2 ˆ ( I − µ ) µ (I − ) ∑n n I Iˆ ∑n n (2) where µI and µIˆ denote the mean of I and Iˆ respectively. The correlation coefficient based measure is optimized to select P features with the highest correlation coefficient Cori from the set of N global features. The second step is a feature extraction, where PCA is used to transform the P features into Q dimensions by a linear combination (N > P > Q). In our experiments described later in this paper we have shown examples for the illustrative case of P = 8 and Q = 7. 3.1.4 Gaussian Mixture Modeling A joint GMM is trained on the Q extracted features and the intelligibility score for each speech utterance in the training data. The GMM was tested with a range of mixtures and the optimal number of mixtures was found to be 7, giving the highest correlation and lowest MSE of estimated intelligibility with subjective scores (determined experimentally). 3.2 Importance weighted signal-to-noise ratio (iSNR) feature The signal-to-noise ratio is a popular objective measure for quantifying the amount of additive noise in the signal. We use an intelligibility specific frequency weighted SNR measure to quantify effects of additive noise for each time frame of the signal. This forms a per frame feature whose statistical properties over the entire utterance is evaluated. The 1900 SII Band Importance Window Speech Signal i frames 0.1 0.09 Derived Features (LPC, iSNR) 0.08 0.07 Band Importance ix11 features Global Features (Statistical Model) N features 0.06 0.05 0.04 0.03 Feature Selection (Correlation) 0.02 0.01 P features 0 Feature Extraction (PCA) 250 500 1000 2000 4000 Frequency(Hz) Figure 3: 1/3rd octave band importance function used in the SII calculation [1]. Q features Mode Train where, N f is the number of frequency bands. The band importance function I(k), describes the importance of a frequency band to speech intelligibility and A(k) is the band audibility function [1]. GMM Training Test Model Parameters The iSNR for frame i is defined as: Nf iSNR(i) = 10 ∑ I(k) log10 GMM Mapping k=1 where Px (i, k) is the power spectrum of the input (noisy speech) signal, computed as follows: Intelligibility Figure 2: Illustration of the modified LCQA algorithm optimized for intelligibility estimation. noise power is estimated using the minimum statistics algorithm [12] for each frame of the signal. The algorithm assumed an additive noise model: x(n) = s(n) + v(n), (3) where x(n) is the noisy speech, s(n) is the speech signal and v(n) is the noise. The SII [1] is an intrusive measure that quantifies the aspects of the signal that are audible and usable to the listener. The SII score is monotonically related to intelligibility and is given in the range 0 to 1. The SII describes different Frequency Importance Functions (FIFs) based on different speech material.The FIFs are weighting functions applied to the signal spectrum based on the importance of the particular frequency band to intelligibility. The general SII formula is defined as: Nf SII = ∑ I(k)A(k), k=1 max(0, Px (i, k) − Pṽ (i, k)) , (5) Pṽ (i, k) (4) Px (i, k) = X(i, k)X ∗ (i, k), (6) where X(i, k) is the Discrete Fourier Transform (DFT) of the ith frame of the input signal. The estimated noise power Pṽ (i, k) is obtained in a similar way. It is important to estimate the iSNR only for those periods in which the speech signal is active. The iSNR calculation is thus restricted to voiced frames of the signal. 3.3 Database The database consists of 200 sentences [13] from a male speaker. The sentences were corrupted with dynamic samples of car and babble noise at five SNRs obtained to correspond to an SII score of 0.1, 0.3, 0.5, 0.7 and 0.9. The speech activity level was obtained through the ITU-T P.56 algorithm [14] and this was used in the SNR calculation when adding the noise. Also included in the database are the noisy utterances processed through the spectral subtraction algorithm [15, 12] available in the Voicebox toolbox [16]. The 20 conditions in the database are summarized in Table 1. Subjective intelligibility results were obtained from 20 naı̈ve native speakers of British English. All subjects had hearing thresholds of less than 20 dBHL at frequencies ranging from 125 Hz to 8 kHz. The task was to listen to the stimuli and give a vocal reply which was recorded and scored. There were 5 keywords per sentence for the subject to identify. The subjective scores were averaged over the conditions 1901 Condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Noise Car Car Car Car Car Babble Babble Babble Babble Babble Car Car Car Car Car Babble Babble Babble Babble Babble SNR (dB) -9 -12 -15 -18 -21 0 -3 -6 -9 -12 -9 -12 -15 -18 -21 0 -3 -6 -9 -12 Global Feature Skewness(spectral dynamics) Kurtosis(spectral dynamics) Skewness(d/dt(excitation variance)) Skewness(d/dt(iSNR)) Skewness(excitation variance) Kurtosis(d/dt(excitation variance)) Skewness(d/dt(spectral centroid)) Kurtosis(iSNR) Suppression off off off off off off off off off off on on on on on on on on on on Correlation 0.90 0.86 0.80 0.61 0.59 0.59 0.57 0.57 Table 2: Table showing the absolute correlation coefficients for the raw features with subjective intelligibility scores (computed individually). Subjective SII LCIA Subjective 1.0 0.91 0.92 SII LCIA 1.0 0.96 1.0 Table 3: Correlations for the 50% cross validation partitions (all test conditions are present in training). Table 1: Database conditions, the suppression refers to processing the noisy speech with the spectral subtraction algorithm. to give a condition averaged word intelligibility score in the range 0 to 1. As the same speaker was used for all the utterances, speaker independence has not been investigated in the current study. 3.3.1 Training The database was partitioned into a test set and a training set. The speech material used in the training set was not included in the test set. Two training schemes were employed: • 50% cross validation – in this scheme, we partition the database into an equal dimension test and training set. The training set contains all the conditions that are present in the test set. However, the test speech material is not available in the training set. The test and training set are swapped and the performed is the average over the cross validated sets. • Predicting processing effects – in this scheme, the training set only contains the noisy speech conditions and has no example of the speech processed through spectral subtraction. Here we are interested in investigating the ability of the algorithm to predict the effects of speech enhancement on intelligibility. 4. RESULTS We describe two experiments based on the training schemes presented in the previous section. For the purpose of these experiments, it has been found that selecting 8 features from the 40 global features and 7 linear combinations after the feature extraction give good results (N = 40, P = 8 and Q = 7). A non-linear relationship is known to exist between percentage correct intelligibility scores and SII [1]. Therefore a performance metric that accounts for this must be used. The Spearman rank correlation coefficient [17] is a non-parametric measure that describes the monotonic relationship between two variables, unlike the Pearson correlation coefficient (2) which describes a linear relationship. The Spearman correlation coefficient (ρ ) is calculated as: ρ = 1− 6 ∑ di2 , n(n2 − 1) (7) where di is the difference between the statistical rank of the subjective and estimated intelligibility scores. The performance of the SII is compared with our non-intrusive intelligibility method, LCIA that is based on the LCQA algorithm. 4.1 Training on all conditions In the 50% cross validation training scheme, examples of all the 20 conditions are represented in the training and test sets. The results from this experiment are presented in Table 3. The LCIA results have a correlation of 0.96 with the ANSI standard SII algorithm. This confirms that the modeling within LCIA has a well defined behavior. Also, with 0.92 correlation with subjective intelligibility scores, the algorithm outperforms the SII in estimating the intelligibility for additive noise and spectral subtraction, even though LCIA is non-intrusive. Also, the statistical properties of the spectral dynamics is the most important feature (with a correlation of 0.90 with intelligibility) suggesting that the rate of change of the spectrum provides important information in intelligibility estimation. 4.2 Predicting processing In this experiment, the training set only contains examples of the noisy speech and no examples of the speech enhanced through spectral subtraction. The algorithms are evaluated for their capability in predicting the effect of spectral subtraction on intelligibility. The results are shown in Table 4. For this scenario, the SII algorithm performs best, with a correlation of 1.0 with subject scores. The LCIA algorithm also has a high correlation of 0.96 with subjective scores. 1902 Subjective SII LCIA Subjective 1.0 1.0 0.96 SII LCIA 1.0 0.96 1.0 [11] Table 4: Correlations with different test/train partitions (predicting effect of algorithm). [12] 5. CONCLUSIONS A low complexity data driven, non-intrusive speech intelligibility estimation algorithm was presented. The algorithm computes 40 features per utterance and applies a two step dimensionality reduction based on correlation and PCA. This results in 7 features, which are used to train a GMM of 7 mixtures. The statistical modeling of the features through skewness and kurtosis were found to correlate well for speech corrupted by noise and for predicting the effects of spectral subtraction. Also, the importance function weighted signal-tonoise ratio was presented as an important feature. The algorithm has a correlation of 0.96 with the intrusive SII method and it was shown to predict the effects of processing after spectral subtraction with a correlation of 0.96. Finally, our approach was shown to give a correlation of 0.92 with subjective intelligibility scores. [13] [14] [15] [16] [17] REFERENCES [1] ANSI, “Methods for the Calculation of the Speech Intelligibility Index,” American National Standards Institute, ANSI Standard S3.5-1997 (R2007), 1997. [2] ——, “Methods for the Calculation of the Articulation Index,” American National Standards Institute, New York, ANSI Standard ANSI S3.5-1969, 1969. [3] V. Grancharov, D. Zhao, J. Lindblom, and W. Kleijn, “Low-Complexity, Nonintrusive Speech Quality Assessment,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 1948–1956, 2006. [4] ITU-T, “ITU-T coded-speech database,” ITU-T Supplement P.Sup23, Feb. 1998. [5] ——, “Single-ended method for objective speech quality assessment in narrow-band telphony applications,” ITU-T Recommendation P.563, 2004. [6] S. Stearns, “On selecting features for pattern classifiers,” in Proc. 3rd Int. Conf. Pattern Recognition, 1976, pp. 71–75. [7] P. Pudil, F. Ferri, J. Novovicova, and J. Kittler, “Floating search methods for feature selection with nonmonotonic criterion functions,” in Proc. IEEE Int. Conf. Pattern Recognition, 1994, pp. 279–283. [8] ITU-T, “Methods for subjective determination of transmission quality,” Online, ITU-T Recommendation P.800, Aug. 1996. [Online]. Available: http://www.itu. int/rec/T-REC-P.800/en [9] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–38, 1977. [10] D. Sharma and P. A. Naylor, “Evaluation of pitch estimation in noisy speech for application in non-intrusive 1903 speech quality assessment,” in Proc European Signal Processing Conf, 2009. A. de Cheveigne and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” J. Acoust. Soc. Amer., vol. 111, no. 4, pp. 1917–1930, Apr. 2002. R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans on Speech and Audio Processing, vol. 9, pp. 504–512, Jul. 2001. M. W. Smith and A. Faulkner, “Perceptual adaptation by normally hearing listeners to a simulated hole in hearing,” J. Acoust. Soc. Amer., vol. 120, pp. 4019– 4030, 2006. ITU-T, “Objective Measurement of Active Speech Level,” ITU-T Recommendation P.56, Mar. 1993. M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” vol. 4, 1979, pp. 208–211. D. M. Brookes, “VOICEBOX: A speech processing toolbox for MATLAB,” 1997. [Online]. Available: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/ voicebox.html E. L. Lehmann and H. J. M. D’Abrera, Nonparametrics: Statistical Methods Based on Ranks. Englewood Cliffs, NJ: Prentice-Hall, 1998.

购买国外文凭英国雷丁大学毕业证最新版案例-咨询购买雷丁大学毕业证成绩单》《Q微信/1954292140》【在线购买英国文凭证书】雷丁大学学历证书毕业证明书英文《英国大学毕业证办理雷丁大学文凭英国雷丁大学毕业文凭外壳》、雷丁大学硕士学位证、仿造俄勒冈大学毕业证办UO毕业证成绩单学历认证、哪里买英国英国雷丁大学学位证《办理雷丁大学毕业证书》、雷丁大学毕业证实拍图原版定做英国英国雷丁大学 admission letter雷丁大学Diploma。留学生买雷丁大学毕业证文凭、学历认证请联系【Q/微信1954 292 140】专业为留学生办理雷丁大学毕业证、成绩单、使馆留学回国人员证明、教育部学历学位认证、录取通知书、Offer、在读证明、雅思托福成绩单、网上存档永久可！八年从业经验《英国本科学历证书原版定做雷丁大学毕业证实拍图》【Q/微1954292140】《原版复刻雷丁大学研究生学位证书》、专业指导、私人定制、倾心为您解决留学毕业回国各种疑难问题。 <1>教育部学历学位认证服务: 做到真实永久存档，网上轻易可查，绝对对客户的资料进行保密，登录核实后再付款。中国教育部留学服务中心认证（中国）：《国外学历学位认证》 <2>为什么您的学位需要在国内进一步认证？如果您计划在国内发展，那么办理国内教育部认证是必不可少的。事业性用人单位如银行，国企，公务员，在您应聘时都会需要您提供这个认证。其他私营、外企企业，无需提供！办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，帮您快速整合材料，让您少走弯路。一整套留学文凭证件服务：一：《英国雷丁大学毕业证实拍图在线购买》【Q/微1954292140】《原版定做雷丁大学毕业证认证本科学历证书》毕业证#成绩单等全套材料，从防伪到印刷，水印底纹到钢印烫金；二：真实使馆认证（留学人员回国证明），使馆存档；三：真实教育部认证，教育部存档，教育部留服网站可查；四：留信认证，留学生信息网站可查；五：国外学历、毕业证、学位证、成绩单办理《英国雷丁大学研究生学位证书雷丁大学本科学历证书原版复刻》【Q/微1954292140】《原版定做英国雷丁大学毕业证实拍图研究生学位证书》。真实留信认证的作用《在线购买雷丁大学毕业证实拍图》(私企，外企，荣誉的见证): 1：该专业认证可证明留学生真实留学身份《英国本科学历证书原版定做雷丁大学毕业证实拍图》《Q微1954292140》。 2：同时对留学生所学专业等级给予评定。 3：国家专业人才认证中心颁发入库证书 4：这个入网证书并且可以归档到地方 5：凡是获得留信网入网的信息将会逐步更新到个人身份内，将在公安部网内查询个人身份证信息后，同步读取人才网入库信息。 6：个人职称评审加20分。 7：个人信誉贷款加10分。 8：在国家人才网主办的全国网络招聘大会中纳入资料，供国家500强等高端企业选择人才。对于留学生而言出去留学就是为了增长见识，为自己今后的发展奠定一个好的基础，能在这个拼爹的时代有一个好的起点，大家都知道英国有许多世界知名的大学，但留学英国的留学生们毕业率却也是不高的，从事留学生国外学历认证办理《在线购买雷丁大学毕业证实拍图》【Q/微1954 292 140】《英国雷丁大学研究生学位证书原版复刻》咨询服务多年，留学生们大多会这样来咨询“由于后论文没有过《英国本科学历证书原版定做雷丁大学毕业证实拍图》【Q/微1954 292 140】《原版复刻雷丁大学研究生学位证书》只拿到了diploma，而没有拿到master degree证书怎么认证master degree这种情况能认证通过吗？能认证成degree吗？”

Log In

Data driven method for non-intrusive speech intelligibility estimation