Design Analysis and Experimental Evaluation of Block-Based Trcomputati
Design Analysis and Experimental Evaluation of Block-Based Trcomputati
com
Received 18 April 2011; received in revised form 14 November 2011; accepted 18 November 2011
Available online 26 November 2011
Abstract
Standard Mel frequency cepstrum coefficient (MFCC) computation technique utilizes discrete cosine transform (DCT) for decorre-
lating log energies of filter bank output. The use of DCT is reasonable here as the covariance matrix of Mel filter bank log energy
(MFLE) can be compared with that of highly correlated Markov-I process. This full-band based MFCC computation technique where
each of the filter bank output has contribution to all coefficients, has two main disadvantages. First, the covariance matrix of the log
energies does not exactly follow Markov-I property. Second, full-band based MFCC feature gets severely degraded when speech signal
is corrupted with narrow-band channel noise, though few filter bank outputs may remain unaffected. In this work, we have studied a class
of linear transformation techniques based on block wise transformation of MFLE which effectively decorrelate the filter bank log ener-
gies and also capture speech information in an efficient manner. A thorough study has been carried out on the block based transforma-
tion approach by investigating a new partitioning technique that highlights associated advantages. This article also reports a novel
feature extraction scheme which captures complementary information to wide band information; that otherwise remains undetected
by standard MFCC and proposed block transform (BT) techniques. The proposed features are evaluated on NIST SRE databases using
Gaussian mixture model-universal background model (GMM-UBM) based speaker recognition system. We have obtained significant
performance improvement over baseline features for both matched and mismatched condition, also for standard and narrow-band
noises. The proposed method achieves significant performance improvement in presence of narrow-band noise when clubbed with miss-
ing feature theory based score computation scheme.
Crown Copyright Ó 2011 Published by Elsevier B.V. All rights reserved.
Keywords: Speaker recognition; MFCC; DCT; Correlation matrix; Decorrelation technique; Linear transformation; Block transform; Narrow-band
noise; Missing feature theory
0167-6393/$ - see front matter Crown Copyright Ó 2011 Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.specom.2011.11.004
544 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
most popular and has become standard in speaker recogni- image processing applications, DCT has also been applied
tion system. MFCC is popular also due to the efficient in blocked manner (Jain, 2010). Subband DCT based cod-
computation schemes available for it and its robustness ing method has been shown to be effective in image coding,
in presence of different noises. image resizing schemes where DCT is computed for differ-
In MFCC computation process, the speech signal is ent block of subband (Jung et al., 1996; Mukherjee and
passed through several triangular filters which are spaced Mitra, 2002). Here the signal is first divided into two parts:
linearly in a perceptual Mel scale. The Mel filter bank log a high pass and a low pass and DCT is computed for each
energy (MFLE) of each filters are calculated. Finally, ceps- signals separately. On the other hand, subband based
tral coefficients are computed using linear transformation speaker recognition technology also gained attention as
of MFLE. The linear transformation is essential here. an alternative of conventional MFCC. In (Sivakumaran
The major reasons are as follows: (a) improving the robust- et al., 2003), different experimental results are reported
ness: the MFLEs are not much robust. They are very much based on subband DCT. During the last decade, several
susceptible to a small change in signal characteristics due to works have been carried out in subband processing based
noise and other unwanted variabilities, (b) decorrelation: speaker recognition (Besacier and Bonastre, 2000; Finan
the log energy coefficients are highly correlated whereas et al., 2001; Damper and Higgins, 2003; Vale and Alcaim,
uncorrelated features are preferred for statistical pattern 2008). The mathematical relationship between multi-band
recognition systems, specially for diagonal covariance and full-band based MFCC coefficient are established in
based Gaussian mixture model (GMM) which is employed (Mak, 2002). In (Kim et al., 2008), subband DCT based
in today’s speaker recognition system. MFCC is shown to perform better than full-band MFCC
Amongst all linear transformation discrete cosine trans- for different additive noises. There exists a number of other
form (DCT) is most popular and widely used for MFCC work on subband DCT or multi-band MFCC where it
computation. The motivations behind the usage of DCT is shown to outperform existing baseline MFCC specially
can be stated as follows. Firstly, the DCT is the sub-opti- for partially corrupted speech signal (Besacier and Jean-
mal approximation of the basis function of Karhunen– Frantois, 1997; Ming et al., 2007; Jingdong et al., 2004).
Loéve transform (KLT) when the correlation matrix of Though, it has played an effective role in improving perfor-
the sample closely approximates the correlation matrix of mance of speech processing applications still multi block
Markov-I process (Ahmed et al., 1974). The correlation DCT is not much used in state-of-the art speaker recogni-
matrix of MFLE data is fairly similar to the correlation tion system. The main reason is that most of the existing
matrix of first order Markov process. Secondly, DCT has works are at experimental level and the design issues
the best energy compaction property for arbitrary data related to multi-block configuration (i.e. number of bands,
length compared to DFT and other sinusoidal transform size of band, etc.) are yet to be precisely addressed. This is
like discrete sine transform (DST), discrete Hadamard one of the main issue behind its unpopularity in spite of its
transform (DHT), etc. (Oppenheim and Schafer, 1979). superior empirical performance for speech and speaker
Though DCT based MFLE transformation technique is recognition.
very popular, some studies have been carried out recently In our present work, the design issues related to block
on further processing schemes of cepstral coefficient to based MFCC computation scheme is addressed carefully
improve the robustness against channel and other variabil- along with a thorough experimental evaluation. The ceps-
ities (Garreton et al., 2010; Hung and Wang, 2001; Naser- tral coefficient using multi-block DCT approach is system-
sharif and Akbari, 2007). Principal component analysis atically formulated. The scheme is also restructured for
(PCA) (Takiguchi and Ariki, 2007), linear discriminant improving the performance of speaker recognition. The
analysis (LDA) (Kajarekar et al., 2001), independent com- block transform (BT) based approach is shown to carry
ponent analysis (ICA) (Kwon and Lee, 2004), etc. are some several levels of information. A novel block based
traditional techniques which are also applied for formulat- approach is also proposed which has complementary infor-
ing decorrelated features for speech processing mation to the formerly proposed methods. The strengths of
applications. both the systems are combined using weighted linear fusion
Our proposed work is focused to design a linear trans- to get better performance. We have evaluated the perfor-
formation technique which can effectively preserve speech mance of speaker recognition system with NIST SRE
related information to improve the speaker recognition 2001 (for matched condition) and NIST SRE 2004 (for
performance. Being motivated by the fact that the block both matched and mismatched condition). The experimen-
wise filter bank outputs are more suitable for transforma- tal result shows the superiority of our proposed block-
tion using DCT, we have investigated block based transfor- based MFCC computation scheme for both the databases.
mation approach in case of traditional full-band based As a final point, the paper proposes a technique where sig-
DCT which is applied to all the MFLE at a time. Earlier nificant performance improvement is obtained for multi-
block based cosine transform has been applied for speech block approach using linear transformation only. The sys-
recognition (Jingdong et al., 2000). Recently, DCT is tem also performs better than standard MFCC for different
applied in a distributed manner (Sahidullah and Saha, types of noise. Additionally this system is significantly bet-
2009) to formulate feature for speaker identification. In ter than baseline system in case of narrow-band noise when
Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565 545
missing feature theory is applied (Lippmann and Carlson, ½HT M ¼ ½Xw T M Xw T M : ð4Þ
2 2 2
1997) by considering the scores of reliable and non-reliable
feature with unequal degree. 5. Filter bank log energy computation: The speech signal is
The rest of the paper is organized as follows. First, in passed through a triangular filter bank of frequency
Section 2, a brief overview of MFCC computation is response (K) which contains p filters, linearly spaced in
depicted mathematically for the completeness and better Mel scale. The log energy output (W) of the filter bank
readability of the paper. In Section 3, different multi-block is given by,
transformation techniques are formulated. In Section 4, h i
several issues of block transformation techniques are dis- ½WT p ¼ log ½HT M ½KM p : ð5Þ
2 2
MFCC computation technique is based on DFT magni- where, each column of D are p-dimensional orthogonal
tude of speech frame. A detailed description of this process basis vector of DCT. However, since the first coefficient
with block diagram can be found elsewhere (Chakroborty, is discarded as it is dc-coefficient, multiplication with a
2008). In this work, we are analyzing the MFCC from p (p 1) matrix is adequate in DCT computation.
mathematical point of view with the help of matrix opera-
tion notations. Though the different steps of MFCC calcu- In this work, we have investigated a better alternative of
lation are standard and well-known, we review them briefly standard DCT (i.e. D) using block transform which is
in a new way with a matrix based approach to formulate shown to give better speaker recognition performance with
the problem addressed here. GMM-UBM based speaker recognition system.
Let, we have T number of speech frames each of size N
extracted from a speech utterance. The followings are the 3. Block transform approach in MFCC computation
different steps of MFCC computation.
Block based transformation are very popular in image
1. Windowing: In the first stage, the signal is multiplied coding (Akansu and Haddad, 1992). In this approach,
with a tapered window (usually Hamming or Hanning the whole signal is divided into non-overlapping blocks
window). The windowed speech frames are given by, and individual blocks are processed independently.
Let F be a signal matrix of dimension T p. Now in
½sw T N ¼ ½sT N ½wT N ; ð1Þ block transformation approach F is transformed with a lin-
where s is a matrix containing framed speech, w is ear kernel L of size p d such that L is strictly a band
another matrix whose T rows contain same window matrix and it can be expressed as,
function w of size N and denotes entry wise matrix
multiplication.
2. Zero-padding: Zero-padding is required to compute the
power spectrum using fast Fourier transform FFT. Suf- ð7Þ
ficient numbers of zeros are padded using the following
matrix operation:
½szp T M ¼ ½sw T N ½ I O N M ; ð2Þ
where U1, U2 and U3 are orthogonal matrices. This is the
where I is an identity matrix of size N N and O is a fundamental idea behind block transformation (BT) which
null matrix of size N (M N). Here M is power of is applied here for computing cepstral vectors from MFLE.
two and is greater than N. In a standard BT, U-matrix is selected as an orthogonal
3. DFT computation: The windowed speech frames are transformation like DCT, DST, DHT, etc. The eigenvector
multiplied with twiddle factor matrix (W) to formulate of underlying covariance matrix of the signal vector is used
DFT coefficients (X). Half of the twiddle factor matrix to choose the orthogonal transformation. In MFCC compu-
is sufficient due to the conjugate symmetric property tation, as the BT is applied on the MFLE data which are
of Fourier transform. This operation can be expressed highly correlated and eventually follow Markov-I property,
as, hence, DCT matrix is a better choice for U. The decomposi-
½XT M ¼ ½szp T M ½WMM : ð3Þ tion of Mel filter bank output into blocks is an important
2 2
issue at this point. In image processing application, a large
4. Power spectrum computation: Power spectrum (H) is image is divided into smaller blocks of size 8 8 or
computed by entry wise multiplying the DFT coefficents 16 16. In speaker recognition system, the number of out-
with its conjugate. This can be written as, puts from filter bank are not so large compared to the size
546 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
of images in image processing applications. In our proposed The dimension of transformation matrix for two block
work, we are considering filter bank consists of 20 filters, NOBT is p 2(q 1) as total number of output co-effi-
which is mostly used in MFCC computation. Hence, we will cient is 2(q 1) after discarding the dc-coefficients which
be considering two or three block based approach. In exper- have insignificant effect in speaker recognition. As Ui is
imental section (Section 5), we will observe that this choice is orthogonal hence, LTnobt Lnobt ¼ I.
reasonably good for the given filter bank size. However, mul- The basis vectors and their frequency responses also
ti-block approach with more number of blocks could be used have an interesting property for NOBT when each block
for the case where higher number of critical bands are con- has equal number of samples. For example, when the num-
sidered. In the following subsections, we proposed three ber of filter p = 20 then the basis vectors of DCT and the
kinds of block transformation. The first form of BT which filter bank response of full-band DCT is shown in Figs. 1
is based on crisp partitioning of MFLE, is referred as Non- and 2. The basis vector and frequency response of NOBT
Overlapped Block Transform (NOBT). On the other hand, consisting of two blocks each of size 10 are shown in Figs.
Overlapped Block Transform (OBT) is the second category 3 and 4. Clearly the ith and (i + 10)th basis functions (for
of BT which is formulated by extending the block sizes of i = 1, 2, 3, . . . , 10) of the NOBT are shifted basis pairs;
NOBT in the direction of adjacent blocks. The last kind of hence, they have similar frequency response. The main lobe
BT introduced in this work is a special case of OBT where width of the NOBT is also higher than the full-band DCT
the basis functions of transformation are shifted form of based filters.
each other. This transformation is named as Shifted Basis The cepstral coefficient (xnobt) using two block based
Block Transform (SBT) and it is shown to carry localized NOBT can be written as,
spectral information which is lacking in NOBT and standard sffiffiffi
2X
q
OBT. nobt q1 pið2j þ 1Þ
xi i¼1
¼ WðiÞ cos ;
q j¼1 2q
sffiffiffiffiffiffiffiffiffiffiffi
3.1. Non-Overlapped Block Transformation (NOBT)
2 X
q
nobt pq1 pði qÞð2j þ 1Þ
xi i¼qþ1
¼ Wðq þ iÞ cos :
The most elementary block transformation scheme is p q j¼1 2ðp qÞ
non-overlapped block transformation. In this scheme, the ð11Þ
transformation matrix is direct sum of two orthogonal
matrices, i.e. the transformation is carried out on two The NOBT is a simple and computationally efficient
non-overlapping blocks. Therefore, the transformation scheme for calculating cepstral coefficient directly from
matrix (Lnobt) can be expressed as, MFLE. The choice of block size for NOBT is studied exper-
2 3 imentally and discussed in Section 4.1. We have shown that
U1 O . . . O
6 O U ... O 7 formant specific block selection approach is better for
6 2 7 speech feature computation. The advantages of NOBT over
Lnobt ¼ U1 U2 UN ¼ 6
6 .. .. . . .. 7
7; ð8Þ
4 . . . . 5 single transformation are: (i) NOBT has a localization
effect. If speech spectrum is partially distorted due to several
O O . . . UN
noises, then a part of the feature vector is only affected while
where U1, U2, U3, . . . , UN are orthogonal transformation the rest remain unaltered. (ii) As the dc-coefficient has less
matrices. significant contribution in speaker recognition, the feature
This design can be made such that the blocks will have dimension of NOBT is lesser than that of full-band case
equal size. In that case, the size of chunk must be a factor where only one dc-coefficient can be discarded.
of total number of filters in the filter bank, i.e. p. In stan- Despite the above listed advantages it has a major draw-
dard MFCC computation, DCT is used for its decorrela- back due to its abrupt discontinuity in the boundary. In
tion property. We have experimentally observed that the block based image coding schemes, this problem is
nearer MFLEs (i.e. smaller blocks) are more suitable for addressed using lapped orthogonal transform (Malvar
DCT as they closely follow Markov-I property. This infor- and Staelin, 1989). Motivated by this, in the subsequent
mation is used for choosing Ui. subsection, a solution is prescribed using overlapping of
For example, let we consider two blocks of same sizes q neighborhood blocks.
such that p = 2q. Hence, the transformation matrix is given
by,
3.2. Overlapped Block Transformation (OBT)
Uq 0
LA ¼ ; ð9Þ In this scheme of block transform, the neighborhood
0 Uq
blocks share some filter bank log energy coefficients to
where Uq is a DCT matrix of size q (q 1) and it is given by avoid the discontinuity at the end. In Fig. 5, overlapped
sffiffiffi block transformation matrix (Lobt) is shown with two
ij 2 pið2j þ 1Þ
Uq ¼ cos : ð10Þ blocks of block size qa and qb where the total number of
q 2q elements in MFLE is p. UA and UB are two orthogonal
Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565 547
0.2 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Fig. 1. Basis functions of DCT filter bank. The titles of the subplot indicate the sequence numbers of the basis functions.
4.5
3.5
3
Magnitude →
2.5
1.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Frequency ( × π rad/sample) →
Fig. 2. Superimposed frequency responses of twenty filters of DCT filter bank of Fig. 1.
transforms, i.e. DCT matrices; and OI and OII are matrices considered then each block is extended by (qa + qb p)/2
of null elements. samples.
The coefficients for OBT for two block of size qa and qb The above scheme can also be extended for arbitrary
are defined as, number of blocks, of different size and overlap. If we
sffiffiffiffiffi assume that the number of elements in each block are equal
2X
qa 1
obt qa 1 pið2j þ 1Þ and is an even number say, 2r and overlapping with the
xi i¼1 ¼ WðjÞ cos ;
qa j¼0 2qa adjacent blocks are 50%, i.e. r,
then the total number of
sffiffiffiffiffi p
obt qa þqb 1 2X
qb 1
pði qa þ 1Þð2j þ 1Þ
p as
r 1 . Hence,
blocks can be expressed the total number
xi i¼q ¼ Wðp qb þ jÞ cos ; of coefficients, m ¼ r 1 ð2r 1Þ .
a qb j¼0 2qb The OBT effectively captures local (in frequency domain)
ð12Þ spectral information with localized transformation of
MFLE. On the other hand, it generates less distorted coef-
where the amount of overlap for the two blocks are given ficients for partially corrupted speech signal. However,
by qa + qb p. If equal extension of each block is OBT with multiple blocks and larger overlap has some
548 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
0.2 0 0 0 0
0 0 0 0 0
0.2 0 0 0 0
0 0 0 0 0
Fig. 3. Basis functions of block DCT with two non-overlapping blocks of equal size. The titles of the subplots indicate the sequence numbers of the basis
functions.
3.5
2.5
Magnitude →
1.5
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Frequency ( × π rad/sample ) →
Fig. 4. Superimposed frequency responses of twenty filters of DCT filter bank of Fig. 3.
MFLE. It is basically the difference between the log ener- From Eq. (14) it is also very clear that this transforma-
gies of the filters in filter bank. Earlier, spectral difference tion is a special kind of multi-block overlapped transfor-
in frequency axis has been proposed in (Nitta et al., mation where each block is of size three and there is an
2000) for speech recognition. It has been observed that overlap of two samples between the consecutive blocks.
the relative subband energies contain significant informa- The SBT computation scheme is very similar to delta fea-
tion for speech and speaker recognition (Chetouani et al., ture computation scheme in spatial domain. Hence, it con-
2009). We have defined a shifted basis function which will tains transitional information of different frequency bands.
capture this information in an effective manner. The BTs proposed in the previous subsections deals with
Spectral difference between subbands is nothing but rel- large segmental information. But the SBT contains more
ative energy of the subbands which can be calculated by detail attributes which some how ignored by full-band
differentiating log energies. The transformation output and other block based transformations. Hence, SBT and
for the proposed method can be written as, block based information contains some amount of comple-
sbt p2 mentary information. Therefore, the advantages of both
xi i¼1 ¼ WðiÞ Wði þ 2Þ: ð13Þ the feature can be used in combined system (Chakroborty
As we are skipping one subband for computing the coeffi- and Saha, 2010; Sahidullah et al., 2010) where both the per-
cients the total number of output will be (p 2), and the formances are fused together to get better speaker recogni-
transformation kernel (Lsbt) can be expressed for p = 10 tion result.
as follows:
2 3
1 0 0 0 0 0 0 0
6 0 1 0 0 0 0 0 0 7
6 7
6 7
6 1 0 1 0 0 0 0 0 7 Table 1
6 7 Database description for speaker recognition experiments. The database
6 7
6 0 1 0 1 0 0 0 0 7 details (i.e. target model, test segment, and trial information) are shown
6 7
6 7 for core-test section.
6 0 0 1 0 1 0 0 0 7
6 7 Specification NIST SRE 2001 NIST SRE 2004
Lsbt ¼ 6 7: ð14Þ
6 0 0 0 1 0 1 0 0 7 (Przybocki and Martin, (Martin and Przybocki,
6 7 2002) 2006)
6 7
6 0 0 0 0 1 0 1 0 7
6 7 No. of speakers 174 310
6 7
6 0 0 0 0 0 1 0 1 7 Speech format 8 kHz, l-law 8 kHz, l-law
6 7 Speech quality Cellular phone Various telephonic
6 7
4 0 0 0 0 0 0 1 0 5 Channel variability No Yes
Handset variability Yes Yes
0 0 0 0 0 0 0 1 Language variability Yes Yes
Number of target 174 616
The weighting is proposed in such a manner that the
models (Male: 74, female: 100) (Male: 246, female: 370)
basis function of the transformation matrix are orthogonal Number of test 2038 1174
to each other. This transformation can be viewed as a filter segments
bank where all the filter will have equal magnitude Total trials 22418 26224
response (as shown in Fig. 6) with a single large main lobe. Correct trial 2038 2386
Impostor trial 20380 23838
It is unlike DCT where it has narrow side lobes.
1.5
Magnitude |H( ω )| →
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Frequency (× π rad/sample) →
Fig. 6. Plot showing the frequency response of the SBT filter-bank. Each filter of the filter bank has equal frequency response (jH(x)j = 2 sin x).
550 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
Fig. 7. Figure showing (a) superimposed histogram of first three spectral peaks, (Data are taken from the male section of NIST SRE 2001.) and (b) the
Mel filter bank structure for 20 filters.
−1
DCT
NOBT−10−10
−2 NOBT−8−12
OBT−9−13
−3 SBT
MFLE
−4
log (ε )→
−5
−6
−7
−8
−9
−10
0.86 0.88 0.9 0.92 0.94 0.96 0.98
Correlation Coefficient ( ρ )→
Fig. 8. Figure showing plot of logarithm of residual correlation of different transformations. The residual correlations are computed for different values of
q where the correlation matrix of the given data follows ideal Markov-I property.
(a)
0.16
0.14
ε→
0.12
0.1
0.08
5 10 15 20 25 30 35 40 45 50
Speaker Index →
(b)
0.2
ε→
0.15
0.1
5 10 15 20 25 30 35 40 45 50
Speaker Index →
DCT NOBT−10−10 NOBT−8−12 OBT−9−13
Fig. 9. Figure showing residual correlation of different transformations for practical speech data. The plots are shown for 50 randomly chosen speakers of
(a) YOHO (microphonic) and (b) POLYCOST (telephonic) databases.
(a) (b)
0.2 5
0
−0.2
0
0 20 40 60 80 100 120 140 160 20 40 60 80 100 120
(c) 4 (d) 5
2
0 0
−2
−4 −5
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
(e) 5 (f) 5
0 0
−5 −5
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18
(g) 5 (h) 5
0 0
−5 −5
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
Fig. 10. Effect of additive white Gaussian noise (SNR:15 dB) on (a) speech signal, (b) power spectrum, (c) mel filter bank log energy. The effect is also
shown for different cepstrum based on (d) DCT, (e) NOBT-10-10, (f) NOBT-8-12, (g) OBT-9-13, and (h) SBT. (Blue line: clean speech, red line: noisy
speech.) (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
that a few MFLE remain almost unchanged even in pres- is synthetically generated using two methods. First method
ence of 15 dB SNR of noise while others are significantly is to pass the white noise sample of NOISEX-92 database
changed. Therefore, if we apply full-band transformation through a band pass filter with sharp bandwidth. The filter
like DCT more number of coefficients will be affected in is designed using 6th order Butterworth approximation
presence of both types of noise. The subfigures of Figs. 10 with lower and upper cut-off frequency as 2000 Hz and
and 11 interprets this by showing transformed coefficients 2300 Hz, respectively. The second method is to add four
based on DCT, NOBT-10-10, NOBT-8-12 and OBT-9-13. frequency components (i.e. sinusoidal tones) of 2000 Hz,
The proposed BTs are much more efficient than stan- 2100 Hz, 2200 Hz and 2300 Hz. The amplitudes of the
dard DCT based approach when the speech signal is cor- sinusoids are chosen randomly. The two narrow-band sig-
rupted by narrow-band noise. We have observed the nals generated here is called as Type-I and Type-II narrow-
effect of narrow-band noise on speech signals and its band noise respectively. As Type-II narrow-band signal is
transformed coefficients. In our work, this type of noise generated by just adding sinusoidal we can call it more pure
Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565 553
−0.2 0
0 20 40 60 80 100 120 140 160 20 40 60 80 100 120
(c) (d) 2
1
0
−1 0
−2 −2
−3
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
(e) (f) 2
2
0 0
−2 −2
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18
(g) (h)
2 2
0 0
−2 −2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
Fig. 11. Effect of hfchannel noise (SNR:15 dB) on (a) speech signal, (b) power spectrum, (c) mel filter bank log energy. The effect is also shown for different
cepstrum based on (d) DCT, (e) NOBT-10-10, (f) NOBT-8-12, (g) OBT-9-13, and (h) SBT. (Blue line: clean speech, red line: noisy speech.) (For
interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
narrow-band than Type-I. Hence, the effect of Type-I is and ignore the distorted ones. This could be accomplished
more local than that of Type-II. The speech spectrogram using missing feature theory (MFT) (Lippmann and Carl-
affected by narrow-band noise is shown in Fig. 12. On son, 1997). The experimental results using MFT based
the other hand, how the presence of narrow-band signal scoring scheme is presented in Section 5.2.6.
affects feature extraction scheme is shown in Fig. 13. It is
noteworthy to mention that the cepstral features extracted 4.4. Computation complexity
from the affected zone gets severely affected. Conversely,
features extracted from the other zone are almost unal- The proposed BTs have another major advantage
tered. Therefore, we can get improved speaker recognition over exiting full-band based transformation due its low
performance if we successfully select the unaffected features computational cost. In Fig. 14, the structure of various
Fig. 12. Spectrogram showing characteristics of (a) clean speech signal, (b) effect of white noise, (c) effect of Type-I narrow-band noise, (d) effect of Type-
II narrow-band noise. In all the cases SNR is set at 15 dB.
554 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
0 2
−0.2 0
0 20 40 60 80 100 120 140 160 20 40 60 80 100 120
(c) 5 (d) 2
0
0
−2
−4
−5
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
(e) (f)
2 2
0 0
−2 −2
−4 −4
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18
(g) 2
(h) 2
0 0
−2 −2
−4 −4
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
Fig. 13. Effect of narrow-band noise (Type-I, SNR:15 dB) on (a) speech signal, (b) power spectrum, (c) mel filter bank log energy. The effect is also shown
for different cepstrum based on (d) DCT, (e) NOBT-10-10, (f) NOBT-8-12, (g) OBT-9-13, and (h) SBT. (Blue Line: Clean Speech, Red Line: Noisy
Speech.) (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
transformation matrix is shown along with standard DCT 5. Speaker recognition experiment
based transformation. The BT matrices are sparse in nat-
ure, hence, the computational time is significantly less than 5.1. Experimental framework
full matrix based transformation.
For example, if we use filter bank of size 20, then total 5.1.1. Database for experiments
number of multiplication required for DCT is 380. While Experiments are carried out for speaker verification (SV)
for NOBT-10-10 it is 180 and for NOBT-8-12 it is 188. On task. In order to evaluate the performance of various class
the other hand for OBT-9-13 and OBT-8-8-8, the required of BT based features, we have considered multiple large
number of multiplications are 228 and 168 consecutively. population speech corpora created by NIST. These are
In all the previous cases, we have discarded dc-coefficient. widely used in speaker recognition system evaluation. In
In case of SBT, no multiplications are required, only sub- our experiments, we have used NIST SRE 2001 and NIST
tractions are needed for computing cepstral coefficients. SRE 2004. The database descriptions are shown in Table 1.
1 (b)
(a)
5 5
0.8
10 10
15 0.6 15
20 20
5 10 15 20 0.4 5 10 15 20
(c) 0.2
(d)
5 5
10 0 10
15 15
−0.2
20 20
5 10 15 20 5 10 15 20
−0.4
(e) (f)
5 −0.6 5
10 10
−0.8
15 15
20 −1 20
5 10 15 20 2 4 6 8 10 12 14 16 18
Fig. 14. Transformation kernel for different block transformations. The subfigures are shown for (a) DCT, (b) NOBT-10-10, (c) NOBT-8-12, (d) OBT-9-
13, (e) OBT-8-8-8, (f) SBT. The dc-coefficients are also shown for first five types.
Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565 555
Apart from it, the background data for gender dependent maximization (EM) algorithm (Dempster et al., 1977) in
UBM training have been collected from the development maximum likelihood (ML) based approach. In these exper-
section of NIST SRE 2001 (for NIST SRE 2001 evalua- iments, the GMMs are trained with a few iterations where
tion) and from NIST SRE 2003 (for NIST SRE 2004 clusters are initialized by binary splitting based vector
evaluation). quantization (Linde and Buzo, 1980) technique. State-of-
the art speaker recognition system utilizes adapted GMM
5.1.2. Preprocessing training using maximum-a-posteriori (MAP) approach. In
In this work, pre-processing stage is kept similar this case, a large GMM of higher model order is trained
throughout different features extraction methods. It is per- as a background model, which is widely known as UBM
formed using the following steps: (Reynolds et al., 2000). Individual target models are cre-
ated by adapting the parameters of the UBM, i.e. mean,
The speech signal is first pre-emphasized with 0.97 pre- covariance, and priors with the training data of target
emphasis factor. speakers. It is observed that mean adaptation is sufficient
The pre-emphasized speech signal is segmented into to create speaker models (Kinnunen and Li, 2010).
frames (s) of each 20 ms, i.e. total number of samples In verification stage, the score of feature matrix of an
in each frame is N = 160 (sampling frequency unknown utterance X for a speaker i is determined by,
Fs = 8 kHz). We keep 50% overlap with adjacent 1
frames. hðX; iÞ ¼ ðlog pðXjki Þ log pðXjkubm ÞÞ; ð18Þ
T
In the last step of pre-processing, each frame is win-
dowed using hamming window. where X consists of T number of speech frames.
The scores for target and impostor trials are used to
evaluate the system performance. The details of speaker
verification system implementation and score calculation
5.1.3. Feature extraction
techniques using adapted GMM are concisely available in
Standard MFCC features are extracted using linearly
(Benesty et al., 2007; Kinnunen and Li, 2010).
spaced filters in Mel scale (Kinnunen and Li, 2010). The
features are further processed using RelAtive SpecTrAl
5.1.5. Performance evaluation
(RASTA) filtering to remove the mismatch between train-
Performances of SV systems are evaluated using the
ing and testing condition. Velocity (delta) coefficients are
detection error trade-off (DET) plot, which is drawn with
extracted over a window of size three and those are
the help of DETWARE2 tool provided by NIST. We have
appended with MFCC coefficient. Finally, voice activity
computed two commonly used metrics from DET curve.
detection (VAD) is performed to discard the non-speech
First, equal error rate (EER), the point on DET curve hav-
frames followed by utterance level cepstral mean and vari-
ing equal probability of false acceptance (FA) and false
ance normalization (CMVN) as a part of channel compen-
rejection (FR), and second, minimum detection cost function
sation and session variability reduction.
(minDCF), a cost function based metric, which is com-
puted with the same tool DETWARE by setting
5.1.4. Speaker verification using adapted GMM CMiss = 10, CFalse Alarm = 1 and PTarget = 0.01 according
Adapted Gaussian mixture modeling based modeling to the NIST evaluation plan (Przybocki and Martin,
technique is used to create target speaker models. The idea 2002; Martin and Przybocki, 2006).
of GMM is to use weighted summation of multivariate
Gaussian functions to represent the probability density of 5.2. Results and discussion
feature vectors as a target speaker model and it is given by,
X
C Speaker verification experiments are carried out on
pðxÞ ¼ pi bi ðxÞ; ð16Þ NIST SRE 2001 and 2004 corpora according to the guide-
i¼1
lines in evaluation plan; and the experiments have been
where x is a d-dimensional feature vector, bi(x), conducted on the core-test section of the databases. The
i = 1, 2, 3, . . . , C are the component densities and pi, pre-processing stages (such as framing, windowing, etc.)
i = 1, 2, 3, . . . , C are the mixture weights or prior of individ- and feature post-processing schemes (like RASTA,
ual Gaussian. CMVN, etc.) have been fixed for different features. We
A GMM is parameterized by the mean, covariance and have set 20 filters in Mel filter bank, and it is fixed through-
mixture weights from all component densities and is out the experiments. The different subband regions which
denoted by are covered by the twenty filters are shown in Table 2.
C As we consider delta feature into account, the number of
k ¼ fpi ; li ; Ri gi¼1 : ð17Þ feature for full-band MFCC is 38. On the other hand,
In a speaker recognition system, each target is repre- the feature dimension for two block NOBT and SBT
sented by a GMM and is referred by its model k. The
2
parameters of k are optimized using iterative expectation http://www.itl.nist.gov/iad/mig//tools/DETware_v2.1.targz.htm.
556 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
Table 4
SV results on NIST SRE 2001 and NIST SRE 2004 using baseline MFCC (full-band DCT) and two block NOBT based MFCC feature. The lengths of the
two blocks are shown in first column.
Block sizes NIST SRE 2001 NIST SRE 2004
EER (in %) minDCF 100 EER (in %) minDCF 100
(5, 15) 0.0581 7.6546 3.4299 0.0600 14.6694 6.2178
(6, 14) 0.0552 7.8508 3.5026 0.0539 14.3696 6.1123
(7, 15) 0.0558 7.5074 3.4158 0.0549 14.1662 6.0229
(8, 12) 0.0574 7.7527 3.5063 0.0570 14.1662 5.9890
(9, 11) 0.0602 7.8999 3.5183 0.0602 14.1243 5.9883
(10, 10) 0.0605 7.7552 3.5925 0.0615 14.5436 6.0353
(11, 9) 0.0618 8.1011 3.7340 0.0626 14.5436 6.0919
(12, 8) 0.0629 7.998 3.6576 0.0632 14.9252 6.2932
(13, 7) 0.0651 8.1452 3.6385 0.0655 14.7113 6.2356
(14, 6) 0.0689 8.4396 3.7073 0.0686 15.0447 6.3397
(15, 5) 0.0710 8.5868 3.7574 0.0708 15.3004 6.3417
Full-band 0.0725 8.2434 3.5763 0.0769 14.9629 6.3231
feature will be 36. The modeling scheme has also been fixed sure. This metric is computed on the UBM and training set
throughout the different experiments. We have done all the of the databases. The result shows that the block based
SV experiments on GMM-UBM system with 256 model approach has higher decorrelation power than full-band
order. Gender independent UBMs are trained using two based approach. We have also found that speaker recogni-
iterations of EM algorithm. The target models have been tion performance improves if size of the first block is smal-
adapted from the UBM using relevance factor, r = 14 for ler. The reason is that if the first block is smaller then it
all the cases. Top-5 Gaussians of UBM for each speech approximately represents a formant frequency zone, F1
frame of test utterance have been selected for final scoring. and other two blocks contain two other formant frequen-
cies, F2 and F3. Hence, formant frequencies or spectral
5.2.1. Performance of various BT based feature peaks are independently processed. We know that F1, F2,
The first experiment on SV is performed using two block and F3 contain prominent speaker specific attributes
based NOBT feature. The result for baseline MFCC and (Quatieri, 2006). We have also observed that spectral peaks
NOBT based MFCC are shown in Table 4 for different (i.e. including formant frequencies) of speech frames are
block sizes. The performances of NOBT based approaches concentrated in specific frequency zone. Independent
are better than baseline MFCC in most of the cases for processing of formant regions also provides details of
both the databases. The speech samples of NIST SRE peaks and valleys effectively. It is most likely one of the
2001 are collected for matched condition only, but in the major reasons of improvement for block based MFCC
case of NIST SRE 2004, speech samples are collected by computation. However, abrupt partitioning of filter bank
keeping considerable amount of variation in training and energies may create trouble by ignoring full-band informa-
testing phase with diverse channels and handsets. The third tion. Our proposed OBT based feature overcomes this
column of the Table 4 shows the residual correlation mea- difficulty.
Table 5
SV results on NIST SRE 2001 and NIST SRE 2004 using two block OBT based feature. The specification of the two block (A and B) are shown in the first
column.
Block specification Feature dimension NIST SRE 2001 NIST SRE 2004
EER (in %) minDCF 100 EER (in %) minDCF 100
A:1-9, B:8-20 40 7.2669 3.4394 13.9146 5.9420
A:1-10, B:7-20 44 7.7036 3.4319 14.0006 5.9924
Table 6
SV results on NIST SRE 2001 and NIST SRE 2004 using three block based feature. The specification of three blocks (A, B and C) are shown in first
column.
Block specification Feature dimension NIST SRE 2001 NIST SRE 2004
EER (in %) minDCF 100 EER (in %) minDCF 100
A:1-8, B:9-15, C:16-20 34 8.3906 3.5466 14.5918 6.1365
A:1-8, B:7-14, C:13-20 42 7.6055 3.3892 14.3339 6.0526
A:1-8, B:9-17, C:15-20 40 7.7429 3.3986 13.9146 5.8950
Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565 557
We have chosen the block size (8, 12) from two block we have precisely chosen the block size for this experiment.
NOBT and extended this system to overlapped version. Among the three blocks the first block covers frequency
The speaker recognition performance is shown in Table 5. zone related to F1, i.e. 0–883.1663 Hz, the second block
We observed that the performance of speaker recognition covers the zone (F2) 745.9244–2281 Hz and the third block
is considerably better when we keep few overlap between takes care of F3, i.e. 1791–4000 Hz. In this case, the result is
the adjacent blocks. However, a larger overlap keeps moderately improved for NIST SRE 2004 than the other
redundant information and the performance is degraded. cases. When speech signal is distorted by various noises,
Experiments have also been also carried out using three usually the critical bands in higher frequency zone are
block based system. We have considered three experiments severely affected. For this reason, the performance is better
on this tri-block feature transformation scheme. The result when F2 and F3 zones are combined into a single block for
is shown in Table 6. The first set of result is for NOBT both the databases. Hence, we arrive into a conclusion that
where the blocks are of size 8, 7, and 5. Here the perfor- the two block based OBT approach is more up to standard
mance is degraded compared to two block based NOBT for the database with much variability.
technique. The reason is that we have not considered the We evaluated the SBT based system on both NIST SRE
formant regions properly. This is to note that there exits 2001 and 2004 databases. The DET plot of SBT along with
a significant amount of overlap between second and third all the other features are shown in Fig. 15. The perfor-
spectral peaks as in Fig. 7. The second result is an OBT mance of the system based on SBT is shown in Table 8.
extension of the previous result where we keep an overlap- For NIST SRE 2001 database where there is no mismatch
ping of two samples of the consecutive blocks. Hence, the between training and testing condition performance is sig-
formant regions are mapped to each block in a better man- nificantly improved over baseline MFCC and NOBT based
ner compared to the earlier case. We have observed that transforms. We have got EER of 7.4583% in this case. On
performance is significantly improved than that of non- the other hand, the performance is degraded for NIST SRE
overlapping thee block based method. However, in the 2004. In that case, we have obtained EER of 15.5499%
three block based approach the overall performance on using SBT compared to 14.9629% for MFCC-GMM base-
two databases is degraded considerably compared to two line system. Most of the speech samples of NIST SRE 2004
block OBT based approach. The most probable reason is is corrupted due the variability in telephone channel, hand-
that due to the presence of noise in speech signal of SRE set, etc. As a result, most of the critical band information is
2004 database the spectrum for F2 and F3 zone are not distorted. The SBT feature keeps local information of fre-
accurately estimated. In our next experiment, we consid- quency band by directly considering closer subband log
ered the third formant (F3) separately which is centered energies, but it loses the full-band information. The full-
around 2.5 kHz within the frequency zone 1690–3300 Hz, band (or large band) information plays a significant role
40 40
20 20
Miss probability (in %)
10 10
5 5
10
2 2
1 1
0.5 0.5
10
0.2 0.2
10
0.1 0.1
10
0.1 0.2 0.5 1 2 5 10 20 40 0.1 0.2 0.5 1 2 5 10 20 40
False Alarm probability (in %) False Alarm probability (in %)
Fig. 15. Figure showing DET plot for various block based techniques. (MFCC-38: Full-band DCT based MFCC, NOBT-10-10: Double block NOBT
where both blocks are of size 10, NOBT-8-12: Double block NOBT where two blocks are of sizes 8 and 12, OBT-9-13: Double block OBT where two
blocks are of sizes 9 and 13, and SBT-36: SBT based feature.)
558 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
Table 7
SV results on NIST SRE 2001 and NIST SRE 2004 using PCA based feature. In PCA-40 the first coefficient is considered. In PCA-38 the first coefficient is
discarded as it is equivalent to dc-coefficient.
Feature type PCA data NIST SRE 2001 NIST SRE 2004
EER (in %) minDCF 100 EER (in %) minDCF 100
PCA-40 UBM 9.4210 3.8209 18.3152 7.3806
UBM+Train 9.0775 4.0105 18.5249 7.4448
PCA-38 UBM 8.6899 3.7433 17.0636 6.9800
UBM+Train 8.55 3.7332 17.6863 7.2229
for recognition of speech signal degraded due to the pres- spectral area which covers F1 as well as the first spectral
ence of wide band noise. We have utilized the advantage peak and other block represents the shared frequency
of both the large band and localized information by com- region covered jointly by F2 and F3.
bining their strength through score level output fusion
which is discussed in Section 5.2.5.
The above study and experimental result for different 5.2.2. Comparison with PCA based approach
block transformation based approach suggests that the PCA is an operation similar to KLT where the data-dri-
two block OBT based system is a superior selection for sin- ven projection matrix is derived from the correlation
gle stream based SV experiment on NIST SRE databases. matrix of the feature vector. PCA completely decorrelates
In this case, the first block approximately represents the a feature matrix, i.e. the residual correlation measure
becomes exactly zero. Theoretically, PCA is the optimal
decorrelation process. In this work, one of our claim is that
our proposed transformation decorrelates MFLE more
Table 8
SV results on NIST SRE 2001 and NIST SRE 2004 using SBT based
efficiently than DCT which helps to improve the perfor-
feature. mance. In an experiment, we have computed PCA projec-
Database EER (in %) minDCF 100
tion matrix from MFLE of UBM data and used this for
computing features. The result is shown in Table 7. It
NIST SRE 2001 7.4583 3.4591
NIST SRE 2004 15.5499 6.9700
shows the result for PCA-40 and PCA-38. PCA-40 feature
is the extracted using standard PCA based where
a−(i) a−(ii)
0.09 0.036
0.085 0.035
minDCF →
0.034
EER →
0.08
0.033
0.075
0.032
0.07 0.031
0.065 0.03
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fusion Weight ( η ) → Fusion Weight ( η ) →
b−(i) b−(ii)
0.16 0.07
0.155
0.065
minDCF →
0.15
EER →
0.145
0.06
0.14
0.135 0.055
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fusion Weight (η) → Fusion Weight (η) →
Fig. 16. Effects of fusion weight (g) on EER and minDCF are shown for (a)NIST SRE 2001 and (b)NIST SRE 2004. The variations of EER w.r.t g are
shown in a-(i) and b-(i), on the other hand the changes of minDCF are shown in a-(ii) and b-(ii).
Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565 559
20
10
Miss probability (in %)
Miss probability (in %)
10
5
1 2
1 2 5 10 1 2 5 10 20
False Alarm probability (in %) False Alarm probability (in %)
Fig. 17. Figure showing DET plot for different fused system. The fusion weights (g) are 0.5 (for NIST SRE 2001) and 0.8 (for NIST SRE 2004).
560 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
Fig. 18. Block diagram of the proposed missing feature based testing scheme. Good and Bad denote log likelihood ratio of reliable features and unreliable
features correspondingly.
feature vector computed using two overlapped blocks of found EER of 8.1992% and minDCF of 0.037080 for NIST
size 9 and 13 are reasonably better than other multi-block SRE 2001. On the other hand, for NIST SRE 2004, we
cases. We have also compared the performance of this 40 have obtained EER of 15.0887% and minDCF of
dimensional proposed feature with 40 dimensional stan- 0.063171. The performance of our proposed feature
dard MFCC feature extracted using 21 filters. We have remains consistently better for both the databases.
system combination scheme, which is employed here, is com- performance is improved for single stream based system
putationally less expensive compared to feature level fusion for all the cases. Particularly, for block based feature extrac-
for GMM based speaker recognition system. We adopted tion, the performance improvement is relatively significant.
the simplest fusion strategy: linear fusion. The score, i.e. The performance improvement is observed for both EER
log-likelihood ratio for two sub-systems are weighted and and minDCF. We obtained an EER of 6.8204% for NIST
summed to derive the final score. If the log-likelihood score 2001 database when OBT based system with two sample
of two systems X and Y are LLRx and LLRy, then final score overlapping (i.e. OBT-9-13 based system) is fused with
is LLRfused = g LLRx + (1 g) LLRy, where g is fusion SBT based system. At the same time, we obtained relative
weight and 0 < g < 1. The speaker recognition performance minDCF improvement of 14.81% over baseline MFCC.
for fused system is shown in Fig. 16 for different values of g. Though for this particular database the EER is not much
Various large band information based system is fused with reduced for fused scheme based on OBT-9-13 feature
SBT based system with a fixed fusion weight g. As the speech compared to other block based techniques, the minDCF is
quality of NIST SRE 2001 is considerably good, we have significantly improved over other cases. Note that minDCF
chosen fusion weight of 0.5, i.e. equal importance is given plays a significant role in computing verification threshold
to both the systems. On the other hand, we have empirically for real time speaker recognition system. On the other hand,
chosen fusion weight 0.8 for large band information in the in NIST SRE 2004 also, we have obtained best performance
evaluation of NIST SRE 2004 database. This is due to the for the case where OBT-9-13 is combined with SBT. There
fact that SBT information are less reliable for distorted we have achieved EER of 13.8266% and minDCF of
speech signals. The DET plot for combined system is shown 5.9546%. In that case, the relative improvement over base-
in Fig. 17 for both the databases. The plot depicts that the line MFCC is 7.59% in EER and 5.83% in minDCF.
fused system based on the combination of OBT-9-13 and
SBT are better than the other systems in terms of EER. In 5.2.6. Performance in presence of noise
Table 9, the speaker recognition result is shown for fusion The performance of speaker recognition systems based
of SBT with various other features. The combined on proposed BTs are evaluated on noisy speech data. We
Table 12
SV results on NIST SRE 2001 in the presence various of narrow-band noises. The results are shown for different SNRs (20 dB, 10 dB, and 0 dB) for fused
system. The fusion weight is set at 0.6 for 20 dB SNR, 0.7 for 10 dB SNR, and 0.9 for 0 dB SNR.
SNR Feature Single stream Fused with SBT
Type-I Type-II Type-I Type-II
20 dB MFCC-38 EER (in %) 16.2929 15.265 15.4073 15.3091
minDCF 100 6.4260 6.4967 6.2508 6.6075
NOBT-10-10-36 EER (in %) 14.8184 14.8184 14.3768 15.3091
minDCF 100 5.8941 6.3276 5.9183 6.4382
NOBT-8-12-36 EER (in %) 16.7812 14.7203 15.7998 15.3680
minDCF 100 6.4745 6.2480 6.2304 6.4500
OBT-9-13-40 EER (in %) 15.2601 14.6222 15.1178 15.5986
minDCF 100 6.0955 6.3139 6.0878 6.5292
SBT-36 EER (in %) 17.4190 19.5780 - -
minDCF 100 7.0402 8.0275 - -
10 dB MFCC-38 EER (in %) 22.8656 23.1600 22.4730 24.3376
minDCF 100 8.4021 8.8851 8.3266 8.9446
NOBT-10-10-36 EER (in %) 20.4612 23.1992 20.3140 23.7488
minDCF 100 7.7507 8.6676 7.7759 8.8152
NOBT-8-12-36 EER (in %) 23.9450 23.5034 22.8165 23.9450
minDCF 100 8.6625 8.9654 8.4552 8.9912
OBT-9-13-40 EER (in %) 22.1320 24.7301 21.9431 25.0245
minDCF 100 8.1286 8.8062 8.1646 8.9903
SBT-36 EER (in %) 24.5437 30.0834 - -
minDCF 100 8.8821 9.9266 - -
0 dB MFCC-38 EER (in %) 26.3984 28.4985 26.1482 28.8027
minDCF 100 9.3465 9.8261 9.3051 9.8210
NOBT-10-10-36 EER (in %) 24.7301 30.8072 24.3817 31.0059
minDCF 100 8.9592 9.6588 8.9353 9.6794
NOBT-8-12-36 EER (in %) 27.1860 30.7139 26.8916 30.2380
minDCF 100 9.5948 9.8476 9.4942 9.8632
OBT-9-13-40 EER (in %) 25.9593 32.7355 25.8096 32.5736
minDCF 100 9.0763 9.9077 9.0719 9.9595
SBT-36 EER (in %) 28.8494 37.5442 - -
minDCF 100 9.5892 9.9951 - -
562 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
have observed the effect for both standard noise and syn- of narrow-band signal are shown in Table 12. In left half
thetic narrow-band noise. The standard noise database of the table, the results are shown for the single stream
NOISEX-92 is used for the evaluation. The noise samples based system and the performance of the fused system is
are first down sampled at 8 kHz, then adaptively scaled shown in right half of the table. Though the performance
to a desired SNR value and added to the required speech of block based features are better for some of the cases,
signal. We have chosen NIST SRE 2001 corpus only for it is not better consistently. This is partially due to the
this particular experiment. NIST SRE 2004 is not selected energy based VAD used in our experiments which
because it is already distorted by various channel noise and ’wrongly’ treats the non-speech higher energy based frames
handset effects. The experiments have been conducted as speech frames. In order to show the effect, we have com-
using four types of standard additive noise (white, hfchan- puted the block based features for a complete utterance
nel, babble and pink) for different levels of SNRs (20 dB, and taken the average over all the frames. Now to study
10 dB and 0 dB). The results for single stream based system the effect of voice activity detection, we have done this
are shown in Table 10. We can observe that the perfor- experiment twice. In the first case, all the frames of noisy
mance of proposed BT based systems are better than speech, which are declared as speech by energy based
DCT based system in most of the cases. The speaker recog- VAD, are considered and the result is shown in Fig. 19.
nition performance can be further improved with fused sys- We can see that features which are not extracted from
tem. Score of full-band based feature as well as proposed noisy zone are also affected. The reason is that originally
large block based BT techniques are fused together with non-speech frames, which are detected as speech in
SBT feature based scores. As the performance of SBT presence of noise, create this error. We have confirmed this
becomes very poor with increase in noise we have adjusted after doing experiment using the speech VAD labels
the fusion weight empirically for different cases. We have extracted from the clean speech signal and applying it to
kept fusion weight 0.6 for 20 dB noise, 0.7 for 10 dB noise noisy data. The result is shown in Fig. 20 and it is clear that
and 0.9 for 0 dB noise. The result for dual stream based the narrow-band noise has only local effect.
system is shown in Table 11. Here furthermore we observe However, the advantages of block based feature can be
that the performance of fused system is significantly better retrieved using MFT technique. In this method, the contri-
than that of single stream based system for almost all kinds bution from different features are operated separately. In
of noise of various SNRs. multidimensional statistical methods, scores of unreliable
The block based features are affected with different features are either masked or given lesser importance by
degree in presence of narrow-band noise. In order to weighting with a lesser value than that of reliable features.
observe this effect, we conducted experiments on narrow- As in block based feature extraction scheme, different
band noise. Speaker recognition results using two types blocks are unequally affected due to narrow-band noise,
(a) 2 (b)
2
0
0
−2
−2
−4
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
(c) (d)
2 2
0 0
−2 −2
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18
(e) (f)
2 2
0 0
−2 −2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
Fig. 19. The effect of narrow-band noise (SNR:15 dB) on (a) mel filter bank log energy and on different features based on (b) DCT, (c) NOBT-10-10,
(d) NOBT-8-12, (e) OBT-9-13, and (f) SBT (Blue line: clean speech, red line: noisy speech.) is shown for a complete speech utterance. Average of the
parameters (e.g. MFLE, features) are computed over all the voiced frames of the noisy utterance. (For interpretation of the references to colour in this
figure legend, the reader is referred to the web version of this article.)
Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565 563
(a) 2 (b)
2
0
0
−2
−2
−4
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
(c) (d)
2 2
0 0
−2 −2
2 4 6 8 10 12 14 16 18 2 4 6 8 10 12 14 16 18
(e) (f)
2 2
0 0
−2 −2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18
Fig. 20. The effect of narrow-band noise (SNR:15 dB) on (a) mel filter bank log energy and on different features based on (b) DCT, (c) NOBT-10-10,
(d) NOBT-8-12, (e) OBT-9-13, and (f) SBT (Blue line: clean speech, red line: noisy speech.) is shown for a complete speech utterance. Average of the
parameters (e.g. MFLE, features) are computed over all the voiced frames of the corresponding clean speech utterance. (For interpretation of the
references to colour in this figure legend, the reader is referred to the web version of this article.)
we have treated them separately. We have chosen the OBT- if we consider the contribution from the distorted blocks.
9-13 feature and applied missing feature theory in fused This scheme is shown in Fig. 18. Here, the contribution
mode. Let us assume that we have prior information con- from both good and bad features are considered. As the
cerning the nature of narrow-band noise. Both kinds of bad features are more distorted with the increase in noise
MFT technique are applied. In masking, as the second we keep higher weight to the good feature scores than
block is much affected with noise we only compute the the bad feature scores. The contribution of the good fea-
score for first block, i.e. first 8 dimensions including the ture is linearly increased with the increase in noise power.
corresponding deltas are selected for this block. Hence, We have chosen ggobt ; ggsbt , and gobt identical for a particular
the feature dimension becomes 16. On the other hand, we SNR. We have set them at 0.6, 0.7, and 0.9 for 20 dB,
have chosen first 11 and 18th SBT based feature and corre- 10 dB, and 0 dB correspondingly, i.e. same weighting as
sponding deltas, i.e. 22 dimensions from 36 dimensional of fusion scheme for noisy data. It has been empirically
SBT feature. The previous fusion weights, i.e. 0.6, 0.7 found that these are reasonably better than any other arbi-
and 0.9 are chosen for 20 dB, 10 dB and 0 dB. The result trary weighting. The results for weighting based MFT
for masking based MFT technique is shown in Table 13. scheme are shown in right half of the Table 13. We have
Incidentally, the unreliable features are not completely acquired significant performance improvement in low
ineffective for speaker recognition. They also have less SNR for both types of narrow-band noise. This scheme
but not negligible contribution depending on the SNR of could be further improved with automatic detection of nar-
the signal. Therefore, the result can be further improved row-band noise and effective frame based weight selection.
6. Conclusions
Table 13
SV results in the presence of narrow-band noise on NIST SRE 2001 using
missing feature theory. Overlapped block transform based fused scheme’s Block based MFCC computation schemes are efficient
result is shown. and robust in speaker recognition context. In this paper,
SNR Feature Masking Weighting we have investigated an improved block based approach
Type-I Type-II Type-I Type-II for speaker recognition. First, the feature extraction
schemes using non-overlapped and overlap block transfor-
20 dB EER (in %) 15.1717 17.3700 14.6099 16.0942
minDCF 100 6.7106 7.2827 6.1362 6.5296 mation are analytically formulated. Proposed block trans-
10 dB EER (in %) 19.5780 22.1295 19.1855 22.1835 form fairly decorrelates filter bank log energies as an
minDCF 100 8.1150 8.9990 7.8261 8.8185 alternative of standard DCT. The experimental evaluation
0 dB EER (in %) 25.1227 27.1467 20.9151 22.3258 is performed on standard databases, and this shows that
minDCF 100 9.4888 9.7222 8.6014 9.0407
formant specific block transformations perform better.
564 Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565
We also propose a novel block based orthogonal transfor- Damper, R.I., Higgins, J.E., 2003. Improving speaker identification in
mation technique which captures transitional information noise by subband processing and decision fusion. Pattern Recognition
Lett. 24 (13), 2167–2173.
of log-energies of filter bank in frequency axis. The infor- Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood
mation covered in the later approach contains complemen- from incomplete data via the em algorithm. J. Roy. Statist. Soc. Ser. B
tary attribute to former block transforms. The (Methodol.) 39, 1–38.
performance of speaker recognition system is further Douglas, O., 2009. Speech Communications: Human and Machine,
enhanced by combining the strength of both kinds of fea- second ed. Universities Press.
Finan, R., Damper, R., Sapeluk, A., 2001. Improved data modeling for
tures using score level linear fusion. We have obtained sub- text-dependent speaker recognition using sub-band processing. Inter-
stantial performance improvement for both standard nat. J. Speech Technol. 4 (1), 45–62.
performance evaluations metric, i.e. EER and minDCF. Garreton, C., Yoma, N., Torres, M., 2010. Channel robust feature
The proposed system is very much suitable for speaker rec- transformation based on filter-bank energy filtering. IEEE Trans.
ognition in noisy condition specifically for narrow-band Audio Speech Lang. Process. 18 (5), 1082–1086.
Hung, W.-W., Wang, H.-C., 2001. On the use of weighted filter bank
noise. In our current work, we have mostly focused on analysis for the derivation of robust mfccs. IEEE Signal Process. Lett.
the effectiveness of block based linear transformation for 8 (3), 70–73.
improving the performance. The performance could be fur- Jain, A.K., 2010. Fundamentals Of Digital Image Processing, first ed. PHI
ther enhanced by effective processing of the signal in sub- Learning Pvt. Ltd.
band level, i.e. subband filtering, non-linear operations, Jingdong, C., Paliwal, K., Nakamura, S., 2000. A block cosine transform
and its application in speech recognition. In: Proc. Internat. Conf. on
etc. As a final point in current work we have used very Spoken Language Processing (INTERSPEECH 2000 – ICSLP), Vol.
basic fusion scheme which could be replaced with advanced IV, pp. 117–120.
logistic regression based fusion technique for better system Jingdong, C., Yiteng, H., Qi, L., Paliwal, K., 2004. Recognition of noisy
design. An investigation can be carried out in frame level speech using dynamic spectral subband centroids. IEEE Signal
score combination for more effective speaker recognition Processing Lett. 11 (2), 258–261.
Jung, S.-H., Mitra, S., Mukherjee, D., 1996. Subband dct: Definition,
system development. We can summarize that our method- analysis, and applications. IEEE Trans. Circ. Syst. Video Technol. 6
ical study on block transform and its evaluation in NIST (3), 273–286.
SRE databases could be a new groundwork for feature Kajarekar, S., Yegnanarayana, B., Hermansky, H., 2001. A study of two
level development of modern speaker recognition system. dimensional linear discriminants for asr. In: Proc. IEEE Internat.
Conf. on Acoustics, Speech, and Signal Processing, 2001 (ICASSP),
2001, Vol. 1, pp. 137–140.
Acknowledgement Kim, S., Ji, M., Kim, H., 2008. Noise-robust speaker recognition using
subband likelihoods and reliable-feature selection. ETRI J. 30 (1), 89–
The authors would like to thank anonymous reviewers 100.
for their useful comments that helped revising the paper. Kinnunen, T., 2004. Spectral features for automatic textindependent
speaker recognition. Ph.D. thesis, University of Joensuu.
Kinnunen, T., Li, H., 2010. An overview of text-independent speaker
References recognition: From features to supervectors. Speech Comm. 52 (1), 12–
40.
Ahmed, N., Natarajan, T., Rao, K., 1974. Discrete cosine transfom. IEEE Kwon, O., Lee, T., 2004. Phoneme recognition using ica-based feature
Trans. Comput. C-23 (1), 90–93. extraction and transformation. Signal Process. 84, 1005–1019.
Akansu, A.N., Haddad, R.A., 1992. Multiresolution signal decomposi- Linde, Y., Buzo, A., Gray, R., 1980. An algorithm for vector quantization
tion: Transforms, subbands, and wavelets. Academic Press. design. IEEE Trans. Comm. COM-28 (4), 84–95.
Benesty, J., Sondhi, M., Huang, Y., 2007. Springer Handbook of Speech Lippmann, R., Carlson, B., 1997. Using missing feature theory to actively
Processing, first ed. Springer-Verlag, Secaucus, NJ. select features for robust speech recognition with interruptions,
Besacier, L., Bonastre, J.-F., 2000. Subband architecture for automatic filtering and noise. EUROSPEECH, KN37–KN40.
speaker recognition. Signal Process. 80 (7), 1245–1259. Mak, B., 2002. A mathematical relationship between full-band and
Besacier, L., Jean-Frantois, B., 1997. Subband approach for automatic multiband mel-frequency cepstral coefficients. IEEE Signal Process.
speaker recognition: Optimal division of the frequency domain. Lett. 9 (8), 241–244.
Lecture Notes Comput. Sci. 1206, 193–202. Malvar, H., Staelin, D., 1989. The lot: Transform coding without blocking
Campbell, J., Shen, W., Campbell, W., Schwartz, R., Bonastre, J.-F., effects. IEEE Trans. Acous. Speech Signal Process. 37 (4), 553–559.
Matrouf, D., 2009. Forensic speaker recognition. IEEE Signal Process. Martin, A., Przybocki, M., 2006. 2004 nist speaker recognition evaluation.
Mag. 26 (2), 95–103. Linguistic Data Consortium.
Campbell, J.P., Jr, 1997. Speaker recognition: A tutorial. Proc. IEEE 85 Ming, J., Hazen, T., Glass, J., Reynolds, D., 2007. Robust speaker
(9), 1437–1462. recognition in noisy conditions. IEEE Trans. Audio Speech Lang.
Chakroborty, S., 2008. Some studies on acoustic feature extraction, Process. 15 (5), 1711–1723.
feature selection and multi-level fusion strategies for robust text- Mukherjee, J., Mitra, S., 2002. Image resizing in the compressed domain
independent speaker identification. Ph.D. thesis, Indian Institute of using subband dct. IEEE Trans. Circ. Syst. Video Technol. 12 (7), 620–
Technology Kharagpur. 627.
Chakroborty, S., Saha, G., 2010. Feature selection using singular value Nasersharif, B., Akbari, A., 2007. Snr-dependent compression of
decomposition and qr factorization with column pivoting for text- enhanced mel sub-band energies for compensation of noise effects on
independent speaker identification. Speech Comm. 52 (9), 693–709. mfcc features. Pattern Recognition Lett. 28, 1320–1326.
Chetouani, M., Faundez-Zanuy, M., Gas, B., Zarader, J., 2009. Investi- Nitta, T., Takigawa, M., Fukuda, T., 2000. A novel feature extraction
gation on lp-residual representations for speaker identification. Pattern using multiple acoustic feature planes for hmm-based speech recogni-
Recognition 42 (3), 487–494. tion. ICSLP 1, 385–388.
Md. Sahidullah, G. Saha / Speech Communication 54 (2012) 543–565 565
Oppenheim, A.V., Schafer, R.W., 1979. Digital Signal Processing. Sahidullah, M., Saha, G., 2009. On the use of distributed dct in speaker
Prentice-Hall, Inc. identification. In: India Conference (INDICON), 2009 Annual IEEE,
Przybocki, M., Martin, A., 2002. 2001 nist speaker recognition evaluation pp. 1 –4.
corpus. Linguistic Data Consortium. Sivakumaran, P., Ariyaeeinia, A.M., Loomes, M., 2003. Sub-band based
Quatieri, T., 2006. Discrete-time Speech Signal Processing. Prentice-Hall, text-dependent speaker verification. Speech Comm. 41 (2–3), 485–509.
Upper Saddle River, NJ. Takiguchi, T., Ariki, Y., 2007. Pca-based speech enhancement for
Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification distorted speech recognition. J. Multimedia 2 (5), 13–18.
using adapted gaussian mixture models. Digital Signal Process. 10 (1– Vale, E., Alcaim, A., 2008. Adaptive weighting of subband-classifier
3), 19–41. responses for robust text-independent speaker recognition. Electron.
Sahidullah, M., Chakroborty, S., Saha, G., 2010. On the use of perceptual Lett. 44 (21), 1280–1282.
line spectral pairs frequencies and higher order residual moments for
speaker identification. Internat. J. Biomet. 2, 358–378.