SPEAKER ADAPTATION USING EIGENVOICES TECHNIQUE
Liselene de Abreu Borges1, Miguel Arjona Ramírez1, Rubem Dutra Ribeiro Fagundes2
1
Dep. Eng. Eletrônica – Escola Politécnica
Caixa Posta 61548, Universidade de São Paulo
CEP 05424-970 São Paulo, SP, Brazil
Tel.: +55-11-818-5606 Fax: +55-11-818-5718
e-mail: liselene@lps.usp.br, miguel@lps.usp.br
2
Dept. de Eng. Elétrica – Faculdade de Engenharia
Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, RS
Tel.: +55-51-3320-3540 Fax: +55-51-3320-3625
e-mail: rubemdrf@attglobal.net
ABSTRACT
This paper discusses speech recognition systems
(SRS) using speaker adaptation techniques. The most
recent speech recognition systems use Hidden
Markov Models (HMM). For such systems, the
eigenvoices speaker adaptation technique presents
the best performance among other techniques usually
suggested by researchers. This performance is due
mainly to the limited amount of data necessary to
perform speaker adaptation. In our experiments we
have reached improvements of around 10% in
speaker adaptation system compared with the
corresponding
independent
speaker
speech
recognition system and just using a very small
fraction of speakers’ data.
1
INTRODUCTION
A speaker dependent speech recognition system (SD)
presents best performance because all data available
comes from just one speaker, usually the system user.
However, it is necessary to utilize a large amount of
data from this user, enough to provide a good
recognition performance. At this point, whenever the
vocabulary increases, the amount of data for training
will increase, becoming very difficult to keep the
system’s performance. The usual way to solve this
problem is to train an independent speaker
recognition system (SI) using data from several
speakers. Nevertheless, the final performance is not
very good compared with the dependent speaker
system.
The solution is the building of a speaker adaptive
system [1], in which an independent speaker system
(SI) is created. After that, using a speaker adaptation
technique, the speech recognition system dynamically
becomes a speaker dependent system (SD).
An adaptation system (Figure 1) is not a fully trained
speaker dependent system, but a system with most of
the knowledge taken from an SI system and specific set of
information from the new user, extracted from the user’s
adaptation data.
Figure 1 – Speaker adaptation system retains general and specific
knowledge of speech
The use of adaptive systems allows improvements in speech
recognition performance at low cost. The main goal is to get
a better performance from a small set of data and save a lot of
computation time.
2
SPEAKER ADAPTATION SYSTEM
A speaker adaptation system (SAS) will try to modify an SI
system, previously trained as a speaker dependent system close to an ideal SD system, for a given new speaker and
using just a small set of adaptation data[1].
The process whose adaptation data set is given a priori is
usually called supervised adaptation. When the adaptation
data set is unknown a priori, the process is called
unsupervised adaptation. Furthermore, the adaptation process
can be executed directly on the input signal, usually called
spectral mapping adaptation [2], or on the HMM parameters
which we call model mapping adaptation[3].
Finally, the adaptation process can be defined as offline [2]
when it runs before the new user can utilize the SRS for the
first time1, or online [2] when it runs together with the
new speaker for the first time.
2.1
Eigenvoices
The eigenvoices technique [4] is based on an image
processing technique [5], namely eigenfaces
technique, usually applied as an image compression
method. The main point is to reduce the dimension of
data variables keeping the best data variation on those
remaining parameters[6], because most of these
parameters have a high correlation with each other.
Westwood in [7] says: “The eigenvoices form a basis
of a subspace of the acoustic model space, and are
chosen to account for inter-speaker variability.”. For a
given set of parameters estimated from different SDs,
the Principal Component Analysis (PCA) will define
the linear directions along which most of the data
variability lies. Such directions are called principal
components or eigenvoices.
2.1.1
Eigenspace estimation
Where E is the eigenspace (eigenvoices space). Each line of
E matrix e k , is given by:
[
e k = ek(1)
T
ek( 2)
ek( j )
]
(3)
where j is the state model and k = 1,2,..., D .
2.1.2
Number of eigenvoices components
In a SRS, usually a number D of components is necessary in
order to reproduce the whole system variability. However,
almost the complete system variability is reduced in a small
number of K components. In other words, the most of the
system’s information is located in a small set of K principal
components which can replace and represent the whole set of
D components. Thus, the original data set, composed of T
observations of D components, will be reduced to a set of T
observations of K principal components [9]. The K should
have a value smaller than T=rank(M), with K<T<<D, and
can be defined by many ways[9]. This work uses the percent
cumulative variation as seen below:
The first step in using eigenvoices is to build the
eigenspace [8]. In order to do so, it is necessary to
train T different SDs , using T different speakers2.
Each SD has its parameters annexed. In this work,
these parameters are the means of the Gaussian
output distributions of the HMM, but we can also use
the variances of the output distributions, transition
matrices or other parameters.
where dk is the eigenvalue associated to eigenvector ek.
After that, these means from each t of SD model are
copied in a vector called supervector with dimension
D where D is the total number of adaptation
parameters.
This number is the ration of each eigenvalue (associated with
each eigenvoice) to the sum of all D eigenvalues of E.
Usually K can be choose with the percent cumulative
variation value around 80 to 90%.
K
∑d
%VarCumK = 100. kD=1
∑d
d =1
k
(4)
d
~
The next setp consists in the buildingof a very large
matrix M, using all T supervectors, with dimensions
(DxT) as follow:
= [p (1)
p ( 2)
p (T ) ]
(1)
(t )
where p
is the supervector (Dx1) with all the t
speaker parameters and t=1,2,...T.
This new eigenspace E is now given by:
~
E = [e1
1
e2
e D ] = PCA( )
2.2
In this case the speaker data set has been acquired before using
the system as part of a pre-processing step.
2
These T speakers are the base speakers of the system
(5)
Eigenvoices coefficients
The adaptation parameters are given by:
K
~
pˆ = E. = ∑ν k e k
(6)
k =1
(2)
eK ]
where those last (D-K) eigenvoices will be ignored.
The eigenspace will be extracted from matrix M
through Principal Component Analysis (PCA) [6] as
we can see below:
E = [e1
e2
where
[8].
ν k are the eigenvoices coefficients to be estimated
2.2.1
Maximum likelihood EigenDecomposition
In order to estimate the eigenvoices components is
necessary to maximize the likelihood of adaptation
data given the HMM model λ̂ [10], as seen from:
she
Suit
year
to
rag
had
greasy
don’t
carry
like
your
Wash
ask
an
that
dark
Water
me
oily
in
Table 1- system vocabulary
λˆ = arg max P(O | λ )
(7)
λ∈Ω
where O is the observation set that is intended to be
represented by the adaptation model, and Ω is the set
of HMM. This maximization is given by the
maximum likelihood estimation decomposition
(MLED) [8], using the maximum likelihood
estimation algorithm (ML) [11] in order to calculate
the Equation (6).
The ML algorithm transforms the function P(O|λ) in
an auxiliary Baum function Q(λ , λˆ ) maximizing this
function in relation t
The SI system was trained by 20 speakers (7 women, 13
men) achieving 84,33% correct recognition rate in a test set
of 15 speakers (6 women, 9 men).
3.1
Results
The first test was the effect of eigenvoices subspace
dimension, changing the number of eigenvoices components
or, in other words, changing k (Figure 2).
λ̂ as follows [10]:
Q(λ , λˆ ) = ∑ P (O, q | λ ) log P(O, q | λˆ ) (8)
q∈Q
where q=(q1,...,qT) is the state sequence and Q is the
set of all possible state sequences. According to [12]
the development of expression (8) will be:
∑ γ tj e (k j )T
( j ) −1
t
K
o t = ∑ γ t( j ) ∑ν i e i( j )T
t
( j ) −1 ( j )
k
e
i =1
(9)
Figure 2- Recognition rate by varying eigenvoices number
where:
o t is the observation vector at a given time t;
( j ) −1
γ
( j)
t
Figure 3 shows the percent cumulative variation. We would
like to point out that the most percent cumulative variation is
concentrated in the first three eigenvoices.
is the inverse covariance matrix of state j;
is state j occupation probability in time t given
the observation sequence O and HMM λ.
3
METHODOLOGY
All tests in this work have been done by using an
isolated speech recognition system with a 20-word
vocabulary in English, from the well known TIMIT
speech corpus [13]. Each word is a continuous
distribution HMM with 6 states, and each state has
one output Gaussian distribution with 12 MFCCs3
and one frame energy coefficient. The system’s
vocabulary is given in Table 1:
Figure 3 – Eigenvoices cumulative variation
3
Mel Frequency Cepstral Coefficients
Some tests have been done changing the number of
speakers in the SI system. Figure 4 shows that the
number of base speakers has no effect in improving
marginal system performance. The marginal
improvement provided by the adaptation is around an
additional 10%.
just a very small amount of adaptation data is available.
Figure 5 – Speaker adaptation system recognition rate with an
increasing size adaptation data set, for first three eigenvoices component
4
Figure 4 – System independent and system adapted recognition
rate for different Nos of base speaker of the SI model
Table 2 shows correct recognition rate results, using
one word as adaptation data.
CONCLUSIONS
It seems very clear that the dimension K of the eigenvoices
plays a main role in the system’s behavior, not only due to
its influence in system performance, but also given the real
advantage in the use of a small fraction of training data to
perform speech recognition. In this sense we would like to
point out that eigenspace dimension K should be just as
small. as the amount of data available to perform speaker
adaptation. In the same way, if there is a large amount of
adaptation data, K must be properly dimensioned to fit the
size of the data set.
It is understandable that the amount of information from the
speaker is extracted from the adaptation data and there is a
strong relation between thisinformation (in an acoustic sense)
and the K dimensions necessary to represent such
information in the eigenspace. Accordingly, the first K
eigenvoices are chosen from the HMM variability4 and if we
try to use a large K with reduced adaptation data, we would
make a bad estimation about these new eigenvoices (or in
other words these new dimensions) leading to incorrect
parameter estimations.
Table 2 – Eigenvoices adaptation results, having one word as
adaptation data, for 1st, 3rd and 6th eigenvoices
The last test was carried out by changing the amount
of adaptation data, as shown in Figure 5. This figure
demonstrates that the amount of adaptation data has
practically no effect in recognition performance. The
recognition rate performance using one word as
adaptation data was 90%, against 91% recognition
rate using all words as the adaptation data set. We can
conclude that this method is specially indicated when
We would like to stress that using just 70% of the total
amount of data the maximum performance was achieved. In
most of eigenvoices adaptation system [4] [7] [2] the amount
of data came from 100 base speakers. In this work, we have
used just 20 base speakers reaching maximum performance.
We also would like to point out that this technique is
especially indicated for small vocabulary5, because a large
4
And these HMM will be trained from the available data.
There is no system using around 1000 words reported in the literature until
now.
5
one will demand a lot of training time6. For large
vocabulary SRS, the Maximum Likelihood Linear
Regression technique (MLLR) is much more
indicated. As a future work we are considering the
use of the eigenvoices technique with regression
classes [15].
Recognition. Proceedings of the IEEE, vol 77. N°2, pp.257284, 1989.
Also in the future, we plan to use the eigenvoices in
SRS with acoustic units, like phonemes [16],
improving recognition performance.
[12] BORGES, L., Speaker adaptation system using
eigenvoices. MSc. Dissertation. In Portuguese. University of
São Paulo. São Paulo. 2001.
5
REFERENCES
[1] FURUI, S., Speaker-Independent and SpeakerAdaptative Recognition Techniques, In: Advances in
Speech Signal Processing. Ed. Furui, S., Sondhi, M.,
New York: Marcel Dekker, pp.597-621,1992.
[2] CHRISTENSEN, H. Speaker Adaptation of Hidden
Markov Models using Maximum Likelihood Linear
Regression. Thesis, Aalborg University, Denmark,
1996.
[3] WOODLAND, P., Speaker Adaptation: Techniques
and Challenges, Proceedings IEEE Automatic
Speech Recognition and Understanding Workshop,
pp.85-90, Colorado, 2000.
[4] KUHN, H. ,et. al., Eigenvoices for speaker
adaptation. Proc. of ICSLP-98, pp.1771-1774,
Sydney, Australia, 1998.
[5] KUHN, H. ,et. al., Eigenfaces and Eigenvoices:
Dimensionality reduction for specialized Pattern
Recognition. IEEE workshop on multimedia Signal
Processing, California , 1998.
[6] JOLLIFFE, T., Principal Component Analysis,
Springer-Veriag, New York, 1986
[7] WESTWOOD, R., Speaker Adaptation using
Eigenvoices,
Thesis,
Cambridge
University,
Cambridge, 1999.
[8] KUHN, H. ,et. al., Eigenvoices for speaker
adaptation. Internal technical report, STL, California,
1997.
[9] JOHNSON, R., WICHERN, D., Applied Multivariate
statistical analysis, Prentice Hall, Texas, 1988.
[10] RABINER, L., A tutorial on Hidden Markov
Models and selected Applications in Speech
6
This is the case for systems with no phonetic modeling. For
large vocabulary SRS with phonetic modeling, the HMM phonetic
models can be adapted by eigenvoices technique.
[11] DELLER, J.; PROAKIS,J.; HANSEN, J. Discrete-Time
processing of Speech Signals. Macmillan, New York, pp.6366, 1993.
[13] NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY
(NIST).The DARPA TIMIT Acoustic-Phonetic Continuous
Speech Corpus, Virginia, http://www.nist.gov/, 1990.
[14] DAVIS, B.; MERMELSTEIN, P. Comparison of
parametric representations for monosyllabic word
recognition in continuous spoken sentences. . IEEE
Acoustics, Speech and Signal Proceeding, Vol. 28, pp.357366, 1980.
[15] LEGGETER, C. Improved Acoustic Modelling for HMMs
using Linear Transformations. Ph.D. thesis, Cambridge
University, Cambridge, 1995.
[16] FAGUNDES, R. D. R, Phonetic-phonologic approach
to continuous language speech recognition system.
Ph.D.thesis. In Portuguese. University of São Paulo –
POLI/USP, São Paulo, Brazil, 1998.