Speech Processing Unit 4 Notes
Speech Processing Unit 4 Notes
(V Saravanan ASP/ECE)
Classification of Speaker Recognition Methods:
The problem of speaker recognition can be divided into two major sub problems:
speaker identification and speaker verification.
Speaker identification can be thought of, as the task of determining who is talking from a set of known
voices of speakers. It is the process of determining who has provided a given utterance based on the
information contained in speech waves. The unknown voice comes from a fixed set of known speakers,
thus the task is referred toas closed set identification.
Speaker Verification on the other hand is the process of accepting or rejecting the speaker claiming to
be the actual one. Since it is assumed that imposters (those who fake as valid users) are not known to
the system, this is referred to as the open set task. Adding anyone of the above option to the closed set
identification task would enable merging of the two tasks, and it is called open set identification. “ Error
that can occur in speaker identification is the false identification of speaker and the errors in speaker
verification can be classified into the following two categories: (1) false rejections: a true speaker is
rejected as an imposter, and (2) False acceptances: a false speaker is accepted as a true one.
The usual approach to speaker recognition is based on the classification of acoustic parameters
derived from the speech signal. Generally, the parameters are obtained via short time spectral analysis
and contain both phonetic information, related to the uttered text, and individual information, related to
the speaker. Since the task of separating the phonetic information from the individual one is not yet
solved, many speaker recognition systems behave in a text dependent way (i.e. the user must utter a
predefined sentence).
1. Acoustic Parameters for Speaker Recognition
The acoustic wave speech signal generated by humans can be converted into an analog
signal using a microphone. An antialiasing filter is thereafter used to condition this signal and
additional filtering is used to compensate for the channel impairments. The antialiasing filter
band limits the speech signal to approximately the Nyquist rate (half the sampling rate) before
sampling. The conditioned analog signal is then sampled by an analog to digital (A/D)
converter in order to obtain a digital signal. The A/D converters in use today for speech signal
applications have a resolution of 12 to 16 bits typically at 8000 to 20,000 samples per second .
For allowing the use of a simple antialiasing filter and precise control of the fidelity of the
sampled speech signal, over sampling of the analog speech signal is used.
Speaker recognition systems generally consist of three major units as shown in Figure
3. The input to the first stage or the front end processing system is the speech signal. Here the
speech is digitized and subsequently the feature extraction takes place. There are no exclusive
features that convey the speakers identity in the speech signal, however it is known from the
source filter theory of speech production that the speech spectrum shape encodes in it the
information about speakers vocal tract shape via formants and glottal source via pitch
harmonics. Therefore some form or the other of the spectral based features is used in most of
the speaker recognition systems. The final process in the front end processing stage is some
form of channel compensation. Different input devices (e.g. different telephone handsets)
impose different spectral characteristics on the speech signal, such as band limiting and
shaping. Therefore channel compensation is done for removal of these unwanted effects. Most
commonly some form of linear channel compensation, such as long and short-term cepstral
mean subtraction are applied to features. The basic fundamental of spectral subtraction is that
the power spectrum of speech signal corrupted by additive noise is equal to the sum of the
signal power spectrum and noise
g) Harmonic Features
The harmonic decomposition of the high-resolution spectral line estimate of speech
signal results in the harmonic features. The line spectral pairs represent the variations in the
glottis and the vocal tract of a speaker, which are transformed into frequency domain. The
feature vector of harmonic features contains the fundamental frequency followed by amplitudes
of several harmonic components. These features can be produced only on voiced segments of
speech and the long vowels and nasals were found to be most speaker specific
3. Similarity measures
The features of the speech signal are in the form of N . dimensional feature vector. For
a segmented signal that is divided into M segments, M vectors are determined producing the
M x N feature matrix. The M x N matrix is created by extracting features from the utterances
of the speaker for selected words or sentences during the training phase. After extraction of the
feature vectors from the speech signal, matching of the templates is required to be carried out
for speaker recognition. This process could either be manual (comparison of spectrograms
visually) or automatic. In automatic matching of templates, speaker models are constructed
from the extracted features. There after a speaker is authenticated by comparison of the
incoming speech signal with the stored model of the claimed user. The speaker models are of
two types: template models and stochastic models.
i. Template Models
The simplest template model has a single template x, which is the model for a speech
segment. The match score between the template x for the claimed speaker and an input feature
vector y from an unknown user is given by d (x, y). The model for the claimed speaker could
be the centroid (mean) of a set of N vectors obtaining in training phase
The various distance measures between the vectors x and y can be written as
Where, W is the weighting matrix. If W is an identity matrix, then all the elements of
the vectors are equally treated and the distance is called Euclidean. If W is a positive . definite
matrix that would allow desired weighting of the template features then, the distance is
Mahalanobis.
a) Dynamic Time Warping (DTW)
The time alignment of different utterances is a serious problem for distance measures
and a small shift would lead to incorrect identification. Dynamic time warping is an efficient
method to solve this time alignment problem. This is the most popular method for speaking
rate variability in template-based systems The asymmetric match score β of comparison of an
input frame y of M samples with the template sequence x is given as follows
The template indices j(i) are given by the DTW algorithm. This algorithm performs a
piece wise linear mapping of the time axis to align both the signals. The variation over time in
the parameters corresponding to the dynamic configuration of the articulators and the vocal
tract is taken into account in this method.
b) VQ Source Modeling
This is another form of usually text dependent template model that uses multiple frames
of speech. This model makes use of has a vector quantized codebook, which is generated for a
speaker by using his/her training data. Standard clustering procedures are utilized for
formulation of the codebook. These procedures average out the temporal information from the
codebook and therefore the requirement of performing time alignment is eliminated. The
pattern match score is the distance between the input vector and the minimum distance code
word in the codebook.
c) Nearest Neighbors
This method combines the strengths of the dynamic time warping and vector
quantization methods. This method keeps all the data obtained from training phase and does
not cluster data to obtain the codebook. Therefore it can make use of the temporal information
that may be present in the prompted phrase. The distances between the input frames and the
stored frames is used for computing the inter frame distance matrix. The nearest neighbor
distance is the minimum distance between the input and the stored frames. The nearest neighbor
distances for all input frames are averaged to arrive upon the matched score. These matched
scores are thereafter combined to form an approximation of the likelihood ratio. This method
is very memory intensive and is one of the most powerful methods.
A Hidden Markov Model (HMM) consists of a set of transitions between a set of states.
Two sets of probabilities are defined for each transition: a transition probability and the output
probability density function. The output probability density function is the probability of
emitting each of the output symbols from a finite vocabulary. As shown in Fig. 5, the transitions
are allowed to the next right state or the same state, thus the model is named left . to right model
and aij are the probabilities of transition to other states. The HMM parameters are generated
from the speech during the training phase and for verification, the likely hood of the input
feature sequence is computed with respect to the speakers HMMs. In case of finite vocabulary
being used for speaker recognition, each word is modelled using multiple state left . to right
HMMs. Therefore in case of large vocabulary, larger number of models are required.
Speaker recognition, which can be classified into identification and verification, is the
process of automatically recognizing who is speaking base on speech signal. This method of
persons identification use unique information included in voice of speaker, and allows verify
their identity and control access to services such as voice dialling, banking by telephone,
telephone shopping, database access services, voice mail, access authorization to resources and
for forensic purpose.
Speaker identification is the process of determining which registered speaker provides
a given utterance. Speaker verification is the process of accepting or rejecting the identity claim
of a speaker. Most applications in which a voice is used as the key to confirm the identity of a
speaker are classified as speaker verification.
Speaker recognition methods are divided into text-dependent and text independent
methods. In case of text-dependent systems the speaker says key words or sentences having
the same text for both training and recognition mode.
i. Text Dependent speaker recognition
Text-dependent speaker recognition characterizes a speaker recognition task, such as
verification or identification, in which the set of words (or lexicon) used during the testing
phase is a subset of the ones present during the enrolment phase. The restricted lexicon enables
very short enrolment (or registration) and testing sessions to deliver an accurate solution but,
at the same time, represents scientific and technical challenges. Because of the short enrollment
and testing sessions, text-dependent speaker recognition technology is particularly well suited
for deployment in large-scale commercial applications. These are the bases for presenting an
overview of the state of the art in text-dependent speaker recognition as well as emerging
research avenues.
In text dependent speaker verification, a speaker presents fixed or prompted phrase that
is programmed into the system and can improve system performance. But if an arbitrary word
or sentence is used, then the system is called text-independent. In a text independent speaker
verification system, the system has no advance knowledge of the speaker' s phrasing and is
much more difficult and less robust.
➢ Pitch: Pitch information provides a unique way for correlating the training and
testing utterances because the rate at which the vocal folds vibrates is different
for different speakers. The different patterns of pitch are used to convey
different meanings to the listener.
➢ Duration: For a genuine client, the total duration of the reference speech may
differ from that of the testing one [6]. But there is always a consistency in the
relative duration of words, syllables or phrases spoken in the utterance. Its
application is found in text-to-speech systems, speech understanding systems
etc. The pitch and duration information are the suprasegmental features,
extracted from a speech signal.
➢ Linear predictive coding: Linear predictive coding (LPC) is used to predict
the present value from a linear combination of the past values . And this is done
to eliminate the redundancy in the signals. These features are generally used for
speech recognition, speech analysis and synthesis, voice compression by phone
companies, secure wireless where voice must be digitized, encrypted and sent
over a narrow voice channel etc,. The speech signals are analyzed by estimating
the formants. On these LPC features, after applying cepstral analysis, a set of
iterative procedures are applied. The coefficients therefore obtained are the
linear predictive cepstral coefficients (LPCC).
➢ Perceptual linear predictive coefficients: Perceptual linear predictive
coefficients (PLP) discard the unnecessary message present in the voice signals
in order to improve the speech recognition rate. It is used in merging a variety
of engineering estimation of human audio procedures. It is alike as LPC except
the spectral domain characteristics are altered such that it becomes equivalent
to the features obtained from a human’s hearing system. In PLP, the nonlinear
mapping and non uniform filter bank in between the perceived loudness and
sound intensity and are used in the extraction process of LP features.
c) Pattern Classification
➢ Vector quantization: In vector quantization (VQ) method, the non-overlapping
clusters of feature vector forms the speaker models [13]. Here quantization of
the data is done in the form of contiguous blocks called vectors, rather than
taking a single scalar value. The output obtained after quantization, is a data
block that results from a finite set of vectors, termed as the codebook.
➢ Dynamic time warping: Dynamic time warping (DTW) is an algorithm for
finding the minimum distance path through a matrix, whereby reducing the
computation time.
➢ Gaussian mixture model: A Gaussian mixture model (GMM) is defined as the
parametric form of probability density function (pdf) having continuous
features in a biometric system . These features include the spectral features of a
vocal-tract system that has weighted sum of Gaussian component densities.
d) Decision Making and Performance Measures
After performing the classification, decision is taken, based on a threshold value. If the
score is more than the threshold value, then it is accepted otherwise rejected. Performance
measures of the system are taken in terms of acceptance and rejection rate, as listed below:
➢ False acceptance rate: False acceptance rate (FAR) is defined as the ratio of
the accepted imposter claims to the total number of the imposter speakers
➢ False rejection rate: False rejection rate (FRR) is given by the ratio of the
rejected client patterns to the total number of genuine speakers
➢ Equal error rate: Equal error rate (EER) is the point where FAR and FRR
intersect each other. EER should be low for better system’s performance
➢ Total success rate: The total success rate (TSR) is obtained by deducting the
EER from 100.