Speech Processing Unit 4 Notes

This document summarizes speaker recognition methods and the process of extracting acoustic parameters from speech signals for speaker identification and verification. It discusses: 1. Speaker identification determines who is speaking from a known set of voices, while speaker verification accepts or rejects whether a speaker is who they claim to be. 2. The usual approach is to classify acoustic parameters derived from spectral analysis of the speech signal, which contains both phonetic and individual speaker information. 3. Speaker recognition systems consist of front-end processing to extract features, feature processing, and classification. Features aim to preserve speaker variations while minimizing irrelevant differences.

Uploaded by

V Saravanan ECE KIOT

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views

Speech Processing Unit 4 Notes

Uploaded by

V Saravanan ECE KIOT

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Unit – 4

(V Saravanan ASP/ECE)
Classification of Speaker Recognition Methods:
The problem of speaker recognition can be divided into two major sub problems:
speaker identification and speaker verification.
Speaker identification can be thought of, as the task of determining who is talking from a set of known
voices of speakers. It is the process of determining who has provided a given utterance based on the
information contained in speech waves. The unknown voice comes from a fixed set of known speakers,
thus the task is referred toas closed set identification.
Speaker Verification on the other hand is the process of accepting or rejecting the speaker claiming to
be the actual one. Since it is assumed that imposters (those who fake as valid users) are not known to
the system, this is referred to as the open set task. Adding anyone of the above option to the closed set
identification task would enable merging of the two tasks, and it is called open set identification. “ Error
that can occur in speaker identification is the false identification of speaker and the errors in speaker
verification can be classified into the following two categories: (1) false rejections: a true speaker is
rejected as an imposter, and (2) False acceptances: a false speaker is accepted as a true one.
The usual approach to speaker recognition is based on the classification of acoustic parameters
derived from the speech signal. Generally, the parameters are obtained via short time spectral analysis
and contain both phonetic information, related to the uttered text, and individual information, related to
the speaker. Since the task of separating the phonetic information from the individual one is not yet
solved, many speaker recognition systems behave in a text dependent way (i.e. the user must utter a
predefined sentence).
1. Acoustic Parameters for Speaker Recognition

The acoustic wave speech signal generated by humans can be converted into an analog
signal using a microphone. An antialiasing filter is thereafter used to condition this signal and
additional filtering is used to compensate for the channel impairments. The antialiasing filter
band limits the speech signal to approximately the Nyquist rate (half the sampling rate) before
sampling. The conditioned analog signal is then sampled by an analog to digital (A/D)
converter in order to obtain a digital signal. The A/D converters in use today for speech signal
applications have a resolution of 12 to 16 bits typically at 8000 to 20,000 samples per second .
For allowing the use of a simple antialiasing filter and precise control of the fidelity of the
sampled speech signal, over sampling of the analog speech signal is used.

The usual approach to speaker recognition is based on the classification of acoustic

parameters derived from the speech signal. Generally, the parameters are obtained via short
time spectral analysis and contain both phonetic information, related to the uttered text, and
individual information, related to the speaker. Since the task of separating the phonetic
information from the individual one is not yet solved, many speaker recognition systems
behave in a text dependent way (i.e. the user must utter a predefined sentence).

Basic Structure of Speaker Recognition System

Figure 3. Structure of speaker recognition system

Speaker recognition systems generally consist of three major units as shown in Figure
3. The input to the first stage or the front end processing system is the speech signal. Here the
speech is digitized and subsequently the feature extraction takes place. There are no exclusive
features that convey the speakers identity in the speech signal, however it is known from the
source filter theory of speech production that the speech spectrum shape encodes in it the
information about speakers vocal tract shape via formants and glottal source via pitch
harmonics. Therefore some form or the other of the spectral based features is used in most of
the speaker recognition systems. The final process in the front end processing stage is some
form of channel compensation. Different input devices (e.g. different telephone handsets)
impose different spectral characteristics on the speech signal, such as band limiting and
shaping. Therefore channel compensation is done for removal of these unwanted effects. Most
commonly some form of linear channel compensation, such as long and short-term cepstral
mean subtraction are applied to features. The basic fundamental of spectral subtraction is that
the power spectrum of speech signal corrupted by additive noise is equal to the sum of the
signal power spectrum and noise

2. Features space for Speaker Recognition

The speech signal can be represented by a sequence of feature vectors in order to
application of mathematical tools without the loss of generality. Most of these features are also
used for speaker dependent speech recognition systems. In practical real life systems, several
of these features are used in combinations. Some of the desirable properties for feature sets are
as follows:
✓ They should preserve or highlight information and variation in the speech that
is relevant to the basis being used for the speech recognition and at the same
time minimize or eliminate any variation irrelevant to that task.
✓ Feature space should be relatively compact in order to enable easier learning of
models from finite amounts of data.
✓ A feature representation that can be used without much consideration in most
circumstances should be used.
✓ The process of feature calculation should be computationally inexpensive.
Processing delay (i.e. how much of the ’future’ of the signal you have to know
before you can emitthe features) is a significant factor in some settings, such as
real-time recognition.

a) Frequency Band Analysis

Filter banks were initially used to gather information about the spectral structure of
signal. The filter banks consist of number of filters where each filter covers one group of
frequencies. Bandwidths of filters could be chosen to be equal, Logarithmic or may correspond
to certain critical intervals. The output of such filter bank offers largely depends upon the
number of the filters being used, which normally varies from 10 . 20 and thus this technique
represents an approximate representation of the actual spectrum. The output of the filter bank
is sampled (usually 100 Hz) and the samples of the output indicate the amplitude of the
frequencies from a particular bandwidth. The output is thus used as the feature vector for
speaker recognition.
b) Formant Frequencies
Periodic excitation is seen in the spectrum of certain sounds, especially vowels. The
speech organs form certain shapes to produce the vowel sound and therefore regions of
resonance and anti resonance are formed in the vocal tract. Location of these resonances in the
frequency spectrum depends on the form and shape of the vocal tract. Since the physical
structure of the speech organs is a characteristic of each speaker, differences between speakers
can also be found in the position of their formant frequencies. The resonances heavily affect
the overall spectrum shape and are referred to as formants. A few of these formant frequencies
can be sampled at an appropriate rate and used for speaker recognition. These features are
normally used in combination with other features.
c) Pitch Contours
The variations of the fundamental frequency (pitch) during the duration of the utterance
if followed, would provide the contour, which can be used as a feature for speech recognition.
The speech utterance is normalized and the contour is determined. The normalization of the
speech utterance is required because the accurate time alignment of utterances is crucial; else
the same speaker utterances could be interpreted as utterances from two different speakers. The
contour is divided into a set of segments and the measured pitch values are averaged over the
whole segment. The vector that contains the average values of pitch of all segments is thereafter
used as a feature for speaker recognition.
d) Coarticulation
Coarticulation is a phenomenon where a feature of a phonemic unit is achieved in the
articulators well in advance of the time it is needed for that phonemic unit. Variation of the
physical form of the speech organs causes the variation in the sounds that they produce. The
process of coarticulation in which, the speech organs prepare to produce a new sound while
transiting from one sound to another is characteristic of a speaker. This is due to the following
reasons: the construction and shape of the vocal tract, and the motorically abilities of the
speaker to produce the sequences of speech. Therefore for speaker recognition using this
feature, the points in the speech signal where coarticulation takes place are spectrographically
analysed.
e) Features derived from Short term processing
The following features of the short - term processing of the speech can be applied short
- term autocorrelation, average magnitude difference function, zero crossing measure, short –
term power and energy measures, and short - term Fourier analysis. The short term processing
techniques provide signals in the following form
T[s (m)] is a transformation, which is applied to the speech signal and the signal is
thereafter weighted by a window w (n). The summation of T[s (n)] convolved with w (n)
represents certain property of the signal that is averaged over the window duration.
f) Linear Prediction Features
The basic idea of linear prediction is that a speech sample s (n) related to excitation
u (n) can be predicted (approximated) by a linear combination of the past P speech samples

Here G is the gain parameter and ak are the prediction coefficients

g) Harmonic Features
The harmonic decomposition of the high-resolution spectral line estimate of speech
signal results in the harmonic features. The line spectral pairs represent the variations in the
glottis and the vocal tract of a speaker, which are transformed into frequency domain. The
feature vector of harmonic features contains the fundamental frequency followed by amplitudes
of several harmonic components. These features can be produced only on voiced segments of
speech and the long vowels and nasals were found to be most speaker specific

3. Similarity measures
The features of the speech signal are in the form of N . dimensional feature vector. For
a segmented signal that is divided into M segments, M vectors are determined producing the
M x N feature matrix. The M x N matrix is created by extracting features from the utterances
of the speaker for selected words or sentences during the training phase. After extraction of the
feature vectors from the speech signal, matching of the templates is required to be carried out
for speaker recognition. This process could either be manual (comparison of spectrograms
visually) or automatic. In automatic matching of templates, speaker models are constructed
from the extracted features. There after a speaker is authenticated by comparison of the
incoming speech signal with the stored model of the claimed user. The speaker models are of
two types: template models and stochastic models.
i. Template Models
The simplest template model has a single template x, which is the model for a speech
segment. The match score between the template x for the claimed speaker and an input feature
vector y from an unknown user is given by d (x, y). The model for the claimed speaker could
be the centroid (mean) of a set of N vectors obtaining in training phase

The various distance measures between the vectors x and y can be written as

Where, W is the weighting matrix. If W is an identity matrix, then all the elements of
the vectors are equally treated and the distance is called Euclidean. If W is a positive . definite
matrix that would allow desired weighting of the template features then, the distance is
Mahalanobis.
a) Dynamic Time Warping (DTW)
The time alignment of different utterances is a serious problem for distance measures
and a small shift would lead to incorrect identification. Dynamic time warping is an efficient
method to solve this time alignment problem. This is the most popular method for speaking
rate variability in template-based systems The asymmetric match score β of comparison of an
input frame y of M samples with the template sequence x is given as follows

The template indices j(i) are given by the DTW algorithm. This algorithm performs a
piece wise linear mapping of the time axis to align both the signals. The variation over time in
the parameters corresponding to the dynamic configuration of the articulators and the vocal
tract is taken into account in this method.
b) VQ Source Modeling
This is another form of usually text dependent template model that uses multiple frames
of speech. This model makes use of has a vector quantized codebook, which is generated for a
speaker by using his/her training data. Standard clustering procedures are utilized for
formulation of the codebook. These procedures average out the temporal information from the
codebook and therefore the requirement of performing time alignment is eliminated. The
pattern match score is the distance between the input vector and the minimum distance code
word in the codebook.
c) Nearest Neighbors
This method combines the strengths of the dynamic time warping and vector
quantization methods. This method keeps all the data obtained from training phase and does
not cluster data to obtain the codebook. Therefore it can make use of the temporal information
that may be present in the prompted phrase. The distances between the input frames and the
stored frames is used for computing the inter frame distance matrix. The nearest neighbor
distance is the minimum distance between the input and the stored frames. The nearest neighbor
distances for all input frames are averaged to arrive upon the matched score. These matched
scores are thereafter combined to form an approximation of the likelihood ratio. This method
is very memory intensive and is one of the most powerful methods.

ii. Stochastic models

Stochastic models have been a lately which provide more flexibility and produce better
matching score. In a stochastic model, the process of pattern matching is carried out by
measuring the likelihood of a feature vector in a given speaker model. A stochastic model that
is widely used for modeling of sequences is the Hidden Markov Model [Cam97]. This
technique efficiently models the statistical variations of the features and provides a statistical
representation of the manner in which a speaker produces sounds.

Figure 5. Five state Markov model

A Hidden Markov Model (HMM) consists of a set of transitions between a set of states.
Two sets of probabilities are defined for each transition: a transition probability and the output
probability density function. The output probability density function is the probability of
emitting each of the output symbols from a finite vocabulary. As shown in Fig. 5, the transitions
are allowed to the next right state or the same state, thus the model is named left . to right model
and aij are the probabilities of transition to other states. The HMM parameters are generated
from the speech during the training phase and for verification, the likely hood of the input
feature sequence is computed with respect to the speakers HMMs. In case of finite vocabulary
being used for speaker recognition, each word is modelled using multiple state left . to right
HMMs. Therefore in case of large vocabulary, larger number of models are required.

4. Speaker Recognition : Text – Dependent and Text – Independent

Speaker recognition, which can be classified into identification and verification, is the
process of automatically recognizing who is speaking base on speech signal. This method of
persons identification use unique information included in voice of speaker, and allows verify
their identity and control access to services such as voice dialling, banking by telephone,
telephone shopping, database access services, voice mail, access authorization to resources and
for forensic purpose.
Speaker identification is the process of determining which registered speaker provides
a given utterance. Speaker verification is the process of accepting or rejecting the identity claim
of a speaker. Most applications in which a voice is used as the key to confirm the identity of a
speaker are classified as speaker verification.
Speaker recognition methods are divided into text-dependent and text independent
methods. In case of text-dependent systems the speaker says key words or sentences having
the same text for both training and recognition mode.
i. Text Dependent speaker recognition
Text-dependent speaker recognition characterizes a speaker recognition task, such as
verification or identification, in which the set of words (or lexicon) used during the testing
phase is a subset of the ones present during the enrolment phase. The restricted lexicon enables
very short enrolment (or registration) and testing sessions to deliver an accurate solution but,
at the same time, represents scientific and technical challenges. Because of the short enrollment
and testing sessions, text-dependent speaker recognition technology is particularly well suited
for deployment in large-scale commercial applications. These are the bases for presenting an
overview of the state of the art in text-dependent speaker recognition as well as emerging
research avenues.
In text dependent speaker verification, a speaker presents fixed or prompted phrase that
is programmed into the system and can improve system performance. But if an arbitrary word
or sentence is used, then the system is called text-independent. In a text independent speaker
verification system, the system has no advance knowledge of the speaker' s phrasing and is
much more difficult and less robust.

Speaker verification has many potential applications, including access control to

computers, databases and facilities, electronic commerce, forensic and telephone banking [2].
Here is the baseline system of speaker verification which we try to follow through the paper.
Speech signal contains lots of information within it and finding out these features from SS is
beneficial for efficient result . There are several features; Mel-frequency Cepstral Coefficients
(MFCC) is standard for baseline system, Linear Frequency Cepstral Coefficients (LFCC),
periodic & aperiodic energy, formants, pitch, Linear predictive coding (LPC) etc. [3] which are
fetched from Speech signal.
Like features, researchers have proposed several modelling technique for speaker
verification systems such as Dynamic time wrapping (DTW), Artificial Neural Networks
(ANN), Hidden Markov models (HMM), Vector Quantization (VQ) & several combined
approach etc. Threshold value can be fixed based on the training of proposed model for decision
logic to obtain efficient result.
A) Various Stages
a) Pre-processing
Pre-processing of any signal is necessary in the beginning, to get a proper signal to be
worked on and it involves the following steps:
Sampling: To process an analog signal in a computer, digitization of the signal is
necessary. So at the very onset, a speech signal should be sampled by the Nyquist criterion.
Framing: As the speech signals are non-stationary in nature, small blocks or frames
are formed by assuming that portion to be stationary by means of short term processing. Energy
of any signal y(n) is given by

Windowing: Windowing is the process of multiplying a signal by a window function

such as hamming, hanning etc. This is basically done to retain the desired region of interest
and discarding all the other regions by equating them to zeros.
Endpoint detection: Endpoint detection refers to detection of the start and the end
points of an utterance in presence of background noise, by means of certain energy threshold,
short term frequency spectrum, cepstral analysis, zero crossing rate
Noise removal: For the speech signals that are degraded by noise, the quality of these
signals can be improved by noise reduction.
b) Feature Extraction

➢ Mel frequency cepstral coefficients: In acoustics, short term spectral feature

of sound is represented by mel frequency cepstrum (MFC) based on a nonlinear
mel scale of frequency . Mel-frequency scale acts linearly up to the frequency
of 1 KHz and then gradually becomes logarithmic for the higher frequencies.
MFCC features are based on human hearing perceptions which cannot perceive
frequencies over 1Khz. They basically used in automatically recognize numbers
spoken into a telephone. Its application is also found in music information
retrieval like genre classification, audio similarity measures. The procedure for
obtaining MFCC coefficients are:
1) Fourier transformation of the signal is done after passing the signal through a window
function.
2) These frequencies are then mapped to the mel scale using the formula
Where m is the mel scale frequency and f represents the frequency of the cepstrum.
3) Log magnitude of the spectrum is taken at each of these mel frequencies.
4) Then DCT is performed.
5) The coefficients thus obtained from the resulting spectra, are the required MFCCs.
The temporal derivatives of the MFCC features are given by △, △△ features. The
MFCC features give the information about the static speech features whereas its derivatives
capture the dynamic attributes present in the speech.

➢ Pitch: Pitch information provides a unique way for correlating the training and
testing utterances because the rate at which the vocal folds vibrates is different
for different speakers. The different patterns of pitch are used to convey
different meanings to the listener.
➢ Duration: For a genuine client, the total duration of the reference speech may
differ from that of the testing one [6]. But there is always a consistency in the
relative duration of words, syllables or phrases spoken in the utterance. Its
application is found in text-to-speech systems, speech understanding systems
etc. The pitch and duration information are the suprasegmental features,
extracted from a speech signal.
➢ Linear predictive coding: Linear predictive coding (LPC) is used to predict
the present value from a linear combination of the past values . And this is done
to eliminate the redundancy in the signals. These features are generally used for
speech recognition, speech analysis and synthesis, voice compression by phone
companies, secure wireless where voice must be digitized, encrypted and sent
over a narrow voice channel etc,. The speech signals are analyzed by estimating
the formants. On these LPC features, after applying cepstral analysis, a set of
iterative procedures are applied. The coefficients therefore obtained are the
linear predictive cepstral coefficients (LPCC).
➢ Perceptual linear predictive coefficients: Perceptual linear predictive
coefficients (PLP) discard the unnecessary message present in the voice signals
in order to improve the speech recognition rate. It is used in merging a variety
of engineering estimation of human audio procedures. It is alike as LPC except
the spectral domain characteristics are altered such that it becomes equivalent
to the features obtained from a human’s hearing system. In PLP, the nonlinear
mapping and non uniform filter bank in between the perceived loudness and
sound intensity and are used in the extraction process of LP features.
c) Pattern Classification
➢ Vector quantization: In vector quantization (VQ) method, the non-overlapping
clusters of feature vector forms the speaker models [13]. Here quantization of
the data is done in the form of contiguous blocks called vectors, rather than
taking a single scalar value. The output obtained after quantization, is a data
block that results from a finite set of vectors, termed as the codebook.
➢ Dynamic time warping: Dynamic time warping (DTW) is an algorithm for
finding the minimum distance path through a matrix, whereby reducing the
computation time.
➢ Gaussian mixture model: A Gaussian mixture model (GMM) is defined as the
parametric form of probability density function (pdf) having continuous
features in a biometric system . These features include the spectral features of a
vocal-tract system that has weighted sum of Gaussian component densities.
d) Decision Making and Performance Measures
After performing the classification, decision is taken, based on a threshold value. If the
score is more than the threshold value, then it is accepted otherwise rejected. Performance
measures of the system are taken in terms of acceptance and rejection rate, as listed below:
➢ False acceptance rate: False acceptance rate (FAR) is defined as the ratio of
the accepted imposter claims to the total number of the imposter speakers
➢ False rejection rate: False rejection rate (FRR) is given by the ratio of the
rejected client patterns to the total number of genuine speakers
➢ Equal error rate: Equal error rate (EER) is the point where FAR and FRR
intersect each other. EER should be low for better system’s performance
➢ Total success rate: The total success rate (TSR) is obtained by deducting the
EER from 100.

ii. Text independent speaker recognition

In text-independent systems, there are no constraints on the words which the
speakers are allowed to use. Thus, the reference (what are spoken in training) and the
test (what are uttered in actual use) utterances may have completely different content,
and the recognition system must take this phonetic mismatch into account. Text-
independent recognition is the much more challenging of the two tasks.
In general, phonetic variability represents one adverse factor to accuracy in text-
independent speaker recognition. Changes in the acoustic environment and technical
factors (transducer, channel), as well as ‘‘within-speaker” variation of the speaker
him/herself (state of health, mood, aging) represent other undesirable factors. In
general, any variation between two recordings of the same speaker is known as session
variability
Fig. 1. Components of a typical automatic speaker recognition system. In the enrolment mode,
a speaker model is created with the aid of previously created background model; in recognition
mode, both the hypothesized model and the background model are matched and background
score is used in normalizing the raw score
Fig. 2. A summary of features from viewpoint of their physical interpretation. The choice of
features has to be based on their discrimination, robustness, and practicality. Short-term
spectral features are the simplest, yet most discriminative; prosodics and high-level features
have received much attention at high computational cost.