Low Bit Rate Speech Coding
Low Bit Rate Speech Coding
Carl Kritzinger
April 2006
Declaration
I, the undersigned, hereby declare that the work contained in this thesis
is my own original work and that I have not previously in its entirety or
in part submitted it at any university for a degree.
Signature Date
Abstract
Despite enormous advances in digital communication, the voice is still the primary tool
with which people exchange ideas. However, uncompressed digital speech tends to require
prohibitively high data rates (upward of 64kbps), making it impractical for many appli-
cations.
Speech coding is the process of reducing the data rate of digital voice to manageable
levels. Parametric speech coders or vocoders utilise a-priori information about the mech-
anism by which speech is produced in order to achieve extremely efficient compression of
speech signals (as low as 1 kbps).
The greater part of this thesis comprises an investigation into parametric speech cod-
ing. This consisted of a review of the mathematical and heuristic tools used in parametric
speech coding, as well as the implementation of an accepted standard algorithm for para-
metric voice coding.
Ten spyte van enorme vordering in digitale kommunikasie is die stem steeds die primêre
manier waarmee mense idees wissel. Ongelukkig benodig digitale spraakseine baie hoë
datatempos, wat dit onprakties maak vir menigte doeleindes.
Spraak kodering is die proses waarmee die datatempo van digitale spraakseine vermin-
der word tot bruikbare vlakke. Parametriese spraakkodeerders oftewel vocoders, gebruik
voorafbekende informasie oor die meganisme waarmee spraak produseer word om beson-
der doeltreffende kodering van spraak seine te verrig (so laag soos 1kbps).
Die meerderheid van hierdie tesis bevat ’n studie oor parametriese spraak kodering. Die
studie bestaan uit ’n oorsig van die wiskundige en heuristieke tegnieke wat in parame-
triese spraak kodering gebruik word sowel as ’n implementasie van ’n aanvaarde standaard
algoritme vir spraak kodering.
Met die oog op moontlike maniere om die bestaande kodeerders te verbeter, het ons die
wiskundige struktuur onderliggend aan parametriese spraak kodering ondersoek. Hieruit
spruit ’n nuwe algoritme vir parametriese spraak kodering wat onder beide objektiewe en
subjektiewe evaluering belowende resultate gelewer het.
’n Verdere bydrae van die tesis is die vergelykende subjektiewe evaluering van die ef-
fek van parametriese kodering van Engelse en Xhosa spraak. Ons het die doeltreffendheid
van twee verskillende enkoderings algoritmes vir die twee tale bestudeer.
To my father, for his quiet greatness.
Contents
Acknowledgements xv
1 Introduction 1
1.1 History of Vocoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
ii
CONTENTS iii
5 MELP Implementation 52
5.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.2 Pitch Estimation Pre-Processing . . . . . . . . . . . . . . . . . . . . 55
5.1.3 Integer Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.4 Fractional Pitch Estimate . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.5 Band-Pass Voicing Analysis . . . . . . . . . . . . . . . . . . . . . . 57
5.1.6 Linear Predictor Analysis . . . . . . . . . . . . . . . . . . . . . . . 57
CONTENTS iv
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8 Evaluation of Vocoders 99
8.1 Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.1.1 Recording Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.1.2 Utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.2 Objective Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2.1 PESQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2.2 Rate-Distortion Curves . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2.3 Language Bias in Rate-Distortion Curves . . . . . . . . . . . . . . 102
8.2.4 Discussion of Objective Test Results . . . . . . . . . . . . . . . . . 102
8.3 Subjective Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.3.1 Test Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.3.2 Subjective Test Overview . . . . . . . . . . . . . . . . . . . . . . . 105
8.3.3 Results of Subjective Tests . . . . . . . . . . . . . . . . . . . . . . . 106
8.3.4 Discussion of Subjective Test Results . . . . . . . . . . . . . . . . . 106
8.4 Discussion of Disparity between Subjective and Objective Tests . . . . . . 107
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
G PESQ 144
G.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
G.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
G.3 Gain adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
G.4 Time Alignment of the Signal . . . . . . . . . . . . . . . . . . . . . . . . . 145
G.5 Perceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
CONTENTS vii
viii
LIST OF FIGURES ix
5.11 Spectral distortion histogram plots for various speakers. In each sub-figure,
the histogram of SD occurrence for a single speaker is shown. . . . . . . . . 71
8.1 Overall (Combined English and Xhosa) Rate-Distortion Curve for Regular
and IS-MELP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2 Language Dependence of MELP Rate-Distortion Trade-off using regular
sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
LIST OF FIGURES x
E.1 2-D Example of a vector quantiser. In the above example the points indi-
cated by cn , cn+1 and c[ n + 2] represent the various codebook entries. The
point labelled x is a vector to be encoded. The region associated with each
codebook entry is indicated. x lies in the region associated with cn+1 and
as such will be encoded as n + q1 and decoded as cn+1 . . . . . . . . . . . 134
E.2 Multi-Stage Vector Quantiser . . . . . . . . . . . . . . . . . . . . . . . . . 136
E.3 Transform Vector Quantisation . . . . . . . . . . . . . . . . . . . . . . . . 137
D.1 Some phoneme classes which have significantly different frequency of oc-
currence in Xhosa and English. . . . . . . . . . . . . . . . . . . . . . . . . 131
xi
Symbols
H Entropy
H(z) Transfer Function of a Linear System
R Information Rate
Φ Covariance Matrix of a Random Process
s[n] Speech Signal
s [n] Approximation to the Speech Signal s[n]
e[n] Error signal
w[k] Windowing function e.g. Hamming window.
z complex number
E Error (Usually a scalar function of a vector space.)
Acronyms
My supervisor, Dr. Thomas Niesler, a huge thanks for his almost infinite patience
with my ramblings, for his enormous insight and positive energy and his tranquility.
Gert-Jan for his relentless enthusiasm, his thesis template and for fearlessly sticking
his neck out to get me into the DSP lab in the first place.
Trizanne for listening, for being there and for believing in me even when I didn’t.
All the DSP lab rats for being a great bunch of people to be around and for providing
an endless supply of laughter, advice and distraction.
xv
Chapter 1
Introduction
Low rate or multi rate implementation to cater for applications where bandwidth is
limited.
1
Chapter 1 — Introduction 2
Possibly the two greatest factors contributing to this lack of popularity were:
2. Vocoders tend to be computationally intensive. Flanagan et. al. [30] noted that
computational complexity of vocoders is positively correlated.
In the 1980’s vocoders began to achieve recognition. Technology had improved sub-
stantially in the field of the low-power digital signal processors (DSP). This means that
the computational and memory requirements for a vocoder to run in real-time in a field
unit were no longer impossible to meet.
Vocoder usage was further driven by need for secure voice communications. Analog
voice is notoriously difficult to encrypt efficiently, whereas digital voice may be easily
encrypted to a very high degree of security. Particularly in military applications, the
need for security far outweighed the importance of natural sounding voice.
In 1984 this led to the adoption of the first standard for digital voice communication,
FS-1015 described a vocoder known as LPC10e.
In subsequent years, vocoder development has been driven by a number of primary
applications:
Chapter 1 — Introduction 3
2. In long haul communications via high-frequency (HF) radio (HF is the term used
to describe the radio spectrum between 3 and 30 MHz), received analog speech is
typically extremely poor due to severe transmission channel effects such as inter-
ference, noise and signal multi-path. A full treatment of the subject may be found
in Betts [7]. However, reliable data transmission is possible even in extremely poor
conditions [71]. This means that there is significant scope for digital voice over HF
radio links.
3. Voice Over Internet Protocol (VOIP) has been increasing enormously in popularity
due to its potential for extremely low-cost long distance telephony [58]. However, the
restrictions imposed by TCP/IP protocols means that voice coders must tolerate
lost data and long transmission delays as well as varying data throughput, since
fluctuations in network load may result in substantial variation in the available
bandwidth.
Additionally, vocoders have been used in niche applications such as satellite commu-
nications, voice recorders [15] and more esoterically to modulate voices in music [86].
1.2 Objectives
The main focus of this project will be an investigation of the implementation of a
low bit-rate vocoder. By low bit-rate we mean a vocoder which has a bandwidth
requirement of at most 2400 bits per second.
There are currently several published vocoder standards which may be suitable for
this application such as LPC, CELP and MELP. The first stage of the thesis will
comprise an overview and understanding of the current vocoder standards and the
selection of a standard for implementation.
Once a suitable candidate has been chosen, it will be implemented and evaluated in
a high level language such as MATLAB to be used as a reference implementation.
This stage of the project will comprise of an investigation into the shortcomings
of the reference implementation and an investigation into potential avenues of im-
provement.
Finally, the performance of the reference and improved vocoder designs will be tested
to compare their respective performance for various languages. Performance of the
vocoders will be measured with subjective listener tests.
1.3 Overview
The structure of this thesis will be as follows:
Chapter 2 In this chapter we will discuss the various approaches which have historically
been used to reduce the bandwidth of the speech waveform.
Chapter 3 In this chapter we will discuss the background knowledge essential to the
understanding of current voice coding techniques.
Chapter 4 In this chapter we will discuss some current voice coding techniques and
standards.
Chapter 6 In this chapter we will examine some of the mathematical properties of the
operation of a parametric voice coder in an attempt to improve on parametric voice
coding.
Chapter 9 In this chapter we will present our conclusions and recommendations for
further work.
Chapter 2
The aim of speech coder is fundamentally that of any data compression system: to rep-
resent information as efficiently as possible.
Claude Shannon [76] introduced three fundamental theorems of information. One of
these is the source-coding theorem which established a fundamental limit on the rate at
which the output of an information source may be transmitted without causing a large
error probability.
In the most naive approach, and following the ideas of Shannon, we may regard the
speech signal as the random output produced by a source. The source is characterised in
two fundamental ways:
By the entropy of the source, which describes how much information it outputs per
symbol.
Most of Shannon’s work deals with a signal consisting of discrete symbols from a finite
alphabet. The speech waveform, however is a continuous time signal. This does not
pose an insurmountable difficulty, since speech may be sampled and quantised without
significant loss of information, as we will describe in 2.2. The quantised samples may then
be regarded as the alphabet of the speech source.
We regard the speech samples as the output of a random source with entropy H.
According to Shannon if we then encode the speech so that we transmit information at a
rate R, the following will hold:
1. If R > H then it is possible to encode the speech so that the probability of error is
arbitrarily small.
5
Chapter 2 — An Overview of Voice Coding Techniques 6
2. If R < H then the error probability will be non-zero, regardless of the complexity
of our coding algorithm.
Unfortunately, the above source coding theorem does not enlighten us as to the actual
encoding scheme which we need to use in order to achieve such efficient compression. Voice
coders all represent algorithms which attempt to minimise R − H while simultaneously
minimising the probability of error.
The way in which this is achieved may be divided into three broad categories:
Waveform Coders
Segmental Coders
Phoneme information; This would roughly be the textual information which the speech
represents.
Speaker information; This would be the components of the waveform needed to char-
acterise the speaker at least as well as could be expected of a good Machine Speaker
Recognition system or a human listener.
We assume that each of these information bearing components of the speech wave-
form is modulated at a different rate. Furthermore we will assume that the three compo-
nents are statistically independent. The last assumption is perhaps somewhat unrealistic
Chapter 2 — An Overview of Voice Coding Techniques 7
(one would be extremely surprised if the potential prosodic content of the phrases ‘good
morning’ and ‘go away now’ were statistically similar in spoken English). However, the
assumption makes the analysis much more tractable.
Now approximate values for the above parameters are given in table 2.1.1:
The information rate of the speech waveform under these conditions can therefore be
1
One would expect that most people can recognise a known person from about 1 second of speech
2
Consider the number of different ways in which the single phoneme “Ah” may be phrased
Chapter 2 — An Overview of Voice Coding Techniques 8
T1 0.1 sec
T2 10 sec
Nalphabet 26
Nspeakers 1000
Nprosody 26
From the above derivation, it would therefore seem unlikely that a vocoder could
operate effectively at a data rate substantially less than this.
1. Sampling
2. Quantisation
We may regard the quantised signal as the original signal plus an error term.
The theory of scalar quantisation of a random variable is well documented in [67].We
call the variable to be quantised x, the quantised approximation to x is xq and the
quantisation error is e. Then
xq = x + e. (2.3)
If we assume that x is distributed on the interval (−Xmax , Xmax ), then uniform quantisa-
tion of the x using B bits will result in 2B quantisation intervals of size Δ such that
2Xmax = 2B Δ (2.6)
Δ2
= (2.11)
12
= σe (2.12)
Thus, if we apply simple linear quantisation to the samples, we may expect to obtain a
SNR of approximately:
2
σx
SNR(dB) = 10 log10 (2.13)
σe2
Xmax
= (20 log10 2)B + 10 log10 3 − 20 log10 (2.14)
σx
4σx
= (20 log10 2)B + 10 log10 3 − 20 log10 (2.15)
σx
≈ 6B − 7 (2.16)
Chapter 2 — An Overview of Voice Coding Techniques 10
(s[n])2
SNR = 20 log1 0 n
(2.18)
n e[n]
SNR is not a robust measure of speech quality but in high bandwidth applications
it is generally regarded as sufficiently accurate to provide a means of comparing various
encoding schemes [20, 33].
In a completely rigorous sense, any band-limited or sampled (digital) voice transmis-
sion system is a type of waveform coder.
More typical examples of waveform coders are PCM such as Differential PCM (DPCM)
and Adaptive Differential PCM (ADPCM) as well as Delta Modulation. These are de-
scribed in detail by Proakis and Salehi [68] as well as Goldberg [33].
The two most significant advantages of waveform coders are their low computational
complexity and their ability to compress and represent a wide variety of signals, such
as music, speech and side noise. This tends to make waveform coders much more ro-
bust against noisy input signals than other vocoders. Waveform coders typically operate
effectively on speech in the region between 16 and 256 kbps.
Parametric coders are interesting from a psycho-acoustic point of view because, al-
though the coding error between the reconstructed signal and the original signal may be
almost as large as the original signal, the original and reconstructed speech signals may
be perceptually almost identical. This implies that the SNR is a poor metric to describe
the perceptual ‘distance’ between speech samples.
Model Parameters
1. Estimate the envelope. This corresponds very closely to an estimate of the vocal
tract parameters.
2. Estimate the excitation signal. This corresponds closely to an estimate of the nature
of the glottal excitation.
Homomorphic Vocoders; Homomorphic vocoders use the short term cepstrum to rep-
resent envelope information. The idea of using the cepstrum to separate the low-
time from the high-time components of the speech waveform was first proposed by
Oppenheim [62].
Formant Vocoders; Formant Vocoders use the positions of the formants to encode en-
velope information as first investigated by Flanagan [29, 28]. Unfortunately the
formants of typical speech waveforms are extremely difficult to track efficiently and
are mathematically ill-defined. Various algorithms have been proposed to track
formants [54] but these vocoders never became extremely popular .
Chapter 2 — An Overview of Voice Coding Techniques 12
Linear Predictive Coders; This is the class of parametric vocoder which has found
the greatest popularity in the literature and which has also been used in by far the
majority of voice coding standards. Linear predictive coders use a digital all-pole
filter (linear predictor) to describe the shape of the spectral envelope. Typically
the residual (linear predictor error) has about 10dB less power than the original
waveform [15].
The Buzz-Hiss Model; This model is used explicitly in LPC10e and is the simplest
(and also most bit-efficient) of all the models described here. The Buss-Hiss model
is the first model which was successfully used in a voice coding standard. In this
case the glottal excitation is simply modelled as being a pulse train (voiced) or white
noise (unvoiced).
Harmonic excitation; In this model the excitation signal is composed of the sum of a
number of sinusoids. This method has been used by Spanias [1] and McAulay [53].
Codebook Excitation; This was first proposed by Schroeder and Atal [5]. The idea is
that a large codebook of excitation signals is used and the excitation signal in the
codebook which most closely matches the glottal excitation is used.
In segmental voice coding, feature vectors are calculated for segments of the speech
signal. These feature vectors are compared to the pre-calculated feature vectors for seg-
ments of speech speech in a database. The index of the segment in the database which is
closest to the original segment is transmitted. To recreate the speech signal, the successive
transmitted indices are decoded to speech segments which are then concatenated. This is
illustrated diagramatically in figure 2.2. In the most extreme cases the encoder effectively
becomes a speech-to-text converter and the decoder a text-to-speech system, as described
in [49].
Speech
Signal
Feature
Calculation
Speech
Database Feature
Calculation
Comparison
Transmission
Speech
Segment
Database
Lookup
Synthesized
Speech
Signal
Concatenation
Segmental coders are typically very efficient in terms of the compression which is
achieved, data rates as low as 200bps are claimed by Cernocky, but the computational
cost associated with the search through the database of speech segments means that real
time implementations of high quality segmental vocoders is not currently feasible.
Chapter 3
3.1.1 Physiology
The primary components of the human speech production system are:
Lungs; The lungs produce the airflow and thus the energy required to generate vocal
sound.
Larynx; The main organ of voice production. The larynx provides periodic excitation
to the system for sounds that are referred to as voiced.
Pharyngeal Cavity, Oral Cavity and Nasal Cavity; These comprise the main or-
gan of modulation of the speech waveform. This will be described in more detail in
the next section.
14
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 15
The human speech production system also contains the following finer structures which
contribute to finer modulation of the speech. These include:
Soft Palate; The soft palate regulates the flow of air through the nasal cavity in order
to alternate between nasalised and non-nasalised sounds.
Tongue, Teeth and Lips; These organs contribute to the general shape of the vocal
tract. They are also used to form the class of phonemes referred to as plosives (see
3.2.1).
1. A source, which produces the signal energy. The excitation energy is almost always
generated in such a way that its spectrum is approximately flat. This component
corresponds to the function of the larynx or glottis in actual speech production.
2. A modulator component which ‘shapes’ the spectrum of the excitation. This corre-
sponds in the physical system to the vocal and nasal tract.
The most common model for the vocal tract is, the so-called lossless multi-tube model.
This means that the vocal tract can be modelled as a series of concatenated open tubes.
The transfer function for a single lossless tube in the complex plane (z) can be shown
to be [20]:
1
H(z) =
cos( zlc )
With l the tube length and c the speed of sound in air (340m/s)
After a substantial but very standard derivation (see [20, 15]) the transfer function of
a P section lossless multi-tube system is found to be:
z −P/2 Pk=1(1 + ρk )
H(z) =
1 − Pk=1 bk z −k
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 16
Speech Information
Excitation Source
Shaping Filter
Speech
Excitation A1 A2 A3 A4 A5 A6 A7 Speech
Figure 3.2: Diagram of the Lossless Tube Model of the Vocal Tract
3.2.1 Excitation
While the excitation makes a key contribution toward synthesising ‘natural’ sounding
speech, the spectral envelope is usually the dominant feature used by both humans and
machines in phoneme classification and speaker recognition. Thus, for the purposes of
voice coding it is generally considered sufficient to describe the excitation of a phoneme
as being in one of the following classes:
Voiced; Periodic movement of the vocal folds resulting in a stream of quasi-periodic puffs
of air.
Plosive; Release of pressure built up behind a completely closed portion of vocal tract.
Whisper; Air forced through partially open glottis to excite an otherwise normally ar-
ticulated utterance.
3.3.1 Masking
Frequency masking is the term commonly used for the phenomenon which occurs when
certain sounds are rendered inaudible by other sounds, usually closely spaced in frequency
and of greater amplitude. The generally used model is that of a triangular masking curve
around each frequency in the spectrum. In other words, any tone f1 with amplitude A1
which satisfies
111111111111111111111
000000000000000000000
0
1
000000000000000000000
111111111111111111111
0
1
000000000000000000000
111111111111111111111
0
1
000000000000000000000
111111111111111111111
0
1 f
000000000000000000000
111111111111111111111
0
1
Masking Area of 0
000000000000000000000
111111111111111111111
0
1 0
1
000000000000000000000
111111111111111111111
0
1 0
1
000000000000000000000
111111111111111111111
0
1 0
1 0
1
000000000000000000000
111111111111111111111
0
1
0
1 0
1
000000000000000000000
111111111111111111111
0
1 0
1
0
1 0
1 0
1
000000000000000000000
111111111111111111111
0
1 0
1 0
1
000000000000000000000
111111111111111111111
0f
1 0f
1 0
1
000000000000000000000
111111111111111111111
2
0
1
f 0 1 Frequency
Temporal masking is similar to frequency masking except that the tones are separated
in time instead of simply in frequency. The effect of temporal masking is typically from
5ms before the onset of the masking tone until 200ms after the masking tone ends.
3.3.2 Non-Linearity
Loudness Perception
The result of this relation is that a doubling in the sone value of a sound is equivalent to
a doubling of the perceived loudness of the sound.
Pitch Discrimination
The smallest change in pitch which humans can recognise is not a constant quantity,
but is dependent on the frequency of the original pitch. In the frequency band which
is of interest to us (between 500 and 4000Hz), changes in frequency of around 0.3% are
noticeable [52].
The explanation for the pitch discrimination abilities of the ear is usually described as
being related to the critical bands. The auditory system works by decomposing sounds
into component frequencies. Thus the ear acts as if it is composed of a number of band-
pass filters. The bandwidth and centre frequencies of these filters are known as the critical
bands. The critical bands affect the resolution with which different pitch frequencies may
be discriminated.
It is generally accepted [39] that the critical bands are not regularly distributed in
frequency. Therefore it is desirable to define a frequency scale along which the critical
1
p0 = 20μPa [52]
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 20
bands are regularly distributed. One commonly used frequency scale is the so-called mel
scale. The frequency in Hz is transformed to frequency in mel using equation 3.4
f
m(f ) = 1125 loge (1 + ) (3.4)
700
Table 3.1: Analysis frame length of some common vocoders (from [33]).
It is impractical to use segments of substantially more than 25ms since this would
mean that a large number of segments would substantially overlap phoneme boundaries.
This in turn would usually mean that the parameters calculated for the analysis segment
would be a mixture of the parameters for the various phonemes in the analysis segment,
weighted by the duration for which the phoneme appears in the analysis segment. Clearly
this is undesirable since we want to transmit the parameters of the individual phonemes
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 21
as distinctly as possible and we wish to avoid possible cross-talk between phonemes which
will diminish the accuracy of the parameter estimation process.
The optimal solution to this problem would be to align the analysis segments with the
phoneme boundaries. This would present quite a challenge since not only are phoneme
boundaries typically quite difficult to characterise algorithmically, but phonemes also vary
substantially in length (between approximately 40 and 400 ms for vowels [20]).
The technique has been the basis for so many practical and theoretical
results that it is difficult to conceive of modern speech technology without it.
There are a number of reasons for this. First and foremost is that linear prediction
using an all-pole filter very closely models the physical model of speech production as
shown in 3.1. A full treatment of the open-tube model of speech production can be found
in [20], demonstrating its equivalence to the all-pole model for speech production.
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 22
Given a stationary output signal, s(n), produced by an auto regressive moving average
(ARMA) process with transfer H(z) function, driven by an input sequence s0 (n), we
denote the spectrum of s(n) by Θ(z) and that of s0 (n) by Θ0 (z). Further, we can write
the output as a function of the parameters of the process and the input in the time domain
as follows:
In terms of the source-filter model discussed in 3.2, s(n) is the speech waveform, s0 (n) is
the excitation waveform produced by the source and H(z) represents the transfer function
of the vocal tract during the formation of the current phoneme. Here the notion of quasi-
stationarity is once again relevant; we regard the ARMA process that is the vocal tract
as having constant parameters over the entire segment of speech under consideration.
Following Chu [15], we ignore the zeros of the transfer function for the following
reasons:
1. We can represent the magnitude spectrum of the speech sufficiently well with an
all-pole system. Thus we lose only the phase of the speech signal through this
generalisation. The human ear is effectively ‘phase-deaf’, thus the phase of the
output signal may be regarded as redundant information.
2. The poles of an all pole system can be determined from the output, s[n], by simple
linear equations. In the case of LP analysis, this output is all the information we
have available, since we have no explicit information about the excitation energy
produced by the glottis.
Then as shown in [15] and [20] we can re-write the system as an all-pass system in series
with a minimum phase system, in series with a real-valued gain.
thus
S(z) = Θ0 (z)Θmin (z)Θap (z)E(z)
where
a = [a(1), a(2), a(3) . . . a(I)]T
and
s(n) = [s(n − 1), s(n − 2), s(n − 3), . . . s(n − I)]T
These are the most obvious of the representations and are simply the elements of the
vector a as shown above. They are exactly the coefficients (taps) of the direct form 1
realisation of the predictor.
Working directly with the LP coefficients has a number of advantages.
1. The transfer function of the Linear Predictor may be easily manipulated using the
LP coefficients. As will be shown in 5.1.6, it may be advantageous to manipulate
the ‘optimal’ linear predictor to produce perceptually better results.
2. The simplest algorithm for linear prediction synthesis is the direct form 1 filter
realisation, [62]. This realisation of the linear predictor requires that we use the LP
coefficients.
Reflection Coefficients
The reflection coefficients are very strongly suggested by the lossless open-tube acoustic
model of speech production (see 3.1). A thorough treatment of the derivation of the
reflection coefficients from the physical parameters of the vocal tract can be found in [20].
The reflection coefficients have an obvious advantage over the predictor coefficients in that
they are bounded between -1 and 1. This makes them substantially easier to quantise
and to deal with on fixed-point architectures. The reflection coefficients can be used to
directly compute the output of the LP system by means of a lattice filter realisation. For
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 24
any set of LPCs of an LP system of order P, we can compute the equivalent set of RCs
by means of the following recursion [39].
ki = aii , i = P ...1
aij +aii aii−j
ai−1
j = 1−ki2
, j = 1...i
Similarly, we can use the following recursion to convert the set of reflection coefficients to
an equivalent set of predictor coefficients.
aii = ki , i = 1...P
(3.8)
aj = aj − ki aj−1 , j = 1 . . . i
i i−1 i−1
Log-Area Ratios
The physical equivalent of the Log-Area Ratio (and also the reason for the name) is
the natural logarithm of the ratio of the cross-sectional areas of adjacent sections of the
lossless tube model of the vocal tract. The log area ratios are not theoretically bounded
in any interval but they are usually distributed very near 0 and usually have a magnitude
of less than 2 [15].
The line spectrum frequencies represent another way of stable representing the LP system
so that small changes in the parameters produce small changes in the perceptual character
of the system.
The line spectrum frequencies LSF were first proposed by Itakura [40] as a represen-
tation of the linear predictor.
There are a number of advantages to using the line spectrum frequencies.
1. Line spectrum frequencies are bounded between 0 and π. This makes them highly
suitable for situations where numerical precision is limited, such as environments
using fixed point arithmetic.
2. The positions of the LSFs is closely related to the positions of the formants. This
makes them ideal for the simple calculation of perceptually motivated distance mea-
sures.
5. As long as the frequencies are ordered, the LSF representation will result in a stable
predictor. This means that by simply re-ordering the frequencies of an unstable
predictor we can create a stable one. With the reflection coefficients one may also
easily verify that the predictor is stable, however with the LP coefficients one needs
to evaluate the transfer function in order to determine stability.
r = Φa
Where Φ is the covariance matrix of the speech segment and r is the biased autocor-
relation estimate.
The Levinson-Durbin algorithm uses only order P operations to solve this system.
The L-D algorithm iteratively uses the optimal (P − 1)’th order predictor to determine
the optimal P ’th order predictor.
In Appendix C we present a derivation of the Levinson-Durbin recursion.
1. Initialisation:
∀ k ∈ {−(P − 1), −(P − 2), . . . , (P − 1), P } ,
Set
0,k = rk
2. For l = 1, 2, . . . , P , Let
l−1,l
kl = (3.9)
l−1,0
l,k = l−1,k − kl l−1,l−k (3.10)
Pitch
Pitch is defined as the frequency at which the glottis closes during voiced sounds. Pitch is
generally not defined for unvoiced sounds. The so called pitch marks correspond to glottal
closures and are characterised by strong impulses in the excitation signal and also in the
speech waveform. Generally, any given speaker will have preferred pitch around which the
pitch of their speech will fluctuate. For male speakers, the pitch range is usually between
50-250Hz and for female speakers, the range is usually 120-500Hz [20].
Voicing
Voicing is a slightly more vague concept and is often not calculated explicitly, particular
in higher-rate excitation coders such as CELP. However, we usually define voicing as the
presence or absence of periodic glottal closures during a phoneme. If few or no glottal
closures occur, the sound is referred to as unvoiced, whereas a number of regular glottal
closures would characterise a sound as voiced.
1. A pre-processing step
3. A post-processing step.
Pre-Processing
The pre-processing usually involves band-pass filtering to remove the high-frequency and
DC components of the signal. Most of the pre-processing algorithms are designed to
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 28
emphasise the high amplitude pulses which characterise the voiced speech. Common
techniques include:
Centre-clipping the speech. This involves applying the following non-linear opera-
tion to the speech signal:
s(t), |s(t)| > T
s∗ (t) = (3.11)
0, |s(t)| < T
Clearly, the choice of the clipping threshold (T ) is crucial. [20] suggest 30% of the
maximum value of the input signal as the clipping threshold. Centre clipping has
a whitening effect on the speech spectrum which may aid in pitch determination.
Additionally, because the pitch marks are generally of high amplitude, centre clip-
ping removes the effect of the vocal tract response while maintaining the excitation
pulse train [15, 20, 33, 70].
Raising the speech waveform samples to a large (odd - to preserve the sign) power.
Filtering with the inverse of the optimal linear predictor to obtain the predictor
residual. This idea was introduced by Markel [51] as part of the SIFT Algorithm
for fundamental frequency estimation. This removes the effect of the vocal tract
response from the speech waveform, thus resulting in a flat spectrum. As mentioned
below, the formants resulting from the vocal tract response may often interfere with
the pitch estimation by emphasising portions of the speech spectrum. Furthermore,
this predictor residual should closely resemble the glottal excitation.
Typically the instantaneous determination of pitch involves finding the candidate pitch
from a set of candidate pitches which maximises the candidate pitch score. Usually if the
best candidate pitch score is below a threshold the sound is classified as unvoiced.
The following short term features are commonly used to estimate the pitch and voicing
of a speech segment.
Short Time Autocorrelation; Because of the regularity of the pitch pulses, the speech
signal exhibits strong self-similarity at the pitch period, and thus the autocorrelation
of the speech signal exhibits a strong peak at the pitch period. However, since the
excitation is approximately an impulse train, the autocorrelation also exhibits peaks
at multiples of the pitch period.
Amplitude Magnitude Difference Function; The AMDF or MDF exhibits very sim-
ilar properties to the auto-correlation except that it is minimised when the auto-
correlation is maximised. The MDF of the speech segment s of length N is defined
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 29
as:
N −1
MDF[τ ] = s[n] − s[n − τ ]
n=τ
Short Time Cepstrum; The harmonics of the pitch period are combined by the short
term cepstrum. This typically results in a large peak at the pitch period in the high
time portion of the cepstrum.
fs
N= (3.12)
2Pmax
with fs the sampling frequency at which the power spectrum is evaluated and Pmax
the maximum pitch candidate frequency. Of course the HPS is calculated by simply
summing the log spectrum. A very complete description of the HSS/HPS may be
found in [84].
Post-processing
Often in pitch processing, the use of non-linear filters such as median filtering may be
useful. This is described in detail in [70]. We also investigated the use of the non-linear
LULU filters as described in appendix F.
Pitch Harmonic Errors; Because the glottal excitation is often approximately an im-
pulse train, it has a spectrum which is close to an impulse train. Often, the higher
harmonics of the signal may be stronger than the fundamental, due to the location
of formants or simply due to noise in the estimation. This results in the estimated
pitch being an integer multiple or fraction of the actual pitch. These phenomena
are referred to as pitch doubling or pitch halving respectively.
Noise; The presence of background noise in the speech signal may result in pitch esti-
mation errors, particularly if the noise has strong periodic components.
Vocal Fry; While the pitch track is generally smooth, in some speakers it may occa-
sionally suddenly change substantially, particularly at the end of a voiced phoneme.
The smoothness constraints imposed by most pitch tracking algorithms may cause
problems in these cases [20].
Formant Interference; It is often difficult to separate the first formant (which lies
between 150-800Hz [20] - overlapping with the region of allowable pitch frequencies.)
from the pitch frequency. As mentioned above, this problem may be alleviated by
performing the pitch estimation algorithm on the linear predictor residual instead
of on the raw speech waveform.
In order to counteract this dilemma, some speech coders attempt to make the voicing
decision less binary, by allowing for a smoother transition between voiced and unvoiced
frames. MELP [55] does this by mixing voiced and unvoiced excitation and introducing
aperiodic pulses in weakly voiced segments (see 4.3.1). CELP [5] does this using a closed
loop analysis with various excitation vectors.
Broadcast Quality; This is approximately the sort of quality one would expect from
CD recordings of speech, at data rates upward of 256kbps.
Network or Toll Quality; This is approximately the quality one would expect over a
standard 3kHz B/W telephone line, or which one can expect to achieve with data
rates of around 64kbps.
Synthetic Quality; This is intelligible speech which has may sound unnatural and im-
pair speaker recognition.
A subjective measure of voice quality is provided by the Diagnostic Rhyme Test (DRT)
originally credited to Fairbanks [27]. In the DRT, listeners are asked to distinguish be-
tween phonetically similar ‘rhyming’ words, such as ‘heat’ and ‘meat’. Later, an enhanced
version of the DRT was presented by House [38]. This required that the listener listens
to an utterance and decide which of six candidate (printed) words the utterance repre-
sented. This enhanced test is known as the Modified Rhyme Test (MRT). According to
Goldberg [33] it is seldom used. While these tests provides very accurate assessments of
the intelligibility of the voice coder, they nevertheless do not take into account some of
the features which contribute to making a voice coding system acceptable to the user,
such as the naturalness of the synthesised voice.
It is likely with this last idea in mind that Voiers in 1977 [85] proposed the Diagnostic
Acceptability Measure (DAM) as a measure of the quality of synthesised speech. The
DAM requires that listeners evaluate speech on 16 different scales, divided into 3 cate-
gories: signal quality, background quality and total quality. Each of these is divided up
into descriptive sub-categories such as: Fluttering, Muffled, Tinny, Rumbling, Buzzing,
Hissing, Intelligible, Pleasant. Further details may be found in [20].
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 32
The importance of certain features in determining the quality of speech may differ sub-
stantially between individual listeners. Thus it may be desirable to have a much broader
measure of quality for speech. The mean opinion score or MOS is obtained as follows: A
subject is presented with a synthesised speech segment. The subject then assigns a sub-
jective score to the synthesised speech based on the perceived quality. Listeners assign a
score (or absolute category rating) to each synthesised segment according to table 3.2 :
The average of the scores over all listeners over all segments for a particular voice coder
is known as the Mean Opinion Score (MOS). Mean opinion scores may vary substantially
from test to test and are thus not a good absolute reference of voice coding quality.
Nevertheless, they are generally used as the preferred method to evaluate the quality of a
voice coding system. Often, the MOS is used to demonstrate the superiority of one coder
over another [26] or simply to demonstrate the functionality of a coder [55].
The generally accepted relation of MOS to the quality categories mentioned above is
presented in table 3.3, the data for which is excerpted from [79].
corpus. These factors combine to create a very cumbersome test procedure, one which is
highly undesirable when developing a voice coding system, when a large number of design
decisions must be made. In the actual processing performed by a voice coder, comparison
of speech segments may also be necessary, for example when performing the closed-loop
analysis-by-synthesis encoding such as in CELP (described in section 4.2.1). Thus it is
desirable to be able to quickly compare two speech samples, preferably using an algorithm
which can be implemented on a computer. The following measures are commonly used
to compare speech segments.
The signal to noise ratio is familiar to anyone who has had any dealing with communi-
cations systems. In the case of speech signal we usually define the noise as the difference
between the original and synthesised voice. Thus the SNR is expressed as:
M 2
n=0 s(n)
SNR = 10 log10 M (3.13)
n=0 (s(n) − ŝ(n))
2
Segmental SNR
In other words it is the average of the short term SNR over many finite length speech
segments. The segmental SNR tends to penalise more strongly coders which have varying
quality.
Unfortunately neither of the above metrics have much bearing on the perception of the
speech and additionally they are very sensitive to features to which the ear is essentially
deaf. Consider the following examples:
1. Change the polarity of each sample, i.e. ŝ(n) = −s(n). This is a change which is
imperceptible to the human ear, but it results extremely poor SNR.
2. Delay the speech by a single sample. At a sampling rate of 8kHz, this will imply
a delay of 0.125ms, which would be completely inaudible to the human ear. How-
ever, the substantial de-correlation between successive samples (high entropy) of the
speech signal will mean that the expected value of (s(n − 1) − s(n))2 will be large
and consequently the SNR will be poor.
Consequently, both the SNR and SEGSNR will reflect poorly in these cases, despite the
fact that the above signals are perceptually identical to the original (reference) signal.
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 34
While these may seem to be somewhat pathological examples we must consider the
fact that parametric coders tend to transmit information like phase extremely poorly.
Additionally, because of the frame-based approach to analysis taken by many parametric
coders we must assume that the correlation between the individual samples of the original
and synthesised speech will be quite poor.
We therefore wish to derive a more appropriate metric, in other words, one which posi-
tively correlates with the perceived distance of the speech samples. Additional constraints
are of course that the calculation of the metric should be computationally tractable.
A few metrics which have become popular, and which take more cognisance of per-
ceptual considerations are:
As we have mentioned before, the human auditory system is relatively insensitive to phase
distortion. Therefore we may base our metric purely on the magnitude spectrum of the
waveform. The Itakura-Saito pseudo-metric is formulated according to this ideas:
1. The two segments a(n) and b(n) may be compared by comparing their associated
optimal linear predictors (for some relevant predictor order). This is because the
linear predictor spectrum of a signal models the magnitude envelope of the signal
very well.
2. We may compare the linear predictor spectra by directly comparing the two vectors
of optimal LP coefficients for the respective segments (â and b̂).
Thus according to this metric the distance between the two segments is defined as the
relative accuracy with which the optimal linear predictor of one predicts the other. Also,
from the definition, since R̃a = R̃b therefore d(a, b) = d(b, a). Therefore we cannot call
this distance a true metric but refer to it as a pseudo-metric. A complete description of
this metric is found in the seminal paper by Itakura, Saito et. al. [41].
Chapter 3 — Fundamentals of Speech Processing for Speech Coding 35
The FW-SEGSNR has been formulated in several ways [6, 83], most of which follow
approximately the following pattern, as described in [20]:
10
M −1 K
w j,k log 10 [Es,k (m j )/E,k (mj )]
log10 k=1
K (3.16)
M j=0 k=1 wj,k
.
We have omitted a factor 10 found within the summation, because it contributes
only a constant term to the FW-SEGSNR and is thus not of interest. The energies of
the signal (Es,k ) and noise (E,k ) in each band (k) are weighted according to perceptual
considerations and the choice of bands is usually also perceptually motivated.
The weighted spectral slope measure (WSSM) is also known as the Klatt Measure [47].
Every 12ms, a bank of 36 filters is used to calculate a smoothed short term spectrum.
The filters have bandwidth corresponding to the ear’s critical bands. This means that the
filters implicitly impart an equal perceptual weight to each critical band. The method
uses the above short term spectrum to estimate weighted differences between the spectral
slopes in each band. This gives us a metric which is relatively insensitive to differences in
formant peak height but very sensitive to differences in formant location. The WSSM
was rated very highly in [20] but [36] reports less positive results with this measure and
describes the Itakura-Saito measure as being more effective.
PESQ
In table 3.4 ρ̂ is the average correlation coefficient between the results of each objective
quality measure discussed in this section, and the MOS, measured over a large number
of conditions. The data in the table are extracted from [20], who do not explicity specify
the experimental conditions used to obtain the data.
In the case of each of the three SNR measures, we must note that these scores are only
calculated for waveform coders. None of the SNR measures is considered suitable for the
measurement of the performance of parametric coders.
Chapter 4
The following sections are intended to illustrate the state of the art of voice coding tech-
niques. The sections on the various coders are arranged roughly in the chronological
order in which the coders were developed. This arrangement was chosen because to a
large extent each design follows as a logical refinement to the preceding design.
The purpose of this filter is to improve the numeric stability of the LP analysis. As men-
tioned in 3.4.2, the speech waveform typically exhibits a high-frequency roll-off. Reducing
this roll-off decreases the dynamic range of the power spectrum of the input speech, re-
sulting in better modelling of the features in the high frequency regions of the speech
spectrum [15].
The transfer function of this filter is illustrated in figure 4.1.
4.1.2 LP Analysis
The LPC10e standard (FS1015) specifies that a covariance method with synthesis filter
stabilisation should be used to determine the LP spectrum of the speech. However, most
modern implementations instead use an autocorrelation approach due to its improved
37
Chapter 4 — Standard Voice Coding Techniques 38
25
20
15
Magnitude (dB)
10
−5
−10
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
70
60
Phase (degrees)
50
40
30
20
10
0
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
numerical stability and computational efficiency and since this does not affect the inter-
operability of the vocoder at all. FS1015 favours a pitch synchronous LP analysis. This
means that the position of the LP analysis window is adjusted with respect to the phase
of the pitch pulses. This design improves the smoothness of the synthesised speech, since
the effect of the glottal excitation spectrum on the LP analysis of the speech is reduced
substantially.
2. Inverse filter the speech signal with a second-order approximation to the optimal
10th order predictor determined by the LP analysis.
3. Calculate the minimum value of the Magnitude Difference Function (MDF). The
MDF is not as accurate as the auto-correlation in pitch determination but was chosen
mainly for reasons related to computational efficiency [33]. On many architectures,
the computational cost of a multiplication was an order of magnitude larger than
that of an addition. On these architectures the MDF would be substantially more
Chapter 4 — Standard Voice Coding Techniques 39
1
efficient to calculate than the auto-correlation.
1. Low-band energy.
3. Zero-crossing rate.
N
ai pi + cj > 0, where j ∈ 0, 1, . . . , M − 1 (4.1)
i−1
Finally, a smoothing algorithm is applied to the voicing decision. This smoothing algo-
rithm is essentially a modified median smoother, which takes into account the voicing
strength of each frame [10]. This prevents the occurrence of single voiced segments within
unvoiced segments. The appearance of these single voiced frames may cause annoying
artifacts in the synthesised speech, typically producing isolated tones.
1
However, most modern DSPs have highly sophisticated architectures, which can typically perform
a multiplication and addition in a single clock cycle and are often optimised for FFT calculation [82].
Furthermore there is a well-known relation that the IFFT of the Power Spectral Density is the auto-
correlation [37]. If this theorem is used to calculate the auto-correlation, it is unlikely that the MDF will
exhibit any substantial advantage in terms of computational complexity. For this reason most modern
implementations of LPC10e use the auto-correlation estimate for pitch estimation.
Chapter 4 — Standard Voice Coding Techniques 40
In LPC10e a very simple scalar quantisation scheme is used, using a different code-
book for each LP parameter. Each codebook is optimised for the particular LP parameter
it is intended to encode. Two different quantisation schemes are used, depending on the
outcome of the voicing decision. These are detailed in table 4.1.
The choice of parameters to quantise, namely the LARs for the first two parameters
and the reflection coefficients thereafter is probably motivated by the results observed by
Gray and Markel [35], namely that the LARs are superior for encoding of the first two
parameters but thereafter present no substantial advantage over the reflection coefficients.
4.2.3 Post-filtering
The CELP post-filter was introduces to improve the perceptual quality of the synthesised
speech. The post-filter is introduced into the synthesiser after the LP filtering of the
reconstructed excitation signal has occurred. The CELP post-filter has the following
transfer function:
P i −i
1+ i=1 ai β z
h1 (z) = P i −i
1+ i=1 ai α z
1
h(z) = P
1+ i=1 ai αi z −i
With α chosen as a constant between 0 and 1. This post-filter reduces the perceived
noise level of the LP synthesis by emphasising the formant regions. However, as the
perceived reduction of noise increases, the synthesised speech acquires a ‘muffled’ quality,
Chapter 4 — Standard Voice Coding Techniques 42
since the post-filter generally has a low-pass spectral tilt. Thus in the choice of α, one
must compromise between reducing noise and reducing clarity of the synthesised speech.
A more sophisticated post-filter uses the transfer function:
1 + Pi=1 ai β iz −i
h1 (z) =
1 + Pi=1 ai αi z −i
In other words it is the difference between the frequency responses of two bandwidth
expanded LP synthesis filters. However, the addition of zeros to the transfer function
does not completely remove spectral tilt and thus a first order IIR filter with transfer
function h2 (z) = (1 − μz −1 ) is often used in cascade with the above filter. This filter is
similar to the pre-emphasis filter and provides a high-pass spectral tilt to compensate for
the low-pass effect. A popular choice for μ is 0.5 [15].
Gp
hp (z) =
1 − az −P̂
Gp is a scaling factor, 0 < a < 1 and P̂ is an estimate of the number of samples in the
pitch period. [20]. The aim of the pitch filter is to provides the long-term correlation in
the excitation signal which is perceived as the pitch of the synthesised voice.
4.3 MELP
The MELP model was originally developed by Alan McCree as a Ph.D. project and was
published by McCree and Thomas Barnwell in 1995 [55]. After some refinement, it was
submitted as a candidate for the new U.S. federal standard at 2.4kbps. MELP officially
Chapter 4 — Standard Voice Coding Techniques 43
became a U.S. federal standard in 1997, replacing LPC10e as the standard vocoder to
be used in secure and digital voice communication over low bandwidth channels. The
draught 2.4kbps MELP standard can be found in [69].
The major shortcoming of LPC10e is the hard decision switching between voiced and
unvoiced segments. This simple excitation model fails to accurately represent certain
phoneme types, most notably fricatives. This results in a distinctly buzzy quality of the
synthesised voice.
The MELP speech production model attempts to soften this hard voicing decision by
introducing intermediate levels between between phonemes which are purely voiced and
those which are purely unvoiced. These intermediate levels are achieved by dividing the
excitation into sub-bands, where each sub-band may be either voiced or unvoiced.
From [55] :
The most important feature of [the MELP model] is the mixed pulse and noise
excitation.
Bandpass Excitation
Generator
(0−500Hz)
Bandpass Excitation
Generator
(500−1000Hz)
Bandpass Excitation
Speech Waveform
Generator Linear Predictor
(1000−2000Hz)
Bandpass Excitation
Generator
(2000−3000Hz)
Bandpass Excitation
Generator
(3000−4000Hz)
In the MELP analysis, the input waveform is filtered by a bank of FIR bandpass filters.
These filters are identical to the filters used to band-limit the excitation signals. This
produces 5 different band-limited approximations of the input speech signal. A voicing
strength is determined in each of these band-limited signals. This voicing strength is
regarded as the voicing strength for that frequency band.
These band limited excitation waveforms are added together to produce an excitation
signal which is partly voiced and partly unvoiced. In this way, the MELP excitation
signal is generated as a combination of bandpass filtered pulses and bandpass filtered
Chapter 4 — Standard Voice Coding Techniques 45
−2
−4
−6
−8
|H(f)| (dB)
−10
−12
−14
−16
0−500 Hz
500−1000 Hz
−18 1000−2000 Hz
2000−3000 Hz
3000−4000 Hz
Figure 4.4: Bandpass Excitation and Analysis Filters used in MELP Synthesis and
Analysis. The stop-band part of the transfer function has been omitted for clarity.
white noise. This substantially reduces the harshness of the voicing decision and removes
a great deal of the hissiness and buzziness of LPC10e.
Figure 4.5 demonstrates the time signals and power spectra of some typical MELP
excitation sequences. Time domain sequences are shown on the top and power spectra of
each sequence is shown underneath. In the successive sequences we see how the excitation
varies between a completely voiced (a) and a completely unvoiced (d) mode. In the
intermediate some of the bands are voiced and some are unvoiced, as can clearly be seen
in the power spectra.
In order to more accurately model the shape of individual pulses of the voiced excitation
vector, the MELP model calculates the strength of the various harmonics of the pitch
period. This is done by evaluating the peak values of the predictor error signal PSD near
the harmonics of the pitch period.
During synthesis, these magnitudes are used to generate an impulse using the inverse
Fourier transform of the measured strengths of the harmonics of the pitch (appropriately
padded to the pitch period). This results in a distorted pulse instead of a perfect impulse.
Chapter 4 — Standard Voice Coding Techniques 46
1 1
0.8 0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Time (ms) Time (ms)
−10 −10
Power Spectrum Magnitude (dB)
−30 −30
−40 −40
−50 −50
−60 −60
−70 −70
−80 −80
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz) Frequency (Hz)
0.8
0.6
0.5
0.4
0.2
0
0
−0.2
−0.5
−0.4
−0.6
0 20 40 60 80 100 120 0 20 40 60 80 100 120
Time (ms) Time (ms)
0 0
Power Spectrum Magnitude (dB)
−10 −2
−4
−20
−6
−30
−8
−40 −10
−50 −12
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz) Frequency (Hz)
This distorted impulse is repeated at the pitch period in order to create the excitation
pulse train. This distorted impulse train models the impulses caused by glottal opening
much more accurately than a simple impulse would, resulting in increased naturalness of
the synthesised voice. This is illustrated in figure 4.6.
Aperiodic Pulses
4
x 10 x 10
4
1.5 6
1
5
0.5
4
−0.5
2
−1
1
−1.5
0 50 100 150 200 250 300 350 0
0 200 400 600 800 1000 1200 1400 1600 1800
Time (samples) Frequency (Hz)
(a) Speech waveform (Thin) and LP (b) Fourier spectrum of the residual, se-
Residual (Thick) lected peaks indicated with circles
1.2
−5
0.8
0.6
−15
0.4
−20
0.2
−25
−30
solved by destroying the periodicity of the voiced excitation by introducing jitter. This
means that each pitch pulse is perturbed by up to 25% of the pitch period, simulating
the erratic glottal movement which is often observed at voicing transitions. However, we
cannot do this for strongly voiced frames. Introducing jitter in strongly voiced frames
introduces a croaking quality to the synthesised voice which is undesirable. It is therefore
crucial to accurately determine whether or not jitter should be introduced to a frame.
McCree [55] mentions that both aperiodic pulses and mixed excitation are needed to
remove the buzzy quality of the LPC synthesiser suggesting that the human ear is capable
of detecting separately both periodicity and peakiness. From [55] :
We believe the buzz associated with LPC vocoders comes from higher fre-
Chapter 4 — Standard Voice Coding Techniques 48
x2n
peakiness = . (4.2)
xn
This quantity is large when there are a large number of samples which differ from
the mean by more than one standard deviation. In other words when there are a large
number of outliers in the signal.
McCree’s explanation for the introduction of jitter is that the buzz of LPC vocoders
is typically caused by the fact that the LP synthesis filter cannot adequately disperse the
energy of the pulse train used as excitation for the voiced frames. Adding jitter to the
mixed excitation model forcibly reduces this energy.
The last step in MELP synthesis is a very weak filter referred to as the pulse dispersion
filter. McCree and Barnwell [55] justify this filter with the following argument: The
filter increases the match between the envelopes of bandpass filtered synthetic speech and
natural speech waveforms in regions which do not contain a formant resonance. At these
frequencies, the synthetic speech can decay to very low power in periods between pitch
pulses. Particularly at lower pitch values, frequencies near the higher formants may decay
significantly between excitation pulses. This does not tend to occur in natural speech,
possibly because of incomplete glottal closure or secondary excitation pulses resulting
from glottal opening.
A sub-frame based pitch estimation algorithm is used which significantly improves perfor-
mance in comparison to the pitch tracking used in the Federal Standard. This algorithm
minimises the pitch-prediction residual energy over the frame, assuming that the opti-
mal pitch prediction coefficient will be used over every sub-frame lag. This algorithm is
substantially more accurate over regions of erratic pitch and speech transitions.
Chapter 4 — Standard Voice Coding Techniques 49
Noise Suppression
An averaged PSD is used to calculate an estimate of the noise power spectrum. The
estimate of the noise PSD is used to design a noise suppression filter.
Improved Quantisation
Fourier Coefficients
The Fourier coefficients are not transmitted at all in this vocoder, which gives a saving of
8 bits in return for a small loss in performance. However, this loss in performance seems
to be compensated for by the gains made in other parameters.
Other Parameters
1. Pitch and voicing are quantised using only six bits instead of seven.
The following obvious changes are made from the standard 2400bps coder described
in MIL-STD-3005:
Aperiodic Flag; The aperiodic flag is omitted from this version of MELP. Chamberlain
justifies this decision by stating that at this bit-rate, more significant improvements
may be obtained by better quantisation of the other speech parameters than by the
inclusion of the aperiodic flag.
Energy; The energy parameter of the MELP vocoder exhibits considerable inter-frame
redundancy. In order to exploit this redundancy, Chamberlain uses vector quan-
tisation of eight energy values (two per frame) for every super-frame. The vector
quantisation of the frames is trained using training data scaled by multiple levels,
in order to prevent input level sensitivity of the energy quantisation.
Fourier Magnitudes; Chamberlain opts to not transmit any of the Fourier magnitudes.
Instead, a single glottal excitation vector is used for each of the two excitation modes.
Chamberlain provides no specifics as to the nature of the excitation vector but sim-
ply states that it reduces the perceived harshness of the synthesised speech. He
further mentions that the loss in quality caused by the reduced rate minimises the
perceived effect of the degradation caused by not transmitting the Fourier magni-
tudes.
Chapter 4 — Standard Voice Coding Techniques 51
Spectrum; A four-stage vector quantisation is used, with the first two stages using ten
bits each and the final two stages using nine bits each. Chamberlain emphasises the
choice of training corpus. Since the quantiser relies on phoneme transitions more
than simply on phonemes, one needs a very representative set of data in order to
properly train the quantiser.
4.4 Conclusion
All three of the vocoders mentioned in this chapter use the LP speech model, again
confirming the statement made by Deller [20] about the ubiquity of this model. However,
the fundamental difference between the three models lies in the sophistication of the
excitation model. Certainly, the model employed by the CELP vocoder is the most
sophisticated of the three, but the simple MELP model proved to be at least as accurate
in modelling the glottal excitation. As [55] mentions, the MELP and CELP vocoders
achieve almost equivalent quality, despite the fact that the MELP vocoder is designed
to operate at one half the bit-rate of the CELP vocoder. This huge improvement may
be due in part to the more sophisticated algorithms used by MELP for pitch tracking,
parameter quantisation and adaptive enhancement of the synthesized speech. However,
it still presents a very convincing argument in favour of the MELP voice coding model.
We will use the MELP model as a basis for the following chapters and we will follow
the standard implementation described in [69] since it is the most well documented and
commonly used version of MELP.
Chapter 5
Implementation of a MIL-STD-3005
compliant MELP Vocoder
The MELP vocoder was chosen for implementation since it represents an extremely good
compromise between bandwidth efficiency and overall voice quality. Additionally it is a
very widely accepted vocoder and had been standardised both by the US (MIL-STD-3005)
and NATO ( STANAG 4591 ).
The MELP vocoder was implemented in MATLAB. The following chapter describes
its implementation.
The MELP Encoder may be divided into 2 components:
Analysis; The purpose of this component is to extract the parameters of the MELP
speech production model which would synthesise a waveform as close as possible to
the input speech waveform.
Encoding; The purpose of this component is two-fold. Firstly, the parameters are quan-
tised and secondly the parameters are encoded and packed into a bit-stream for
transmission.
Synthesis; The purpose of this component is to synthesise the speech waveform from
the model parameters.
We note that these components are symmetrical about the transmission. Implementing
the MELP vocoder as two separate layers, allows us to verify the components individually,
as illustrated in figure 5.1. By synthesising with the un-quantised analysis parameters,
52
Chapter 5 — MELP Implementation 53
we may test the accuracy with which the MELP model can synthesise the target speech.
Additionally, by directly investigating the model parameters before and after a simulated
transmission we may investigate the quantisation errors directly as well as investigate the
effect of bit errors on the decoding of model parameters.
Analysis Decoded
Parameters Parameters
Bit
Stream
5.1 Analysis
A block diagram of the MELP analysis system is presented in figure 5.2.
Chapter 5 — MELP Implementation 54
LP analysis
Predictor Coefficients
Bandwidth
Expansion
Pitch
Pre−Filter
LP Inverse
Integer Filter
Pitch Est
LP Residual
Fractional
Pitch Est LPF Peakiness
Fourier Magnitudes
End of Frane Gain (G_1)
Aperiodic Flag
Pitch
5.1.1 Pre-Processing
The pre-processing of the speech consists solely of a DC removal. The input speech
is filtered with a 4th order Chebyshev high-pass filter with a cutoff frequency of 60Hz,
as shown in figure 5.3. The purpose of this filter is to remove extreme low-frequency
components of the input signal which are inaudible and would interfere with the parameter
estimation.
Chapter 5 — MELP Implementation 55
−5
−10
−15
|H(f)| (dB)
−20
−25
−30
−35
−40
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
−50
|H(f)| (dB)
−100
−150
−200
−250
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
cτ (0, τ )
r(t) = (5.1)
cτ (0, 0)cτ (τ, τ )
where
−τ /2+79
cτ (m, n) = sk+m sk+n (5.2)
k=−τ /2−80
The interpolated maximum value of the auto-correlation for this lowest band is used
as the overall voicing strength of the frame.
Chapter 5 — MELP Implementation 57
5.1.8 Peakiness
The notion of peakiness has previously been used in 4.3.1. High peakiness of the predictor
error signal is of course a strong indication of a voiced sound. The peakiness of the residual
signal is calculated using equation 4.2. High peakiness values in the LP error signal cause
the band-pass voicing strengths to be forced to high values, since this implies that the
excitation was contained in a number of high amplitude samples and this type of excitation
is characteristic of voiced speech.
5.1.11 Gain
The gain of the segment is calculated over two sub-frames, one centred on the centre of
the analysis frame and referred t as G1 , the other centred over the end of the analysis
frame and referred t as G2 . The length of these sub-frames is set as the lowest multiple
of the pitch period greater than 120.
Chapter 5 — MELP Implementation 59
5.2 Encoding
5.2.1 Band-Pass Voicing Quantisation
The voicing of the segment is quantised according to the following rules:
1. If the lowest band voicing strength is less than 0.6, the frame is regarded as unvoiced
and all higher band voicing strengths are set to 0.
2. If the lowest band voicing strength is greater than 0.6, the frame is regarded as
voiced and all higher band voicing strengths are quantised to 0 only if their voicing
strength is less than 0.6.
3. If only the highest band voicing and the lowest band are voiced then the highest
band is forced to be unvoiced.
vector is quantised using a four stage MSVQ. The codebook consists of a first stage of
128 entries and three subsequent stages of 64 entries. The quantisation metric used is a
simple weighted Euclidean distance between the quantised and un-quantised LSF vectors.
9
ˆ =
d(f, f) wi (fi − fˆi )2 (5.5)
i=0
1
The weighting vector of the metric is calculated as follows: Let P (f ) = A(2π ff )
, i.e.
s
let P (f ) be the inverse prediction filter transfer function evaluated at f Hz.
Then
⎧
⎪
⎨ P (fi )
0.3
, i = 0...7
wi = 0.64P (fi)0.3 , i = 8 (5.6)
⎪
⎩
0.16P (fi)0.3 , i = 9
Since the codebook used for quantisation is essential to the interoperability of the
system, it is presented in the standard.
⎡ ⎤
⎢ 117 ⎥
wi = ⎣ fi 2 0.69 ⎦ , i = 1, 2, . . . , 10 (5.7)
25 + 75 1 + 1.4 1000
1.1
1.05
0.95
0.9
Weight
0.85
0.8
0.75
0.7
0.65
0.6
0 200 400 600 800 1000 1200 1400
Frequency
1. Fourier magnitudes
3. Aperiodic flag
This makes available thirteen bits for error correction. The available bits are used for
error correction in the following way:
Chapter 5 — MELP Implementation 62
1. The four most significant bits of the LSF MSVQ first stage index are protected with
a Hamming (8,4) code.
2. The three least significant bits of the LSF MSVQ first stage index are protected
with a Hamming (7,4) code. The fourth data bit of this codeword is always set to
0.
3. The four most significant bits of the G2 codeword are protected with a Hamming
(7,4) code.
4. The least significant bit of the G2 codeword combined with the G1 codeword are
protected with a Hamming (7,4) code.
5.3 Decoder
5.3.1 Error Correction
The (7,4) Hamming code will correct a single bit error, while the (8,4) Hamming code will
additionally detect the presence of two bit errors. Additionally, the pitch encoding allows
the detection of one or two bit errors. If uncorrectable bit errors are detected, a frame
erasure is signalled. In this case the MELP synthesiser is simply given the parameters of
the previous frame again.
5.4 Synthesis
5.4.1 Pitch Synchronous Synthesis
In the MELP synthesiser, the synthesised speech is generated pitch-synchronously. This
means that the excitation energy is generated one pitch period at a time. Furthermore,
Chapter 5 — MELP Implementation 63
parameters such as the LP parameters, gain and Fourier magnitudes are updated once
per pitch period.
Aperiodic Flag
Band-Pass Excitation
Generator
Linear Prediction Synthesis LP Coefficients
Band-Pass Excitation
Generator
Band-Pass Excitation
Pulse Dispersion Filter
Generator
Excitation Generation
Synthesized Speech
Flow Control
The position of the start of the current pitch period is used as the interpolation factor
for the parameters.
α = t0 /180 (5.9)
The exception to this rule occurs at a sudden change in signal power, such as at the
onset of voice activity or the onset of voiced speech. In this case, the trajectory of the
MELP model parameter vector may change quite rapidly. Because the gain is measured
more than once per analysis frame, the gain trajectory is a much more accurate interpo-
lation factor in this case and the interpolation factor is calculated from the interpolated
gain (Gint ) and the current G2 and sucessive G2p gain values as:
Gint − G2p
α= (5.10)
G2 − G2p
Jitter
In weakly voiced frames, indicated by the jitter flag, we add a uniformly distributed
random integer to the pitch period. This random integer has a maximum absolute value
of 25% of the pitch period and may be positive or negative. Thus the modified length of
the synthesis segment is given by:
;;;
Band-Pass Voicing Strength
;;;
IFFT
Band-Pass
Filter
Noise Generator
unvoiced bands and using these to filter the impulse and noise excitations respectively.
This means that we need only sum two signals instead of five.
Gint − Gn − 12
ρ= (5.12)
18
Where Gint is the interpolated gain for the current pitch period and Gn an estimated gain
for the background noise.
Additionally, the first reflection coefficient, k1 is calculated from the decoded line
spectral frequencies.
The signal probability and first reflection coefficient are used to calculate the band-
width expansion factors, α and β as well as a tilt coefficient, μ. This is done as follows:
The transfer function A(z −1 ) of the LP synthesis filter is calculated from the LSFs.
This produces the transfer function of the ASE filter:
Chapter 5 — MELP Implementation 66
A(αz −1 )
HASE (z) = (1 + μz −1 ) (5.14)
A(βz −1 )
The purpose of the ASE filter is to emphasise the power of the synthesised speech near
the formants, similar to the CELP postfilter described in section 4.2.3.
Gint
10 20
Sgain = (5.15)
1 T 2
T n=1 ŝn
Naturally, one does not want to produce sudden changes in the signal amplitude and
therefore the above gain value is linearly interpolated between the previous and current
values over the first ten samples of the pitch period.
5.5 Results
In McCree and Barnwell [55], the MELP speech production model is presented and several
of the key algorithms are described. In a subsequent paper by McCree and De Martin
[56], several improvements to the model and the analysis and synthesis algorithms are
proposed. Heuristic justification is given and subjective test results are presented which
demonstrate the success of the overall system in each case. However, neither of the
above-mentioned publications present detailed investigation into the effects of the various
individual components of the vocoder.
Chapter 5 — MELP Implementation 67
Plante et. al. [57] have made available a speech corpus, (referred to as the Keele corpus
since it was produced at Keele University) designed to test the reliability of pitch tracking
algorithms. This corpus consists of ten speech samples from different speakers (five male
and five female speakers), each of which is approximately 60 seconds in length. Each
speech sample is accompanied by a laryngogram signal, which may be used to calculate
the pitch contour. Additionally, each sample is accompanied by a pitch contour sampled
at 100Hz. This pitch contour was automatically generated from the laryngogram signal
using the autocorrelation method but was further verified manually. The pitch contour
also differentiates between voiced, unvoiced and indeterminate speech segments.
In our experiment we have separated the frames which are regarded as unvoiced as
well as those which are regarded as indeterminate. The remaining frames of each speech
segment are compared to the results obtained from the MELP pitch tracker. De Cheveigne
[19] uses the gross pitch error rate as a measure of the accuracy of a pitch tracker. A gross
pitch error is defined as a frame with measured pitch differing by more than 20% from
the true pitch. In this paper various results from the above-mentioned corpus are also
presented, which are useful for comparative purposes. De Cheveigne reports a gross pitch
error rate of 2.8% using an optimised version of the normalised ACF - the same feature
as used in MELP.
We used the Keele corpus to investigate the occurrence of pitch errors in the MELP
pitch tracker. Results were obtained over the entire Keele speech corpus, for all voiced
frames. g(ep ) is the cumulative probability density function of the relative pitch error,
where the relative pitch error pe is defined in terms of the true pitch ptrue and the estimated
pitch pest as:
|ptrue − pest |
pe = (5.16)
ptrue
.
Figure 5.8 indicate that the standard pitch tracking algorithm presented in MELP
achieves a gross pitch error rate of 6.5% for the Keele corpus. This suggests that the MELP
pitch tracker performs quite poorly compared to the results presented by De Cheveigne.
Since a very similar feature is used for the results documented in De Cheveign’s results,
we conclude that the post-processing applied to pitch track by the MELP pitch tracker
may be improved substantially.
Chapter 5 — MELP Implementation 68
0.7
0.5
0.4
0.3
0.2
0.1
0.065878
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Relative Pitch Error
Voicing Analysis
The effect of the band-pass voicing analysis may be investigated by forcing all the higher
band voicing strengths to conform to the voicing of the lowest band. This reduces the
MELP speech production model to a simpler model, very close to the original LPC speech
production model. It is well established in the literature [15, 33, 55] that the LPC model
typically produces synthesised speech with a distinctly buzzy quality. We can reduce the
buzziness somewhat by reducing the voicing decision threshold, but this would result in
the synthesised voice taking on a hissy, whispering character.
We investigated the accuracy of the overall voicing decision using speech samples in
which the voicing has been accurately determined. For this purpose we used the same
speech corpus as above. The overall frame voicing as determined by the MELP algorithm
is compared to the voicing as marked up in the corpus.
In table 5.1, we demonstrate the frequency of occurrence of each of the four possible
Chapter 5 — MELP Implementation 69
outcomes of the MELP estimated voicing and the true frame voicing as determined by
the laryngiograph1 . The MELP voicing decision exhibits a substantial bias toward an
unvoiced decision. This appears to be a perceptually motivated design choice, since
incorrectly classification of frames as voiced will result in very audible ‘musical’ tones in
the synthesised voice, while the incorrect classification of frames as unvoiced will result
in ‘hisses’ which are much less disturbing to listeners.
Figure 5.9 illustrates the effect of quantisation on the frequency response of the linear
prediction synthesis filter.
10 15
10
6
LP Frequency Response (dB)
0
0
−2 −5
−4
−10
−6
−8 −15
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz) Frequency (Hz)
30 35
25 30
20 25
15 20
LP Frequency Response (dB)
10 15
5 10
0 5
−5 0
−10 −5
−15 −10
−20 −15
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz) Frequency (Hz)
Figure 5.9: Effect of quantisation on the LP. Figures show the frequency response of
the MELP LP before (thin line) and after (thick line) quantisation for a few
representative frames.
1
This implies that the columns and rows of the table will not sum to 1 but the four entries of the
table sum to 1
Chapter 5 — MELP Implementation 70
We can examine the objective effect of LP quantisation using some of the objective
speech metrics mentioned in 3.7. Paliwal and Atal [63], introduced the idea of transparent
quantisation which has been discussed in appendix E. Transparent quantisation is defined
in terms of the spectral distortion imposed by the quantisation. The spectral distortion is
defined in terms of the frequency response of the original linear predictor (A(f )) and the
transfer function of the quantised linear predictor (A (f )). The spectral distortion (SD)
is calculated as:
∞
(A(f ) − A (f ))2 df
SD = −∞ ∞ 2 df
(5.17)
−∞
A(f )
. In a discrete time system, the integrals are usually approximated by sums.
Paliwal and Atal suggested 24 bits as being the minimum information necessary per
frame in order to achieve transparent quantisation of the linear predictor power spectrum.
Since MELP uses 25 bits per frame, we expect that transparent encoding of the LP should
be possible. In order to test this, the per-frame Spectral Distortion was measured for a
large speech corpus, taken from the TIMIT speech corpus. The estimated PDF of the
per-frame SD produced by the MELP VQ is presented in figure 5.10.
0.35
0.3
0.25
0.2
Frequency
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7
Spectral Distortion
Figure 5.10: Histogram of spectral distortion measured over 75000 frames of MELP
encoded speech.
The results here are comparable to the results reported by Paliwal and Atal using their
Split VQ and MSVQ schemes (see appendix E), but the quantisation scheme in MELP
Chapter 5 — MELP Implementation 71
definitely appears to introduce more distortion. There are many factors which might
contribute to this, one possibility is the choice of speech corpus used for the evaluation.
Paliwal and Atal used about 160s of speech recorded from radio stations for their tests,
while we used 1600s from the TIMIT corpus.
However, investigating the SD histograms for various individual speakers presented
the surprising result that the distribution of spectral distortion varied only slightly for
the individual speakers, as illustrated in figure 5.11.
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
Frequency
Frequency
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Spectral Distortion Spectral Distortion
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
Frequency
Frequency
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Spectral Distortion Spectral Distortion
Figure 5.11: Spectral distortion histogram plots for various speakers. In each
sub-figure, the histogram of SD occurrence for a single speaker is shown.
Another explanation would be that the two stage MSVQ or the two stage split VQ used
by Paliwal and Atal represents a more efficient quantisation than the four stage MSVQ
used by MELP. The decision to use a four stage MSVQ was most probably motivated by
computational considerations. This idea has been discussed in appendix E.
5.6 Conclusion
The floating point ‘C’ reference implementation of the MELP algorithm is available at
[61]. This implementation was used as a qualitative verification of the implementation.
Chapter 5 — MELP Implementation 72
2. The model parameters produced by analysis of speech signals by the two implemen-
tations were identical.
In the implementation of the MELP vocoder it was found that the overwhelming ma-
jority of the development effort was spent on the implementation of the finer details of the
algorithm. While the analysis-synthesis model is conceptually simple, the MELP vocoder
relies on a number of seemingly heuristic algorithms, which are justified neither in McCree
and Barnwell’s original paper nor in the MELP standard. Examples would be the pitch
doubling check algorithm and the adaptive spectral enhancement filter. Additionally, the
frame-based nature of the MELP analysis and synthesis engines mean that continuity
constraints must be carefully enforced. Thus we hope that our MELP MATLAB model
will represent a useful tool for future work.
In the following chapters we will consider some of the limitations of the MELP vocoder
and investigate potential avenues of improvement.
Chapter 6
73
Chapter 6 — The Temporal Decomposition Approach to Voice Coding 74
Prandoni and Vetterli [65] describe an approach whereby the speech segmentation is
optimised in the time domain in order to minimise a cost function based on the mod-
elisation cost. This means that the original speech signal is examined on a sample by
sample basis in order to determine the optimal segmentation on which to perform LP
analysis and parameter estimation. Their approach was successful, in that they achieved
very good rate-distortion trade-offs.
George [31] presented a block-coded variable frame rate strategy. This approach used
dynamic programming to select optimal break points or frames for transmission. Few
details of the algorithm are described in the paper and no quantitative results are docu-
mented.
We wish to derive a further mathematical theory of parametric voice representation,
in order to better understand this approach.
Thus the parameter vector describes a trajectory through its parameter space at the
same time as the speech waveform describes a trajectory in the time domain.
1 0.45
0.8 0.4
0.6 0.35
0.4 0.3
0.2 0.25
0 0.2
−0.2 0.15
−0.4 0.1
−0.6 0.05
−0.8 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
Time (sec) Time (sec)
Figure 6.1: Time and Parameter domain representations of the speech signal
As one can see in figure 6.1, the feature vector is characterised by long periods of fairly
slow, linear behaviour, with occasional sharp, discontinuous changes. This is in agreement
with Atal’s observation that [3]:
intervals are chosen to be sufficiently small (and thus the rate at which p is sampled is
chosen sufficiently high) in order to satisfy a sampling criterion analogous to the Nyquist
criterion which is described in [77].
0.45
0.4
0.35
0.3
0.25
Power
0.2
0.15
0.1
0.05
0
0 0.5 1 1.5 2 2.5
Time (sec)
Figure 6.2: Feature vector trajectory sampled at a typical vocoder sampling rate. The
thin line indicates the feature trajectory. The markers indicate the point at which the
feature trajectory was sampled and the thick line indicates the estimated value of the
trajectory created by linear interpolation between sampling points.
Most low rate speech coders calculate a feature vector approximately every 30ms. This
corresponds to sampling p every 30ms. The effect of this is demonstrated in figure 6.2.
In figures 6.2, 6.3 and 6.4, the thin line indicates the feature trajectory. The markers
Chapter 6 — The Temporal Decomposition Approach to Voice Coding 77
indicate the point at which the feature trajectory was effectively sampled and the thick
line indicates the estimated value of the trajectory created by linear interpolation between
sampling points. One can see that in this case the feature vector trajectory appears to
be over-sampled in certain regions.
0.45
0.4
0.35
0.3
0.25
Power
0.2
0.15
0.1
0.05
0
0 0.5 1 1.5 2 2.5
Time (sec)
In figure 6.3, the sampling interval has been increased to 90ms, thus reducing the
sampling rate by a factor of 3. Now much of the high-frequency behaviour of the feature
vector seems to be lost. Clearly, simply increasing the sampling interval for regular
sampling is not an acceptable solution.
In figure 6.4, an average sampling interval of 90ms has been used. However, the
sampling points have been manually aligned with the most relevant points of the feature
trajectory. Now much of the high-frequency behaviour of the feature vector has been
preserved without increasing the number of samples.
Chapter 6 — The Temporal Decomposition Approach to Voice Coding 78
0.45
0.4
0.35
0.3
0.25
Power
0.2
0.15
0.1
0.05
0
0 0.5 1 1.5 2 2.5
Time (sec)
6.4 Conclusion
In this chapter we have shown that irregular sampling of the speech parameter vector may
lead to better encoding of the speech signal. In the following chapter we will describe the
implementation of a vocoder which makes use of the ideas presented here.
Chapter 7
Implementation of an Irregular
Frame Rate Vocoder
In the section 6.3, we illustrated how we may possibly represent the speech signal accu-
rately with fewer sampling points using irregular sampling of the parameter trajectory.
In this chapter we will apply these ideas to the MELP speech production model de-
scribed in chapter 4 and chapter 5, in order to develop a variable frame-rate vocoder.
The development of such a vocoder requires the following:
80
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 81
Original Synthesized
Analysis Synthesis
Post−Processing
Encoding Decoding
Transmission
3. We used the L∞ norm to calculate a metric for any point chosen for transmission
based on the preceding points.
j − kn−1
αj =
kn − kn−1
Hence E[kn ] represents the maximum distance between p[t] and p̃[t] between the
points kn−1 and kn .
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 82
4. We modified the MELP synthesis engine so that the synthesis frame length is also
received and can vary.
We will call this approach to voice coding using the MELP voice production model
Irregularly Sampled MELP or IS-MELP.
7.1 Analysis
In the IS-MELP analysis step, the input speech waveform is analysed using the standard
MELP analysis engine as described in chapter 5. However, the IS-MELP analysis window
is advanced by only 2.25 ms (or 18 samples) at a time instead of the 22.5ms (180 samples)
by which the standard MELP analysis window is advanced. This results in a tenfold
oversampling of the parameter trajectory.
The primary purpose of this over-sampling is that the oversampling allows for more
accurate identification of the significant points in the speech parameter trajectory.
We determine the feature trajectory in our algorithm by performing MELP analysis on
overlapping frames of the speech waveform. The standard MELP analysis is performed on
analysis frame of 22.5ms, which is advanced by 22.5ms for every analysis. In our algorithm
we attain a high-resolution view of the trajectory by advancing the analysis frame by only
2.25ms. This of course leads to substantial redundancy in the feature vector trajectory,
analogous to the redundancy produced by over-sampling a band-limited signal. In order
to utilise this redundancy to obtain a more accurate estimation of the trajectory, we will
perform a filtering step on the feature trajectory.
number of frames which are transmitted and hence increasing the overall bit rate of the
vocoder.
Original Line Spectrum Frequencies Line Spectrum Frequencies after filtering with a filter with α = 0.3
12000 12000
10000 10000
8000 8000
LSF (Hz)
LSF (Hz)
6000 6000
4000 4000
2000 2000
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Time (sec) Time (sec)
Line Spectrum Frequencies after filtering with a filter with α = 0.1 Line Spectrum Frequencies after filtering with a filter with α = 0.01
12000 12000
10000 10000
8000 8000
LSF (Hz)
LSF (Hz)
6000 6000
4000 4000
2000 2000
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Time (sec) Time (sec)
(c) LSF filtered with α = 0.1 (d) LSF filtered with α = 0.01
In figure 7.2, we illustrate the effect of the filter with the transfer function
αz −1
H(z) = (7.2)
1 + (1 − α)z −1
on the LSFs for various values of α. In the original LSFs, one notices that the LSFs exhibit
a large amount of high-frequency behaviour. We postulated that this high frequency
behaviour is not perceptually significant, and may be removed using a low-pass filter,
with a suitable narrow bandwidth, such as the filter shown in figure 7.2(c) However,
choosing α too small (and correspondingly, narrowing the bandwidth of the filter too
much), causes a loss of resolution in the LSF trajectories and correspondingly, a loss in
intelligibility of the synthesised speech. Ideally, one would determine the optimal filter
cutoff by means of perceptual tests, but this was not feasible within the scope of this
project.
1. Sampling the MELP parameter trajectory at intervals of less than 22.5ms - referred
to hence as over-sampling.
2. Sampling the MELP parameter trajectory at intervals of more than 22.5ms - referred
to hence as under-sampling.
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 85
In the former case, over-sampling the trajectory by a factor 2 (in other words using
sampling intervals of 11.25ms) appears to produce a small but noticeable improvement in
speech quality. Over-sampling by a factor of 10 produces speech which appears by to the
author’s ear to be almost indistinguishable from the original samples. This suggests that
the MELP model accurately models the true speech production model and consequently
that the majority of the artifacts of the synthesised speech are created by the effects of
sampling the parameter trajectory.
In the latter case, under-sampling the parameter trajectory by a factor 2 produces
audible degradation of the synthesised speech. Under-sampling by a factor 3 produces
substantially more distortion and under-sampling by a factor 4 produces speech which is
usually no longer intelligible to the author.
3500 3500
3000 3000
2500 2500
Frequency
1500 1500
1000 1000
500 500
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time Time
Reconstructed from MELP parameters sampled at 22 Hz Reconstructed from MELP parameters sampled at 15 Hz
4000 4000
3500 3500
3000 3000
2500 2500
Frequency
Frequency
2000 2000
1500 1500
1000 1000
500 500
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time Time
Figure 7.3 illustrates the effect of the sampling rate of the MELP parameter vectors
on the speech signal. This is best done by means of the spectrograms of the signals. In
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 86
the case of the parameters sampled at a high rate (44Hz), one can see in the spectrograms
how the features of the speech are preserved. In the low sampling rate versions (such as
the 15Hz example), the main features of the speech can still be seen to be preserved (such
as the main shape of the pitch and the silent regions), but much of the resolution is lost.
From the results shown in figure 7.3 as well as those in figure 6.2 and 6.3 we can
therefore conclude that a naive under-sampling of the MELP parameter trajectory will
produce a commensurate reduction in bit-rate but will also result in unacceptable losses in
the quality of the synthesised speech. This is confirmed by the objective results obtained
by the PESQ algorithm in figure 7.4, which illustrate the diminishing returns which are
obtained as the frame rate of the regular MELP vocoder is increased. Above a frame rate
of roughly 100 frames per second, one does not see any singificant improvement in the
quality of the synthesised speech as measured by the PESQ metric. In the region below
50 frames per second, the quality degrades very rapidly with decreasing frame rate. This
suggests that the frame rate of 44 frames per second used by the standard MELP vocoder
represents a sensible choice.
3.4
3.2
3
Mean PESQ Score
2.8
2.6
2.4
2.2
1.8
1.6
0 50 100 150 200 250
Frame Rate
Figure 7.4: Variation of Bit Rate and Quality by Modification of Regular MELP
Sampling Rate
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 87
f (x)ξ(x)
e1 (x) = (x − x0 )(x − x1 ) (7.3)
6
With ξ(x) ∈ [x0 , x1 ]. The interpolation error at any point is therefore linearly propor-
tional to the second derivative of the function at that point. Thus for piecewise linear
interpolation it makes sense to choose the interpolation points to be those points which
have the largest second derivative. This approach makes sense since these points are those
points where the function exhibits maximum curvature and is thus least suited to linear
interpolation.
Our first algorithm therefore simply examined the second partial derivatives of the
MELP parameter vector trajectory. If any of these second partial derivatives (in other
words, the second derivative of any given MELP model parameter) exceeded a threshold,
the frame was regarded as significant and transmitted.
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 88
Since taking the Euclidean distance between LSFs does not constitute a sensible metric
with which to evaluate the magnitude of the derivative of the MELP parameter vector, we
used the perceptually based distance measure employed by McCree and Barnwell [55, 69]
and mentioned in equation 5.5 in the MELP Vector Quantisation, to estimate a more
perceptually applicable form of the derivative.
In figure 7.5 and figure 7.6, we demonstrate the various partial derivatives of the MELP
parameter trajectory.
The algorithm was not successful. At low bit rates, synthesised speech was severely
distorted and sounded unacceptably unnatural. We believe that the estimation of the
MELP model parameters is an inherently noisy process, and calculating the derivative
tends to emphasise the noise, leading to highly inaccurate estimation of frame importance.
A second reason for the failure of the algorithm is that the regions of maximum
curvature of the feature vector trajectory tend to be clustered together. This means
that often many successive frames are chosen for transmission. In order to maintain an
acceptable bit-rate, the threshold for transmission must therefore be set to a large value.
This large threshold results in some important features of the feature vector trajectory
being ignored.
d(P × P ) → (7.4)
8
250
200
6
5
150
|Δ Voicing|
|Δ2 Pitch|
4
2
100
3
2
50
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Time Time
40
0.25
0.2 35
0.15 30
0.1
25
|Δ Gain|
0.05
|Δ2 LSF|
20
0 2
15
−0.05
10
−0.1
5
−0.15
−0.2 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Time Time
Figure 7.5: Second partial derivatives of MELP model parameter vector trajectory. No
post-processing applied to trajectory.
2. Given two parameter vector trajectories, devise a metric which quantifies how differ-
ent the sounds produced by the trajectories will be. The norm described in equation
7.1 was used for this.
As discussed in 3.3, the most suitable measure to discriminate between two speech sig-
nals is their perceptual distance. As we mentioned, the only true measure of the perceptual
distance is the human ear, but some proposed approximations exist. These approxima-
tions typically operate in either the time or the frequency domain.
We use knowledge of how the MELP decoder would treat the received data (linear
interpolation between successive key frames) in order to choose frames to transmit.
In order to do this we considered the trajectory that would be created by the decoder
from interpolation of the candidate set of key frames. We compare the distance between
this interpolated trajectory and the original parameter trajectory in order to determine an
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 90
4 45
40
3.5
35
3
30
2.5
|Δ2 Voicing|
25
|Δ Pitch|
2
2
20
1.5
15
1
10
0.5 5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Time Time
−3
x 10
8 40
35
6
30
4
25
|Δ Gain|
|Δ2 LSF|
2 20
2
15
0
10
−2
5
−4 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Time Time
Figure 7.6: Second Partial Derivatives of MELP Model Parameter Vector Trajectory
after application of post-processing.
SignificantInterpolationError((p[prev]. . . p[next]))
function, which evaluates the expected interpolation error which will be created by lin-
ear interpolation between the previous frame and the candidate frame. We will regard
the interpolation error as significant if the original and reconstructed parameter vector
trajectories are significantly different. The trajectories are regarded as being significantly
different if they are significantly different at any point, thus reducing the problem fur-
ther to whether the model parameters at any point in the trajectories are significantly
different, as in equation 7.1.
Two sets of MELP model parameters are regarded as being significantly different when
the distance between them along any dimension exceeds a threshold, i.e. if any of the
model parameters are significantly different. This is qualitatively similar to the standard
L∞ norm which is commonly used in approximation theory. More detail about the Lp
norms may be found in [64].
It was beneficial to define a different threshold and distance function for each MELP
model parameter (for convenience we regard the Linear Predictor as a single model param-
eter and we regard the set of band-pass voicing strengths as a single model parameter).
This was done because a significant change in any of the parameters will result in a
perceptually significant difference in the speech waveform.
The metrics used for the individual MELP model parameters are detailed below:
Linear Predictor; The perceptually motivated distance measure used by Barnwell and
McCree in [55] and defined in equation 5.5 is used.
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 92
A number of different weight vectors were tried. The weighting which appeared to
work best was to weight only the lowest band voicing (overall voicing) more heavily.
It was also beneficial to set the lowest band weight to a very high value, so that a
change in the overall voicing would result in frames being classified as ‘very’ different.
Thus w1 = w2 = w3 = w4 = 1 and w0 = ∞.
Pitch; We used the maximum of the pitch difference relative to each of the two pitch
values. See section 3.3 .
|p1 − p2 | |p1 − p2 |
d(p1 , p2 ) = max , (7.6)
p1 p2
Gain; We used the absolute value of the logarithm of the ratio of the respective gain
values:
# #
# G1 ##
d(G1 , G2 ) = ##log (7.7)
G2 #
Using these metrics, we can compare the interpolated and original parameter trajec-
tories directly in the parameter domain, without having to re-synthesise the speech.
This approach appeared to be quite successful. In comparison to regularly sampled
MELP, IS-MELP achieved improved quality at similar bit-rates. Since the bit-rate pro-
duced by this algorithm is variable, statistics for the bit-rate were collected over a sub-
stantial speech corpus in order to obtain an accurate estimate of the average bit rate
produced by a given set of parameters.
Subjective test results for the achieved quality is detailed in chapter 8.
We used the PESQ algorithm described in to estimate speech quality during develop-
ment.
In figure 7.7 we can examine the points which were selected by the IS-MELP frame
selection algorithm, using non-optimised thresholds and compare these with the tracks of
the line spectrum frequencies and with the spectrogram of the results.
1.5
0.5
4000
3000
Frequency
2000
1000
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Time
Figure 7.7: IS-MELP Sampling. The positions of the sampling points are indicated on
the spectrogram by the vertical lines.
quality of the speech as a function of the threshold. In accordance with the results in [18]
we would expect to see that the quality of the synthesised speech will increase as a loga-
rithmic function of the bit rate. We present results obtained by variation of the thresholds
mentioned in algorithm 1. In each of the following graphs, we plot a parametric curve
which illustrates the quality (measured using the PESQ algorithm introduced in section
3.7.3) of the vocoder in comparison to the average bit-rate produced by the vocoder.
Figures 7.8, 7.9, 7.10 and 7.11 show the variation of bit-rate and quality produced by
variation of the thresholds for the voicing, gain, LP system and pitch respectively. In
each experiment, three of the parameters were kept constant and the fourth was varied.
In each of the figures, it can be seen that there are regions in which the IS-MELP
vocoder outperforms the regularly sampled MELP vocoder and regions where the IS-
MELP vocoder performs less well than the regular MELP vocoder. This implies that the
variation of a single threshold will produce a local maximum in the improvement of the
IS-MELP performance over the performance of regular MELP.
Chapter 7 — Implementation of an Irregular Frame Rate Vocoder 94
27
26
25
24
23
2.6
2.4
22.5 23 23.5 24 24.5 25 25.5 26 26.5 27 27.5
Mean Frame Rate (Frames/sec)
Figure 7.8: Variation of frame rate and quality by modification of the voicing
threshold. The top and middle plots indicate the quality and frame rate of the vocoder
respectively as functions of the threshold used. The lower graph plots the quality against
the frame rate for voicing, gain, LP and pitch thresholds respectively. For comparative
purposes, the quality of regularly sampled MELP at various frame rates is also shown.
Parameter : Gain Distortion Threshold
Mean PESQ Score
2.9
2.8
2.7
2.6
1 2 3 4 5 6 7 8 9 10
Mean Frame Rate (Frames/sec)
120
100
80
60
40
1 2 3 4 5 6 7 8 9 10
Gain Distortion Threshold
3.2
Mean PESQ Score
2.8
2.6
2.4
30 40 50 60 70 80 90 100 110 120 130
Mean Frame Rate (Frames/sec)
Figure 7.9: Variation of Bit Rate and Quality by Modification of the Gain Threshold
2.9
2.7
2.6
2.5
2 4 6 8 10 12 14
30
25
20
2 4 6 8 10 12 14
LP Distortion Threshold −3
x 10
Mean PESQ Score
Figure 7.10: Variation of Bit Rate and Quality by Modification of the LSF Threshold
Parameter : Pitch Distortion Threshold
Mean PESQ Score
2.85
2.8
2.75
2.7
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Mean Frame Rate (Frames/sec)
80
60
40
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Pitch Distortion Threshold
Mean PESQ Score
2.8
2.6
30 40 50 60 70 80 90
Mean Frame Rate (Frames/sec)
Figure 7.11: Variation of Bit Rate and Quality by Modification of the Pitch Threshold
2.8
Mean PESQ Score
2.6
2.4
2.2
2
1.8
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Mean Frame Rate (Frames/sec)
LSF Filter α
30
25
20
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
LSF Filter α
3
Mean PESQ Score
1.5
20 22 24 26 28 30 32 34
Mean Frame Rate (Frames/sec)
Figure 7.12: Variation of Bit Rate and Quality by Modification of the LSF Post
processing
In figure 7.12, we note that the bit-rate against quality curve exhibits a sharp change in
gradient around α = 0.2. This suggests that if the bandwidth of the LSF filter is decreased
to less than 15Hz, there is a significant loss of information in the LSF trajectories. This
is fairly consistent with the average phoneme rate we discussed in section 2.1.
2.65
2.55
5 10 15 20 25
26
24
22
5 10 15 20 25
Pitch/Voicing LULU Filter Order
Mean PESQ Score
2.6
2.4
22 23 24 25 26 27
Mean Frame Rate (Frames/sec)
Figure 7.13: Variation of Bit Rate and Quality by Modification of the Voicing Post
Processing
by PESQ first increases slightly and then decreases with increasing smoothness, as seen
in figure 7.13. The marginal increase in perceived quality of the speech is most probably
explained by the fact that the post processing removes errors introduced in the pitch and
voicing analysis. However, increasing the post processing causes a loss of phonetic content
in the synthesised speech and a corresponding loss of quality.
7.9 Conclusions
The bit-rate versus quality curves illustrated in the previous sections indicate that it is
possible to achieve continuous variation of the bit rate and quality of the voice coding sys-
tem by varying the allowable distortion. Furthermore, this decision may be continuously
adjusted at the transmitter without introducing the necessity of transmitting additional
information to maintain synchronisation with the receiver.
The most significant disadvantage of the IS-MELP vocoder is the difficulty of relating
the distortion thresholds to a fixed bit-rate. Since there is no simple mathematical function
which determines the bit-rate from a set of thresholds, the bit rate produced by a threshold
set must be evaluated empirically. However, in an application environment, this problem
could be circumvented in one of two ways:
Figures 7.8, 7.9, 7.10 and 7.11 clearly demonstrate that in certain regions, particularly
in the case of lower frame rates, IS-MELP achieves better PESQ scores than regularly
sampled MELP.
Filtering the pitch and voicing with a non-linear filtering scheme appears to signif-
icantly improve the performance of the IS-MELP vocoder. Filtering of the LP system
produces a less significant but still noticeable improvement in the performance of the
IS-MELP vocoder.
In the following chapter, we will verify the performance of the IS-MELP vocoder by
means of subjective tests. We will use the optimised threshold generated as described in
section 7.6.
Chapter 8
Evaluation of Vocoders
As has been discussed in previous chapters (3.7), there are two standard approaches to the
evaluation of the quality of a vocoder, namely algorithmic quality measures and human
listening tests.
By comparing these approaches, we aim in this chapter to:
1. Compare the performance of the MELP vocoder for a European and an African
language, namely English and Xhosa.
2. Compare the relative performance of the reference and IS-MELP vocoders.
3. Determine whether the IS-MELP promises any improvement in voice quality for
African languages.
4. To investigate the correlation between the objective and subjective metrics for vocoder
performance.
1
From the English corpus the phoneme rich sentence common to all speakers, “Helen’s stint as league
manager provided useful opportunities but the elementary practical tasks of going to meetings and reading
a work sheet bored her.” was used for the test.
From the Xhosa corpus the phoneme rich sentence common to all speakers, “Intetho kamongameli
99
Chapter 8 — Evaluation of Vocoders 100
In the AST speech corpus, each speaker reads two phonetically rich sentences. One
of these sentences is individual to the speaker, the other is common to all speakers. We
used the latter for our tests.
8.1.2 Utterances
Four speech samples were selected from the phonetically rich sentence spoken by each
speaker, thus bringing the total number of samples per language to 40. For every language,
the same samples were selected for every speaker. Samples were chosen to be short but
grammatically meaningful utterances.
English Utterances
Xhosa Utterances
8.2.1 PESQ
The PESQ algorithm refers to the objective metric described in [43]. This metric is recom-
mended by the Telecommunication Standardisation Sector of the International Telecom-
munication Union as an objective method for the end-to-end speech quality assessment of
narrow band speech codecs. We have provided full details of the algorithm in Appendix
G. The PESQ metric provides an estimate of the mean opinion score (MOS) which would
be assigned to a speech file. Thus a higher PESQ score corresponds to better speech
quality.
IS−MELP
3 Regular MELP
2.9
2.8
2.6
2.5
2.4
2.3
2.2
20 30 40 50 60 70 80
Frame Rate
Figure 8.1: Overall (Combined English and Xhosa) Rate-Distortion Curve for Regular
and IS-MELP
Xhosa
3 English
2.8
2.6
PESQ Score
2.4
2.2
1.8
1.6
10 20 30 40 50 60 70 80 90 100
Frame Rate
IS-MELP, the PESQ rate-distortion curves for the English and Xhosa speech were almost
exactly equal.
At low bit rates the IS-MELP vocoder produces substantially better speech quality
than the regular MELP vocoder according to the PESQ metric. At higher bit-rates the
IS-MELP vocoder is less efficient than the regular MELP vocoder.
It was also found that the IS-MELP vocoder exhibited less differentiation in the way in
which it handled English and Xhosa. We suspect that this difference is due to the fact that
the multi-rate capability of IS-MELP is better able to handle certain phoneme classes. As
shown in appendix D.1, there are different distributions of phoneme occurrences in the
two languages used in the test. See table D.1.
We hypothesise that the difference is due to a greater variance in the phonetic modu-
lation rate in Xhosa.
Xhosa
3 English
2.8
2.6
PESQ Score
2.4
2.2
1.8
1.6
10 20 30 40 50 60 70 80 90 100
Frame Rate
order to verify the results predicted using the objective PESQ metric.
The ITU-T has published a recommendation describing methods and procedures for
conducting subjective evaluations of the quality of transmitted speech [42]. We have
followed these recommendations as closely as possible.
Referring to the results in figure 8.1, we note that at the higher rate, we would expect
that regular MELP would perform better and at the lower rate we would expect that
IS-MELP would perform better.
The transcoding involved only the parameterisation and frame selection before re-
synthesis. It was felt that quantisation effects would interfere with the effect that was
Chapter 8 — Evaluation of Vocoders 105
being studied (that of regular and irregular sampling of the speech feature vector trajec-
tory).
The synthesised speech was saved in the form of a standard PCM wave (.wav) file for
use in the subjective tests.
English Xhosa
IS-MELP 22 fps 1.95 2.51
Regular MELP 22 fps 2.27 3.04
IS-MELP 60 fps 2.60 3.51
Regular MELP 60 fps 3.70 4.10
Table 8.2: Language dependence of IS-MELP and regular MELP, showing mean
opinion scores are indicated for each condition and language.
between 60 and 22 fps was a MOS difference of 1.31. Thus the amount of degradation
incurred by lowering the frame rate was halved by using the IS-MELP vocoder.
3500
3000
2500
Frequency
2000
1500
1000
500
0
0 0.5 1 1.5 2 2.5
Time
Figure 8.4: Spectrogram of sample which obtained poor MOS rating but a good PESQ
score. Sample was transcoded with IS-MELP algorithm at 22fps.
Original sample
4000
3500
3000
2500
Frequency
2000
1500
1000
500
0
0 0.5 1 1.5 2 2.5 3
Time
Figure 8.5: Spectrogram of original speech segment used to generate sample in figure
8.4.
Chapter 8 — Evaluation of Vocoders 109
3500
3000
2500
Frequency
2000
1500
1000
500
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time
Figure 8.6: Spectrogram of sample which obtained poor MOS rating and a poor PESQ
score. Sample was transcoded with IS-MELP algorithm at 22fps.
Original sample
4000
3500
3000
2500
Frequency
2000
1500
1000
500
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time
Figure 8.7: Spectrogram of original speech segment used to generate sample in figure
8.6.
Chapter 8 — Evaluation of Vocoders 110
8.5 Conclusion
While the IS-MELP algorithm has produced results comparable to those of the regular
MELP algorithm, and in some cases demonstrated superior performance, the performance,
particularly at low frames rates, was found to be unsatisfactory. This was most apparent
from the subjective tests. We feel that substantial improvement of the IS-MELP algorithm
may still be achieved. Methods of improving the performance of the IS-MELP algorihm
are described in section 9.2.
Chapter 9
111
Chapter 9 — Summary and Conclusion 112
encoding the short transient phonemes (clicks) characteristic of certain African languages.
Both subjective and objective evaluations were performed in order to compare vocoders.
Using the PESQ metric for objective evaluation of the vocoders indicated that regular
MELP performed substantially more poorly in encoding of Xhosa than in English at all
frame rates. Furthermore, the IS-MELP vocoder did not exhibit different performance
for English and Xhosa.
The PESQ metric also indicated that at low bit rates, the IS-MELP vocoder exhibited
improved performance over the regular MELP vocoder in both English and Xhosa.
Subjective tests did not confirm the improved performance of the IS-MELP algorithm
at low frame rates. However as frame rates decreased, the IS-MELP algorithm resulted
in more graceful degradation of the transcoded speech than the regular MELP algorithm.
vocoder in subjective listener tests. We have attempted to analyse this discrepancy partic-
ularly with regard to the results obtained by the IS-MELP algorithm, but a full treatment
of this topic is beyond the scope of this thesis. However, in the interest of developing more
accurate objective metrics (the value of which have already been discussed), a thorough
investigation of these results would be valuable.
115
BIBLIOGRAPHY 116
[14] CHENG, Y.-M. and O’SHAUGHNESSY, D., “On 450-600b/s Natural Sounding
Speech Coding.” IEEE Trans. Speech Audio Processing, April 1993.
[15] CHU, W. C., Speech Coding Algorithms. Hoboken: Wiley, 2003.
[16] CHU, W. C., “Window Optimisation in Linear Prediction Analysis.” IEEE Trans.
Accoustics, Speech and Signal Processing, November 2003.
[17] COLLURA, J. S. and TREMIAN, T. E., “Vector Quantizer Design for the Coding
of LSF Parameters.” U.S. Department of Defence.
[18] DAUMER, W. R., “Subjective Evaluation of Several Efficient Speech Coders.”
IEEE Transactions on Communications, April 1982, Vol. 30.
[19] DE CHEVEIGNE, A. and KAWAHARA, H., “YIN, a fundamental frequency
estimator for speech and music.” Journal of the Accoustic Society of America,
April 2002.
[20] DELLER, J. R., PROAKIS, J. G., and HANSEN, J. H., Discrete-Time Processing
of Speech Signals. New York: Macmillan, 1993.
[21] DUDLEY, H., “The Vocoder.” Bell Labs, December 1939.
[22] DUDLEY, H. and TARNOCZY, T. H., “The Speaking Machine of Wolfgang von
Kempelen.” Journal of the Accoustic Society of America, 1950.
[23] DUNN, H., “The calculation of vowel resonances and an electrical vocal tract..” J.
Acoust. Soc. Amer., November 1950.
[24] E.E. DAVID JR, B. L., M.R. SHROEDER and PRESTIGIACOMO, A.,
“Voice-Excited Vocoders for Practical Speech Bandwidth Reduction.” IEEE Trans.
Information Theory, Sept 1962.
[25] EL-JAROUDI, A. and MAKHOUL, J., “Discrete All-Pole Modeling.” IEEE
Transactions on Signal Processing, 1991.
[26] ETEMOGLU, C. O. and CUPERMAN, V., “Matching Pursuits Sinusoidal Speech
Coding.” IEEE Trans. Speech Audio Processing, 2003.
[27] FAIRBANKS, G., “Test of Phonemic Variation: The Rhyme Test.” Journal of the
Accoustic Society of America, 1958.
[28] FLANAGAN, J. L., “Bandwidth and channel capacity necessary to transmit the
formant information of speech.” Journal of the Accoustic Society of America,
July 1956.
[29] FLANAGAN, J. L. and HOUSE, A. S., “Development and testing of a
formant-coding speech compression system.” Journal of the Accoustic Society of
America, November 1956.
BIBLIOGRAPHY 117
[45] JOZSEF VASS, Y. Z. and ZUANG, X., “Adaptive Forward-Backward Quantizer for
Low Bit Rate, High Quality Speech Coding.” IEEE Trans. Speech Audio Processing.
[46] KAY, S. M., Modern Spectral Estimation. New Jersey: Prentice Hall, 1988.
[47] KLATT, D., “Prediction of perceived phonetic distance from critical-band spectra:
A first step.” IEEE ICASSP, May 1982.
[48] LE ROUX, J. and GUEGUEN, C., “A Fixed-Point Computation of Partial
Correlation Coefficients.” IEEE Trans. Accoustics, Speech and Signal Processing,
June 1977.
[49] LEE, K. S. and COX, R. V., “A Very Low Bit Rate Speech Coder Based on a
Recognition/Synthesis Paradigm.” IEEE Transactions on Speech Audio Processing,
2001.
[50] LOPEZ-SOLER, J. and FARVARDIN, N., “A Combined
Quantization-Interpolation Scheme For Very Low Bit Rate Coding Of Speech LSP
Parameters.” IEEE ICASSP, 1993, Vol. 2, pp. 21–24.
[51] MARKEL, J. D., “The SIFT Algorithm for Fundamental Frequency Estimation.”
IEEE Trans. on Audio and Electroaccoustics, December 1972.
[52] MATTHAEI, P. E., “Automatic Speech Transcription.” Master’s thesis, University
of Stellenbosch, April 2004.
[53] MCAULAY, R. and T.F.QUATIERI, “Low-bit-rate speech coding based on an
improved sinusoidal model.” Speech Coding and Synthesis, 1995.
[54] MCCANDLESS, S. S., “An Algorithm for Automatic Formant Extraction using
Linear Prediction Spectra.” IEEE Trans. Accoustics, Speech and Signal Processing,
1974.
[55] MCCREE, A. and III, T. P. B., “A Mixed Excitation LPC Vocoder Model for Low
Bit Rate Speech Coding.” IEEE Transactions on Speech and Audio Processing,
July 1995.
[56] MCCREE, A. and MARTIN, J. C. D., “A 1.7 kB/s MELP Coder with improved
Aalysis and Quantisation.” IEEE ICASSP, 1998.
[57] MEYER, G. F., PLANTE, F., and AINSWORTH, W. A., “A pitch extraction
reference database.” Proceedings of EUROSPEECH, October 1995.
[58] MINOLI, D. and MINOLI, E., Delivering Voice over IP Networks. New York:
Wiley, 1998.
[59] NIESLER, T., LOUW, P., and ROUX, J., “Phonetic analysis of Afrikaans, English,
Xhosa and Zulu using South African speech databases.” Southern African
Linguistics and Applied Language Studies, 2005.
BIBLIOGRAPHY 119
[60] NORMAN, J., Elementary Dynamic Programming. London: Edward Arnold, 1975.
[61] OF DEFENSE DIGITAL VOICE PROCESSOR CONSORTIUM, T. U. D., MELP
at 2.4Kbps. http://maya.arcon.com/ddvpc/melp.htm, April 2002.
[62] OPPENHEIM, A. V. and SCHAFER, R., Digital Signal Processing. New Jersey:
Prentice Hall, 1975.
[63] PALIWAL, K. K. and ATAL, B. S., “Efficient vector quantization of LPC
parameters at 24 bits/frame.” IEEE Trans. Speech Audio Processing, January 1993.
[64] POWELL, M. J. D., Approximation Theory and Methods. Cambridge: Cambridge
University Press, 1981.
[65] PRANDONI, P. and VETTERLI, M., “R/D Optimal Linear Prediction.” IEEE
Trans. Accoustics, Speech and Signal Processing, November 2000.
[66] PRESS, W. H., TEUKOLSKY, S. A., VETTERLING, W. T., and
FLANNERY, B. P., Numerical Recipes in C . Cambridge: Cambridge University
Press, 1997.
[67] PROAKIS, J. G. (Ed.), Digital Communications. Fourth edition. New York:
McGraw-Hill, 2000.
[68] PROAKIS, J. G. and SALEHI, M. (Eds), Communications Systems Engineering.
New Jersey: Prentice Hall, 2002.
[69] PUBLICATION, F. I. P. S., “Analog to Digital Conversion of Voice by 2,400
Bit/second Mixed Excitation Linear Prediction (MELP).” June 1997.
[70] RABINER, L. and SCHAFER, R., Digital Processing of Speech Signals. New
Jersey: Prentice Hall, 1978.
[71] RAPID MOBILE, Pretoria. Datasheet for RM6: HF Modem and ALE Controller
Unit, 2005.
[72] ROHWER, C. H., “Variation Reduction and LULU Smoothing.” Quaestiones
Mathematicae, 2002, Vol. 25, No. 2, No. 2, pp. 163–176.
[73] ROHWER, C. H. and WILD, M., “Natural Alternatives for One Dimensional
Median Smoothing.” Quaestiones Mathematicae, 2002, Vol. 25, No. 2, No. 2,
pp. 135–162.
[74] SCHWARDT, L., “Voice Conversion: An Investigation.” Master’s thesis, University
of Stellenbosch, December 1997.
[75] SCHWARTZ, M. (Ed.), Information, Transmission, Modulation and Noise.
Second edition. New York: McGraw-Hill, 1970.
[76] SHANNON, C. E., “A Mathematical Theory of Communication.” Bell System
Technical Journal, 1948, Vol. 27, pp. 379–423, 623–656.
BIBLIOGRAPHY 120
We therefore consider the mean squared error signal of the predictor a over a speech
segment of N samples, s[0] . . . s[N − 1].
N −1
E = e2 [i] (A.1)
i=0
N −1
= (s[i] − s∗ [i]) (A.2)
i=0
2
N −1
P
= s[i] − a(j)s(i − j) (A.3)
i=0 j=1
Thus the mean squared error signal over a speech segment is a quadratic function of the
predictor coefficients and therefore has a unique global minimum, since the function
E(a) can be shown to be strictly convex.
We can determine the global minimum by finding the point in the vector space of a
where all the partial derivatives of E are equal to zero. Call this point a∗
then
#
∂E ##
= 0 ; ∀{k ∈ 1 . . . P } : (A.4)
∂ak #a∗
But
2
∂
N −1 P
∂E
= s[i] − a(j)s(i − j)
∂ak ∂ai i=0 j=1
N −1 P
= 2 s[i] − a(j)s[i − j] (−s[i − k])
i=0 j=1
P
N −1
= 2 aj − s[i − k]s(i − j) − s[i − k]s[i]
i=0 j=1
121
Chapter A — Optimisation of the Linear Predictor 122
thus
N −1
P
N −1
∀{k ∈ 1 . . . P } : s[j − k]s(j) = s[j − k]a(i) s(n − i)
i=0 i=1 i=0
or in matrix form ⎡ ⎤ ⎡ ⎤
φ1,0 a1
⎢ ⎥ ⎢ ⎥
⎢ φ2,0 ⎥ ⎢ a2 ⎥
⎢ ⎥ ⎢ ⎥
⎢ φ ⎥ ⎢ ⎥
⎢ 3,0 ⎥ =Φ×⎢ a3 ⎥
⎢ . ⎥ ⎢ .. ⎥
⎢ .. ⎥ ⎢ ⎥
⎣ ⎦ ⎣ . ⎦
φp,0 ap
with ⎡ ⎤
φ1,1 φ1,2 . . . φ1,p
⎢ ⎥
⎢ φ2,1 φ2,2 . . . φ2,p ⎥
⎢ ⎥
⎢ ⎥
Φ = ⎢ φ3,1 φ2,2 . . . φ3,p ⎥
⎢ . .. .. ⎥
⎢ .. ..
. ⎥
⎣ . . ⎦
φp,1 φp,2 . . . φp,p
We are able to deduce certain properties of the predictor from analysis of the covariance
matrix [46].
If the covariance matrix is positive definite (all eigenvalues greater than 0), the
predictor will be stable - all the poles of the predictor will lie within the unit circle
in the z-plane.
If the covariance matrix is positive semidefinite, this implies that the analysis
segment consists of the sum of p perfect sinusoids and the predictor will be
marginally stable - all the poles of the predictor will lie on the unit circle in the
z-plane.
If the covariance matrix is not positive semidefinite, this implies that the predictor
will be unstable - some of the poles of the predictor will lie outside the unit circle
in the z-plane.
This proves to be problematic. An unstable predictor does not only not agree with the
physical interpretation of the system, but also will cause problems in the realisation of
the predictor.
Chapter A — Optimisation of the Linear Predictor 123
Additionally, we are now left with the uncomfortable situation that we require samples
outside the analysis segment in order to calculate the Φ matrix. Furthermore, the
inversion of the Φ matrix is numerically unstable and the nature of the Φ matrix does
not guarantee that we will obtain a stable filter design for our LP synthesis filter.
There is a common solution to this dilemma. By windowing the (assumed infinite)
speech signal with a window w[k] which is uniformly zero outside the analysis interval,
we obtain the following result:
N −1
φm,n = s[i − n]w[i − n]s[i − m]w[i − m] (A.5)
i=0
N
−1−k
= s[i]w[i]s[i + k]w[i + k] (A.6)
i=0
= rm−n (A.7)
noting that
φm,n = φn,m
and that rm−n is the biased autocorrelation estimate for the (assumed stationary) time
signal using the window w[k]. We can re-write Φ as
⎡ ⎤
r0 r1 . . . rp − 1
⎢ ⎥
⎢ r1
⎢ r0 . . . rp − 2 ⎥
⎥
⎢ r
Φ=⎢ 2 r1 . . . rp − 3 ⎥
⎥
⎢ . .. .. ⎥
⎢ .. ..
. ⎥
⎣ . . ⎦
rp−1 rp−2 ... r0
r = Φa
where ⎡ ⎤
r1
⎢ ⎥
⎢ r2 ⎥
⎢ ⎥
⎢ ⎥
r=⎢ r3 ⎥
⎢ .. ⎥
⎢ ⎥
⎣ . ⎦
rp
This convenient approximation to the full Yule-Walker Equations, is known as the
modified or extended Yule Walker Equations [46] and has become extremely common in
speech processing applications, since it additionally allows for a very convenient
numerical solution.
Chapter A — Optimisation of the Linear Predictor 124
One obvious question to ask would concern the nature of the window w[k] which we so
glibly refer to in equation A.7. As previously stated, the window must have finite
support. A common choice is the Hamming Window [55] [70] but Chu [16] considers
windows more specifically adapted to the task.
Appendix B
We first derive the LSFs of the linear system with transfer polynomial given by:
P
a(z) = ai z i
i=0
first define:
then
Then z = −1 is a root of p(z) and z = 1 is a root of q(z). We can therefore factorise out
(1 + z −1 ) and (1 − z −1 ) from p(z) and q(z) respectively. This Results in
It can be shown that β1 and β2 are both on the interval (−1, 1) for any value of f0 and
ρ0 . thus the roots of p(z) and q(z) are complex and given by
β1 ± j 1 − β12
125
Chapter B — Derivation of the Line Spectrum Frequencies 126
and
β2 ± j 1 − β22
respectively. Because the roots lie on the unit circle, they can be uniquely represented
by their angles. These angles are known as the line spectral frequencies of a(z) and are
given by:
1 − ρ20
cos(2πf1 ) = ρ0 cos(2πf0 ) + (B.7)
2
1 − ρ20
cos(2πf2 ) = ρ0 cos(2πf0 ) − (B.8)
2
Now it can be shown that f1 < f0 < f2 and that the three frequencies become close as
the pole of the second order system moves close to the unit circle. The following also
hold for more general cases:
1. The roots of p(z) and q(z) lie on the unit circle
2. that ±1 are roots
3. once sorted by complex angle, the roots of p(z) and q(z) alternate on the unit circle.
Thus the P predictor coefficients can always be transformed into P line spectral
frequencies.
To compute the LSFs for higher order systems we replace z = cos(ω) and compute the
roots of p(ω) and q(ω) by any root finding method. The bisection method for finding
these roots is popular [66]. To compute the predictor coefficients form the LSFs we can
factor p(z) and q(z) as a product of second order filters, and then
p(z) + q(z)
a(z) =
2
Unfortunately since the LSFs correspond to the poles of the transfer function of the
Linear Predictor, calculation of the LSFs involves the factorisation of the LPs
characteristic polynomial. This may introduce problems since polynomial factorisation
is well known to be a computationally intensive and numerically unstable problem [9].
Fortunately, [66] provides a convenient solution for this problem using the m × m
so-called Companion Matrix
⎡ am−1 ⎤
− am − am−2 am
. . . − aam1 − aam0
⎢ ⎥
⎢ 1 0 ... 0 0 ⎥
⎢ ⎥
A=⎢ 0⎢ 1 ... 0 0 ⎥
⎥
⎢ ⎥
⎣ 0 0 ... 0 0 ⎦
0 0 ... 1 0
The eigenvalues of this matrix correspond to the zeros of the polynomial given by:
m
ai z i
i=0
Chapter B — Derivation of the Line Spectrum Frequencies 127
The Levinson-Durbin algorithm uses the following properties of the correlation matrix:
1. The correlation matrix of a given size contains as sub-blocks all the lower order
correlation matrices.
2. The correlation matrix is invariant under the interchange of its columns and rows.
R[0] = J0
128
Chapter C — Derivation of the Levinson-Durbin Algorithm 129
We are now ready to solve for the optimal predictor of order 1, which satisfies the
equation:
r0 r1 1 J1
=
r1 r0 a11 0
Where the superscript of a denotes the predictor order. J1 represents the minimum
mean-squared error (MSE) attainable with a predictor of order 1. Let
1 1 0
= − k 1
a11 0 1
Then, multiplying by the correlation matrix produces:
r0 r1 1 r0 r 1 1 r0 r 1 0
= − k 1
r1 r0 a11 r1 r0 0 r1 r0 1
Which reduces to:
J1 J0 Δ0
= − k1
0 Δ0 J0
Implying that
J0 r1
k1 = =
Δ0 J0
Thus the optimal predictor of order 1 is represented by:
−k1 z −1
And the MSE for the predictor is given by:
J1 = J0 (1 − k12 )
To obtain an optimal predictor of order 2 by solving
⎡ ⎤⎡ ⎤ ⎡ ⎤
r0 r1 r2 1 J2
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ r1 r0 r1 ⎦ ⎣ a21 ⎦ = ⎣ 0 ⎦
r2 r1 r0 a22 0
We consider a solution of the form
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 0
⎢ 2 ⎥ ⎢ 1 ⎥ ⎢ ⎥
⎣ a1 ⎦ = ⎣ a1 ⎦ − k2 ⎣ a11 ⎦
a22 0 0
Once again we multiply by the correlation matrix and obtain
a22 = −k2 (C.1)
a21 = a11 − k2 a11 (C.2)
Where
1
k2 = r2 + a11 r1
J1
and
J2 = J1 (1 − k22 )
Chapter C — Derivation of the Levinson-Durbin Algorithm 130
C.0.1 Summary
The L-D algorithm calculates the lth reflection coefficient by:
1
l−1
l−1
kl = rl + ai rl−i
Jl − 1 i=1
and
k0 = 1
and additionally using the conversion in equation 3.8 to calculate the predictor
coefficients at every iteration.
Appendix D
Although the MELP speech production model is suited in a very general and language
independent way to the human speech production mechanism, substantial linguistic bias
may still be introduced. This was demonstrated by the evaluation of the MELP vocoder
in chapter 8, in which we found differences in the quality of the trans-coded Xhosa and
English speech both during subjetcive and objetcive evaluation.
As has been shown by Niesler [59], there is a substantial difference between the phonetic
content of African and European languages. We investigated two primary areas in which
the two language groups may differ, with particular reference to the two languages
which were used in evaluation we performed in chapter 8, namely English and Xhosa.
Table D.1: Some phoneme classes which have significantly different frequency of
occurrence in Xhosa and English.
131
Chapter D — African and European Phonetics 132
extracted from sentences in the AST speech database. Since phoneme discrimination is
often done primarily on the basis of the spectral envelope, it makes sense that the
different phoneme distributions would necessitate different vector quantisation
codebooks for the LP system.
0.4
0.35
0.3
Frequency of Occurence
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6
Phoneme length (μs) 6
x 10
Figure D.1: Distribution of phoneme lengths in Xhosa and English. Phonemes were
automatically segmented. Statistics were collected over the sentences used in testing as
described in section 8.1. Approximately 4000 Xhosa phonemes and 1000 English
phonemes were used.
It was hypothesised that there may be more variance in the rate at which Xhosa
speakers speak. If this was the case then we would see significant differences in the
histograms of phoneme lengths for the two languages. This would mean that a vocoder
with an adaptive rate would be advantageous since it would more efficiently encode
longer phonemes and would be less likely to distort short phonemes. However, analysis
of the speech corpus using automatic segmentation and analysis of phoneme lengths
revealed that the phoneme length distributions for English and Xhosa were extremely
similar as shown in figure D.1.
Appendix E
Vector Quantisation
Q : Rn → C
log2 N
r=
k
This gives an indication of the average number of bits per component used to represent
the input vector.
133
Chapter E — Vector Quantisation 134
cn+2
111111111
000000000
cn
000000000
111111111
000000000
111111111
000000000
111111111
c n+1
000000000
111111111
000000000
111111111
x
000000000
111111111
000000000
111111111
Figure E.1: 2-D Example of a vector quantiser. In the above example the points
indicated by cn , cn+1 and c[ n + 2] represent the various codebook entries. The point
labelled x is a vector to be encoded. The region associated with each codebook entry is
indicated. x lies in the region associated with cn+1 and as such will be encoded as n + q1
and decoded as cn+1
the words vector quantiser as synonymous with nearest neighbour vector quantiser. The
primary advantage of this kind of quantiser is that no explicit description of the
geometry of the quantiser cells is necessary.
Where ck represents an entry from the k’th codebook. The approach typically followed
is to use the k + 1th codebook to to encode the residual of the kth stage.
For i ∈ {1 . . . M n } (E.3)
xni = xk ; k ∈ {1 . . . N} (E.4)
E.4.3 Transform VQ
In this technique, the vector to be encoded is transformed using a linear transformation
to a different (usually orthogonal) basis. The idea of the transformation is to compact
the information of the vector into a subset of the vector components. One of the above
Chapter E — Vector Quantisation 136
Codevector
n4
Lookup
Table
n3 Codevector
Lookup
Table
n2 Codevector
Lookup
Table
n1 Codevector
Lookup
Table
0 result
vector quantisation techniques is then applied to the vector in the transformed space, as
illustrated in figure E.3.
x T Encoding
x̃ T− 1 Decoding
Quantisation
envelope information. In MELP VQ is also used to encode the shape of the glottal
pulses (see 4.3.1), but this is not common and was discarded in later refinements of the
MELP model. [13, 56]. We will therefore focus our attention in the remainder of this
section on the vector quantisation of the spectral envelope.
As mentioned in 3.2.1, the spectral envelope is the dominant factor used in speaker
recognition and phoneme classification by both human listeners and automatic speech
recognition systems. Thus it is not surprising that a large proportion of the information
transmitted by most low rate vocoders is used to describe the spectral envelope. A
typical case is MELP where 25 of 54 bits are used to transmit the LSFs.
The most popular methods in recent publications have been Split VQ and MSVQ. Other
approaches are in [63] [78] [45].
There is a tradeoff between the computational complexity, distortion and bandwidth
efficiency of a VQ. In vector quantisers which achieve equivalent amounts of distortion,
one finds that the bandwidth efficiency is positively correlated with both computational
complexity of the quantisation and the required memory used to store the codebooks
[17].
Of course all of the above methods have only examined the redundancy exhibited
between the individual components of a single linear predictor.However, Vass mentions
that the
4.2. Additionally, measures such as the Itakura Saito measure described in 3.7 may be
useful.
E.8 Literature
Collura and Tremain [17] give indications as to the number of bits per frame required to
provide this level of quantisation using various vector quantisation schemes.
Subramaniam and Rao [81] recently proposed a promising new approach to LSF
quantisation. The LSF vector is modelled as a random variable which has a PDF
described by a Gaussian Mixture Model. The Expectation Minimisation Algorithm is
used to generate the GMM from a large speech corpus. To quantise a single vector, we
obtain a candidate from each cluster in the GMM which is the approximation of the
vector within that cluster, using scalar quantisation on the (de-correlated) individual
components of the vector. The ‘best’ of the candidates from all the clusters is chosen as
the quantisation for the vector. Using this approach, almost transparent VQ is achieved
at less than 20bits per frame.
In [50] a completely different approach is taken. Because the quasi-stationarity of the
speech signal can occur for intervals of much greater than the commonly accepted
20-30ms, one can achieve substantial coding gain by exploiting the occurrence of longer
stationary intervals in the speech waveforms. In this paper, only frames in which
significant spectral change occurs are transmitted. In the paper, good results were
obtained at bit rates as low as 350bits per second.
1
Spectral distortion was defined in section 5.5.2
Appendix F
LULU Filters
1
A typical example of this behaviour results from the repeated application of a median filter to the
sequence {. . . , −1, +1, −1, +1, −1, +1, −1, +1, −1, . . .}. Each application of a median filter simply reverses
the polarity of the sequence
140
Chapter F — LULU Filters 141
Then
' &
n n
Un = (F.3)
& '
n n
Ln = (F.4)
Rohwer describes several desirable properties of the Ln and Un operators in [72] and
[73]. The basic principle which he emphasises is that of variation reduction, implying
that the Ln and Un operators reduce the spread of sample values in the sequence.
Rohwer illustrates in [72] how the Ln operator removes upward block-pulses or
consecutive outliers of less than length n and Un removes downward block-pulses of less
than length n. Thus the concatenation of Ln Un or Un Ln will remove all block-pulses of
less than length n. However, one must bear in mind that the non-linearity of the
operators implies that Ln Un x and Un Ln x are not necessarily the same. In the context of
a speech processing application we should therefore examine both the Ln Un and Un Ln
operators and determine which produces better results.
Additionally, Rohwer recommends that one should filter using the concatenated filter
constructed as Ln Un Ln−1 Un−1 . . . L3 U3 L2 U2 L1 U1 in order to achieve best results. We
will refer to these as the LU and UL decomposition filters respectively, since they
correspond to the output of various stages of the LULU decomposition referred to by
Rohwer and denote these concatenated operators with LUn and ULn .
In figure F.1, we illustrate the effect of LULU filtering on the pitch track as estimated
by the MELP pitch tracking algorithm. As can be seen, even filtering with a low filter
order reduces the number of pitch errors. Increasing the filter order removes the
occurrence of isolated frames which are incorrectly classified as voiced or unvoiced.
However, increasing the filter order also removes some resolution in the pitch, especially
in the sharp pitch peaks exhibited in the correct pitch track. The pitch track is
compared with the true pitch track, which was recovered from laryngogram data. [57].
In figure F.2, we illustrate the statistical effect of the Ln Un and Un Ln filters of both the
simple and concatenated forms on pitch and voicing errors in the MELP analysis.
Clearly the filter which exhibits the best performance for the pitch post-processing is the
UL decomposition filter. In the optimal order, it reduces the incidence of gross pitch
errors by almost 25% relative to the un-filtered pitch track.
In the case of the voicing decision, the optimal choice of filter is less obvious, since each
of the filters represents a trade-off between the number of frames which are incorrectly
classified as voiced and the number of frames which are incorrectly classified as unvoiced.
In this case the optimisation can only be done using perceptual considerations.
Chapter F — LULU Filters 142
300 300
Smoothed MELP pitch Smoothed MELP pitch
Pitch tracks from Laryngiogram Pitch tracks from Laryngiogram
250 250
200 200
Pitch (Hz)
Pitch (Hz)
150 150
100 100
50 50
0 0
7 8 9 10 11 12 13 14 7 8 9 10 11 12 13 14
Time (sec) Time (sec)
300 300
Smoothed MELP pitch Smoothed MELP pitch
Pitch tracks from Laryngiogram Pitch tracks from Laryngiogram
250 250
200 200
Pitch (Hz)
Pitch (Hz)
150 150
100 100
50 50
0 0
7 8 9 10 11 12 13 14 7 8 9 10 11 12 13 14
Time (sec) Time (sec)
(c) After filtering with LU10 (d) After filtering with LU20
0.05
0 5 10 15 20 25 30 35 40 45 50
Probability of misclassification of frame as unvoiced
0.2
LU
0.15 UL
LU decomposition
P(n)
0.1 UL decomposition
0.05
0
0 5 10 15 20 25 30 35 40 45 50
Probability of misclassification of frame as voiced
1
LU
0.8 UL
LU decomposition
0.6 UL decomposition
0.4
0.2
0 5 10 15 20 25 30 35 40 45 50
PESQ
The PESQ algorithm was used substantially in the development phase of our IS-MELP
vocoder. We did not implement the PESQ algorithm, but used the ITU reference
implementation. We present a short description of the algorithm.
G.1 Purpose
The PESQ algorithm compares an original and degraded signal and outputs a prediction
of the perceived quality that would be given to the degraded signal by subjects in a
subjective listening test.
G.2 Limitations
PESQ is not capable of correctly handling certain types of distortion (such as side-tones
and delay). Therefore PESQ is not intended to replace subjective listening tests as the
ultimate authority on the quality of a speech codec. However PESQ provides a very fast
evaluation of the quality of a speech sample. Therefore it presents a substantial
advantage over listener tests, since it allows us to make design decisions based on
comprehensive analysis of large speech corpora.
144
Chapter G — PESQ 145
1. The power spectrum is binned into Bark bands to reflect the fact that the human
ear has better frequency resolution at lower frequencies.
2. The power spectrum is transformed into a loudness value to account for
non-linearities in perception of sounds. This transformation is performed using
Zwicker’s law. This transformed spectrum is referred to as the loudness spectrum.
3. The difference between the loudness spectra (as defined in the previous steps) of
the original and distorted signals is referred to as the disturbance density.
4. Since introduced spectral components are much more perceptually disturbing than
removed spectral components, a second (asymmetric) disturbance density is
calculated to emphasise introduced spectral components in the degraded signal.
This error signal is intended to reflect spectral information which is present in the
synthesised speech but not in the original.
Chapter G — PESQ 146