Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Speech Compression

The document discusses the evolution and techniques of speech compression, emphasizing its importance in digital communications and various voice applications. It outlines the linear prediction model as a foundational structure for speech coding, detailing the trade-offs between compression rate and distortion. Additionally, it addresses current challenges, future research directions, and the significance of speech coding standards in enhancing speech intelligibility and quality.

Uploaded by

Dr-Samia Kabel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Speech Compression

The document discusses the evolution and techniques of speech compression, emphasizing its importance in digital communications and various voice applications. It outlines the linear prediction model as a foundational structure for speech coding, detailing the trade-offs between compression rate and distortion. Additionally, it addresses current challenges, future research directions, and the significance of speech coding standards in enhancing speech intelligibility and quality.

Uploaded by

Dr-Samia Kabel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

information

Review
Speech Compression
Jerry D. Gibson
Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA 93118, USA;
gibson@ece.ucsb.edu; Tel.: +1-805-893-6187

Academic Editor: Khalid Sayood


Received: 22 April 2016; Accepted: 30 May 2016; Published: 3 June 2016

Abstract: Speech compression is a key technology underlying digital cellular communications, VoIP,
voicemail, and voice response systems. We trace the evolution of speech coding based on the linear
prediction model, highlight the key milestones in speech coding, and outline the structures of the
most important speech coding standards. Current challenges, future research directions, fundamental
limits on performance, and the critical open problem of speech coding for emergency first responders
are all discussed.

Keywords: speech coding; voice coding; speech coding standards; speech coding performance; linear
prediction of speech

1. Introduction
Speech coding is a critical technology for digital cellular communications, voice over Internet
protocol (VoIP), voice response applications, and videoconferencing systems. In this paper, we present
an abridged history of speech compression, a development of the dominant speech compression
techniques, and a discussion of selected speech coding standards and their performance. We also
discuss the future evolution of speech compression and speech compression research. We specifically
develop the connection between rate distortion theory and speech compression, including rate
distortion bounds for speech codecs. We use the terms speech compression, speech coding, and
voice coding interchangeably in this paper. The voice signal contains not only what is said but also
the vocal and aural characteristics of the speaker. As a consequence, it is usually desired to reproduce
the voice signal, since we are interested in not only knowing what was said, but also in being able to
identify the speaker. All of today’s speech coders have this as a goal [1–3].
Compression methods can be classified as either lossless or lossy. Lossless compression methods
start with a digital representation of the source and use encoding techniques that allow the source
to be represented with fewer bits while allowing the original source digital representation to be
reconstructed exactly by the decoder. Lossy compression methods relax the constraint of an exact
reproduction and allow some distortion in the reconstructed source [4,5].
Thus, given a particular source such as voice, audio, or video, the classic tradeoff in lossy source
compression is rate versus distortion—the higher the rate, the smaller the average distortion in the
reproduced signal. Of course, since a higher bit rate implies a greater channel or network bandwidth or
a larger storage requirement, the goal is always to minimize the rate required to satisfy the distortion
constraint or to minimize the distortion for a given rate constraint. For speech coding, we are interested
in achieving a quality as close to the original speech as possible within the rate, complexity, latency, and
any other constraints that might be imposed by the application of interest. Encompassed in the term
quality are intelligibility, speaker identification, and naturalness. Note that the basic speech coding
problem follows the distortion rate paradigm; that is, given a rate constraint set by the application, the
codec is designed to minimize distortion. The resulting distortion is not necessarily small or inaudible,
just acceptable for the given application.

Information 2016, 7, 32; doi:10.3390/info7020032 www.mdpi.com/journal/information


Information 2016, 7, 32 2 of 22

The distortion rate structure is contrasted with the rate distortion formulation wherein the
constraint is on allowable distortion and the rate required to achieve that distortion is minimized.
Notice that for the rate distortion approach, a specified distortion is the goal and the rate is adjusted to
obtain this level of distortion. Voice coding for digital cellular communications is an example of the
distortion rate approach, since it has a rate constraint, while coding of high quality audio typically has
the goal of transparent quality, and hence is an example of the rate distortion paradigm.
The number of bits/s required to represent a source is equal to the number of bits/sample
multiplied by the number of samples/s. The first component, bits/sample, is a function of the coding
method, while the second component, samples/s, is related to the source bandwidth. Therefore, it is
common to distinguish between speech and audio coding according to the bandwidth occupied by
the input source. Narrowband or telephone bandwidth speech occupies the band from 200 to 3400 Hz,
and is the band classically associated with telephone quality speech. The category of wideband speech
covers the band 50 Hz to 7 kHz, which is a bandwidth that originally appeared in applications in 1988
but has come into prominence in the last decade. Audio is generally taken to cover the range of 20 Hz
to 20 kHz, and this bandwidth is sometimes referred to today as fullband audio. More recently, a few
other bandwidths have attracted attention, primarily for audio over the Internet applications, and the
bandwidth of 50 Hz to 14 kHz, designated as superwideband, has gotten considerable recent attention [6].
The interest in wider bandwidths comes from the facts that wider bandwidths improve intelligibility,
naturalness, and speaker identifiability. Furthermore, the extension of the bandwidth below 200 Hz
adds to listener comfort, warmth, and naturalness. The focus in this paper is on narrowband and
wideband speech coding; however, codecs for these bands often serve as building blocks for wider
bandwidth speech and audio codecs. Audio coding is only discussed here as it relates to the most
prevalent approaches to narrowband and wideband speech coding.
As the frequency bands being considered move upward from narrowband speech through
wideband speech and superwideband speech/audio, on up to fullband audio, the basic structures for
digital signal processing and the desired reproduced quality change substantially. Interestingly, all
of these bands are incorporated in the latest speech coders, and the newest speech coding standard,
EVS, discussed later, utilizes a full complement of signal processing techniques to produce a relatively
seamless design.
The goal of speech coding is to represent speech in digital form with as few bits as possible while
maintaining the intelligibility and quality required for the particular application [1,4,5]. This one
sentence captures the fundamental idea that rate and distortion (reconstructed speech quality) are
inextricably intertwined. The rate can always be lowered if quality is ignored, and quality can always
be improved if rate is not an issue. Therefore, when we mention the several bit rates of various speech
codecs, the reader should remember that as the rate is adjusted, the reconstructed quality changes as
well, and that a lower rate implies poorer speech quality.
The basic approaches for coding narrowband speech evolved over the years from waveform
following codecs to the code excited linear prediction (CELP) based codecs that are dominant
today [1,5]. This evolution was driven by applications that required lower bandwidth utilization
and by advances in digital signal processing, which became implementable due to improvements
in processor speeds that allowed more sophisticated processing to be incorporated. Notably, the
reduction in bit rates was obtained by relaxing prior constraints on encoding delay and on complexity.
This later relaxation of constraints, particularly on complexity, should be a lesson learned for future
speech compression research; namely, the complexity constraints of today will almost certainly be
changed in the future.
With regard to complexity, it is interesting to note that most of the complexity in speech encoding
and decoding resides at the encoder for most voice codecs; that is, speech encoding is more complex,
often dramatically so, than decoding. This fact can have implications when designing products.
For example, voice response applications, wherein a set of coded responses are stored and addressed
by many users, require only a single encoding of each stored response (the complex step) but those
Information 2016, 7, 32 3 of 22
Information 2016, 7, 32 3 of 22

between two users, however, each user must have both an encoder and a decoder, and both the
encoder and
responses maythebe decoder
accessed must
andoperate
decoded without
many noticeable
times. For delay.
real time voice communications between
As we trade off rate and distortion,
two users, however, each user must have both an encoder the determination of theand
and a decoder, rateboth
of the
a speech
encodercodec
and the is
straightforward,
decoder must operate however, the noticeable
without measurement delay.of the distortion is more subtle. There are a variety of
approaches to evaluating voice intelligibility
As we trade off rate and distortion, the anddetermination
quality. Absolute category
of the rate ofrating (ACR) codec
a speech tests areis
subjective tests of speech quality and involve listeners assigning a category
straightforward, however, the measurement of the distortion is more subtle. There are a variety and rating for each speech
utterance
of according
approaches to the classifications,
to evaluating such as,and
voice intelligibility Excellent
quality.(5), Good (4),
Absolute Fair (3),rating
category Poor (2),
(ACR)andtests
Bad (1).
are
The average for each utterance over all listeners is the Mean Opinion Score
subjective tests of speech quality and involve listeners assigning a category and rating for each speech (MOS) [1].
Of course,
utterance according listening
to thetests involving human
classifications, such as,subjects
Excellent are difficult
(5), Good (4), to organize and perform,
Fair (3), Poor (2), and Bad so the
(1).
development
The average for of each
objective measures
utterance over allof speech
listeners quality is highly
is the Mean desirable.
Opinion ScoreThe perceptual
(MOS) [1]. evaluation
of speech quality
Of course, (PESQ)tests
listening method, standardized
involving by the ITU-T
human subjects as P.862,
are difficult to was
organizedeveloped to provide
and perform, an
so the
assessment ofofspeech
development codec
objective performance
measures of speech in conversational
quality is highly voice communications.
desirable. The perceptual The evaluation
PESQ has
been and can be used to generate MOS values for both narrowband
of speech quality (PESQ) method, standardized by the ITU-T as P.862, was developed to provide and wideband speech [5,7]. While
no substitute for actual listening tests, the PESQ and its wideband version
an assessment of speech codec performance in conversational voice communications. The PESQ has have been widely used for
initialand
been codec
can beevaluations. A newer
used to generate MOSobjective
values formeasure, designated
both narrowband andaswideband
P.863 POLQAspeech (Perceptual
[5,7]. While
Objective Listening Quality Assessment) has been developed but
no substitute for actual listening tests, the PESQ and its wideband version have been widely it has yet to receive widespread
used for
acceptance [8]. For a tutorial development of perceptual evaluation of
initial codec evaluations. A newer objective measure, designated as P.863 POLQA (Perceptual Objectivespeech quality, see [9]. More
details onQuality
Listening MOS Assessment)
and perceptual performance
has been developed evaluation
but it has yetfor tovoice
receivecodecs are provided
widespread acceptancein the
[8].
references [1,7–10].
For a tutorial development of perceptual evaluation of speech quality, see [9]. More details on MOS
The emphasis
and perceptual in this paper
performance is on linear
evaluation for voice prediction
codecs are based speech
provided in coding. The reason
the references for this
[1,7–10].
emphasis is that linear prediction has been the dominant structure
The emphasis in this paper is on linear prediction based speech coding. The reason for thisfor narrowband and wideband
speech
emphasis coding
is that since theprediction
linear mid-1990’shas [11]been
and the
essentially
dominant all structure
importantfor speech coding standards
narrowband and wideband since
that time
speech are based
coding sinceonthethe linear prediction
mid-1990’s [11] andparadigm
essentially[3,11]. We do not
all important discuss
speech codec
coding modifications
standards since
to account for channel or network effects, such as bit errors, lost packets,
that time are based on the linear prediction paradigm [3,11]. We do not discuss codec modifications or delayed packets. Whileto
these issues are important for overall codec designs, the emphasis here is
account for channel or network effects, such as bit errors, lost packets, or delayed packets. While these on compression, and the
required
issues modifications
are important are codec
for overall primarilydesigns,add-ons to compensate
the emphasis for such non-compression
here is on compression, and the required
issues. Further, these modifications must be matched to the
modifications are primarily add-ons to compensate for such non-compression issues. specific compression method being used,
Further, these
so understanding the speech compression techniques is an important
modifications must be matched to the specific compression method being used, so understanding the first step for their design
and implementation.
speech compression techniques is an important first step for their design and implementation.
We begin with
We begin with the
the fundamentals
fundamentals of of linear
linear prediction.
prediction.

2. The
2. TheBasic
BasicModel:
Model:Linear
LinearPrediction
Prediction
The linear
The linear prediction model has
prediction model has served
served as
as the
the basis
basis for
for the
the leading
leading speech
speech compression
compression methods
methods
over the last 45 years. The linear prediction model has the form
over the last 45 years. The linear prediction model has the form
N

s(n) ÿ a aspn n iqi`) wpnq
N
spnq “ i s (´ w(n) (1)
(1)
i
i 1
i “1
where we see that the current speech sample at time instant n can be represented as a weighted linear
where we see that the current speech sample at time instant n can be represented as a weighted linear
combination of N prior speech samples plus a driving term or excitation at the current time instant.
combination of N prior speech samples plus a driving term or excitation at the current time instant.
The weights,
The weights, ta i 1,1,2,2,....,
{a,ii, “ N}are
...., Nu, , are called
called thethe linear
linear prediction
prediction coefficients.
coefficients. A block
A block diagram
diagram of
of this
i
this model
model is depicted
is depicted in Figure
in Figure 1. 1.

Figure 1. The Linear Prediction Model.


Information 2016, 7, 32 4 of 22
Information 2016, 7, 32 4 of 22
We can write the z-domain transfer function of the block diagram in Figure 1 by assuming zero
initial conditions to obtain
We can write the z-domain transfer function of the block diagram in Figure 1 by assuming zero
initial conditions to obtain S ( z) 1 1
Spzq  1  1N
W ( z“) 1  A( z“) (2)
Wpzq 1 ´ Apzq

N a z i
1 ř
ai zi´ i 
i“i11

where Apzq
where A( zrepresents
) represents thethe weighted
weighted linearlinear combination
combination of samples
of past past samples as indicated
as indicated [4,5].
[4,5]. This This
model
is also known
model is also as an autoregressive
known (AR) process
as an autoregressive (AR) or AR model
process or AR in model
the time in series analysis
the time seriesliterature.
analysis
It is helpful
literature. It istohelpful
envision the linear
to envision theprediction modelmodel
linear prediction as a speech synthesizer,
as a speech wherein
synthesizer, wherein speech
speechis
reconstructed by inserting the linear prediction coefficients and applying the
is reconstructed by inserting the linear prediction coefficients and applying the appropriate excitation appropriate excitation
in order
order to
to generate
generate the set set of
of speech
speech samples.
samples. This is the basic structure of the decoders in all linear
prediction
prediction based
based speech
speech codecs
codecs [12].
[12]. However,
However, the the encoders
encoders carry
carry the
the burden
burden of calculating
calculating the linear
prediction
predictioncoefficients
coefficientsand and choosing
choosing thetheexcitation
excitationto allow the decoder
to allow to synthesize
the decoder acceptable
to synthesize quality
acceptable
speech
quality[4,5].
speech [4,5].
The earliest speech coder to use use thethe linear
linear prediction
prediction formulation
formulation was differential
differential pulse
pulse code
code
modulation (DPCM) shown in Figure 2. Here Here we see that the decoder has the form of the linear
prediction
prediction model
model and the the excitation
excitation consists of the quantized quantized and coded prediction error at each
sampling instant.
sampling instant. This prediction
prediction error is decoded and used
error is decoded and used as the excitation and the linear prediction
coefficients
coefficients areare either
either computed
computed at at the
the encoder
encoder and and transmitted
transmitted or or calculated
calculated at at both
both the encoder and
decoder
decoder on onaasample-by-sample
sample-by-sample basis
basisusing
usingleast mean
least squared
mean (LMS)
squared or recursive
(LMS) least squares
or recursive (RLS)
least squares
algorithms that arethat
(RLS) algorithms adapted based on
are adapted the reconstructed
based speech samples.
on the reconstructed The LMS The
speech samples. approach served as
LMS approach
the basisasfor
served thethebasis
ITU-T forinternational standards G.721,
the ITU-T international G.726, and
standards G.721,G.727, which
G.726, andhave transmitted
G.727, which have bit
rates from 16 kilobits/s up to 40 kilobits/s, with what is called “toll quality”
transmitted bit rates from 16 kilobits/s up to 40 kilobits/s, with what is called “toll quality” producedproduced at 32 kbits/s.
See
at 32the references
kbits/s. See the for a further for
references development of DPCM and
a further development other time
of DPCM and domain
other time waveform following
domain waveform
variants
following asvariants
well as the related
as well ITU-T
as the standards
related ITU-T [1,4,5].
standards [1,4,5].

Modulation (DPCM).
Figure 2. Differential Pulse Code Modulation (DPCM).

Of course,
Of course, for
for many
many applications
applications these
these rates
rates were
were tootoo high,
high, and
and toto lower
lower these
these rates,
rates, while
while
maintaining reconstructed speech quality, required a more explicit use of the linear prediction
maintaining reconstructed speech quality, required a more explicit use of the linear prediction model. model.
It is
It is instructive
instructive to
to investigate
investigate the
the usefulness
usefulness ofof the
the linear
linear prediction
prediction model
model forfor speech
speech spectrum
spectrum
approximation. To do this, consider the voiced speech segment shown in Figure 3a. If we take take
approximation. To do this, consider the voiced speech segment shown in Figure 3a. If we the
the Fast
Fast Fourier Transform (FFT) of this segment, we obtain the spectrum shown
Fourier Transform (FFT) of this segment, we obtain the spectrum shown in Figure 3b. in Figure 3b.
Information 2016, 7, 32 5 of 22
Information 2016, 7, 32 5 of 22

(a)

(b)

Figure 3. Cont.
Information 2016, 7, 32 6 of 22
Information 2016, 7, 32 6 of 22

(c)

(d)
Figure 3. (a) A voiced speech segment; (b) FFT of the speech segment in (a); (c) Magnitude Spectrum of
Figure 3. (a) A voiced speech segment; (b) FFT of the speech segment in (a); (c) Magnitude
the Segment in (a) from Linear Prediction N = 100; (d) An N = 10th order Linear Predictor
Spectrum of the Segment in (a) from Linear Prediction N = 100; (d) An N = 10th order Linear
Approximation.
Predictor Approximation.
Information 2016,
Information 7, 32
2016, 7, 32 77 of
of 22
22

The very pronounced ripples in the spectrum in Figure 3b are the harmonics of the pitch period
visible Thein very
Figure pronounced ripples in
3a as the periodic the spectrum
spikes in the timein Figure
domain3bwaveform.
are the harmonics
As one might of the guess,
pitch period
these
visible in Figure 3a as the periodic spikes in the time domain waveform.
periodicities are due to the periodic excitation of the vocal tract by puffs of air being released by As one might guess, these
the
periodicities
vocal cords. Can are the
duelinear
to theprediction
periodic excitation
model provideof the avocal
closetract by puffs of of
approximation airthis
being releasedLetting
spectrum? by the
vocal
the cords. Can
predictor order theNlinear
= 100, prediction
the magnitude model providecan
spectrum a close approximation
be obtained from theoflinearthis spectrum?
predictionLetting
model
the predictor order N = 100, the magnitude spectrum can be obtained from
in Figure 1, and this is shown in red in Figure 3c. We see that the model is able to provide an excellent the linear prediction model
in Figure 1, and to
approximation thisthe
is shown
magnitudein redspectrum,
in Figure 3c. We see that all
reproducing the of
model
the is able harmonics
pitch to provide an excellent
very well.
approximation to the magnitude spectrum, reproducing all of the pitch
However, for speech coding, this is not a very efficient solution since we would have to quantize andharmonics very well. However,
for speech
code coding, locations
100 frequency this is not a very
plus theirefficient
amplitudes solution since we would
to be transmitted have to this
to reproduce quantize and code
spectrum. This
100 frequency locations plus their amplitudes to be transmitted to
is a relatively long speech segment, about 64 ms, so if we needed (say) 8 bits/frequency location reproduce this spectrum. This
plus is
a relatively long speech segment, about 64 ms, so if we needed (say) 8 bits/frequency
8 bits for amplitude for accurate reconstruction, the transmitted bit rate would be about 25,000 bits/s. location plus
8 bitsrate
This for is
amplitude
about thefor sameaccurate reconstruction,
or slightly lower thanthe transmitted
DPCM bit rate would
for approximately thebe about
same 25,000
quality bits/s.
but still
This rate is about the same or slightly lower than DPCM for approximately
more than the 8 kbits/s or 4 kbits/s that is much more desirable in wireline and cellular applications. the same quality but still
more than the 8 kbits/s or 4 kbits/s that is much more desirable in wireline
Further, speech sounds can be expected to change every 10 ms or 20 ms so the transmitted bit rate and cellular applications.
Further,
would bespeech sounds
3 to 6 times 25 can be expected
kbits/s, which is to change
clearly notevery 10 ms or 20 ms so the transmitted bit rate
competitive.
would be 3 to 6 times 25 kbits/s, which is clearly not competitive.
So, what can be done? The solution that motivated the lower rate linear predictive coding
methods So, what
was to canusebeadone?
lowerThe solution
order that motivated
predictor, say N = 10,the lower rate linear
to approximate thepredictive
envelope of coding methods
the spectrum
was to use a lower order predictor, say N
as shown in red in Figure 3d, and then provide the harmonic structure using the excitation. shown
= 10, to approximate the envelope of the spectrum as
in red in Figure
Thus, we only3d, and
need then provide the
to quantize andharmonic
code 10structure
coefficientsusingand thesoexcitation.
if the rate required for the
Thus,iswe
excitation only need
relatively low,tothe quantize
bit rate and
shouldcode be10muchcoefficients
lower, even and with
so if 10thems rate required
frame for the
sizes for the
excitation is relatively
linear prediction analysis. low, the bit rate should be much lower, even with 10 ms frame sizes for the
linear Theprediction analysis. coder (LPC) was pioneered by Atal and Hanauer [13], Makhoul [14],
linear predictive
Markel and Gray [15], andcoder
The linear predictive others(LPC)
and was
tookpioneered
the form by Atal in
shown and Hanauer
Figure [13], Makhoul
4, where with N = [14], Markel
10 uses the
and Gray [15], and others and took the form shown in Figure 4, where with
explicit split between the linear prediction fit of the speech envelope and the excitation to provide the N = 10 uses the explicit
split between
spectral the linearIn
fine structure. prediction
Figure 4, fittheofexcitation
the speech envelope
consists and the
of either excitationimpulse
a periodic to provide the spectral
sequence if the
fine structure. In Figure 4, the excitation consists of either a periodic
speech is determined to be Voiced (V) or white noise if the speech is determined to be Unvoiced impulse sequence if the speech
(UV)
is determined
and G is the gain to of
bethe
Voiced (V) orused
excitation white tonoise
matchifthethereconstructed
speech is determined
speech energy to be Unvoiced
to that of the(UV) and
input.
G depiction
A is the gainofof thethetwoexcitation
components, used namely
to match the
the reconstructed
speech envelope speech
and theenergy
spectral tofine
thatstructure,
of the input.
for
A depiction of the two components,
a particular speech spectrum is shown in Figure 5 [16].namely the speech envelope and the spectral fine structure, for
a particular speech spectrum is shown in Figure 5 [16].

Figure
Figure 4.
4. Linear
Linear Predictive
Predictive Coding
Coding (LPC).
(LPC).
Information 2016, 7, 32 8 of 22
Information 2016, 7, 32 8 of 22

Formant and harmonic weighting


20
Original speech
15 Formant filter response
Pitch filter response
10

5
Magnitude in dB
0

-5

-10

-15

-20

-25

-30
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency in Hz

Figure
Figure 5. A5.depiction
A depiction
of of
thethe decompositionof
decomposition ofthe
the spectrum
spectrum in
interms
termsofofthe
theenvelope
envelopeand thethe
and spectral
spectral
fine structure.
fine structure.

The linear prediction coefficients and the excitation (V/UV decision, gain, and pitch) are
The linear
calculated prediction
based on a blockcoefficients
or frame ofand
inputthe excitation
speech (V/UV
using, which aredecision, gain, and
since the 1970’s, well pitch)
known are
calculated based on a block or frame of input speech using, which
methods [4,5,14,15]. These parameters are quantized and coded for transmission to the are since the 1970’s, well
known methods [4,5,14,15].
receiver/decoder and they mustThesebe parameters are quantized
updated regularly in order toand trackcoded
the timeforvarying
transmission
nature oftoa the
receiver/decoder and they must
speech signal [3–5,11,17,18]. Theberesulting
updatedbitregularly
rate was in order 2.4
usually to track the
kbits/s, 4 time varying
kbits/s, nature of
or 4.8 kbits/s
depending
a speech signalon[3–5,11,17,18].
the applicationThe andresulting
the quality
bitneeded.
rate was For LPC-102.4
usually at kbits/s,
a rate of 2.4 kbits/s, or
4 kbits/s, the4.8
coding
kbits/s
of the linear prediction coefficients was allocated more than 1.8 kbits/s and
depending on the application and the quality needed. For LPC-10 at a rate of 2.4 kbits/s, the coding the gain, voicing, and
pitch (if needed) received the remaining 600 bits/s [5]. This structure in Figure
of the linear prediction coefficients was allocated more than 1.8 kbits/s and the gain, voicing, and 4 served as the decoder
pitchfor(ifthe LPC-10received
needed) (for a 10th
theorder predictor)
remaining Federal
600 bits/s Standard
[5]. 1015 [19],
This structure as well 4asserved
in Figure the synthesizer in
as the decoder
the Speak ‘N Spell toy [20]. The speech quality produced by the LPC codecs was intelligible and
for the LPC-10 (for a 10th order predictor) Federal Standard 1015 [19], as well as the synthesizer in
retained many individual speaker characteristics, but the reconstructed speech can be “buzzy” and
the Speak ‘N Spell toy [20]. The speech quality produced by the LPC codecs was intelligible and
synthetic-sounding for some utterances.
retained many individual speaker characteristics, but the reconstructed speech can be “buzzy” and
Thus, the power of the linear prediction model is in its ability to provide different resolutions of
synthetic-sounding
the signal frequency for some
domain utterances.
representation and the ability to separate the calculation of the speech
Thus, the
spectral power from
envelope of thethe
linear
modelprediction model
excitation, whichis in itsinability
fills to provide
the harmonic finedifferent
structure.resolutions
Today’s of
the signal frequency domain representation
speech coders are a refinement of this approach. and the ability to separate the calculation of the speech
spectral envelope from the model excitation, which fills in the harmonic fine structure. Today’s speech
3. The
coders areAnalysis-by-Synthesis Coding Paradigm
a refinement of this approach.
Researchers found the linear prediction model compelling but it was clear that the excitation
3. The Analysis-by-Synthesis Coding Paradigm
must be improved without resorting to the higher transmitted bit rates of waveform-following coders
Researchers
such as DPCM. found
Afterthe linearofprediction
a series model
innovations, compelling but it was(AbS)
the analysis-by-synthesis clearapproach
that the excitation
emerged asmust
the most promising method to achieve good quality coded speech at 8 kbits/s,
be improved without resorting the higher transmitted bit rates of waveform-following coders such which was a very
useful After
as DPCM. rate for wireline
a series applications,the
of innovations, andanalysis-by-synthesis
more importantly, for digital
(AbS) cellularemerged
approach applications.
as theAn
most
analysis-by-synthesis
promising coding
method to achieve scheme
good is illustrated
quality in Figure
coded speech at 86,kbits/s,
where awhich
preselected
was a set of useful
very excitations,
rate for
(say) applications,
wireline 1024 sequences of more
and some chosen length,for
importantly, (say) 80 samples,
digital cellularand here shown
applications. Anasanalysis-by-synthesis
the Codebook, are
applied one at a time (each 80 sample sequence) to the linear prediction model with a longer term
coding scheme is illustrated in Figure 6, where a preselected set of excitations, (say) 1024 sequences of
predictor also included to model the periodic voiced excitation. For each excitation, the speech is
some chosen length, (say) 80 samples, and here shown as the Codebook, are applied one at a time (each
synthesized and subtracted from the current block of input speech being coded to form an error
80 sample sequence) to the linear prediction model with a longer term predictor also included to model
signal, then this error signal is passed through a perceptual weighting filter, squared and averaged
the periodic voiced excitation. For each excitation, the speech is synthesized and subtracted from the
over the block, to get a measure of the weighted squared error. This is repeated for every possible
current block(1024
excitation of input
here)speech
and thebeing coded tothat
one excitation form an error
produces thesignal,
minimumthenweighted
this error signalerror
squared is passed
is
through a perceptual weighting filter, squared and averaged over the block, to get a measure of the
weighted squared error. This is repeated for every possible excitation (1024 here) and the one excitation
Information 2016, 7, 32 9 of 22
Information 2016, 7, 32 9 of 22
Information 2016, 7, 32 9 of 22
that produces
chosen, thentheits’minimum
10 bit code weighted squaredalong
is transmitted errorwith
is chosen, then its’ parameters
the predictor 10 bit code is
to transmitted
the decoderalong
or
with the predictor
chosen,
receiver parameters
then its’ 10the
to synthesize to
bitspeech the decoder
code is [21]. or receiver to synthesize the speech [21].
transmitted along with the predictor parameters to the decoder or
receiver to synthesize the speech [21]. s (n)
s (n )
sˆ( n)
sˆ( n)
CODE
CODE
BOOK
g
g
1/(1-P(z))
1/(1-P(z))
1/(1-A(z))
1/(1-A(z)) +
+ W(z)
W(z)
BOOK

e(n
e(n ) )
ERROR
ERRORMINIMIZATION
MINIMIZATION

(a)
(a)

CODE
CODE
g 1/(1-P(z)) 1/(1-A(z))
sˆ( nsˆ)( n)
BOOK g 1/(1-P(z)) 1/(1-A(z))
BOOK
(b)
(b)
Figure 6. (a) An analysis-by-synthesis encoder; (b) An analysis-by-synthesis decoder.
Figure 6.6.(a)
Figure (a)An
Ananalysis-by-synthesis
analysis-by-synthesis encoder; (b) An
encoder; (b) Ananalysis-by-synthesis
analysis-by-synthesisdecoder.
decoder.
Let us investigate how we can get the rate of 8 kbits/s using this method. At a sampling rate of
8000
LetLet
usus samples/s,
investigate
investigate a sequence
howwe
how we80 cansamples
can getthe
get long
the corresponds
rate
rate of 88 kbits/s
of kbits/sto 10 ms, so
using
using formethod.
this
this a 1024 sequences,
method. we need
AtAta asampling
sampling rate of of
rate
8000 10 bits transmitted
samples/s,aasequence every
sequence80 10 ms,
80samples for a rate of 1000
samples long corresponds bit/s for the excitation. This leaves 7000 bits/s forwe10 need
8000 samples/s, correspondsto to10 10ms,
ms,sosoforfora a1024
1024sequences,
sequences, we need
coefficients (this is a maximum since we need to transmit a couple of other parameters), which can
10 10
bitsbits transmittedevery
transmitted every10
yield a very good approximation.
10ms,
ms,for for aa rate
rate of 1000 1000bit/s
bit/sforforthetheexcitation.
excitation. This leaves
This 70007000
leaves bits/s for 10for
bits/s
coefficients
10 coefficients (this is a maximum since we need to transmit a couple of other parameters), which cancan
The(this
set is of a1024
maximumcodewords since inwetheneed to transmit
codebook, and the a couple of other parameters),
analysis-by-synthesis approach, which
as
yield a
yield a very very good
good as
promising approximation.
approximation.
it appears, entails some difficult challenges, one of which is the complexity of
The set
synthesizing of 1024
1024
The set of 1024 codewords codewords
possible 80 theincodebook,
in sample the codebook,
reconstructed theand
and speech the analysis-by-synthesis
segments for each input
analysis-by-synthesis speechapproach,
approach, segment
as promisingas
promising
of lengthas 80it appears,
samples, entails
every 10 ms!some
This difficult
is in challenges,
addition to calculating
as it appears, entails some difficult challenges, one of which is the complexity of synthesizing 1024one of
the which
linear is the
prediction complexity
coefficients of
synthesizing 1024excitation.
and the pitch possible 80 sample reconstructed speech segments for each input speech segment
possible 80 sample reconstructed speech segments for each input speech segment of length 80 samples,
of length 80 In recent
samples, years, it has10
every become
ms! This common to use an adaptive
is in addition codebook
to calculating thestructure to model the
linear prediction long
coefficients
every 10termms!memory
This is in addition
rather than a to calculating
cascaded long term the predictor.
linear prediction
An encoder coefficients and the codebook
using the adaptive pitch excitation.
and the pitch excitation.
In approach
recent years, and ait corresponding
has become decoder commonaretoshown use aninadaptive
Figure 7a,b, codebook structure
respectively. to model the
The adaptive
In recent years, it has become common to use an adaptive codebook structure to model the long
long term memory
codebook is used rather thanthe
to capture a cascaded
long term memory long term predictor.
and the An encoder
fixed codebook is selectedusing
to be athe adaptive
set of
term memory rather than a cascaded long term predictor. An encoder using the adaptive codebook
codebook random
approach sequences, binary codes,
and a corresponding or a vector
decoder quantized
are shownversion of a
in Figure set of desirable sequences.
7a,b, respectively. The Theadaptive
approach and a corresponding
analysis-by-synthesis procedure
decoder
is term
are shownintensive,
computationally
in Figure 7a,b,is respectively.
and itcodebook
fortunate that
The adaptive
algebraic
codebook
codebook is used
is used to capture
to capture the long
the long term memorymemory and the fixed is selected toa be
set aofset
codebooks, which have mostly zero values and only aandfewthe fixedpulses,
nonzero codebookhave is selected
been to beand
discovered
of random
randomsequences,
sequences, binary codes, or a vector quantized version of of
a set of desirable sequences.
work well for thebinary codes,
fixed codebook or[22,23].
a vector quantized version of a set desirable sequences. The
The analysis-by-synthesis
analysis-by-synthesis procedure
procedure is computationally
is computationally intensive,
intensive, andand it is
it is fortunate
fortunate that
that algebraic
algebraic
codebooks, which have mostly zero values and only a few nonzero pulses, have been discovered
codebooks, which have mostly zero values and only a few nonzero pulses, have been discovered and and
work well
work forfor
well thethe
fixed codebook
fixed codebook[22,23].
[22,23].

(a)

(a)

Figure 7. Cont.
Information 2016, 7, 32 10 of 22
Information 2016, 7, 32 10 of 22

(b)
Figure 7. (a) Encoder for code-excited linear predictive (CELP) coding with an adaptive codebook; (b)
Figure 7. (a) Encoder for code-excited linear predictive (CELP) coding with an adaptive codebook;
CELP decoder with an adaptive codebook.
(b) CELP decoder with an adaptive codebook.
The analysis-by-synthesis coding structure relies heavily on the perceptual weighting filter to
Thean
select analysis-by-synthesis coding
excitation sequence that structure
produces relies
highly heavily on
intelligible, the quality
high perceptual weighting
speech. Further,filter
the to
analysis-by-synthesis
select approach
an excitation sequence only
that became widely
produces highly implementable afterquality
intelligible, high innovations
speech.in the designthe
Further,
of the excitation sequence
analysis-by-synthesis and inonly
approach efficient search
became procedures
widely so as to reduce
implementable aftercomplexity
innovations dramatically.
in the design
ofThese advancessequence
the excitation and the current
and incodecs aresearch
efficient discussed in following
procedures so assections.
to reduceSeecomplexity
also [11,17,18].
dramatically.
These advances and the current codecs are discussed in following sections. See also [11,17,18].
4. The Perceptual Weighting Function
4. TheAs Perceptual Weighting
noted in the previous Function
section, the perceptual weighting filter is critical to the success of the
analysis-by-synthesis approach.section,
As noted in the previous This importance was exposed
the perceptual weighting earlyfilter
by theis work of to
critical Anderson
the success and of
hisanalysis-by-synthesis
the students on tree coding, a form of This
approach. analysis-by-synthesis
importance wascoding exposed built around
early a DPCM
by the worklike coding
of Anderson
structure, wherein they used unweighted mean squared error [24]. They were able
and his students on tree coding, a form of analysis-by-synthesis coding built around a DPCM like to greatly improve
the signal-to-quantization
coding structure, wherein they noise ratio,
used which is amean
unweighted measure
squared of how
errorwell
[24].the speech
They weretime-domain
able to greatly
waveform is approximated, over DPCM at the same rate but with a surprising degradation in
improve the signal-to-quantization noise ratio, which is a measure of how well the speech time-domain
perceived quality! The degradation in speech quality was the result of the analysis-by-synthesis
waveform is approximated, over DPCM at the same rate but with a surprising degradation in perceived
search with the mean squared error distortion measure generating a spectrally whitened coding error
quality! The degradation in speech quality was the result of the analysis-by-synthesis search with the
which sounded noise-like and had a flattened spectrum. The later work of Atal and Schroeder
mean squared error distortion measure generating a spectrally whitened coding error which sounded
employing the coding method shown in Figure 6 with a perceptual weighting filter (as well as a
noise-like and had a flattened spectrum. The later work of Atal and Schroeder employing the coding
longer block size) revealed the promise of the paradigm, but with the complexity limitations at the
method
time from thein
shown Figure 6excitation
Gaussian with a perceptual weighting filter (as well
and the analysis-by-synthesis as a [21].
search longer
Weblock size)
return revealed
to this issuethe
promise of the
in the next paradigm, but with the complexity limitations at the time from the Gaussian excitation
section.
and theThe
analysis-by-synthesis search
selection of a perceptual [21]. Wefilter
weighting returnwastoinformed
this issueby inthe
theprior
next work
section.
on noise spectral
The selection
shaping of a perceptual
in conjunction weighting
with waveform filter
coders was
[25]. Theinformed
shaping of bythe
thequantization
prior work on noise
error spectral
in those
shaping
codecs in
was conjunction
accomplished with
bywaveform
creating a coders [25].function
weighting The shaping
using theof the quantization
linear prediction error in those
coefficients
codecs was accomplished
and motivated by creating
by the linear predictiona weighting function
model itself. using the
The general formlinear
of theprediction coefficients
noise shaping filter inand
motivated by the
the waveform linear
coders prediction model itself. The general form of the noise shaping filter in the
was
waveform coders was N

N
11´ βi iaai zi z´ii
ř
i 1
W ( z )“ i“
Wpzq N
N
(3) (3)
  
αi iaai zi z´ii
ř
11´
i“1
i 1
where the tai , i “ 1, ..., Nu are the linear prediction coefficients and the parameters α and β are
where the {ai , i  1,..., N } are the linear prediction coefficients and the parameters α and β are
weighting factors chosen to be between 0 and 1 to adjust the shape of the formant peaks and the
weighting factors chosenvalues
spectral valleys. Various to be between
of these 0parameters
and 1 to adjust
have the shape
been usedofinthe
theformant peaks
successful and the
codecs, with
spectral valleys. Various values of these parameters
these parameters usually held fixed for coding all inputs.have been used in the successful codecs, with
these parameters usually held fixed for coding all inputs.
Information 2016, 7, 32 11 of 22
Information 2016, 7, 32 11 of 22

The
The effect
effect ofof perceptual
perceptual weighting
weighting in in analysis-by-synthesis
analysis-by-synthesis codecscodecs is is shown
shown in in Figure
Figure 8,
8, where
where
the
the input speech spectral envelope is shown in blue as the original and the unweighted squared error
input speech spectral envelope is shown in blue as the original and the unweighted squared error
spectral
spectral shape is shown as a dashed red line. We can see that the dashed red line crosses over and
shape is shown as a dashed red line. We can see that the dashed red line crosses over and
moves
moves above
above thethe blue
blue line
line representing
representing the the original
original speech
speech spectral
spectral envelope
envelope in in several
several frequency
frequency
bands.
bands. What
What this
this means
means perceptually
perceptually is is that
that in in these
these regions
regions the
the coding
coding error
error or
or noise
noise isis more
more audible
audible
than
than in those regions where the input speech spectrum is above the error spectrum. The goalgoal
in those regions where the input speech spectrum is above the error spectrum. The of
of the
the frequency weighted perceptual weighting is to reshape the coding error
frequency weighted perceptual weighting is to reshape the coding error spectrum such that it lies spectrum such that it
lies below
below the the
inputinput speech
speech spectrum
spectrum acrossacross
the the desired
desired frequency
frequency band.band.
WithWith a proper
a proper selection
selection of
of the
the parameters α and β in the weighting function, the error spectrum can
parameters α and β in the weighting function, the error spectrum can be reshaped as shown by the be reshaped as shown by
the solid
solid redred
lineline
in in Figure
Figure 8. 8. This
This shapingcauses
shaping causesthetheinput
inputspeech
speechto to mask
mask the
the coding
coding error
error which
which
produces
producesaaperceptually
perceptuallypreferable
preferableoutput
outputfor forlisteners.
listeners.

Figure 8.
Figure 8. Example of
of the
the perceptual
perceptual weighting
weighting function
function effect
effect for
for analysis-by-synthesis
analysis-by-synthesiscoding.
coding.

Noticethat
Notice thatalthough
althoughthethesolid
solid
redred
lineline
doesdoes lie below
lie below the solid
the solid blue blue line across
line across the frequency
the frequency band,
there are a couple of frequencies where the two curves get close together and even touch.touch.
band, there are a couple of frequencies where the two curves get close together and even The mostThe
most desirable perceptual shaping would keep the red curve corresponding
desirable perceptual shaping would keep the red curve corresponding to the coding error spectral to the coding error
spectral envelope
envelope an equally anspaced
equallydistance
spaced below
distance thebelow
input the input
speech speech across
envelope envelope
the across
band but thethis
band but
is not
this is not achieved with the shaping shown. This reveals that this shaping method
achieved with the shaping shown. This reveals that this shaping method is not universally successful is not universally
successful
and in some and in some
coded coded
frames of frames
speech of thespeech
coding the coding
error error spectrum
spectrum may cross may cross
over theover
inputthe input
speech
speech spectrum
spectrum when the when the parameters
parameters α and β areα andheldβ fixed,
are held fixed,usually
as they as theyareusually
in mostare in most
codecs. codecs.
However,
However, this weighting function is widely used and has been quite
this weighting function is widely used and has been quite successful in applications. successful in applications.

5. The
5. The Set
Set of
of Excitation
Excitation Sequences:
Sequences: The
The Codebook
Codebook
In demonstrating
In demonstrating the the promising
promising performance
performance of of analysis-by-synthesis
analysis-by-synthesis speech
speech coding,
coding, Atal
Atal and
and
Schroeder used
Schroeder used aa perceptual
perceptual weighting
weightingfunction
functionandandaacodebook
codebookofof1024
1024Gaussian
Gaussian sequences
sequences each 40
each
samples long. The complexity of the analysis-by-synthesis codebook search, wherein
40 samples long. The complexity of the analysis-by-synthesis codebook search, wherein for each for each 40
samples
40 samplesof of
input
inputspeech
speechtotobebecoded,
coded,1024
1024 possible
possible reproduction
reproduction sequences
sequences are generated, was
are generated, was
immediately recognized
immediately recognized as as prohibitive
prohibitive [21].
[21]. Researchers
Researchers investigated
investigated aa wide
wide variety
variety ofof possible
possible
codebooks in addition to Gaussian random codebooks, including convolutional
codebooks in addition to Gaussian random codebooks, including convolutional codes, vector codes, vector
quantization,permutation
quantization, permutationcodes,
codes,and
andcodes
codes based
based onon block
block codes
codes fromfrom error
error control
control coding.
coding. The
The key
key breakthrough by Adoul and his associates was to demonstrate that relatively
breakthrough by Adoul and his associates was to demonstrate that relatively sparse codebooks sparse codebooks
made up
made up of
of aa collection
collection of
of +1
+1 or
or ´1
−1 pulses
pulses all
all of
of the
the same
same amplitude
amplitude could
could produce
produce good
good quality
quality
speech [22,23].
speech [22,23].
These codebooks have been refined to what are called the interleaved single pulse permutation
(ISSP) designs that are common in the most popular codecs today. These codebooks consist of a set
of 40 sample long sparse sequences with fixed pulse locations that are used sequentially to reconstruct
Information 2016, 7, 32 12 of 22

These codebooks have been refined to what are called the interleaved single pulse permutation
(ISSP) designs that are common in the most popular codecs today. These codebooks consist of a set of
40 sample long sparse sequences with fixed pulse locations that are used sequentially to reconstruct
possible sequences. The coupling of the sparsity, the fixed pulse locations, and the sequential searching
reduces the complexity of the analysis-by-synthesis process while still generating good quality
reconstructed speech. These codebooks are discussed in more detail in the references [11,17,18,22,23].

6. Codec Refinements
A host of techniques for improving coded speech quality, lowering the bit rate, and reducing
complexity have been developed over the years. Here we mention only three techniques that are
incorporated in most higher performance speech coding standards (such as G.729, AMR, and EVS,
all to be discussed in Section 8): Postfiltering, voice activity detection (VAD) and comfort noise
generation (CNG).

6.1. Postfiltering
Although a perceptual weighting filter is used inside the search loop for the best excitation
in the codebook in analysis-by-synthesis methods, there is often some distortion remaining in the
reconstructed speech that is sometimes characterized as “roughness”. This distortion is attributed
to reconstruction or coding error as a function of frequency that is too high at regions between
formants and between pitch harmonics. Codecs thus often employ a postfilter that operates on the
reconstructed speech at the decoder to de-emphasize the coding error between formants and between
pitch harmonics. Postfiltering is indicated by the “Post-Processing” block in Figure 7b.
The general frequency response of the postfilter has the form similar to the perceptual weighting
filter with a pitch or long term postfilter added. There is also a spectral tilt correction since
the formant-based postfilter results in an increased low pass filter effect, and a gain correction
term [26]. The postfilter is usually optimized for a single stage encoding (however, not always),
so if multiple tandem connections of speech codecs occur, the postfilter can cause a degradation in
speech quality [5,17,18,26].

6.2. Voice Activity Detection and Comfort Noise Generation


It has been said broadly that conversational speech has about 50% silence. Thus, it seems intuitive
that the average bit rate can be reduced by removing silent periods in speech and simply coding these
long periods at a much reduced bit rate. The detection of silent periods between speech utterances,
called voice activity detection (VAD), is tricky, particularly when there is background noise. However,
ever more sophisticated methods for VAD have been devised that remove silence without clipping the
beginning or end of speech utterances [18,27].
Interestingly, it was quickly discovered that inserting pure silence into the decoded bit stream
produced unwanted perceptual artifacts for the listener because segments of the coded speech utterance
has in the background any signals that are present in the “silent” periods, so inserting pure silence
had an audibly very pronounced switching between silence and speech plus background sounds.
Further, pure silence sometimes gave the listener the impression that the call had been lost. Therefore,
techniques were developed to characterize the sounds present in between speech utterances, such
as energy levels and even spectral shaping, and then code this information so that more realistic
reconstruction of the “silent” intervals could be accomplished. These techniques are called comfort
noise generation (CNG) and are essential to achieving lower average bit rates while maintaining speech
quality [18,27].

7. The Relationship between Speech and Audio Coding


The process of breaking the input speech into subbands via bandpass filters and coding each
band separately is called subband coding [4,5,28,29]. To keep the number of samples to be coded
Information 2016, 7, 32 13 of 22

at a minimum, the sampling rate for the signals in each band is reduced by decimation. Of course,
since the bandpass filters are not ideal, there is some overlap between adjacent bands and aliasing
occurs during decimation. Ignoring the distortion or noise due to compression, quadrature mirror
filter (QMF) banks allow the aliasing that occurs during filtering and subsampling at the encoder
to be cancelled at the decoder [28,29]. The codecs used in each band can be PCM, ADPCM, or even
Information 2016, 7, 32 13 of 22
an analysis-by-synthesis method, however, the poorer the coding of each band, the more likely aliasing
will no longer
(QMF) be cancelled
banks by the choice
allow the aliasing that occurs of synthesizer
during filtering filters.
andThe advantage
subsampling at of
thesubband
encoder to coding
be is
that each bandatcan
cancelled the be coded[28,29].
decoder to a different
The codecs accuracy
used in and eachthat bandthe cancoding
be PCM, error in each
ADPCM, or band
even an can be
controlled in relation to human
analysis-by-synthesis method,perceptual
however, the characteristics
poorer the coding [4,5].of each band, the more likely aliasing
will no longer
Transform be cancelled
coding methods by the
werechoice
firstofapplied
synthesizer filters.
to still The advantage
images of subband coding
but later investigated is
for speech.
that each band can be coded to a different accuracy and that the coding
The basic principle is that a block of speech samples is operated on by a discrete unitary transform and error in each band can be
controlled
the resulting in relationcoefficients
transform to human perceptual
are quantized characteristics
and coded [4,5].
for transmission to the receiver. Low bit
Transform coding methods were first applied to still images but later investigated for speech.
rates and good performance can be obtained because more bits can be allocated to the perceptually
The basic principle is that a block of speech samples is operated on by a discrete unitary transform
important coefficients, and for well-designed transforms, many coefficients need not be coded at all,
and the resulting transform coefficients are quantized and coded for transmission to the receiver.
but are
Low simply discarded,
bit rates and good and acceptablecan
performance performance
be obtainedis because
still achieved
more bits [30].can be allocated to the
Although
perceptually classical
important transform coding
coefficients, and forhaswell-designed
not had a major impactmany
transforms, on narrowband
coefficients needspeech coding
not be
and subband
coded at coding hassimply
all, but are fallen discarded,
out of favor andinacceptable
recent years (with a slight
performance recent
is still resurgence
achieved [30]. for Bluetooth
audio [31]), filter bank
Although andtransform
classical transform methods
coding has notplay
hadaacritical role inon
major impact high quality audio
narrowband speechcoding,
coding and
and subband coding has fallen out of favor in recent years
several important standards for wideband, superwideband, and fullband speech/audio coding (with a slight recent resurgence for are
Bluetooth audio [31]), filter bank and transform methods play a
based upon filter bank and transform methods [32–35]. Although it is intuitive that subband filtering critical role in high quality audio
coding, and
and discrete several important
transforms are closely standards
related,for bywideband,
the early superwideband,
1990’s, the relationships and fullband speech/audio
between filter bank
coding are based upon filter bank and transform methods [32–35]. Although it is intuitive that
methods and transforms were well-understood [28,29]. Today, the distinction between transforms
subband filtering and discrete transforms are closely related, by the early 1990’s, the relationships
and filter bank methods is somewhat blurred, and the choice between a filter bank implementation
between filter bank methods and transforms were well-understood [28,29]. Today, the distinction
and abetween
transform methodand
transforms may simply
filter bank be a design
methods choice. Often
is somewhat a combination
blurred, and the choice of between
the two aisfilter
the most
efficient [32].
bank implementation and a transform method may simply be a design choice. Often a combination
The
of thebasic
two isvery successful
the most efficientparadigm
[32]. for coding full band audio in the past two decades has
been the filter bank/transform
The basic very successfulbased paradigmapproach withfull
for coding perceptual
band audio noise
in themasking
past two using
decades anhasiterative
been bit
the filter
allocation bank/transform
[32,35]. This technique based approach
does not lend with perceptual
itself to real time noisecommunications
masking using an iterative
directly bit of
because
allocation [32,35]. This technique does not lend itself to real time
the iterative bit allocation method and because of complexity, and to a lesser degree, delay in thecommunications directly because of filter
the iterative bit allocation method and because of complexity,
bank/transform/noise masking computations. As a result, the primary impact of high quality audio and to a lesser degree, delay in the
filter bank/transform/noise masking computations. As a result, the primary impact of high quality
coding has been to audio players (decoders) such as MP3 and audio streaming applications, although
audio coding has been to audio players (decoders) such as MP3 and audio streaming applications,
the basic structure for high quality audio coding has been expanded in recent years to conversational
although the basic structure for high quality audio coding has been expanded in recent years to
applications with lower
conversational delay [34].
applications with lower delay [34].
A high level block diagram
A high level block diagram ofofananaudio
audiocodec
codec is shownininFigure
is shown Figure 9. 9.
In In
thisthis diagram,
diagram, two twopathspaths
are shown
are shownfor the sampled
for the sampled input audio
input audiosignal,
signal, one
one path
path is is through
throughthe thefilter
filter bank/transform
bank/transform that that
performs the the
performs analysis/decomposition
analysis/decomposition into intospectral
spectral components
components to beto be coded,
coded, and theand otherthepathother
into path
the psychoacoustic
into the psychoacoustic analysisanalysisthat thatcomputes
computes thethenoise
noise masking
masking thresholds.
thresholds. The Thenoisenoise
masking
masking
thresholds are then used in the bit allocation that forms the basis
thresholds are then used in the bit allocation that forms the basis for the quantization and coding for the quantization and coding in in
the analysis/decomposition path. All side information and parameters
the analysis/decomposition path. All side information and parameters required for decoding are then required for decoding are then
losslessly coded for storage or transmission.
losslessly coded for storage or transmission.

Figure9.9.Generic
Figure Generic audio
audio coding
codingapproach.
approach.

The primary differences among the different coding schemes that have been standardized
and/or found wide application are in the implementations of the time/frequency
analysis/decomposition in terms of the types of filter banks/transforms used and their resolution in
Information 2016, 7, 32 14 of 22

The primary differences among the different coding schemes that have been standardized and/or
found wide application are in the implementations of the time/frequency analysis/decomposition in
terms of the types of filter banks/transforms used and their resolution in the frequency domain.
Note that the frequency resolution of the psychoacoustic analysis is typically finer than the
analysis/decomposition path since the perceptual noise masking is so critical for good quality.
There are substantive differences in the other blocks as well, with many refinements over the years.
The strengths of the basic audio coding approach are that it is not model based, as in speech
coding using linear prediction, and that the perceptual weighting is applied on a per-component basis,
whereas in speech coding, the perceptual weighting relies on a spectral envelope shaping. A weakness
in the current approaches to audio coding is that the noise masking theory that is the foundation of the
many techniques is three decades old; further, the masking threshold for the entire frame is computed
by adding the masking thresholds for each component. The psychoacoustic/audio theory behind this
technique of adding masking thresholds has not been firmly established.
Other key ideas in the evolution of the full band audio coding methods have been pre- and
post-masking and window switching to capture transients and steady state sounds. Details of the
audio coding methods are left to the very comprehensive references cited [4,5,32–34].

8. Speech Coding Standards


Although the ITU-T had set standards for wireline speech coding since the 1970’s, it was only
with the worldwide digital cellular industry that standards activities began to gain momentum, and by
the 1990’s, speech coding standardization activities were expanding seemingly exponentially. We leave
the historical development of speech coding standards and the details of many of the standards to the
references [1–3,5,11]. Here, however, we present some key technical developments of standards that
have greatly influenced the dominant designs of today’s leading speech coding standards.
By the early 1990’s, the analysis-by-synthesis approach to speech coding was firmly established
as the foundation of the most promising speech codecs for narrowband speech. The research and
development efforts focused on designing good excitation codebooks while maintaining manageable
search complexity and simultaneously improving reconstructed speech quality and intelligibility.
What might be considered two extremes of codebook design were the Gaussian random
codebooks, made up of Gaussian random sequences, and the multipulse excitation type of codebook,
which consisted of a limited number of impulses (say 8) placed throughout a speech frame, each
with possibly different polarity and amplitude [36]. In the former case, encoding complexity was
high since there needed to be a sufficient number of sequences to obtain a suitably rich excitation
set, while in the latter, encoding was complex due to the need to optimally place the impulses and
determine their appropriate amplitude. The breakthrough idea came through the work of Adoul and
his colleagues who showed that a relatively sparse set of positive and negative impulses, all of the
same amplitude (!), would suffice as a codebook to produce good quality speech, while at the same
time, managing complexity due to the sparseness of the impulses and the need to determine only one
amplitude [22,23]. These ideas were motivated by codes from channel coding, and while it should
be noted that others had proposed and investigated using excitations motivated by channel coding
structures [37,38], Adoul and his colleagues provided the demonstration that the needed performance
could be achieved.
This sparse excitation codebook, called an algebraic codebook, served as the basis for the
G.729 analysis-by-synthesis speech coding standard set by the ITU-T for speech coding at 8 kbits/s.
The speech coding method in G.729 was designated as Algebraic Code Excited Linear Prediction
(ACELP) and served to define a new class of speech codecs. We leave further development of the G.729
standard to the references [3,23], but we turn our attention now to ACELP codecs in general and the
most influential and widely deployed speech codec in the 2000’s to date.
The Adaptive Multirate (AMR) codec uses the ACELP method but improves on the G.729 standard
in several ways, including using a split vector quantization approach to quantize and code the linear
Information 2016, 7, 32 15 of 22

prediction parameters on a frame/subframe basis. The AMR narrowband (AMR-NB) codec was
standardized and widely deployed and operated at the bit rates of 4.75, 5.15, 5.9, 6.7, 7.4, 7.95, 10.2,
and 12.2 kbits/s [3]. The bit rate can be changed at any frame boundary and the “adaptive” in AMR
refers to the possible switching between rates at frame boundaries in response to instructions from the
base station/mobile switching center (MSC) or eNodeB in LTE (long term evolution), which is referred
to as “network controlled” switching. The AMR-NB codec standardization was then followed by the
AMR-WB (wideband) speech codec, which operates at bit rates of 6.6, 8.85, 12.65, 14.25, 15.85, 18.25,
19.85, 23.05, and 23.85 kbits/s [27]. These codecs have been implemented in 3rd generation digital
cellular systems throughout the world and have served as the default codecs for VoLTE (voice over
long term evolution) in 4th generation digital cellular, designated as LTE, while a new codec standard
was developed. The AMR-WB codec is the basis for claims of HD (High Definition) Voice for digital
cellular in industry press releases and the popular press, where HD simply refers to wideband speech
occupying the band from 50 Hz to 7 kHz.
After the development of the AMR codecs, another speech codec, called the Variable Multirate
(VMR) codec, was standardized [39]. This codec allowed rate switching at the frame boundaries not
only as a result of network control but also due to the analysis of the input speech source. This type of
rate switching is called “source controlled” switching. Although the VMR codec was standardized, it
was not widely deployed, if at all.
The newest speech codec to be standardized is the Enhanced Voice Services (EVS) codec designed
specifically for 4th generation VoLTE but is expected to be deployed in many applications because of
its performance and its wide-ranging set of operable modes [40]. The new EVS codec uses the ACELP
codec structure and builds on components of the AMR-WB codec. The EVS codec achieves enhanced
voice quality and coding efficiency for narrowband and wideband speech, provides new coding modes
for superwideband speech, improves quality for speech, music, and mixed content, has a backward
compatible mode with AMR-WB with additional post-processing, and allows fullband coding at a bit
rate as low as 16.4 kbit/s. The EVS codec has extensive new pre-processing and post-processing
capabilities. It builds on the VMR-WB codec and the ITU-T G.718 codec by using technologies from
those codecs for classification of speech signals. Further, the EVS codec has source controlled variable
bit rate options based on the standardized ERVC-NW (enhanced variable rate narrowband-wideband)
codec. There are also improvements in coding of mixed content, voice activity detection, comfort noise
generation, low delay coding, and switching between linear prediction and MDCT (modulated discrete
cosine transform) coding modes. Further details on the EVS codec can be found in the extensive set of
papers cited in [40].
The impact of rate distortion theory and information theory in general can be seen in the designs
of the excitation codebooks over the years, starting with the tree/trellis coding work of Anderson [24],
Becker and Viterbi [37], and Stewart, Gray, and Linde [38], among others, through the random
Gaussian sequences employed by Atal and Schroeder [21] and then continuing with the algebraic
codes pioneered by Adoul, et al. Additionally, this influence appears in the use of vector quantization
for some codebook designs over the years and also for quantization of other parameters, such as the
linear prediction coefficients in AMR codecs [27]. More recently, rate distortion theoretic bounds on
speech codec performance have been developed as described in the following section.

9. Fundamental Limits on Performance


Given the impressive performance of the EVS codec and the observably steady increase in speech
codec performance over the last 3 decades, as evidenced by the standardized speech codec performance
improvements since the mid-1980’s, it would be natural to ask, “what is the best performance
theoretically attainable by any current or future speech codec design?” Apparently, this question
is not asked very often. Flanagan [41] used Shannon’s expression for channel capacity to estimate the
bit rate for narrowband speech to be about 30,000 bit/s, and further, based on experiments, concluded
that the rate at which a human can process information is about 50 bits/s. Later, in his 2010 paper [42],
Information 2016, 7, 32 16 of 22

Flanagan reported experiments that estimated a rate of 1000 to 2000 bits/s preserved “quality and
personal characteristics”. Johnston [43] performed experiments that estimated the perceptual entropy
required for transparent coding of narrowband speech to be about 10 kbit/s on the average up to
a maximum of about 16 kbits/s. See also [44]. Given the wide range of these bit rates and since these
are all estimates of the bit rate needed for a representative bandwidth or averaged over a collection of
speech utterances, they do not provide an indication of the minimum bit rate needed to code a specific
given utterance subject to a perceptually meaningful distortion measure.
In standardization processes, the impetus for starting a new work item for a new speech codec
design comes not only from a known, needed application, but also from experimental results that
indicate improvement in operational rate distortion performance across the range of desirable rates and
acceptable distortion is possible. However, the question always remains as to what is the lowest bit rate
achievable while maintaining the desired quality and intelligibility with any, perhaps yet unexplored,
speech coding structure.
Other than the broad range of estimated rates cited earlier, there have been only a few attempts to
determine such performance limits in the past [45]. There are two challenges in determining any rate
distortion bound: specifying the source model and defining an analytically tractable, yet meaningful,
distortion measure. For real sources and human listeners, both of these components are extraordinarily
difficult, a fact that has been recognized since the 1960’s. Recently however, the author and his students
have produced some seemingly practical rate distortion performance bounds as developed in some
detail in a research monograph [45].
In order to develop such bounds, it is necessary to identify a good source model and to utilize
a distortion measure that is relevant to the perceptual performance of real speech coders. The approach
used in [45] is to devise speech models based on composite sources, that is, source models that switch
between different modes or subsources, such as voiced, unvoiced, onset, hangover, and silence speech
modes. Then, conditional rate distortion theory for the mean squared error (MSE) distortion measure
is used to obtain rate distortion curves subject to this error criterion. Finally, a mapping function is
obtained that allows the rate versus MSE curves to be mapped into rate versus PESQ-MOS bounds. Since
the PESQ-MOS performance of real speech codecs can be determined from [7], direct comparisons are
possible. These steps are performed for each speech utterance consisting of one or two short sentences
of total length of a few seconds, such as those that are used in evaluating voice codec performance
using [7]. A complete list of references and details of the approach are left to [45].
While these bounds have not been compared to all standardized codecs, the bounds are shown to
lower bound the performance of many existing speech codecs, including the AMR-NB and AMR-WB
codecs, and additionally, these bounds indicate that speech codec performance can be improved by
as much as 0.5 bit/sample or 50%! Further, by examining how the different codecs perform for the
different source sequences, it is possible to draw conclusions at to what types of speech sources are the
most difficult for current codec designs to code [45], thus pointing toward new research directions to
improve current codecs.
Therefore, practically significant rate distortion bounds that express the best performance
theoretically attainable for the given source model and distortion measure address at least these
two questions: (1) Is there performance yet to be achieved over that of existing codecs; and (2) what
types of speech codecs might be worthy of further research? Furthermore, answers to these questions
can be provided without implementing new speech codecs; seemingly a very significant savings in
research and development effort.
It is critical to emphasize what has already been said: The rate distortion bounds obtained thus
far are based on certain specified source models, composite source models in this case, and on using
a particular method to create a distortion measure expressible in terms of MOS, mean opinion score,
which can be interpreted in terms of subjective listening tests. Therefore, it is clear that the current
bounds can be refined further by developing better source (speech) models and by identifying a more
precise, perceptually relevant distortion measure. As a result, it would appear that future research to
extend rate distortion performance bounds for speech is highly desirable.
Information 2016, 7, 32 17 of 22

10. Current Challenges


An on-going challenge for conversational voice communications is latency. It is well known that
a round trip delay nearing 0.5 s in a conversation causes the speakers to “step on” each other’s speech;
that is, a speaker will inherently begin to speak again if a response is not heard in around 0.5 s [46].
Since this fact is well known, the latency in the speech encoding and the estimated switching and
network delays are designed to be much less than this amount. However, the responses of the base
station/mobile switching center (MSC) or eNodeB in the cellular networks can add significant latency
in call handling and switching that is unmodeled in the engineering estimates, resulting in excessive
latency, particularly across providers.
Another challenge to conversational call quality is transcoding at each end of a cellular call.
Generally, each cell phone encodes the speaker’s voice using a particular voice codec. The codec at
one end of the call need not be the codec at the other end of the call, and in reality, which codec is
being used by the phone at the other end is usually unknown. As a result, the coded speech produced
by the speaker’s cell phone is decoded at the network interface, re-encoded in terms of log PCM and
transmitted to the other end, where the log PCM coded speech is decoded and re-encoded using
a codec that can be decoded by the far end cell phone. These transcoding operations degrade the
voice quality of the call, and in fact, add latency. The requirement to transcode is well-known to
engineers and providers, but is unavoidable, except in special circumstances where the call is entirely
within some networks where transcoder free operation is available. While the goal is to move toward
transcoder free operation, this capability is not widely deployed nor is it available across networks [47].
The necessity to transcode can also limit the ability to communicate using wideband speech codecs [48].
Background noises and background speakers are also a great challenge to speech codecs. While the
pre-processing stages have gotten more sophisticated in classifying the types of inputs and identifying
the presence of background impairments [40], this is still a challenging issue. Any background
sound that is not correctly identified as background noise can significantly degrade the speech
coding operation since the CELP codecs are designed primarily to code speech. Additionally, input
background noise and the presence of other speakers can cause the source controlled variable rate
codecs to operate at a higher rate than expected and can be a difficult challenge for VAD and
CNG algorithms.
A network source that lowers reconstructed voice quality is the behavior of the BS/MSC or
eNodeB in cellular networks. These switching centers are all-powerful in that they allocate specific
bit rates to speech codecs on the cellular networks. These BS/MSC or eNodeB installations take into
account a wide variety of information when allocating bit rates to a user, ranging from the quality of
the connection with the handset, the loading of the current cell site, the loading of adjacent cell sites,
expected traffic conditions at the particular time of day, and many other data sources. The way all of
this information is used to allocate bit rate is not standardized and can vary widely across cell sites
and networks. One broad statement can be made however: The BS/MSC or eNodeB is conservative in
allocating bit rate to voice calls, often resulting in lower than expected coded speech quality.
For example, a cell phone may measure and report that the channel connecting the cell phone to
the control/switching center is a good one and request a 12.2 kbits/s bit rate for the speech codec (this
is one of the rates available for AMR-NB). Often, unfortunately, the control/switching center will reply
and instruct the cell phone to use the 5.9 or 6.7 kbits/s rate, both of which have quality lower than
achievable at 12.2 kbits/s. Thus, call quality is degraded, particularly when transcoding is necessary,
since a lower codec rate results in poorer transcoding quality. Now, to be fair, the service provider
could reply that using the lower rate guarantees that the users call will not be dropped or reserves bit
rate for the user to stream video, so such are the tradeoffs.

11. First Responder Voice Communications


Emergency first responder voice communications in the U.S. and Europe rely on entirely
different communications systems than the telephone network or digital cellular systems used by
Information 2016, 7, 32 18 of 22

the public. These emergency first responder systems have much lower transmitted data rates and
as a result, the voice codecs must operate at much lower bit rates. Additionally, the voice codecs
must operate in much more hostile environments, such as those experienced by firefighters for
example, wherein the background noise consists of chain saws, sirens, and alarms, among other noise
types. Furthermore, first responders depend critically on voice communications in these dynamic,
unpredictable environments.
In the U.S., the Emergency First Responder Systems are called Land Mobile Radio (LMR), which
started out as a purely analog system but in recent years has evolved toward digital transmission via the
designation Project 25 (P25) Radio Systems [49]. The standard used for first responder communications
in Europe and the United Kingdom is TETRA, originally Trans European Trunked Radio but now
Terrestrial Trunked Radio, and TETRA includes a comprehensive set of standards for the network and
the air interface. TETRA was created as a standard for a range of applications in addition to public
safety [49].
For P25 in the U.S., the speech codecs used are the IMBE, AMBE, and AMBE +2 codecs, all of which
are based upon the Multiband Excitation (MBE) coding method [49,50]. In P25 Phase I, the Improved
MBE, or IMBE, codec at 4.4 kbits/s is used for speech coding and then an additional 2.8 kbits/s is
added for error control (channel) coding. This 7.2 kbits/s total then has other synchronization and
low-speed data bits incorporated to obtain the final 9.6 kbits/s presented to the modulator. For P25
Phase II, the total rate available for speech and channel coding is half of 7.2 kbits/s or 3.6 kbits/s,
which is split as 2.45 kbits/s for voice and 1.15 kbits/s for channel coding [49,50].
These bit rates, namely, 4 kbits/s and below, are in the range of what is called low bit rate speech
coding [51]. Speech coding at these rates has not been able to achieve quality and intelligibility
sufficient for widespread adoption. In fact, there have been standards activities directed toward
establishing an ITU-T standard at 4 kbits/s for over a decade, and while some very innovative codecs
have been developed, none have yet achieved toll quality across the desired range of conditions.
The public safety first responder requirements include a much harsher operational environment in
terms of background noises as well as a desire for quality equivalent to analog narrowband speech
communications, which is similar to toll quality.
We do not provide block diagrams of the IMBE based codecs here but we describe the basic IMBE
codec in the following. The IMBE vocoder models each segment of speech as a frequency-dependent
combination of voiced (more periodic) and unvoiced (more noise-like) speech. The encoder computes
a discrete Fourier transform (DFT) for each segment of speech and then analyzes the frequency content
to extract the model parameters for that segment, which consists of the speaker pitch or fundamental
frequency, a set of Voiced/Unvoiced (V/UV) decisions, which are used to generate the mixture of
voiced and unvoiced excitation energy, and a set of spectral magnitudes, to represent the frequency
response of the vocal tract. These model parameters are then quantized into 88 bits, and the resulting
voice bits are then output as part of the 4.4 kbits/s of voice information produced by the IMBE
encoder [5,49,50].
At the IMBE decoder the model parameters for each segment are decoded and these parameters
are used to synthesize both a voiced signal and an unvoiced signal. The voiced signal represents the
periodic portions of the speech and is synthesized using a bank of harmonic oscillators. The unvoiced
signal represents the noise-like portions of the speech and is produced by filtering white noise.
The decoder then combines these two signals and passes the result through a digital-to-analog converter
to produce the analog speech output.
For TETRA, the voice codec is based on code excited linear prediction (CELP) and the speech is
coded at 4.567 kbits/s, or alternatively, if the speech is coded in the network or in a mobile handset,
the AMR codec at 4.75 kbits/s is used [3,49]. Block diagrams of the TETRA encoder and decoder are
essentially the same as the CELP codecs already discussed. The TETRA codecs based on the CELP
structure are clearly a very different coding method than IMBE.
Information 2016, 7, 32 19 of 22

The algorithmic delay of the TETRA voice codec is 30 ms plus an additional 5 ms look ahead.
Such a delay is not prohibitive, but a more thorough calculation in the standard estimates an end-to-end
delay of 207.2 ms, which is at the edge of what may be acceptable for high quality voice communications.
A round trip delay near 500 ms is known to cause talkers to talk over the user at the other end, thus
causing difficulty in communications, especially in emergency environments [49].
Codec performance in a noisy environment is much more of a challenge than for clean speech,
wherein for these hostile environments, the speech codecs must pass noisy (PASS (Personal Alert Safety
System) alarms, chainsaws, etc.) input speech and speech from inside a mask (Self-Contained Breathing
Apparatus (SCBA) that is essential in firefighting) [49,52]. Recent extensive test results by the Public
Safety Research Program in the U. S. has shown that the IMBE codecs at these low rates perform poorly
compared to the original analog FM voice systems and that the AMR codecs at a rate of 5.9 kbit/s,
which is higher than the 4.75 kbits/s used in TETRA, perform poorly as well [52]. Emergency first
responder voice communications is clearly an area in need of intensive future research.

12. Future Research Directions


The EVS speech codec is a tremendous step forward for both speech coding and for a codec that
is able to combine speech and audio coding to obtain outstanding performance. Among the many
advances in this codec are the preprocessing and postprocessing modules. Because of the need to fine
tune the coding schemes to the codec input, further advances in preprocessing are needed in order to
identify background disturbances and to separate those disturbances from the desired signals such as
speech and audio. There also appears to be substantial interest in capturing and coding stereo audio
channels for many applications, even handheld devices.
The EVS codec has taken the code-excited linear prediction and transform/filter bank methods
with noise masking paradigms to new levels of performance in a combined codec. The question is how
much further can these ideas be extended? Within these coding structures, some possible research
directions are to incorporate increased adaptivity into the codec designs. Since it is well known that
the perceptual weighting according to the input signal envelope does not always succeed in keeping
the error spectrum below the speech spectrum, adapting the parameters of the perceptual weighting
filters in CELP is one possible research direction. Another research direction is to incorporate adaptive
filter bank/transform structures such as adaptive band combining and adaptive band splitting into
combined speech/audio codecs. Of course, a more difficult, but perhaps much more rewarding
research direction would be to identify entirely new methods for incorporating perceptual constraints
into codec structures.

13. Summary and Conclusions


After reading this paper, one thing should be crystal clear, there has been extraordinary innovation
in speech compression in the last 25 years. If this conclusion is not evident from this paper alone, the
reader is encouraged to review References [1–3,11,17,18,35,44]. A second conclusion is that standards
activities have been the primary drivers of speech coding research during this time period [3,11,35].
Third, the ACELP speech coding structure and the transform/filter bank audio coding structure have
been refined to extraordinary limits by recent standards, and one wonders how much further these
paradigms can be extended to produce further compression gains. However, given the creativity
and technical expertise of the engineers and researchers involved in standards activities, as well as
the continued expansion of the boundaries on implementation complexity, additional performance
improvements and new capabilities are likely to appear in the future.
Rate distortion theory and information theory have motivated the analysis-by-synthesis approach,
including excitation codebook design, and some speech codecs employ vector quantization to transmit
linear prediction coefficients, among other parameters. It is not obvious at present what next
improvement might come out of this theory, unless, for example, speech codecs start to exploit
lossless coding techniques further.
Information 2016, 7, 32 20 of 22

Recent results on rate distortion bounds for speech coding performance may offer some efficiencies
in the codec design process by indicating how much performance gain is still possible, irrespective of
complexity, and may also point the way toward specific techniques to obtain those gains. More work
is needed here both to extend the existing bounds and to demonstrate to researchers that such rate
distortion bounds are a vital tool in arriving at new speech codecs.

Acknowledgments: This research was supported in part by the U. S. National Science Foundation under Grant
Nos. CCF-0728646 and CCF-0917230.
Conflicts of Interest: The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:
VoIP Voice over Internet Protocol
CELP Code-Excited Linear Prediction
MOS Mean Opinion Score
ACR Absolute Category Rating
PESQ Perceptual Evaluation of Speech Quality
POLQA Perceptual Objective Listening Quality Assessment
AR Autoregressive
DPCM Differential Pulse Code Modulation
LMS Least Mean Square
RLS Recursive Least Squares
ITU-T International Telecommunications Union—Telecommunications
FFT Fast Fourier Transform
LPC Linear Predictive Coder (or Coding)
AbS Analysis-by-Synthesis
ISSP Interleaved single pulse permutation
VAD Voice Activity Detection
CNG Comfort Noise Generation
QMF Quadrature Mirror Filter
ACELP Algebraic Code-Excited Linear Prediction
AMR Adaptive Multirate
VoLTE Voice Over Long Term Evolution
NB Narrowband
MSC Mobile Switching Center
WB Wideband
VMR Variable Multirate
EVS Enhanced Voice Services
ERVC-NW Enhanced Variable Rate—Narrowband-Wideband
NDCT Modulated Discrete Cosine Transform
BS Base Station
LMR Land Mobile Radio
TETRA Terrestrial Trunked Radio
MBE Multiband Excitation
IMBE Improved Multiband Excitation
AMBE Advanced Multiband Excitation
DFT Discrete Fourier Transform
V/UV Voiced/Unvoiced
PASS Personal Alert Safety System
SCBA Self-Contained Breathing Apparatus
FM Frequency Modulation

References
1. Gibson, J.D. Speech coding methods, standards, and applications. IEEE Circuits Syst. Mag. 2005, 5, 30–49.
[CrossRef]
2. Gibson, J.D. (Ed.) Speech coding for wireless communications. In Mobile Communications Handbook, 3rd ed.;
CRC Press: Boca Raton, FL, USA, 2012; pp. 539–557.
3. Sinder, J.D.; Varga, I.; Krishnan, V.; Rajendran, V.; Villette, S. Recent speech coding technologies and
standards. In Speech and Audio Processing for Coding, Enhancement and Recognition; Ogunfunmi, T., Togneri, R.,
Narasimha, M., Eds.; Springer: New York, NY, USA, 2014; pp. 75–109.
Information 2016, 7, 32 21 of 22

4. Sayood, K. Introduction to Data Compression, 4th ed.; Morgan-Kaufmann: Waltham, MA, USA, 2012.
5. Gibson, J.D.; Berger, T.; Lookabaugh, T.; Lindbergh, D.; Baker, R.L. Digital Compression for Multimedia:
Principles and Standards; Morgan-Kaufmann: San Francisco, CA, USA, 1998.
6. Cox, R.; de Campos Neto, S.F.; Lamblin, C.; Sherif, M.H. ITU-T coders for wideband, superwideband, and
fullband speech communication. IEEE Commun. Mag. 2009, 47, 106–109. [CrossRef]
7. Recommendation P.862, Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-End
Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs; ITU-T: Geneva, Switzerland,
February, 2001.
8. Recommendation P.863, Perceptual Objective Listening Quality Assessment; ITU-T: Geneva, Switzerland, 2011.
9. Chan, W.Y.; Falk, T.H. Machine assessment of speech communication quality. In Mobile Communications
Handbook, 3rd ed.; Gibson, J.D., Ed.; CRC Press: Boca Raton, FL, USA, 2012; pp. 587–600.
10. Grancharov, V.; Kleijn, W.B. Speech quality assessment. In Springer Handbook of Speech Processing; Benesty, J.,
Sondhi, M.M., Juang, Y., Eds.; Springer: Berlin, Germany, 2008; pp. 83–99.
11. Chen, J.H.; Thyssen, J. Analysis-by-synthesis coding. In Springer Handbook of Speech Processing; Benesty, J.,
Sondhi, M.M., Juang, Y., Eds.; Springer: Berlin, Germany, 2008; pp. 351–392.
12. Budagavi, M.; Gibson, J.D. Speech coding for mobile radio communications. IEEE Proc. 1998, 86, 1402–1412.
[CrossRef]
13. Atal, B.S.; Hanauer, S.L. Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust.
Soc. Am. 1971, 50, 637–655. [CrossRef] [PubMed]
14. Makhoul, J. Linear prediction: A tutorial review. IEEE Proc. 1975, 63, 561–580. [CrossRef]
15. Markel, J.D.; Gray, A.H., Jr. Linear Prediction of Speech; Springer: New York, NY, USA, 1976.
16. Shetty, N. Tandeming in Multihop Voice Communications. Ph.D. Thesis, ECE Department, University of
California, Santa Barbara, CA, USA, December 2007.
17. Chu, W.C. Speech Coding Algorithms: Foundation and Evolution of Standardized Coders; John Wiley & Sons:
Hoboken, NJ, USA, 2003.
18. Kondoz, A.M. Digital Speech: Coding for Low Bit Rate Communications Systems; John Wiley & Sons: Chichester,
UK, 2004.
19. Tremain, T.E. The government standard linear predictive coding algorithm: LPC-10. Speech Technol. 1982, 1,
40–49.
20. Frantz, G.A.; Wiggins, R.H. Design case history: Speak & Spell learns to talk. IEEE Spectr. 1982, 19, 45–49.
21. Atal, B.S.; Schroeder, M.R. Stochastic coding of speech at very low bit rates. In Proceedings of the International
Conference on Communications, Amsterdam, The Netherlands, May 1984; pp. 1610–1613.
22. Adoul, J.P.; Mabilleau, P.; Delprat, M.; Morissette, S. Fast CELP coding based on algebraic codes.
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX,
USA, 6–9 April 1987; pp. 1957–1960.
23. Salami, R.; Laflamme, C.; Adoul, J.P.; Kataoka, A. Design and description of CS-ACELP: A toll quality 8 kb/s
speech coder. IEEE Trans. Speech Audio Process. 1998, 6, 116–130. [CrossRef]
24. Anderson, J.B.; Bodie, J.B. Tree Encoding of Speech. IEEE Trans. Inform. Theory 1975, 21, 379–387. [CrossRef]
25. Atal, B.S.; Schroeder, M.R. Predictive coding of speech signals and subjective error criteria. IEEE Trans.
Acoust. Speech Signal Process. 1979, 7, 247–254. [CrossRef]
26. Chen, J.H.; Gersho, A. Adaptive postfiltering for quality enhancement of coded speech. IEEE Trans. Speech
Audio Process. 1995, 3, 59–71. [CrossRef]
27. Bessette, B.; Salami, R.; Lefebvre, R.; Jelinek, M. The adaptive multirate wideband speech codec (AMR-WB).
IEEE Trans. Speech Audio Process. 2002, 10, 620–636. [CrossRef]
28. Malvar, H.S. Signal Processing with Lapped Transforms; Artech House: Norwood, MA, USA, 1992.
29. Vaidyanathan, P.P. Multirate Systems and Filter Banks; Prentice-Hall: Englewood Cliffs, NJ, USA, 1993.
30. Zelinski, R.; Noll, P. Adaptive transform coding of speech signals. IEEE Trans. Acoust. Speech Signal Process.
1977, 25, 299–309. [CrossRef]
31. Advanced Audio Distribution Specification Profile (A2DP) Version 1.2. Bluetooth Special Interest Group,
Audio Video WG, April 2007. Available online: http://www.bluetooth.org/ (accessed on 2 June 2016).
32. Bosi, M.; Goldberg, R.E. Introduction to Digital Audio Coding and Standards; Kluwer: Alphen aan den Rijn,
The Netherlands, 2003.
Information 2016, 7, 32 22 of 22

33. Neuendorf, M.; Gournay, P.; Multrus, M.; Lecomte, J.; Bessette, B.; Geiger, R.; Bayer, S.; Fuchs, G.;
Hilpert, J.; Rettelbach, N.; et al. A novel scheme for low bitrate unified speech and audio coding-MPEG
RM0. In Proceedings of the 126th Audio Engineering Society, Convention Paper 7713, Munch, Germany,
7–10 May 2009.
34. Fraunhofer White Paper. The AAC-ELD Family for High Quality Communication Services; Fraunhofer IIS
Technical Paper: Erlangen, Germany, 2013.
35. Herre, J.; Lutzky, M. Perceptual audio coding of speech signals. In Springer Handbook of Speech Processing;
Benesty, J., Sondhi, M.M., Juang, Y., Eds.; Springer: Berlin, Germany, 2008; pp. 393–410.
36. Atal, B.S.; Remde, J.R. A new model of LPC excitation for producing natural sounding speech at low bit rates.
In Proceeding of the International Conference on Acoustics Speech and Signal Processing, Paris, France,
3–5 May 1982; pp. 617–620.
37. Becker, D.W.; Viterbi, A.J. Speech digitization and compression by adaptive predictive coding with delayed
decision. In Proceedings of the National Telecommunications Conference, Conference Record, New Orleans,
LA, USA, 1–3 December 1975; pp. 46-18 through 46-23.
38. Stewart, L.C.; Gray, R.M.; Linde, Y. The Design of Trellis Waveform Coders. IEEE Trans. Commun. 1982, 30,
702–710. [CrossRef]
39. Jelinek, M.; Salami, R. Wideband speech coding advances in VMR-WB standard. IEEE Trans. Audio Speech
Lang. Process. 2007, 15, 1167–1179. [CrossRef]
40. Dietz, M.; Multrus, M.; Eksler, V.; Malenovsky, V.; Norvell, E.; Pobloth, H.; Miao, L.; Wang, Z.; Laaksonen, L.;
Vasilache, A.; et al. Overview of the EVS codec architecture. In Proceeding of the IEEE International
Conference on Acoustics, Speech and Signal Processing, South Brisbane, Australia, 19–24 April 2015;
pp. 5698–5702.
41. Flanagan, J.L. Speech Analysis, Synthesis and Perception, 2nd ed.; Springer: New York, NY, USA, 1972; pp. 3–8.
42. Flanagan, J.L. Parametric representation of speech signals [DSP History]. IEEE Signal Process. Mag. 2010, 27,
141–145. [CrossRef]
43. Johnston, J.D. Estimation of perceptual entropy using noise masking criteria. In Proceedings of the
International Conference on Acoustics, Speech, and Signal Processing, New York, NY, USA, 11–14 April
1988; pp. 2524–2527.
44. Kleijn, W.B.; Paliwal, K.K. An introduction to speech coding. In Speech Coding and Synthesis; Kleijn, W.B.,
Paliwal, K.K., Eds.; Elsevier: Amsterdam, The Netherlands, 1995; pp. 1–47.
45. Gibson, J.D.; Hu, J. Rate distortion bounds for voice and video. Found. Trends Commun. Infor. Theory 2014, 10,
379–514. [CrossRef]
46. Recommendation G.114, One-Way Transmission Time; ITU-T: Geneva, Switzerland, May, 2000.
47. Gibson, J.D. The 3-dB transcoding penalty in digital cellular communications. In Proceedings of the
Information Theory and Applications Workshop, University of California, San Diego, La Jolla, CA, USA,
6–11 February 2011.
48. Rodman, J. The Effect of Bandwidth on Speech Intelligibility; Polycom white paper; Polycom: Pleasanton, CA,
USA; September; 2006.
49. Gibson, J.D. (Ed.) Land mobile radio and professional mobile radio: Emergency first responder
communications. In Mobile Communications Handbook, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2012;
pp. 513–526.
50. Hardwick, J.C.; Lim, J.S. The application of the IMBE speech coder to mobile communications. In Proceedings
of the International Conference on Acoustics, Speech, and Signal Processing, Toronto, ON, Canada,
14–17 April 1991; pp. 249–252.
51. McCree, A.V. Low-Bit-Rate Speech Coding. In Springer Handbook of Speech Processing; Benesty, J., Sondhi, M.M.,
Juang, Y., Eds.; Springer: Berlin, Germany, 2008; pp. 331–350.
52. Voran, S.D.; Catellier, A.A. Speech Codec Intelligibility Testing in Support of Mission-Critical Voice Applications for
LTE; NTIA Report 15-520; U.S. Department of Commerce: Washington, DC, USA, September 2015.

© 2016 by the author; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like