GSM Codecs
GSM Codecs
The GSM standard supports four different but similar compression technologies to
analyse and compress speech. These include full-rate, enhanced full-rate (EFR),
adaptive multi-rate (AMR), and half-rate. Despite all being lossy (i.e. some data is
lost during the compression), these codecs have been optimized to accurately
regenerate speech at the output of a wireless link.
After coding, the bits are re-arranged, convoluted, interleaved, and built into bursts
for transmission over the air interface. Under extreme error conditions a frame
erasure occurs and the data is lost, otherwise the original data is re-assembled,
potentially with some errors to the less significant bits. The bits are arranged back
into their parametric representation, and fed into the decoder, which uses the data
to synthesise the original speech information.
The vocoder model consists of a tone generator (which models the vocal chords),
and a filter that modifies the tone (which models the mouth and nasal cavity shape)
[Figure 1]. The short-term analysis and filtering determines the filter coefficients
and an error measurement, the long-term analysis quantifies the harmonics of the
speech.
The residual signal from the short-term filtering is segmented into four sub-frames
of 40 samples each. The long-term prediction (LTP) filter models the fine harmonics
of the speech using a combination of current and previous sub-frames. The gain
and lag (delay) parameters for the LTP filter are determined by cross-correlating
the current sub-frame with previous residual sub-frames.
The peak of the cross-correlation determines the signal lag, and the gain is
calculated by normalising the cross-correlation coefficients. The parameters are
applied to the long-term filter, and a prediction of the current short-term residual is
made. The error between the estimate and the real short-term residual signal—the
long-term residual signal—is applied to the RPE analysis, which performs the data
compression.
The Regular Pulse Excitation (RPE) stage involves reducing the 40 long-term
residual samples down to four sets of 13-bit sub-sequences through a combination
of interleaving and sub-sampling. The optimum sub-sequence is determined as
having the least error, and is coded using APCM (adaptive PCM) into 45 bits.
The resulting signal is fed back through an RPE decoder and mixed with the short-
term residual estimate in order to source the long-term analysis filter for the next
frame, thereby completing the feedback loop (Table 2).
The EFR codec is an algebraic code excitation linear prediction (ACELP) codec,
which uses a set of similar principles to the RPE-LTP codec, but also has some
significant differences. The EFR codec uses a 10th-order linear-predictive (short-
term) filter and a long-term filter implemented using a combination of adaptive and
fixed codebooks (sets of excitation vectors).
Figure 2: Diagram of the EFM vocoder model
The pre-processing stage for EFR consists of an 80 Hz high-pass filter, and some
downscaling to reduce implementation complexity. Short-term analysis, on the
other hand, occurs twice per frame and consists of autocorrelation with two
different asymmetric windows of 30mS in length concentrated around different sub-
frames. The results are converted to short-term filter coefficients, then to line
spectral pairs (for better transmission efficiency) and quantized to 38 bits.
In the EFR codec, the adaptive codebook contains excitation vectors that model the
long-term speech structure. Open-loop pitch analysis is performed on half a frame,
and this gives two estimates of the pitch lag (delay) for each frame.
The open-loop result is used to seed a closed-loop search for speed and reduced
computation requirements. The pitch lag is applied to a synthesiser, and the results
compared against the non-synthesised input (analysis-by-synthesis), and the
minimum perceptually weighted error is found. The results are coded into 34 bits.
The residual signal remaining after quantization of the adaptive codebook search is
modelled by the algebraic (fixed) codebook, again using an analysis-by-synthesis
approach. The resulting lag is coded as 35 bits per sub-frame, and the gain as 5
bits per sub-frame.
The final stage for the encoder is to update the appropriate memory ready for the
next frame.
Going Adaptive
The principle of the AMR codec is to use very similar computations for a set of
codecs, to create outputs of different rates. In GSM, the quality of the received air-
interface signal is monitored and the coding rate of speech can be modified. In this
way, more protection is applied to poorer signal areas by reducing the coding rate
and increasing the redundancy, and in areas of good signal quality, the quality of
the speech is improved.
In terms of implementation, an ACELP coder is used. In fact, the 12.2 kbit/s AMR
codec is computationally the same as the EFR codec. For rates lower than 12.2
kbit/s, the short-term analysis is performed only once per frame. For 5.15 kbit/s
and lower, the open-loop pitch lag is estimated only once per frame. The result is
that at lower output bit rates, there are a smaller number of parameters to
transmit, and fewer bits are used to represent them.
The half-rate codec is a vector sum excitation linear prediction (VSELP) codec that
operates on an analysis-by-synthesis approach similar to the EFR and AMR codecs.
The resulting output is 5.7 kb/s, which includes 100 b/s of mode indicator bits
specifying whether the frames are thought to contain voice or no voice. The mode
indicator allows the codec to operated slightly differently to obtain the best quality.
Half-rate speech coding was first introduced in the mid 1990's, but the public
perception of speech quality was so poor that it is not generally used today.
However, due to the variable bit-rate output, AMR lends itself nicely to transmission
over a half-rate channel. By limiting the output to the lowest 6 coding rates (4.75 --
7.95kbps), the user can still experience the quality benefits of adaptive speech
coding, and the network operator benefits from increased capacity. It is thought
that with the introduction of AMR, use of the half-rate air-channel will start to
become much more widespread.
Computational Complexity
Table 3 shows the time taken to encode and decode a random stream of speech-
like data, and the speed of the operations relative to the GSM full-rate codec.
The process of building the air transmission bursts involves adding redundancy to
the data by convolution. During this process, the most important bits (Class 1a) are
protected most while the least important bits (Class 2) have no protection applied.
This frame building process ensures that many errors occurring on the air interface
will be either correctable (using the redundancy), or will have only a small impact
on the speech quality.
Future Outlook
The current focus for speech codecs is to produce a result that has a perceptually
high quality at very low data rated by attempting to mathematically simulate the
mechanics of human voice generation. With the introduction of 2.5G and 3G
systems, it is likely that two different applications of speech coding will be
developed.
The first will be comparatively low bandwidth speech coding, most likely based on
the current generation of CELP codecs. Wideband AMR codecs have already been
standardised for use with 2G and 2.5G technologies and these will utilise the
capacity gains from EDGE deployment.
The second will make more use of the wide bandwidth employing a range of
different techniques which will probably be based on current psychoacoustic coding,
a technique which is in widespread use today for MP3 audio compression.
There is no doubt that speech quality over mobile networks will improve, but it may
be some time before wideband codecs are standardised and integrated with fixed
wire-line networks, leading to potentially CD-quality speech communications
worldwide.