Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

LPC Modeling: Unit 5 1.speech Compression

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 13

UNIT 5

1.SPEECH COMPRESSION
Speech compression involves the compression of audio data in the form of speech. Speech is a
unique form of audio data. A lot of factors must be considered during compression to ensure that
it will be intelligible and reasonably pleasant to listen to.
The aim of speech compression is to produce a compact representation of speech sounds such
that when reconstructed it is perceived to be close to the original. The two main measures of
closeness are intelligibility and naturalness.
Need for compression:
Raw audio data can take up a great deal of memory. During compression, the data is compressed
so that it will occupy less space. This frees up room in storage, and it also becomes important
when data is being transmitted over a network. On a mobile phone network, for example, if
speech compression is used, more users can be accommodated at a given time because less
bandwidth is needed. Likewise, speech compression becomes important with teleconferencing
and other applications; sending data is expensive, and anything which reduces the volume of data
which needs to be sent can help to cut costs.
TYPES:
1.The -law algorithm (often u-law, ulaw, mu-law, pronounced /) is a companding algorithm,
primarily used in the digital telecommunication systems of North America and Japan.
Companding algorithms reduce the dynamic range of an audio signal. In analog systems, this can
increase the signal-to-noise ratio (SNR) achieved during transmission, and in the digital domain,
it can reduce the quantization error (hence increasing signal to quantization noise ratio). These
SNR increases can be traded instead for reduced bandwidth for equivalent SNR.ulaw coding
does not exploit the (normally large) sample to sample correlations found in speech.
2. ADPCM is the next family of speech coding techniques, and it exploits the redundancy by
using a simple linear filter to predict the next sample of speech. The resulting prediction error is
typically quantised to 4 bits thus giving a bit rate of 32 kbps.Advantages of ADPCM are that is
simple to implement and has very low delay.
3. LPC: To obtain more compression specific properties of the speech signal must be modelled.
The main assumption that a source (voicing or fricative excitation) is passed through a filter (the
vocal tract response) to produce the speech. The simplest implementation of this is known as a
LPC synthesiser (e.g. LPC10e). At every frame, the speech is analysed to compute the filter
coefficients, the energy of the excitation, a voicing decision, and a pitch value if voiced. At the
decoder a regular set of pulses for voiced speech or white noise for unvoiced speech is passed
through the linear filter and multiplied by the gain to produce the speech. This is a very efficient
system and typically produces speech coded at 1200-2400bps. With clever acoustic vector
prediction this can be reduced to 300-600bps. The disadvantages are a loss of naturalness over
most of the speech and occasionally a loss of intelligibility.
4. CELP: Code-Excited Linear Prediction- The CELP family of coders compensates for the lack
of quality of the simple LPC model by using more information in the excitation. Each of a set of
codebook of excitation vectors is tried and the index of the one that best matches the original
speech is transmitted. This results in an increase in the bit rate to typically 4800-9600bps. Most
speech coding research is currently directed towards CELP coders.

LPC Modeling

Digital speech signals are sampled at a rate of 8000 samples/sec. Typically, each sample
is represented by 8 bits (using mu-law). This corresponds to an uncompressed rate of 64 kbps
(kbits/sec). With current compression techniques (all of which are lossy), it is possible to reduce
the rate to 8 kbps with almost no perceptible loss in quality. Further compression is possible at a
cost of lower quality. All of the current low-rate speech coders are based on the principle of
linear predictive coding (LPC) which is presented in the following sections.

A. Physical Model:

When you speak:


Air is pushed from your lung through your vocal tract and out of your mouth comes
speech.
For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the
vocal cords vibrate determines the pitch of your voice. Women and young children tend
to have high pitch (fast vibration) while adult males tend to have low pitch (slow
vibration).
For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vibrate but
remain constantly opened.
The shape of your vocal tract determines the sound that you make.
As you speak, your vocal tract changes its shape producing different sound.
The shape of the vocal tract changes relatively slowly (on the scale of 10 msec to 100
msec).
The amount of air coming from your lung determines the loudness of your voice.
LPC Transmitter and Receiver :

Please refer Class notes

B. Mathematical Model:
The above model is often called the LPC Model.
The model says that the digital speech signal is the output of a digital filter (called the
LPC filter) whose input is either a train of impulses or a white noise sequence.
The relationship between the physical and the mathematical models:
The LPC filter is given by:

which is equivalent to saying that the input-output relationship of the filter is given by the
linear difference equation:

Vocal Tract

(LPC Filter)

Air

(Innovations)

Vocal Cord Vibration

(voiced)

Vocal Cord Vibration Period

(pitch period)

Fricatives and Plosives

(unvoiced)

Air Volume

(gain)

The LPC model can be represented in vector form as:

changes every 20 msec or so. At a sampling rate of 8000 samples/sec, 20 msec is


equivalent to 160 samples.
The digital speech signal is divided into frames of size 20 msec. There are 50
frames/second.

The model says that


is equivalent to
Thus the 160 values of

is compactly represented by the 13 values of

There's almost no perceptual difference in if:


o For Voiced Sounds (V): the impulse train is shifted (insensitive to phase change).
o

For Unvoiced Sounds (UV):} a different white noise sequence is used.

LPC Synthesis: Given

, generate

LPC Analysis: Given

, find the best

(this is done using standard filtering techniques).


(this is described in the next section).

2.Adaptive filter
An adaptive filter is a filter that self-adjusts its transfer function according to an optimizing
algorithm. Because of the complexity of the optimizing algorithms, most adaptive filters are
digital filters that perform digital signal processing and adapt their performance based on the
input signal.
Need:When some parameters of the desired processing operation (for instance, the properties of
some noise signal) are not known in advance,adaptive filters are used. An adaptive filter, uses
feedback to refine the values of the filter coefficients and hence its frequency response.
As the power of digital signal processors has increased, adaptive filters have become much more
common and are now routinely used in devices such as mobile phones and other communication
devices, camcorders and digital cameras, and medical monitoring equipment.
Example
Suppose a hospital is recording a heart beat (an ECG), which is being corrupted by a 50 Hz noise
(the frequency coming from the power supply in many countries).
One way to remove the noise is to filter the signal with a notch filter at 50 Hz. However, due to
slight variations in the power supply to the hospital, the exact frequency of the power supply
might (hypothetically) wander between 47 Hz and 53 Hz. A static filter would need to remove all
the frequencies between 47 and 53 Hz, which could excessively degrade the quality of the ECG
since the heart beat would also likely have frequency components in the rejected range.
To circumvent this potential loss of information, an adaptive filter could be used. The adaptive
filter would take input both from the patient and from the power supply directly and would thus
be able to track the actual frequency of the noise as it fluctuates. Such an adaptive technique
generally allows for a filter with a smaller rejection range, which means, in our case, that the
quality of the output signal is more accurate for medical diagnoses.
Block diagram
The block diagram, shown in the following figure, serves as a foundation for particular adaptive
filter realisations, such as Least Mean Squares (LMS) and Recursive Least Squares (RLS). The
idea behind the block diagram is that a variable filter extracts an estimate of the desired signal.

To start the discussion of the block diagram we take the following assumptions:
The input signal is the sum of a desired signal d(n) and interfering noise v(n)
x(n) = d(n) + v(n)
The variable filter has a Finite Impulse Response (FIR) structure. For such structures the
impulse response is equal to the filter coefficients. The coefficients for a filter of order p
are defined as

.
The error signal or cost function is the difference between the desired and the estimated
signal

The variable filter estimates the desired signal by convolving the input signal with the impulse
response. In vector notation this is expressed as
where
is an input signal vector. Moreover, the variable filter updates the filter coefficients at every time
instant
where
is a correction factor for the filter coefficients. The adaptive algorithm generates this
correction factor based on the input and error signals. LMS and RLS define two different
coefficient update algorithms.
Applications of adaptive filters
Noise cancellation
Signal prediction
Adaptive feedback cancellation
Echo cancellation
Filter implementations
Least mean squares filter
Recursive least squares filter

1.Noise cancellation
Active noise control (ANC) (also known as noise cancellation, active noise reduction (ANR)
or antinoise) is a method for reducing unwanted sound A noise-cancellation speaker emits a
sound wave with the same amplitude but with inverted phase (also known as antiphase) to the
original sound., and effectively cancel each other .

A noise-cancellation speaker may be co-located with the sound source to be attenuated. In this
case it must have the same audio power level as the source of the unwanted sound. Alternatively,
the transducer emitting the cancellation signal may be located at the location where sound
attenuation is wanted (e.g. the user's ear). This requires a much lower power level for
cancellation but is effective only for a single user. Noise cancellation at other locations is more
difficult .
The advantages of active noise control methods compared to passive ones are that they are
generally:
More effective at low frequencies.
Less bulky.
Able to block noise selectively.
2.Linear prediction is a mathematical operation where future values of a discrete-time signal
are estimated as a linear function of previous samples. In digital signal processing, linear
prediction is often called linear predictive coding (LPC
The prediction model
The most common representation is

where
is the predicted signal value, x(n i) the previous observed values, and ai the
predictor coefficients. The error generated by this estimate is
where x(n) is the true signal value.
These equations are valid for all types of (one-dimensional) linear prediction. The differences are
found in the way the parameters ai are chosen.
For multi-dimensional signals the error metric is often defined as
where

is a suitable chosen vector norm.

3.Adaptive feedback cancellation


Adaptive feedback cancellation is a common method of cancelling audio feedback in a variety
of electro-acoustic systems such as digital hearing aids. The time varying acoustic feedback
leakage paths can only be eliminated with adaptive feedback cancellation. Adaptive feedback
cancellation has its application in echo cancellation. The error between the desired and the actual
output is taken and given as feedback to the adaptive processor for adjusting its coefficients to
minimize the error.
4.Echo cancellation
The term echo cancellation is used in telephony to describe the process of removing echo from a
voice communication in order to improve voice quality on a telephone call. In addition to
improving subjective quality, this process increases the capacity achieved through silence
suppression by preventing echo from traveling across a network.
Two sources of echo have primary relevance in telephony: acoustic echo and hybrid echo.
Echo cancellation involves first recognizing the originally transmitted signal that re-appears,
with some delay, in the transmitted or received signal. Once the echo is recognized, it can be
removed by 'subtracting' it from the transmitted or received signal. This technique is generally
implemented using a digital signal processor (DSP), but can also be implemented in software.

Echo cancellation is done using either echo suppressors or echo cancellers, or in some cases
both.
The Acoustic Echo Cancellation (AEC) process works as follows:
1. A far-end signal is delivered to the system.
2. The far-end signal is reproduced by the speaker in the room.
3. A microphone also in the room picks up the resulting direct path sound, and consequent
reverberant sound as a near-end signal.
4. The far-end signal is filtered and delayed to resemble the near-end signal.
5. The filtered far-end signal is subtracted from the near-end signal.
6. The resultant signal represents sounds present in the room excluding any direct or
reverberated sound produced by the speaker.

3.Musical sound processing


Musical sound, is defined as any tone with characteristics such as controlled pitch and timbre.
The sounds are produced by instruments in which the periodic vibrations which can be controlled
by the performer.
Tones are commonly linked with their sources (violin tone, flute tone, etc.), and they possess
controlled pitch, loudness, timbre, and duration, attributes that make them amenable to musical
organization. Instruments that yield musical sounds, or tones, are those that produce periodic
vibrations. Their periodicity is their controllable (i.e., musical) basis.
The pitch, or high-low aspect, created by each of these vibrating bodies is most directly a
product of vibrational frequency. Timbre (tone colour) is a product of the total complement of
simultaneous motions enacted by any medium during its vibration. Loudness is a product of the
intensity of that motion. Duration is the length of time that a tone persists.
Music signal processing methods that is the methods used for coding/decoding, synthesis,
composition and content-indexing of music signals facilitate some of the essential functional
requirements in a modern multimedia communication system.
Some of the applications of music signal processing methods include the followings:
Music coding for efficient storage and transmission of music signals. Examples are MP3, and
Sonys adaptive transform acoustic coder.
Noise reduction and distortion equalization such as Dolby systems, restoration of old audio
records degraded with hiss, crackles etc., and signal processing systems that model and
compensate for non-ideal characteristics of loudspeakers and music halls.
Music synthesis, pitch modification, audio mixing, audio morphing, audio editing and
computer music composition.
Music transcription and content classification, music search engines for Internet.
Music sound effects as in 3-D spatial surround music and special effect sounds in cinemas and
theatres.
Music processing can be divided into two main branches:
1.music signal modelling
2.music content creation.
Bandwidths of Music and Voice
The bandwidth of unimpaired hearing is normally between 10 Hz to 20 kHz,although some
individuals may have a hearing ability beyond this range of frequencies. Sounds below 10 Hz are
called infra-sounds and above 20 kHz are called ultra-sounds. The information in speech (i.e.
words, speaker identity, accent, intonation, emotional signals etc.) is mainly in the traditional
telephony bandwidth of 300 Hz to 3.5 kHz.

Music Coding (Compression).


The transmission bandwidth and the storage capacity requirement for digital music depend on
the sampling rate and the number of bits per sample. A stereo music with left and right channels
sampled at 44100 Hz and quantised with 16 bits per sample generates data at a rate of
24410016=1,411,200 bits per second (12.56) and requires about 5 Gigabits or 635 Mega
bytes of storage per hour of music.
The objective of music compression is to reduce the bit rate as far as possible while maintaining
high fidelity. This is usually achieved through decomposition of the music signal into a series of
de-correlated timefrequency components or a set of source-filter parameters of a synthesizer
model of music. Using a psycho-acoustic model, the various components of the decomposed
music signal are each allocated the minimum number of bits required to maintain the uantization
noise masked and inaudible and achieve high fidelity of reconstructed signal. The quantization
noise of a music coder depends on a number of factors that include; (a) the number of bits per
sample, (b) the efficiency of utilisation of the distributions of the music signal in time and
frequency domains and (c) the efficiency of utilisation of the psychoacoustics of hearing. The
goal of audio coding is to utilise the time-frequency distribution of the signal and to shape the
time-frequency distribution of the quantisation noise such that the quantisation noise is made
inaudible and the reconstructed signal is indistinguishable from the original signal.
Adaptive Transform Coding
A transform coder, Figure (12.30), consists of the following sections:
(a) Buffer and window, divides the signal into overlapping segments of length N samples. The
segment length may be variable as it controls the time and frequency resolutions and affects the
severity of preecho distortion described in section 12.20.
(b) Frequency analysis transforms the signal into frequency. Discrete cosine transform is often
used due to its ability to compress most of the signal energy into a relatively limited number of
principal components. Fourier transform may also be used.
(c) A pre-echo detector detects abrupt changes (e.g. attacks) in signal energy which can cause an
audible spread of the quantisation noise of the high energy part of a frame of music into the low
energy part of the frame and hence produce a pre-echo distortion.
(d) Psycho-acoustic model calculates the tonal and non-tonal distortion (JND) levels for each
frequency band of each signal frame.
(e) The quantizer represents each frequency component with k bits. One quantiser may be used
for all frequencies. However, it may be advantageous to use a set of quantisers to span the
frequency range; one quantiser for each group of frequency bins.
(f) The bit allocation modules, known as rate-distortion loop, allocates bits to each quantiser in
order to satisfy two requirements: (a) to keep the total number of bits within the intended bit rate,
(3) to keep the distortion of each frequency partial below the calculated JND levels.

Fig:Outline of adaptive transform coding.

High Quality Audio Coding: MPEG-1/2 layer-3 (MP3)


MPEG is the acronym for the moving picture experts group established in 1988 to develop open
standards for development of coders for moving pictures and audio. Open standards are
specifications that are available to developers interested in implementing the standard.
MPEG Structure
Figure 12.32 illustrates a block diagram structure of an MP3 coder. This is frequency domain
coder in that segments of 576 time domain samples are transformed into 576 frequency domain
samples. The available bits are then allocated non-uniformly among various frequency
components depending on the just noticeable distortion (JND) levels at different frequencies
calculated from the psychoacoustic model.

It consists of the following subsystems


The Filter bank can be a uniformly spaced filter bank or a non-uniformly spaced filter bank
where the filters bandwidth are matched to the critical bands of hearing i.e. high resolution at
lower frequencies and lower resolutions at higher frequencies .In this example we assume the
filter bank consists of 32 equal-bandwidth poly-phase filters of length 512 taps each. The filters
are equally spaced and split a bandwidth of 24 kHz, at a sampling rate of 48 KHz, to 32 bands of
width 750 Hz. The output of each filter is down sampled by a factor of 32:1.
After down sampling each subband has a sampling rate of 1500 Hz, and the total numbers of
samples across 32 subbands is 48000 samples per second; the same as the input sampling rate
before band splitting.
Modified discrete cosine transform Each sub-band output is segmented into segments of
18 samples long corresponding to a segment length of 1832=576 samples of the original
signal before down sampling or 12 ms duration and transformed by a modified discrete cosine
transform (MDCT). Hence the 750 Hz width of each subband is further decomposed into 18
frequency bins with a frequency resolution of 750/18=41.7 Hz.
The auditory perceptual model is based on critical bands of hearing and masking thresholds
as described in section 12.6. A 1024-samples FFT of the music signal (with a frequency
resolution Fs/N=48000/1204=46.875 Hz) is used to calculate the noise masking thresholds, the
so called just noticeable distortion (JND) levels; this is the amount of quantization noise in
each
frequency band that would be masked and made inaudible by the signal energy at and around
that band as explained in section 12.6. The frequency bands for calculation of the masking

thresholds are based on the critical bands of hearing. If the quantisation noise energy can be kept
below the masking threshold then the compressed signal would have the same transparent
perceptual audio quality as the original signal.
Quantisation and coding processes aim to distribute the available bits among the DCT
coefficients such that the quantisation noise remains masked. This is achieved through an
iterative two-stage optimization loop. A power law quantiser is used so that large spectral values
are coded with a larger quantization step size, as a higher signal energy masks more quantisation
noise. The quantised values are then Huffman coded. To adapt the coder to the local statistics of
the input audio signal the best Huffman coding table is selected from a number of choices .
The Huffman coder is a probabilistic coding method that achieves coding efficiency through
assigning shorter length codewords to more probable (i.e. more frequent) signal values and
longer length codewords to less frequent values. Consequently, for audio signals smaller
quantised values, which are more frequent, are assigned shorter length codewords and larger
values, which are less frequent, are assigned longer length codewords.
Quantisation consists of two loops an inner loop that adjusts the rate to keep the overall bit rate
within the required limit and an outer loop that aims to keep the distortion in each critical band
masked.

4.Image enhancement
The aim of image enhancement is to improve the interpretability or perception of information in
images for human viewers, or to provide `better' input for other automated image processing
techniques.
Image enhancement techniques can be divided into two broad categories:
1. Spatial domain methods, which operate directly on pixels, and
2. frequency domain methods, which operate on the Fourier transform of an image.
4.1Spatial domain methods
Suppose we have a digital image which can be represented by a two dimensional random field
f(x,y).
An image processing operator in the spatial domain may be expressed as a mathematical function
T[ ] applied to the image f(x,y).to produce a new image g(x,y=T[f(x,y)] as follows.
g ( x, y ) T f ( x , y )

The operator T applied on f(x,y).may be defined over:


(i)

A single pixel(x,y). In this case T is a grey level transformation (or mapping) function.

(ii)

Some neighbourhood of(x,y).

(iii)

T may operate to a set of input images instead of a single image.

a)Enhancement by point processing - Intensity transformations


Image Negatives
The negative of a digital image is obtained by the transformation function
s T ( r ) L 1 r shown in the following figure, where L is the number of grey levels. The
idea is that the intensity of the output image decreases as the intensity of the input increases. This
is useful in numerous applications such as displaying medical images.

s T (r )

L 1
L 1

Contrast Stretching
Low contrast images occur often due to
i)poor or non uniform lighting conditions,
ii) due to nonlinearity,
iii) small dynamic range of the imaging sensor.
Contrast stretching is done by darkening the levels below m and brightening the levels above
m in the original image. This technique is known as contrast stretching.

s T (r )

b) Histogram processing.
The histogram represents the frequency of occurrence of the various grey levels in the
image. A plot of this function for all values of k provides a global description of the appearance
of the image.By processing (modifying) the histogram of an image we can create a new image
with specific desired properties.
Suppose we have a digital image of size N N with grey levels in the range [0, L 1] . The
histogram of the image is defined as the following discrete function:
p ( rk )

nk
N2

Where rk is the kth grey level, k 0,1, , L 1

nk is the number of pixels in the image with grey level rk


N 2 is the total number of pixels in the image

c) Enhancement in the case of a single image


Spatial masks
Many image enhancement techniques are based on spatial operations performed on local
neighbourhoods of input pixels.The image is usually convolved with a finite impulse response
filter called spatial mask. The use of spatial masks on a digital image is called spatial filtering.
Suppose that we have an image f ( x , y ) of size N 2 and we define a neighbourhood around
each pixel. For example let this neighbourhood to be a rectangular window of size 3 3

w1

w2

w3

w4

w5

w6

w7

w8

w9

If we replace each pixel by a weighted average of its neighbourhood pixels then the response of
9

the linear mask for the pixel z 5 is wi z i . We may repeat the same process for the whole
i 1

image.
d) Enhancement in the case of a Multiple images
Image averaging
Suppose that we have an image f ( x, y ) of size M N pixels corrupted by noise n ( x , y ) , so
we obtain a noisy image as follows. g ( x, y ) f ( x, y ) n ( x, y )
For the noise process n ( x , y ) the following assumptions are made.
(i)
The noise process n ( x , y ) is ergodic.(ii)It is zero mean, (iii)
autocorrelation function of the noise process is zero.

It is white, i.e., the

Suppose now that we have L different noisy realisations of the same image f ( x, y ) as
g i ( x, y ) f ( x , y ) ni ( x , y ) , i 0,1, , L . Each noise process ni ( x, y ) satisfies the properties
(i)-(iii) given above. New image g ( x, y ) is formed by averaging these L noisy images. image
averaging produces an image g ( x, y ) , corrupted by noise with variance less than the variance
of the noise of the original noisy images.

4.2 Frequency domain methods


Let g ( x, y ) be a desired image formed by the convolution of an image f ( x , y ) and a linear,
position invariant operator h ( x , y ) , that is:
g ( x, y ) h( x, y ) f ( x, y )

The following frequency relationship holds:

G (u, v ) H (u, v ) F ( u, v )

We can select H (u, v ) so that the desired image


g ( x , y ) 1 H ( u, v ) F (u , v )

exhibits some highlighted features of f ( x , y ) . For instance, edges in f ( x , y ) can be


accentuated by using a function H (u, v ) that emphasises the high frequency components of
F (u, v ) .

You might also like