Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Speech Compression (2)

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

SPEECH COMPRESSION

When you speak:

Air is pushed from your lung through your vocal tract and out of your mouth
comes speech.
For certain voiced sound, your vocal cords vibrate (open and close). The rate at
which the vocal cords vibrate determines the pitch of your voice.

 Women and young children tend to have high pitch (fast vibration) while adult
males tend to have low pitch (slow vibration).
For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not
vibrate but remain constantly opened.
The shape of your vocal tract determines the sound that you make.
As you speak, your vocal tract changes its shape producing different sound.

The shape of the vocal tract changes relatively slowly (on the scale of 10 msec
to 100 msec).

The amount of air coming from your lung determines the loudness of your voice
VOWELS

Diphthongs
SEMIVOWELS
Is the sound that intermediate between a vowel and
consonant
It is quite difficult to characterize.They are having the
vowel like nature
Example : /w/, /I/, /r/ and /y/
CONSONANT
A nasal consonant is a type of consonant produced with a
lowered velum in the
mouth, allowing air to come out through the nose, while the air
is not allowed to
pass through the mouth
The air is flows through the nasal tract with sound being
radiated at the nostrils
The three nasal consonant are distinguished by the place
along the oral tract at
which a total constriction is made.
o For /m/ the constriction is at the lips
o For /n/ the constriction is behind the teeth
o For /Ƞ / constriction is just forward the velum itself
PCM
• Sample Rate
– Nyquist Critria
– Bandwidth Limitation
• Sample Size
– Quantization Levels
– Quantization Noise / Distortion
Compression
• Lossey • Source Independent
• Lossless • Source Dependent
Voice Coding
(Compression)
• Waveform
• Frequency Domain
• Vocoder
Waveform Coding
• DSI - Digital Speech Interpolation
 Law / A Law
• Differential PCM
– ADPCM – Adaptive Differential PCM
• Delta Modulation
– CVSD – Continuously Variable Slope Detector
Frequency Domain
• SBC – Sub Band Coding
Vocoder
• LPC – Linear Predictive Coding
– CELP – Code Excited Linear Prediction
– VSELP – Vector Sum Excited Linear
Prediction
At the transmitter, the speech is divided into segments. Each segment is
analyzed to determine an excitation signal and the parameters of the
vocal tract filter.

In some of the schemes, a model for the excitation signal is transmitted


to the receiver.

The excitation signal is then synthesized at the receiver and used to drive
the vocal tract filter. In other schemes,
the excitation signal itself is obtained using an analysis-by-synthesis
approach.
This signal is then used by the vocal tract filter to generate the speech
signal.
each segment of input speech is analyzed using a bank of band-pass
filters called the analysis filters. The energy at the output of each filter
is estimated at fixed intervals and transmitted to the receiver.

Phonems

Shanon coding theory – 1st coding theorem


120 bits per
Computer - PCM – 8000 samples per second
sample for
Audio , speech 44.1 kbps
singing voice
Bits per sample is 8
Bit rate is 8000 *8= 64,000 bits per second

Speech will be divided 20-30 milli seconds frame

10 LPC coeficients
Bits per coefficient 16 bit (we can give)
10*16=160

(20 bits for other parameters)


So totally, 180 bits
No of frames in a second =50
Total number of bit for one second = bit rate of 1 sec speech =50
9000 bits
Shanon coding
Divide into phonems – 10 phones per second(approx)
128 phonems so we need only 7 bit
So 1 second = 10 x 7= 70 bits is enough
This is called entropy of speech for English
What is phonemes and examples?
phoneme, in linguistics, smallest unit of speech
distinguishing one word (or word element) from
another, as the element p in “tap,” which separates
that word from “tab,” “tag,” and “tan.” A phoneme
may have more than one variant, called an allophone
(q.v.), which functions as a single sound; for
example, the p's of “ ...
Formants are distinctive
frequency components of
the acoustic signal
produced by speech,
musical instruments or
singing. The information
that humans require to
distinguish between speech
sounds can be represented
purely quantitatively by Channel Vocoders
specifying peaks in the
amplitude or frequency
spectrum.
Linear Predictive Coder
Instead of the vocal tract being modeled by a bank of filters, in the
linear predictive coder the vocal tract is modeled as a single linear filter
whose output is related to the input by

where G is called the gain of the filter. As in the case of the channel vocoder,
the input to the vocal tract filter is either the output of a random noise
generator or a periodic pulse generator.
At the transmitter, a segment of speech is analyzed. The parameters obtained
include a decision as to whether the segment of speech is voiced or unvoiced,
the pitch period if the segment is declared voiced, and the parameters of the
vocal tract filter.

The input speech is generally sampled at 8000 samples per second. In the
LPC-10 standard, the speech is broken into 180 sample segments,
corresponding to 22.5 milliseconds of speech per segment.
The Voiced/Unvoiced Decision

the samples of the voiced speech have larger amplitude; that is,
there is more energy in the voiced speech.
The unvoiced speech contains higher frequencies.
Both speech segments have average values close to zero, this means that the unvoiced speech
waveform crosses the x = 0 line more often than the voiced speech sample.

The speech is voiced or unvoiced based on the energy in the segment relative to background
noise and the number of zero crossings within a specified window.

In the LPC-10 algorithm, the speech segment is first low-pass filtered using a filter with a
bandwidth of 1 kHz. The energy at the output relative to the background noise is used to obtain
a tentative decision about whether the signal in the segment should be declared voiced or
unvoiced.
The estimate of the background noise is basically the energy in the unvoiced speech segments.
This tentative decision is further refined by counting the number of zero crossings and
checking the magnitude of the coefficients of the vocal tract filter.
E s t i m a t i n g the Pitch Period

the autocorrelation of a periodic function Rxx(k) will have a maximum when


k is equal to the pitch period. Coupled with the fact that the estimation of the
autocorrelation function generally leads to a smoothing out of the noise, this
makes the autocorrelation function a useful tool for obtaining the pitch
period.
Voiced speech is not exactly periodic
When there is uncertainty about the magnitude of the maximum value, it is
difficult to select a value for the threshold. Another problem occurs because of
the interference due to other resonances in the vocal tract.
LPC-10 algorithm, that uses the average magnitude difference
function (AMDF).

it can be used to identify the pitch period as well as the


voicing condition.
O b t a i n i n g the Vocal Tract F i l t er
In linear predictive coding, the vocal tract is
modeled by a linear filter with the input-output
relationship shown in Equation

if yn are the speech samples in that particular segment, then we


want to choose ai to minimize the average value of en2 where
autocorrelation approach,
In order to compute the filter
coefficients of an Mth-order filter, the
Levinson-Durbin algorithm requires the
computation of all filters of order less
than M. Furthermore, during the
computation of the filter coefficients,
the algorithm generates a set of
constants k known as the reflection
coefficients, or partial correlation
(PARCOR) coefficients.
T r a n s m i t t i n g the Parameters

The parameters that need to be transmitted include the voicing


decision, the pitch period, and the vocal tract filter parameters.
CELP G.728,

A codebook of excitation patterns is constructed. Each


entry in this codebook is an excitation sequence that
consists of a few nonzero values separated by zeros.
Given a segment from the speech sequence to be
encoded, the encoder obtains the vocal tract filter
using the LPC analysis described previously. The
encoder then excites the vocal tract filter with the
entries of the codebook. The difference between the
original speech segment and the synthesized speech
is fed to a perceptual weighting filter, which weights
the error using a perceptual weighting criterion. The
codebook entry that generates the minimum average
weighted error is declared to be the best match. The
index of the best-match entry is sent to the receiver
along with the parameters for the vocal tract filter.
Silence Compression
Silence compression provides a way to squeeze redundancy
out of sound files. The silence compression scheme is
essential for efficient voice communication systems. It allows
significant reduction of transmission bandwidth during a period
of silence.
Voice Message Silence Compression
1.Determine threshold value that can be considered
silence, even though it is not pure silence.
2. Extract the data from the sound file to be compressed
pass through threshold check if it is below the threshold
(considered to be silence) make it pure silence.
3.Run length coding to the manipulated data.
4.Store it as a compressed file.
What are the parameters that are used in silence
compression?

- Silence compression is used in compressing sound files.

- It is equivalent to run length coding on normal data files.

- The parameters are:


1. A threshold value. It is a parameter that specifies, below
which the compression can be considered as silence.
2. A silence code followed by a single byte. It indicates the
numbers of consecutive silence codes are present.
3. To specify the start of a run of silence, which is a
threshold.

You might also like