Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
27 views

Module3 SSP

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Module3 SSP

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

SPEECH SIGNAL PROCESSING

ECE3028
BY
DR. K. GOWRI
ASSISTANT PROFESSOR,
DEPARTMENT OF ECE,
PRESIDENCY UNIVERSITY,
BANGALORE
Module 3

• Introduction, definitions and properties: Fourier


Transforms interpretation and Z transform
interpretation, sampling rates in time and
frequency(308), filter bank Summation method for
short time Synthesis(341), Spectral estimation of
speech using the discrete Fourier Transform
Introduction- Frequency Domain Representations
• In many areas of science and engineering, the representation of signals or other
functions by sums of sinusoids or complex exponentials leads to convenient
solutions to problems.
• Such representations—Fourier representations as they are commonly called—
are useful in signal processing for two basic reasons.
• Such representations—Fourier representations as they are commonly called—
are useful in signal processing for two basic reasons.
1. for linear systems, it is very convenient to determine the response to a superposition of
sinusoids or complex exponentials.
2. The Fourier representations often serve to place in evidence certain properties of the
signal that may be obscure or at least less evident in the original signal
Introduction- Frequency Domain Representations
• Speech communication research and technology are areas where the concept of
a Fourier representation has traditionally played a major role.
• Figure 1, consists of a linear system with system function, V(z), excited by a
source which is either periodically varying (AVp[n] ∗ g[n] for voiced speech) or
randomly varying (ANu[n] for unvoiced speech).
• A transfer function, R(z), represents radiation of sound at the lips.
• In general, the spectrum of the output of such a model would be the product of
the frequency responses of the vocal tract system, the spectrum of the
excitation source, and the spectrum of the model of sound radiation.
Fig. 1 General discrete-time model of speech production showing explicit sources for voiced and
unvoiced speech sounds .
Introduction- Frequency Domain Representations
• For voiced speech such as a sustained vowel, we can write the discrete-time
Fourier transform (DTFT) expression as:

• and for sustained unvoiced speech, assuming white noise excitation with unit
power, the power spectrum of the output is,

• Thus, it is to be expected that the Fourier spectrum of the output would reflect
the properties of the excitation, the vocal tract and radiation frequency
responses.
• However, although vowels and fricatives can be sustained for several seconds
with little variation, natural speech is continually changing in time.
Introduction- Frequency Domain Representations
• Thus the standard Fourier representations that are appropriate for periodic,
transient, or stationary random signals are not directly applicable to the
representation of speech signals.
• That temporal properties such as energy, zero crossings, and correlation (short-
time analysis principle ) are slowly varying so that they can be assumed to be
fixed over time intervals on the order of 10 to 40 msec.
• In order to study spectral properties of speech signals, we will define a time-
varying Fourier transform, which we generally refer to as the short-time Fourier
transform (STFT).
• Use of the STFT is therefore termed short-time Fourier analysis (STFA).
• The STFT is invertible in the sense that, with certain constraints, we can recover
the original sampled signal by a process that we term short-time Fourier
synthesis (STFS).
Introduction- Frequency Domain Representations
• Indeed, STFA/STFS provides a representation of the speech waveform that can
serve as the basis for many types of speech processing including coding and
various types of signal enhancement.

Fig. 2 Model of frequency-


domain processing of
speech via STFA and STFS
methods.

• This is depicted in Figure 2, which shows that the processing can be controlled
by “side information” extracted by other means from the speech signal.
Discrete time Fourier Analysis
• The discrete-time Fourier transform (DTFT) of a discrete time signal, x[n],
related by the pair of equations,

• --(1)
• where ω is the normalized frequency variable of X(e ^jω) in units of radians.
• The discrete Fourier transform (DFT), which is inherently a representation of
periodic sequences, but is applicable to finite length sequences if care is taken
to ensure that one period is precisely equal to the desired finite-length
sequence.
Discrete time Fourier Analysis
• The DFT and its inverse are given by the equations,

• The DFT and DTFT can both be used as mathematical representations of a finite-
length sequence; specifically, the DFT and the DTFT of a finite-length sequence
are related by,

that is, the DFT is a sampled (in frequency) version of the DTFT.
SHORT-TIME FOURIER ANALYSIS
• Define the time-dependent, or short-time, Fourier transform (STFT) as,

• --(1)
• where w[nˆ − m]  real window sequence whose purpose is to determine the
portion of the input signal that receives emphasis at a particular time index, nˆ.
• The time dependent Fourier transform is a complex function of two variables:
1. the time index, nˆ, which is discrete,
2. the frequency variable, ωˆ, which is continuous and periodic, with period 2π.
SHORT-TIME FOURIER ANALYSIS
• A plot showing the domain of the two variables (is given in Figure 3 ),
1. nˆ  for the range 0 ≤ nˆ ≤ 8 (nˆ is defined for all discrete values but only a few
are shown in this figure) and
2. ωˆ  for 0 ≤ ˆω < 2π (since ωˆ is periodic over intervals of 2π). Alternatively,
we could use the range −π < ωˆ ≤ π.
• An alternative form of Eq. (1) is obtained by a change of summation index,
which yields the expression,
FIGURE 3 Domain of variables nˆ and ωˆ in
the definition of the STFT.
SHORT-TIME FOURIER ANALYSIS
• If we define

• Then c an be expressed as,

• The STFT equations can be interpreted in two distinct ways.


1. assuming that nˆ is fixed, is simply the DTFT of the sequence w[nˆ −
m]x[m], − ∞ < m < ∞.
• Therefore, for fixed nˆ, has the same properties as a normal DTFT.
SHORT-TIME FOURIER ANALYSIS
• 2. ωˆ fixed  This corresponds to points at the intersection of the horizontal
dashed line in Figure 3 with the vertical lines.
• This interpretation leads us naturally to consider the time-dependent Fourier
representation in terms of linear filtering (DFT).

DTFT Interpretation
DFT Implementation
DTFT Interpretation
• Consider as the DTFT of the sequence w[nˆ − m]x[m], − ∞ < m < ∞, for
fixed nˆ.
• The time-dependent Fourier transform is a function of the time index, nˆ, which
takes on all integer values so as to “slide” the window, w[nˆ − m], along the
sequence, x[m].
• This is depicted in Figure 4, which shows x[m] and w[nˆ − m] as functions of m
for several values of nˆ.

FIGURE 4 Plots of x[m] and w[nˆ − m] for


several values of nˆ
DTFT Interpretation
• The conditions for the existence of the STFT representation are identical to
those for the DTFT; i.e., the sequence must be absolutely summable.
• For the STFT, the sequence x[m]w[nˆ − m] must be absolutely summable for all
values of nˆ.
• If, as is generally the case, w[nˆ − m] is of finite duration and |x[m]| < ∞ for all
m, then this condition is satisfied for all nˆ.
• As in the case for normal Fourier transforms of discrete-time signals, the STFT is
periodic in ωˆ with period 2π.
• This is easily seen by substituting ωˆ + 2π into Eq. (1).
• The STFT can be expressed in terms of analog frequencies through the relation
, where T is the sampling period used to obtain the sequence x[m],
and analog radian frequency.
DTFT Interpretation
• - STFT, has the same properties as a normal DTFT.
• The input sequence x[m] can be recovered exactly from the time-varying Fourier
transform.
• . Since Xnˆ(e jωˆ) for fixed nˆ is the DTFT of w[nˆ − m]x[m], Eq. (3) will be,

• (2)

• (3)

• (4)
DTFT Interpretation
• Note that the integration could be over any interval of length 2π (e.g., 0 to 2π).
• Now if w[0] ~= 0, Eq. (7.12) can be evaluated for m = nˆ, thereby obtaining,

• If w[0] ~= 0, the sequence x[nˆ] can be exactly recovered from , if it is


known.
• We could recover not just a single value x[nˆ] but instead an entire range of x[m]
could be recovered by realizing that, so long as the window is positive and non-
zero, the sequence x[m] can be recovered as,
DTFT Interpretation

• for m inside the interval where w[nˆ − m] ~= 0.


• This demonstrates that it is possible to invert the short-time transform, but it is
not a computationally feasible way to do it.
• The STFT, is, in general, a complex-valued function of nˆ and ωˆ.
Therefore, it can be expressed in terms of its real and imaginary parts,
DTFT Interpretation
• For the case when x[m] and w[nˆ − m] are both real,
 an(ωˆ) and bn(ωˆ) can be shown to satisfy certain symmetry and periodicity
relations that are the properties of any DTFT.
• Another representation for is in terms of magnitude and phase as,

• The quantities, and can readily be related to anˆ(ωˆ) and


bnˆ(ωˆ) and vice versa.
DFT Implementation
• While the DTFT interpretation of the STFT yields useful insights, we must rely on
the DFT and its fast computation algorithms (FFT) to implement the
computation of the STFT as a sequence of Fourier transforms evaluated at a
finite discrete set of frequencies ω_k = 2πk/N, with k = 0, 1, ... , N − 1.
• By substituting (2πk/N) for ω in Eq. (1), we obtain,

• --(5)
DFT Implementation
• where the window is chosen to be an L-point non-causal window such that w[ −
m] ~= 0 only in 0 ≤ m ≤ L − 1 and L ≤ N.
• It follows that the STFT at time nˆ and frequencies ω_k = 2πk/N is,

• --(6)
• Equation (5) should be recognized as the DFT of the windowed sequence x˜nˆ
[m] = x[nˆ + m]w[−m] for 0 ≤ m ≤ L − 1.
• Hence X˜ nˆ [k] can be computed efficiently by an FFT algorithm if N is a power
of two or some other highly composite number.
Steps for DFT
• Thus, to compute X˜ [k], we iterate the following steps:
• 1. Select a set of L samples starting at nˆ. (For a causal window, we would take
sample nˆ and the L − 1 samples preceding nˆ.)
• 2. Multiply by the window w[ − m] to form , for m = 0,
1, ... , L − 1.
• Compute the N-point DFT of using a fast (FFT) algorithm.
• If magnitude and phase of are required, use eqn(6).
• Otherwise, note that,
Sampling Rates in Time and Frequency
• The STFT is a complex two-dimensional representation of the one-dimensional
real speech signal x[n].
• That is, is a function of both the discrete index n which represents time,
and continuous normalized radian STFT analysis frequency ωˆ.
• As such, is like a (complex-valued) two-dimensional image with one
discrete (n) and one continuous dimension.
• Figure 5a shows the region of support for in two dimensions.
• A basic consideration in the digital implementation of systems for STFA is the
rate at which should be sampled in both the time and frequency
dimensions to provide an un-aliased representation of from which x[n]
can be exactly recovered.
Sampling Rates in Time and Frequency

FIGURE 5 Domain of STFT variables ωˆ and n for (a) the case with no sampling and (b) the case with sampling based
on the bandwidth and timewidth of the lowpass window, w[n].
Sampling Rates in Time and Frequency
• Figure 5b shows the discrete region of support when is sampled in time
with interval R samples and in frequency at frequencies ωk = (2πk/N) as in,

• --(7)
• where k = 0, 1, ... , N − 1 and r ranges over the integers.
• For example, for an L-point causal Hamming window, the limits would be finite,
ranging from m = rR − L + 1 to m = rR.
• R and N should be chosen carefully, so that the speech signal can be
reconstructed from its sampled STFT.
Sampling Rates in Time and Frequency
• Sampling rates lower than the theoretically minimum rate required to avoid
aliasing in both dimensions can be used in either the time or the frequency
dimensions, and x[n] can still be exactly recovered from the aliased (under-
sampled) short-time transform.
• Such under-sampled representations are indeed quite useful for applications -
spectral estimation, pitch and formant analysis, and digital speech spectrograms
or for speech and audio coding applications.
Sampling Rate of in Time
• The linear filtering interpretation- for determining the required sampling rate in
the time dimension.
• For a fixed value of ωˆ, is the output of a linear filter with impulse
response w[n].
• DTFT W(e^jω) has the properties of a (non-ideal) low-pass filter frequency
response.
• Let us denote the effective bandwidth of the analysis window as B Hz.
• Thus the sequence (as a function of n with ωˆ fixed) has bandwidth
determined by the DTFT of the window, and therefore according to the sampling
theorem, must be sampled at a rate of at least 2B samples/sec to avoid
aliasing.
Sampling Rate in STFT Frequency ω^
• As an example, consider an (L = 2M + 1)−point Hamming window,
• Then the approximate filter bandwidth of W(e^jω) in terms of analog
frequencies is,

• where Fs is the sampling rate of the signal, x[n],


• and thus the required sampling rate of in the time dimension is Fs/R =
2B ≥ 4Fs/L samples/sec or R ≤ L/4, where R is an integer.
• In other words, for the Hamming window, in order to avoid aliasing for the
sequence of time samples of at frequency ωˆ, the spacing between
analysis window positions must be less than or equal to 25% of the window
length.
Sampling Rate in STFT Frequency ω^
• Equivalently, the windows must overlap by at least 75%.
• Thus for L = 100 and Fs = 10,000 Hz, we get B = 200 Hz, so that must be
evaluated 400 times/sec— i.e., every R = 100/4 = 25 samples at the sampling
rate of the input.
Sampling Rate in STFT Frequency ω^
• Since is periodic in ωˆ with period 2π, it is only necessary to sample over
an interval of length 2π.
• To determine an appropriate finite set of frequencies, ωˆ k = 2πk/N, k = 0, 1, ... ,
N − 1 at which must be specified to exactly recover the signal x[n], we
use the DTFT interpretation of .
• If the window is time-limited, then if is viewed as a DTFT, its inverse
transform is time-limited.
• In this case the sampling theorem requires that we sample in the
frequency dimension at a rate of at least twice its “time-width.”
Sampling Rate in STFT Frequency ω^
• Since the inverse DTFT of is the signal x[m]w[rR − m] and this signal is of
duration L samples (again due to the finite duration window), then according to
the sampling theorem, must be sampled (in frequency ωˆ) at the set of
frequencies.

• with N ≥ L in order to exactly recover x[n] from .


• With this set of samples, Eq. (7) will be recognized as the N-point DFT of the
sequence x[m]w[rR − m], which is assumed to be of finite-length L samples.
Sampling Rate in STFT Frequency ω^
• Thus, for an L-point causal window, the inverse DFT yields,

• if N ≥ L.
• Otherwise time aliasing will occur; i.e., the inverse DFT evaluated in the interval
rR − L + 1 ≤ m ≤ rR will be composed of a sum of shifted (by N) copies of
x[m]w[rR − m].
• Thus, for the example of a Hamming window of duration L = 100 samples, we
require to be evaluated for at least 100 uniformly spaced frequencies.
Total Sampling Rate
• We can determine the total number of samples of that must be computed
per second to give an un-aliased representation of the original signal x[n].
• The minimum sampling rate of in the time dimension is 2B, where B is
the frequency bandwidth of the window, and the minimum number of samples
in the frequency dimension is L, the time width of the window is,
• --(8)
• For most practical windows, B can be represented as a multiple of (Fs/L), where
Fs is the sampling frequency of x[n]; i.e.,

• where Cb is the proportionality constant.


Total Sampling Rate
• Thus Eq. (8) can be written as,
• The ratio of SR to Fs is therefore,

• The quantity 2Cb indicates the “oversampling” ratio of the short-time analysis
as compared to the sampling rate used to obtain the sequence x[n].
• As an example, if w[n] is a Hamming window, then 2Cb = 4, whereas if w[n] is a
rectangular window [and if the bandwidth B is defined to be the frequency of
the first zero of W(e jω)], then 2Cb = 2.
• It should be emphasized that the DTFT of the rectangular window is not a very
good lowpass filter.
• Its sidelobes are only 14 dB down from the value at ω = 0 while the sidelobes of
the Hamming window are much lower.
Total Sampling Rate
• However, in return for this expansion of sampling rate, we obtain a very flexible
representation of the signal from which extensive modifications, in both the
time and frequency domains, can be made.
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• It is necessary to sample the STFT in both the time and frequency dimensions
in order to obtain efficient and effective computational realizations of short-
time Fourier analysis and synthesis.
• The requirement : N ≥ L,
• For exact reconstruction, N>=L can be relaxed if we choose the window and N
properly.
• Assume that the sampling rate in the time dimension is identical to that of the
input signal; i.e., in the notation, R = 1.
• The frequency-sampled STFT is,

• --(9)
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• where for uniform sampling in frequency, ωk = 2πk/N, with k = 0, 1, ... , N − 1.
• These are the standard DFT frequencies,
• The STFS equation
• --(10)
• This is simply the inverse DFT of at the particular time n.
• The process of STFS from the point of view of the linear filtering interpretation
of the STFT is discussed in this topic.
• The method of synthesis that emerges from this interpretation is called the filter
bank summation method (FBS) of short-time synthesis.
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• Before exploring, observe that if N ≥ L, where L is the window length, the
inverse DFT produces w[n − m]x[m] for n − m inside the region of support of the
window.
• If m = n, --(11)

• That is, Eq. (11) exactly reconstructs x[n] to within a constant multiplier, and if
w[0] > 0, we can divide by w[0] to obtain x[n].
• Therefore, for all n, Eq. (10) is the desired synthesis equation and w[0] is
the scale factor on the synthesized output.
• has the interpretation of low-pass filtering following frequency down-
shifting by ωk.
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• With a change of summation variable, we have the alternative form (put m=n-
m),

• --(12)
• With the definition,

• --(13)
• Eq. (12) becomes,

• --(14)
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• W.k.t, window w[n] has the properties of a low-pass filter.
• Eq. (14) can be interpreted as in Figure 6 as a band-pass filter with impulse
response followed by frequency down-shifting by modulation with a
complex exponential .

FIGURE 6 Another interpretation of short-


time spectral analysis in terms of linear
filtering: complex operations;
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• Figure 7 shows an example of how the corresponding low-pass and band-pass
filter frequency responses are related for the case of a Hamming window for
analysis frequency ωˆ.

FIGURE 7 Filters for STFA based on the


Hamming window: (a) lowpass filter
frequency response; (b) bandpass filter
frequency response for analysis frequency
ω^
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• Now, the signal, --(15)
is simply one of the terms of the sum in Eq. 11.
• Therefore, using (14) and (15),

• --(16)

• That is, is the output of a band-pass filter with impulse response as


defined by Eq. (13).
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• The operations of Eqs. (12), (14) and (16) are depicted in Figure 8

FIGURE 8 Methods for implementing the


synthesis of a single channel in terms of linear
filtering
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• The result summarized in Figure 8 provides a practical method for
reconstructing the input signal from its time-dependent Fourier transform.
• We simply implement N band-pass channels of the form of Eq. (16) and sum the
outputs; i.e.,
• --(17)

• This motivates the name FBS for this approach to STFS.


• The combination of analysis followed by synthesis can be represented as a
band-pass filter centered on the analysis frequency ωk.
• Now consider the set of N frequencies {ωk = 2πk/N}, k = 0, 1, ... , N − 1, and
suppose that the N time sequences are available for each frequency.
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• This can be achieved by a “bank” of analysis/synthesis channels.
• The N bandpass filters of the form of Eq. (13) have frequency responses
• --(18)

• If we consider the entire collection of band-pass filters, each having the same
input and their outputs added together as in Figure 9, the composite frequency
response relating y[n] to x[n] is,

• --(19)
FILTER BANK SUMMATION METHOD OF SYNTHESIS

FIGURE 8 Equivalent linear system relating yk[n] and y[n] to


x[n].
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• If is properly sampled in the frequency dimension (i.e., if N ≥ L where L is
the time duration of the window), then it can be shown that,

• --(20)
• To derive Eq. (20), recall that is the DFT of the window.
• Therefore, the inverse DFT of is,

• --(21)

• corresponds to a time-aliased representation of w[n].


FILTER BANK SUMMATION METHOD OF SYNTHESIS
• If w[n] is of duration L samples and L ≤ N, then,
• and no time aliasing occurs due to sampling in frequency.
• Therefore, if Eq. (21) is evaluated for n = 0, we get,

• The impulse response of the composite system is,

• We have used the concept of a bank of filters to confirm a very important result
that we have already observed from the DFT viewpoint.
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• Under the condition that
• 1. w[n] has finite duration,
• 2. N>=L,
• 3. the sequence x[n] can be reconstructed exactly from the time-dependent
Fourier transform sampled in the time dimension (n) at the sampling rate of the
input signal
• sampled in the frequency dimension at N ≥ L equally spaced frequencies over 0
≤ ˆω < 2π.
• There are many ways to achieve exact reconstruction, even when N < L
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• For example, if the window is the infinite sequence,

• ---(22)
• corresponding to the impulse response of an ideal lowpass filter with cutoff
frequency π/N,
• then the composite frequency response would be constant
independent of ω , 0 ≤ ω < 2π.
• This situation is depicted in Figure 9, which shows the composite response for N
= 6 equally spaced ideal filters each of total width 2π/N.
• It is clear that exact reconstruction is achieved for this window even though it
has infinite length.
FILTER BANK SUMMATION METHOD OF SYNTHESIS

FIGURE 9 Composite frequency response for N = 6 equally spaced ideal filters


FILTER BANK SUMMATION METHOD OF SYNTHESIS
• To explore this question further and to understand why it is not necessary for
the window to have finite length with L ≤ N, it is helpful to redefine the
synthesis equation as,

• where we have introduced a set of complex gain coefficients, P[k], for each of
the N filter bank channels.
• These complex gain coefficients are shown in Figure 10
FILTER BANK SUMMATION METHOD OF SYNTHESIS

FIGURE 10 Analysis and synthesis


operations for short-time spectrum
analysis.
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• The filters hk[n] are the equivalent band-pass filters for analysis followed by
synthesis.
• The coefficients P[k] can be used to adjust the magnitude and phase of the
individual channels.
• This adds significant flexibility in the design and implementation of STFA/STFS
systems.
• Now the overall composite impulse response of the filter bank becomes,
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• Defining,

• to be the inverse DFT of the set of coefficients P[k], k = 0, 1, ... , N − 1.


• Therefore h˜[n] can be written as
• Since the sequence p[n] is periodic, with period N if evaluated outside the base
interval 0 ≤ n ≤ N − 1, the nature of w[n] outside the base interval can
significantly affect the overall impulse response of the filter bank.
• Specifically, if P[k] = 1 for all k , that p[n] is a periodic train of impulses; i.e.,
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• Therefore, h˜[n] is,

• i.e., the composite impulse response is simply the window sequence sampled at
intervals of N samples.
• This is depicted in Figure 11.
• Figure 11a shows the sequence p[n].
• Figure 11b shows w[n] as given in Eq. (7.75)
FILTER BANK SUMMATION METHOD OF SYNTHESIS
FILTER BANK SUMMATION METHOD OF SYNTHESIS

FIGURE 12 Typical sequences for p[n]


and w[n] for composite filter bank
FILTER BANK SUMMATION METHOD OF SYNTHESIS
• The output of the analysis/synthesis system is,

• Thus, apart from the scale factor, , and delay of r0N samples, the output
of the system for time-dependent Fourier analysis and synthesis is an exact
replica of the input sequence.

You might also like