Method To Study Speech Synthesis
Method To Study Speech Synthesis
Method To Study Speech Synthesis
Pattern Playback
Articulatory synthesis
Parametric Synthesis
Analysis by synthesis
Pattern Playback
• The first example of a real speech synthesizer appears in 1951 at Haskins
Laboratories, with the PATTERN PLAYBACK
• Sonagraph machine working in reverse.
• The sonagraph transforms recorded speech into a '3-dimensional' plot
• the two first dimensions being time and frequency, the third one being
intensity, represented on a gray scale
• Conversely, by drawing schematic evolutions of formant frequencies on a
glass plate, and by scanning this spectrogram along the time axis (using a
set of frequency modulated light beams, and a light collector that is fed
into a loudspeaker), one can actually hear the sound corresponding to the
spectrogram." Stella (1985).
• hand drawn spectrograph instead of the printed one.
ARTICULATORY SYNTHESIS
Articulatory synthesis is also parametric approach which attempts to model the physical
properties of the human vocal tract.
The goals of articulatory synthesis:
• Naturality of the model.
• Accuracy of the model in comparison with the speaker(s) on, which it is based
• Intelligibility of the model.
• Understanding and information gained from the model.
In Articulatory Synthesis method of synthesizing speech the speech articulators are
controlled (e.g. jaw, tongue, lips, etc.).
The natural speech production process is modeled as accurately as possible in
articulatory synthesis.
This is done by creating an artificial or a synthetic replica of human physiology and
making it produce speech.
• The vocal tract geometry is described in 1, 2 or 3 D’s based on the
articulatory synthesizer.
• Area function directly represents the vocal tract in an 1-D model.
The variation of cross sectional area of the vocal tract tube between the mouth
opening and glottis is described by the area function
•The advantage of the two and three-dimensional models is that the
position and form of the articulators can be talked about in a very direct
fashion and specific manner.
ARTICULATORY MODEL (Coker & Fujimura 1966)
• Hence LPC is just the hypothesis that any sample is a linear function of those that
precede it.
• LPC parameterizes the speech signal that is it analyses the complex, constantly changing
speech signal into a few values called “ parameters” which changes relatively slowly.
• The parameters which represent the signal are the frequencies & bandwidths of a set of
filters which would produce that signal, given a certain excitation.
• speech signal is represented as a set of parameters, one can edit these parameters.
Merits:
• LPC represents speech as a set of parameters, hence can be edited easily.
• Variations in frequency parameters easy.
• Extremely accurate estimation of Speech Parameters
• High speed of Computation
• Robust, reliable & accurate method
Demerits:
• LPC considers “resonances” because of which they have difficulty to describe anti-
resonance i.e., nasals.
Formant synthesis
• Recreates the changing formants of speech, each one being specified to different
parameters, updated every 5ms or during an utterance.
• Received a big boost in 1980 with Dennis Klatt’s publication of an elaborate synthesizer
model, complete with a computer program which synthesized speech on a laboratory
computer.
Klatt’s Model
• Basis for this model is source filter theory.
• There are 2 sound sources:
1.Voicing
2.Friction
• The voicing source generates a train of impulses like that produced by the vocal folds.
The filters RGP, RGZ & RGS smoothens this simulated glottal waveform & shapes its
spectrum.
• AV controls the amplitude of voicing
• Source then enters the resonating system in which RNP and RNZ represent nasal
pole nasal zero.
• R1 to R5 represents formants 1 through 5.
Parallel system
• It models the production of fricatives, in which noise source is higher , usually in the
oral cavity and only that part of vocal tract which is in front of the source serves as the
resonator.
• MOD provides for mixing the noise and voicing source for voiced fricatives.
• LPF is a low pass filter which shapes the source spectrum.
• AH and AF control the amplitude of aspiration and friction respectively.
• Aspiration noise goes to the cascade resonator because generates at the larynx, like
voicing uses the entire vocal tract as a resonator.
• The noise source for fricatives goes through the parallel resonators, each with its own
amplitude control.
• The boxes labeled “ first diff ” are high pass filters ; the one at the output simulates
the emphasis given to higher frequencies as sound radiates from the lips.
Merits:-
•Cascade synthesis yields the correct relative amplitudes corresponding to formant
peaks for vowels without any individual amplitude controls for each formant.( Fant,
1956)
•It has been useful in comparative researches like the effect of changing one or more
parameters within a relatively small number of syllables.
Demerits
• Difficulty in representing abrupt transitions.
• Setting parameter every 5 ms is not a practical way of meeting the commercial
needs for speech synthesis.
• Lack of variations in F0 is one of the sources of unnaturalness of speech
Synthesis by rule
Few rules of thumb which are well known bits of phonology of English and other
languages are :
• In this system , the user should type in the sequence of phonemes in an utterance.
• Synthesizer would then start with a table of default values for each phoneme
• It would automatically tailor each of those values according to the context of each
phoneme.
A reading machine is required
• Eg ; of such synthesizer is DEC talk tm. It takes ordinary spelling from keyboard or
scanner & produce highly intelligible and reasonable natural speech as output.
Merits:
• Logan, Green and Pisoni (1989) studied the intelligibility of 10 synthesis by rule
systems and concluded that Dec talk yielded the lowest error rate and was stated
as most equivalent to natural speech.
• The procedure is easy.
Demerits:
• The frequency range of the system is limited to 5 kHz .
• Produces inappropriate intonation.
• Fewer variations in amplitude.
• Abrupt transitions are difficult to represent.
• Certain sounds are more aspirated than in normal speech.
E.g. /p/ in speech is aspirated the same way as /p/ in peach.
ANALYSIS BY SYNTHESIS
The heart of an analysis by synthesis system is a signal generator capable of
synthesizing all and only the signals to be analyzed.
Signals generated are compared with the signals to be analyzed and a measure of
error is calculated.
Different signals are generated until one with smallest error value is found.
• A true analysis by synthesis coder should synthesize all possible output speech signals
and identify the combination that gives the minimum error.
Advantages:
•Comparatively easy to use
•Coarticulation patterns are better represented
•Quality of speech is comparatively good.
Disadvantages:
•Abrupt transitions are difficult to represent
•Frequency variation is difficult to represent
REFERENCES: