Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Method To Study Speech Synthesis

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 43

METHODS OF SPEECH SYNTHESIS

Faculty: Dr. Ajith Kumar U Presenter: Aman Kumar


INTRODUCTION
• The main important use of synthetic speech is that to check on our analysis of speech.
• Formant pattern is important to the production and comprehension of a particular
speech sound.
• The perception of vowels and consonants are dependent on their distinctive acoustical
cues.
• Vowels- periodic, high energy and long duration
• Consonants- complex and require multiple cues
• Speech perception varies with intra and inter subjects variability.
• To understand these variables speech needs to be synthesized through artificial means
which helps the researcher in manipulating the specific acoustic cues in which they are
interested.
Speech synthesis
• Synthetic speech is a production of speech by artificial means, especially the
generation of speech by computers or computer controlled devices.
• Various speech synthesizers in daily life Ex: Railway station, Toys, customer care, AAC
• Synthesis useful to understand how speech is perceived.
• Modern speech synthesis able to control and manipulate the any features of speech
which is important for speech production.
• But, all the features of natural speech can not be manipulated individually.
Method used for Studying Speech Perception are:

 Pattern Playback

 Articulatory synthesis

 Parametric Synthesis

 Analysis by synthesis
Pattern Playback
• The first example of a real speech synthesizer appears in 1951 at Haskins
Laboratories, with the PATTERN PLAYBACK
• Sonagraph machine working in reverse.
• The sonagraph transforms recorded speech into a '3-dimensional' plot
• the two first dimensions being time and frequency, the third one being
intensity, represented on a gray scale
• Conversely, by drawing schematic evolutions of formant frequencies on a
glass plate, and by scanning this spectrogram along the time axis (using a
set of frequency modulated light beams, and a light collector that is fed
into a loudspeaker), one can actually hear the sound corresponding to the
spectrogram." Stella (1985).
• hand drawn spectrograph instead of the printed one.
ARTICULATORY SYNTHESIS

Articulatory synthesis is also parametric approach which attempts to model the physical
properties of the human vocal tract.
The goals of articulatory synthesis:
• Naturality of the model.
• Accuracy of the model in comparison with the speaker(s) on, which it is based
• Intelligibility of the model.
• Understanding and information gained from the model.
 In Articulatory Synthesis method of synthesizing speech the speech articulators are
controlled (e.g. jaw, tongue, lips, etc.).
The natural speech production process is modeled as accurately as possible in
articulatory synthesis.
 This is done by creating an artificial or a synthetic replica of human physiology and
making it produce speech.
• The vocal tract geometry is described in 1, 2 or 3 D’s based on the
articulatory synthesizer.
• Area function directly represents the vocal tract in an 1-D model.
The variation of cross sectional area of the vocal tract tube between the mouth
opening and glottis is described by the area function
•The advantage of the two and three-dimensional models is that the
position and form of the articulators can be talked about in a very direct
fashion and specific manner. 
ARTICULATORY MODEL (Coker & Fujimura 1966)

Seven parameters are used


 Position of tongue body (X,Y)
 Lip protrusion (L)
 Lip rounding (W)
 Place and degree of tongue tip constriction (R,B)
 Degree of velar coupling (N)
In summary, an articulatory synthesizer comprises at least the following three
parts:
1.A mechanism to control the parameters during an utterance
2.Based on a set of articulatory parameters the geometric description of the vocal
tract
3. A model for the acoustic simulation
Advantages:

• Coarticulation patterns are reflected

• Connected speech can be produced

• Nasal and oral sounds are represented


Disadvantages:

• Complex instrumentation and computation

• Practically impossible to create sounds with 100% accuracy.

• Difficult to use by those who are not well trained.


Applications
• Designed for studying the linguistically and perceptually significant
aspects of articulatory events.
• Speech sounds for use in perceptual tests can be generated through
controlled variations in timing or position parameters.
Eg., /banana/, /bandana/, /badnana/, /baddata/.
• Purposes of this device is to study vocal-tract anatomy and dynamic
behavior.
• Investigation of detailed relationships between velar control and the
perceptual oral-nasal distinction.
Parametric synthesis
• It’s a rule based way of synthesizing speech.
• Parametric synthesis makes the use of either acoustic information (time-domain and
frequency domain) or articulatory information.
• Parametric synthesis that depends upon the acoustic information is known as signal
based (bottom-up) synthesis as it specifies acoustic properties of speech such as
formants, duration of segments and type of noise for fricatives.
• The other name for signal based synthesis is terminal analog because it attempts
to produce an analog of the terminal (acoustic) level of speech and pay little or
no attention to articulatory aspects of speech.
• Articulatory synthesis is another parametric approach which attempts to model
the physical properties of the human vocal tract. (Top- down approach).
3 Types based upon parametric synthesis:-
 linear predictive coding
 Formant synthesis
 synthesis by rule
 Linear predictive coding (LPC)
• It is a class of method used to obtain a spectrum.
• LPC comes from 2 sources
1.Branch of statistics
2.Branch of engineering
LPC builds on the fact that
• Any sample in digitized speech is partly predictable from immediate predecessors.

• Speech does not vary wildly from sample to sample.

• Hence LPC is just the hypothesis that any sample is a linear function of those that
precede it.
• LPC parameterizes the speech signal that is it analyses the complex, constantly changing
speech signal into a few values called “ parameters” which changes relatively slowly.

• The parameters which represent the signal are the frequencies & bandwidths of a set of
filters which would produce that signal, given a certain excitation.

• speech signal is represented as a set of parameters, one can edit these parameters.
Merits:
• LPC represents speech as a set of parameters, hence can be edited easily.
• Variations in frequency parameters easy.
• Extremely accurate estimation of Speech Parameters
• High speed of Computation
• Robust, reliable & accurate method

Demerits:
• LPC considers “resonances” because of which they have difficulty to describe anti-
resonance i.e., nasals.
 Formant synthesis
• Recreates the changing formants of speech, each one being specified to different
parameters, updated every 5ms or during an utterance.

• Received a big boost in 1980 with Dennis Klatt’s publication of an elaborate synthesizer
model, complete with a computer program which synthesized speech on a laboratory
computer.
Klatt’s Model
• Basis for this model is source filter theory.
• There are 2 sound sources:
1.Voicing
2.Friction

These drive 2 resonating systems


1.Cascade resonator for vowels
2.Parallel resonator for fricatives
Cascade system
• In the cascade resonator, the out put of first formant resonator becomes the input to
the second formant resonator.

• The voicing source generates a train of impulses like that produced by the vocal folds.
The filters RGP, RGZ & RGS smoothens this simulated glottal waveform & shapes its
spectrum.
• AV controls the amplitude of voicing
• Source then enters the resonating system in which RNP and RNZ represent nasal
pole nasal zero.
• R1 to R5 represents formants 1 through 5.
Parallel system
• It models the production of fricatives, in which noise source is higher , usually in the
oral cavity and only that part of vocal tract which is in front of the source serves as the
resonator.
• MOD provides for mixing the noise and voicing source for voiced fricatives.
• LPF is a low pass filter which shapes the source spectrum.
• AH and AF control the amplitude of aspiration and friction respectively.

• Aspiration noise goes to the cascade resonator because generates at the larynx, like
voicing uses the entire vocal tract as a resonator.

• The noise source for fricatives goes through the parallel resonators, each with its own
amplitude control.

• The boxes labeled “ first diff ” are high pass filters ; the one at the output simulates
the emphasis given to higher frequencies as sound radiates from the lips.
Merits:-
•Cascade synthesis yields the correct relative amplitudes corresponding to formant
peaks for vowels without any individual amplitude controls for each formant.( Fant,
1956)

•It has been useful in comparative researches like the effect of changing one or more
parameters within a relatively small number of syllables.
Demerits
• Difficulty in representing abrupt transitions.
• Setting parameter every 5 ms is not a practical way of meeting the commercial
needs for speech synthesis.
• Lack of variations in F0 is one of the sources of unnaturalness of speech
 Synthesis by rule
Few rules of thumb which are well known bits of phonology of English and other
languages are :

 F0 declines slowly over utterance

 F0 declines rapidly at the end of declarative sentence

 vowels are lengthened before voiced consonants.


• If we can quantify those rules, we can automate much of the parameter setting in
synthesis.

• In this system , the user should type in the sequence of phonemes in an utterance.

• Synthesizer would then start with a table of default values for each phoneme

• It would automatically tailor each of those values according to the context of each
phoneme.
A reading machine is required

• Eg ; of such synthesizer is DEC talk tm. It takes ordinary spelling from keyboard or
scanner & produce highly intelligible and reasonable natural speech as output.
Merits:
• Logan, Green and Pisoni (1989) studied the intelligibility of 10 synthesis by rule
systems and concluded that Dec talk yielded the lowest error rate and was stated
as most equivalent to natural speech.
• The procedure is easy.
Demerits:
• The frequency range of the system is limited to 5 kHz .
• Produces inappropriate intonation.
• Fewer variations in amplitude.
• Abrupt transitions are difficult to represent.
• Certain sounds are more aspirated than in normal speech.
E.g. /p/ in speech is aspirated the same way as /p/ in peach.
ANALYSIS BY SYNTHESIS
The heart of an analysis by synthesis system is a signal generator capable of
synthesizing all and only the signals to be analyzed.

 Signals generated are compared with the signals to be analyzed and a measure of
error is calculated.

 Different signals are generated until one with smallest error value is found.
• A true analysis by synthesis coder should synthesize all possible output speech signals
and identify the combination that gives the minimum error.
Advantages:
•Comparatively easy to use
•Coarticulation patterns are better represented
•Quality of speech is comparatively good.

Disadvantages:
•Abrupt transitions are difficult to represent
•Frequency variation is difficult to represent
REFERENCES:

• Kent R.D& Read , C (1995). The acoustic analysis of speech.


• Klatt ,D (1980). Software for cascade/ parallel formant synthesizer, Journal of Acoustic
Society of America, 67 : 971- 995.
• Harington , J & Cassidy ,S (1999). Techniques in speech acoustics. Netherlands; Kulwer
Academic Publishers.
• Flanagan, J. L(1972). Speech analysis, synthesis and perception
• Flanagan, J. L and Rainer, L.R (1973). Speech synthesis
• Jacob, B., Mohan,M and Yiteng,H(2006). Hand book of speech processing
• Kent and Reed(1995), Acoustic analysis of speech

You might also like