Speech Chapter 4

MAI-5125
Deep Learning for Speech Recognition

(Introduction)
What is Sound?
• A sound signal is produced by variations in air pressure.
• We can measure the intensity of the pressure variations
and plot those measurements over time.
– Sound intensity is the power carried by sound waves per unit
area.
• Sound signals often repeat at regular intervals so that each

wave has the same shape.
• The height shows the intensity of the sound and is known
as the amplitude.
What is Sound? (Cont’d)
The time taken for the signal to complete one full wave is
the period.
The number of waves made by the signal in one second is

called the frequency.
The frequency is the reciprocal of the period. The unit of
frequency is Hertz.
1
𝑓=
𝑇
• The majority of sounds we encounter may not
follow such simple and regular periodic
patterns.
• But signals of different frequencies can be
added together to create composite signals
with more complex repeating patterns.
• All sounds that we hear, including our own
human voice, consist of waveforms like these.
For instance, this could be the sound of a musical
instrument.
The human ear is able to differentiate between

different sounds based on the ‘quality’ of the
sound which is also known as timbre.
Only acoustic waves that have frequencies lying between
about 20 Hz and 20 kHz, the audio frequency range, elicit an
auditory percept in humans.
How do we represent sound digitally?
• To digitize a sound wave we must turn the
signal into a series of numbers so that we can
input it into our models.
• This is done by measuring the amplitude of

the sound at fixed intervals of time.
• Each such measurement is called a sample,
and the sample rate is the number of samples
per second.
• For instance, a common sampling rate is about

44,100 samples per second.
• That means that a 10-second music clip would

have 441,000 samples!
Preparing audio data for a deep learning model
• In the days before Deep Learning,
– machine learning applications of Computer Vision used to

rely on traditional image processing techniques to do
feature engineering.
– For instance, we would generate hand-crafted features

using algorithms to detect corners, edges, and faces.
– With NLP applications as well, we would rely on

techniques such as extracting N-grams and computing
Term Frequency.
Preparing audio data … (cont’d)
• Similarly, audio machine learning applications used to
depend on traditional digital signal processing
techniques to extract features.
• For instance, to understand human speech,

– audio signals could be analyzed using phonetics concepts
to extract elements like phonemes.
• All of this required a lot of domain-specific expertise

to solve these problems and tune the system for better
performance.
• However, in recent years, as Deep Learning becomes
more and more ubiquitous,
– it has seen tremendous success in handling audio as well.
• With deep learning,

– the traditional audio processing techniques are no longer
needed, and
– we can rely on standard data preparation without

requiring a lot of manual and custom generation of
features.
• What is more interesting is that, with deep learning,
– we don’t actually deal with audio data in its raw form.
• Instead, the common approach used is
– to convert the audio data into images and then use a standard
CNN architecture to process those images!
• This is done by generating Spectrograms from the audio.
• So first let’s learn what a Spectrum is, and use that to

understand Spectrograms.
Spectrum
• As we discussed earlier, signals of different frequencies
can be added together to create composite signals,
– representing any sound that occurs in the real-world.
• This means that any signal consists of many distinct

frequencies and can be expressed as the sum of those
frequencies.
• The Spectrum is the set of frequencies that are
combined together to produce a signal.
– Example, the picture shows the spectrum of a piece of
music.
Spectrum (cont’d)
• The Spectrum plots all of the frequencies that

are present in the signal along with the
strength or amplitude of each frequency.
Spectrum (cont’d)
• The lowest frequency in a signal called the
fundamental frequency.
• Frequencies that are whole number multiples of the

fundamental frequency are known as harmonics.
• For instance, if the fundamental frequency is 200 Hz,

then
– its harmonic frequencies are 400 Hz, 600 Hz, and so on.
Time Domain vs Frequency Domain
• The waveforms that we saw earlier showing Amplitude
against Time are one way to represent a sound signal.
– Since the x-axis shows the range of time values of the
signal, we are viewing the signal in the Time Domain.
• The Spectrum is an alternate way to represent the

same signal.
• It shows Amplitude against Frequency, and
– since the x-axis shows the range of frequency values of the
signal, at a moment in time, we are viewing the signal in
the Frequency Domain.
Time Domain vs Frequency Domain
Spectrograms
• Since a signal produces different sounds as it varies over
time, its constituent frequencies also vary with time.
• In other words, its Spectrum varies with time.
• A Spectrogram of a signal plots its Spectrum over time and
is like a ‘photograph’ of the signal.
• It plots Time on the x-axis and Frequency on the y-axis.
• It is as though we took the Spectrum again and again at
different instances in time, and then joined them all
together into a single plot.
Spectrograms (cont’d)
• It uses different colors to indicate the Amplitude
or strength of each frequency.
• The brighter the color the higher the energy of

the signal.
• Each vertical ‘slice’ of the Spectrogram is

essentially the Spectrum of the signal at that
instant in time and shows how the signal strength
is distributed in every frequency found in the
signal at that instant.
• In the example below, the first picture displays the signal in
the Time domain.
– i.e. Amplitude vs Time.
• It gives us a sense of how loud or quiet a clip is at any point
in time, but it gives us very little information about which
frequencies are present.
• The second picture is the Spectrogram and displays the

signal in the Frequency domain.
Generating Spectrograms
• Spectrograms are produced using Fourier
Transforms to decompose any signal into its
constituent frequencies.
• We won’t actually need to recall all the

mathematics about Fourier Transforms,
– there are very convenient Python library functions

that can generate spectrograms for us in a single step.
Audio Deep Learning Models
• We realize that Spectrogram is an equivalent
compact representation of an audio signal,
somewhat like a ‘fingerprint’ of the signal.
• It is an elegant way to capture the essential
features of audio data as an image.
Audio Deep Learning Models (cont’d)
• So most deep learning audio applications use Spectrograms to
represent audio.
• They usually follow a procedure like this:
– Start with raw audio data in the form of a wave file.
– Convert the audio data into its corresponding spectrogram.
– Optionally, use simple audio processing techniques to augment the
spectrogram data. (Some augmentation or cleaning can also be done
on the raw audio data before the spectrogram conversion)
– Now that we have image data, we can use standard CNN architectures
to process them and extract feature maps that are an encoded
representation of the spectrogram image.
• The next step is to generate output predictions from this
encoded representation, depending on the problem that
you are trying to solve.
– For instance, for an audio classification problem, you

would pass this through a Classifier usually consisting
of some fully connected Linear layers.
– For a Speech-to-Text problem, you could pass it

through some RNN layers to extract text sentences
from this encoded representation.
What problems does audio deep
learning solve?
• Audio data in day-to-day life can come in innumerable
forms such as
– human speech, music, animal voices, and other natural sounds

as well as man-made sounds from human activity such as cars
and machinery.
• Given the prevalence of sounds in our lives and the range

of sound types, it is not surprising that there are a vast
number of usage scenarios that require us to process and
analyze audio.
• Now that deep learning has come of age, it can be applied
to solve a number of use cases.
Audio Classification
• This is one of the most common use cases and
involves taking a sound and assigning it to one
of several classes.
• For instance, the task could be to identify the

type or source of the sound.
– Example, is this a car starting, is this a hammer, a

whistle, or a dog barking.
Audio Classification (cont’d)
• Obviously, the possible applications are vast.
• This could be applied to detect the failure of

machinery or equipment based on the sound that it
produces, or in a surveillance system, to detect
security break-ins.
Audio Separation and Segmentation
• Audio Separation involves isolating a signal of
interest from a mixture of signals so that it can
then be used for further processing.
• For instance, you might want to
– separate out individual people’s voices from a lot
of background noise, or
– the sound of the violin from the rest of the
musical performance.
Audio Separation and … (cont’d)
• Audio Segmentation is used to highlight relevant

sections from the audio stream.
• For instance, it could be used for diagnostic
purposes
– to detect the different sounds of the human heart and
detect anomalies.
Music Genre Classification and
Tagging
• With the popularity of music streaming services, another
common application that most of us are familiar with is
– to identify and categorize music based on the audio.
• The content of the music is analyzed to figure out the genre
to which it belongs.
• This is a multi-label classification problem because a given
piece of music might fall under more than one genre.
– Example, rock, pop, jazz, salsa, instrumental as well as other
facets such as ‘oldies’, ‘female vocalist’, ‘happy’, ‘party music’
and so on.
Music Genre Classification … (cont’d)
• Of course, in addition to the audio itself,

– there is metadata about the music such as singer, release date,
composer, lyrics and so on which would be used to add a rich set of
tags to music.
Music Genre Classification … (cont’d)
• This can be used
– to index music collections according to their

audio features,
– to provide music recommendations based on a

user’s preferences, or for searching and
– retrieving a song that is similar to a song to which

you are listening.
Music Generation and Music
Transcription
• We have seen a lot of news these days about
deep learning being used
– to programmatically generate extremely
authentic-looking pictures of faces and other
scenes, as well as being able
– to write grammatically correct and intelligent

letters or news articles.
Music Generation … (cont’d)
• Similarly, we are now able to generate synthetic music that

matches a particular genre, instrument, or even a given
composer’s style.
• In a way, Music Transcription applies this capability in reverse. It

takes some acoustics and annotates it, to create a music sheet
containing the musical notes that are present in the music.
Voice Recognition
• Technically this is also a classification problem
but deals with recognizing spoken sounds.
• It could be used to identify the gender of a

speaker, or their name
– (eg. is this Bill Gates or Tom Hanks, or is this

Ketan’s voice vs an intruder’s)
Voice Recognition (cont’d)
• We might want to detect human emotion and identify

the mood of the person from the tone of their voice
– eg. is the person happy, sad, angry, or stressed.
• We could apply this to animal voices to identify the type

of animal that is producing a sound, or potentially to
identify whether it is a gentle affectionate purring sound,
a threatening bark, or a frightened yelp.
Speech to Text and Text to Speech
• When dealing with human speech, we can go a step
further, and
– not just recognize the speaker, but understand what they are
saying.
• This involves extracting the words from the audio, in the
language in which it is spoken and transcribing it into text
sentences.
• This is one of the most challenging applications because it

deals not just with analyzing audio, but also with NLP and
requires developing some basic language capability to
decipher distinct words from the uttered sounds.
(cont’d)
• Conversely, with Speech Synthesis, one could

go in the other direction and take written text
and generate speech from it, using, for
instance, an artificial voice for conversational
agents.
(cont’d)
• Being able to understand human speech obviously
enables a huge number of useful applications both in
our business and personal lives, and we are only just
beginning to scratch the surface.
• The most well-known examples that have achieved

widespread use are virtual assistants
– like Alexa, Siri, Cortana, and Google Home, which are

consumer-friendly products built around this capability.

Speech Chapter 4

Uploaded by

Copyright:

Available Formats

Speech Chapter 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Chapter 4

Uploaded by

Copyright:

Available Formats

MAI-5125

Deep Learning for Speech Recognition

• Sound signals often repeat at regular intervals so that each

The number of waves made by the signal in one second is

The human ear is able to differentiate between

• This is done by measuring the amplitude of

• For instance, a common sampling rate is about

• That means that a 10-second music clip would

• In the days before Deep Learning,

– machine learning applications of Computer Vision used to

– For instance, we would generate hand-crafted features

– With NLP applications as well, we would rely on

• For instance, to understand human speech,

• All of this required a lot of domain-specific expertise

• With deep learning,

– we can rely on standard data preparation without

• This is done by generating Spectrograms from the audio.

• So first let’s learn what a Spectrum is, and use that to

• This means that any signal consists of many distinct

• The Spectrum plots all of the frequencies that

• Frequencies that are whole number multiples of the

• For instance, if the fundamental frequency is 200 Hz,

• The Spectrum is an alternate way to represent the

• The brighter the color the higher the energy of

• Each vertical ‘slice’ of the Spectrogram is

• The second picture is the Spectrogram and displays the

• We won’t actually need to recall all the

– there are very convenient Python library functions

– For instance, for an audio classification problem, you

– For a Speech-to-Text problem, you could pass it

– human speech, music, animal voices, and other natural sounds

• Given the prevalence of sounds in our lives and the range

• For instance, the task could be to identify the

– Example, is this a car starting, is this a hammer, a

• Obviously, the possible applications are vast.

• This could be applied to detect the failure of

• Audio Segmentation is used to highlight relevant

• Of course, in addition to the audio itself,

– to index music collections according to their

– to provide music recommendations based on a

– retrieving a song that is similar to a song to which

– to write grammatically correct and intelligent

• Similarly, we are now able to generate synthetic music that

• In a way, Music Transcription applies this capability in reverse. It

• It could be used to identify the gender of a

– (eg. is this Bill Gates or Tom Hanks, or is this

• We might want to detect human emotion and identify

• We could apply this to animal voices to identify the type

• This is one of the most challenging applications because it

• Conversely, with Speech Synthesis, one could

• The most well-known examples that have achieved

– like Alexa, Siri, Cortana, and Google Home, which are

You might also like