Full Text 01
Full Text 01
Full Text 01
I Introduction 3
1 Introduction 4
1.1 Defining the Research . . . . . . . . . . . . . . . . . . . . . . 7
II Background 9
2 Neural Networks 10
2.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 10
2.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Information Flow . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Algorithm Outline . . . . . . . . . . . . . . . . . . . . 13
2.2.3.a Forward Pass . . . . . . . . . . . . . . . . . . 14
2.2.3.b Backward pass . . . . . . . . . . . . . . . . . 15
2.2.3.c Update Weights . . . . . . . . . . . . . . . . 17
3 Sound Processing 18
3.1 Speech From A Human Perspective . . . . . . . . . . . . . . . 18
3.1.1 Speech Production . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Speech Interpretation . . . . . . . . . . . . . . . . . . 19
3.2 Automatic Feature extraction . . . . . . . . . . . . . . . . . . 20
3.2.1 The Speech Signal . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Analyzing the signal . . . . . . . . . . . . . . . . . . . 21
3.2.2.a Mel Frequency Cepstral Coefficients . . . . . 22
1
5 Experiments 32
5.1 Does size matter? . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Will the classifications be robust? . . . . . . . . . . . . . . . . 33
7 Discussion 46
7.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2
Part I
Introduction
3
Chapter 1
Introduction
There are a set of concepts to get acquainted with regarding speaker recog-
nition. These concepts can be a bit difficult to grasp at first because of
their similarity to each other. However, the following paragraphs will try to
explain their differences [4, 6]. Roughly, there are two phases involved with
speaker recognition: enrollment and verification. In the enrollment phase
speech is collected from speakers and features are extracted from it. In the
second phase, verification, a speech sample is compared with the previously
4
recorded speech to figure out who is speaking. How these two steps are
made differ between applications. It is common to categorize applications as
speaker identification and speaker verification. Identification tasks involve
identifying an unknown speaker among a set of speakers, whereas verification
involves trying to verify that the correct person is speaking. Identification is
therefore a bigger challenge.
The most common method used for recognizing human speech has for the
past decades been based on Hidden Markov Models (HMM) [1–3]. This is
because of their proficiency in recognizing temporal patterns. Temporal pat-
terns can be found in most real life applications where not all data is present
at start but instead is revealed over time - sometimes with very long time in
between important events.
Neural networks have become increasingly popular within this field the recent
years due to some advances in research. When compared to Feed Forward
Neural Networks (FFNN), Recurrent Neural Networks (RNN) have the abil-
ity to perform better in tasks involved with sequence modeling. Nonetheless,
RNNs have historically been unable to recognize patterns over longer peri-
ods of time because of their gradient based training algorithms. Usually they
cannot connect output sequences to input sequences separated by more than
5 to 10 time steps [30]. The most commonly used recurrent training algo-
rithms are Backpropagation Through Time (BPTT), as used by Rumelhart
and McClelland [29], and Real Time Recurrent Learning (RTRL), used by
Robinson and Fallside [28]. Even though they give very successful results
for some applications, the limited ability to bridge input-output time gaps
gives the trained networks difficulties when it comes to temporal information
processing.
The designs of the training algorithms are such that previous outputs of
the network gets more or less significant for each time step. The reason
for this is that errors get scaled in the backwards pass by a multiple of the
network nodes activations and weights. Therefore, when error signals are
propagated through the network they are likely to either vanish and get for-
5
gotten by the network, or blow up in proportion in just a few time steps.
This flaw can lead to an oscillating behavior of the weights between nodes
in a network. It can also make the weight updates so small that the network
needs excessive training in order to find the patterns, if ever [21, 24]. In
those cases it is impossible for the neural network to learn about patterns
that repeat themselves with a long time lag in between their occurrences.
The very strength of the algorithm proved to also introduce some limita-
tions, pointed out by Gers et al. [13]. It could be shown that the standard
LSTM algorithm, in some situations where it was presented to a continuous
input stream, allowed memory cell states to grow indefinitely. These situ-
ations can either lead to blocking of errors input to the cell, or make the
cell behave as a standard BPTT unit. Presented, by Gers et al. [13], was an
improvement to the LSTM algorithm called forget gates. By the addition of
this new gating unit to the memory block, memory cells were able to learn
to reset themselves when their contents have served their purpose, hence
solving the issue with indefinitely growing memory cell states.
Building upon LSTM with forget gates, Gers et al. [14] developed the al-
gorithm further by adding so called peephole connections. The peephole
connections were giving direct connection between the memory cell and the
gating units within a memory block making it able to view its current in-
ternal states. The addition of these peephole connections proved to make it
possible for the network to learn very precise timing between events. The
algorithm was now really robust and promising for use in real world applica-
tions where timing is of the essence, for instance in speech or music related
tasks.
Long Short-Term Memory has brought about a change to the top of speech
recognition algorithms, as indicated by several research papers [12, 16, 17, 30,
34]. It has not only shown to outperform more commonly used algorithms,
like Hidden Markov Models, but it has also directed research in this area
6
towards more biologically inspired solutions [20]. Apart from the research
made with LSTM within the speech recognition field, the algorithm’s ability
to learn precise timing has been tested in the area of music composition,
Coca et al. [7], Eck and Schmidhuber [11], and handwriting recognition [19]
with very interesting results. Inspired by the great achievements described
above, the thought behind this thesis came about.
The data sets used for training and testing were gathered from a set of
audio books. These audio books were narrated by ten English speaking
adult males and contained studio recorded, emotionally colored speech. The
speaker identification system created was text-independent and tested on
excerpts of speech from different books read by the same speakers. This was
done as to test the systems robustness.
LSTM is capable of both online and offline learning techniques [20]. How-
ever, in this thesis the focus will be on online learning. Thus, the weights of
the network will be updated at every time step during training. The network
is to be trained using the LSTM learning algorithm proposed by Gers et al.
[14]. Experiments regarding the parameters: size, depth, network architec-
ture and the classification robustness will be carried out within the scope of
this thesis. These experiments will constitute a path towards optimization
of the system and the results will be evaluated based on the classification
error of the networks in use.
The first part of this thesis, Introduction, introduces the subject and defines
7
the research. In the second part, Background, the fundamentals of neural
networks and the LSTM architecture are outlined. This part also explains
sound processing and specifically MFCC extraction in detail. The third part,
Experiment Setup, describes how and what experiments were made during
this research. The fourth part, Results and Conclusions, states the results
from the experiments and the conclusions drawn from these.
8
Part II
Background
9
Chapter 2
Neural Networks
10
Figure 2.1: A simple RNN structure. The grey boxes show the boundary of
each layer. Nodes in the network are represented by the blue circles. The
arrows represent the connections between nodes. Recurrent connections are
marked with red color.
2.2.1 Fundamentals
In this part the fundamentals of the LSTM structure will be described along
with the importance of each element and how they work together.
Instead of the hidden nodes in a traditional RNN, see Figure 2.1, an LSTM
RNN makes use of something called memory blocks. The memory blocks
are recurrently connected units that in themselves hold a network of units.
11
Figure 2.2: An LSTM memory block with one memory cell.
Inside these memory blocks is where the solution to the vanishing gradient
problem lies. The memory blocks are made up of a memory cell, an input
gate, an output gate and a forget gate. See Figure 2.2. The memory cell is
the very core of the memory block, containing the information. To be able
to preserve its state when no other input is present the memory cell has a
selfrecurrent connection. The forget gate guards this selfrecurrent connec-
tion. In this way it can be used to adaptively learn to discard the cell state
when it has become obsolete. This is not only important to keep the network
information up to date, but also because not resetting the cell states can in
some occations, with continuous input, make them grow indefinitely. This
would defeat the purpose of LSTM [13]. The input gate determines what
information to store in the cell, that is, protects the cell from unwanted in-
put. The output gate, on the other hand, decides what information should
flow out of the from the memory cell and therefore prohibits unwanted flow
of information in the network.
The cell’s self-recurrent weight and the gating units constructs altogether
a constant error flow through the cell. This error flow is referred to as Con-
stant Error Carousel (CEC) [21]. The CEC is what makes LSTM networks
able to bridge inputs to outputs with more than 1000 time steps in between
them and thereby extending the long range memory capacity by a hundred-
fold compared to conventional RNNs. Having access to this long history of
information is also the very reason that LSTM networks can solve problems
12
that earlier was impossible with RNNs.
Incoming signals first get summed up and squashed through an input ac-
tivation function. Traveling further towards the cell the squashed signal gets
scaled by the input gate. The scaling of the signal is the way that the in-
put gate can guard the cell state from getting interfered with by unwanted
signals. So, to prohibit the signal from reaching the cell, the input gate
simply multiplies the signal with a scaling factor of, or close to, zero. If the
signal is let past the gate then the cell state gets updated. Similarly, the
output from the memory cell gets scaled by the output gate in order to pro-
hibit unnecessary information from disturbing other parts of the network. If
the output signal were to be allowed through the output gate then it gets
squashed through an output activation function before leaving the memory
block.
In the occasion that an input signal is not let through to update the state
of the memory cell, the cell state is preserved to the next time step by the
cell’s self-recurrent connection. The weight of the self-recurrent connection
is 1, so usually nothing gets changed. However, the forget gate can interfere
on this connection to scale the cell value to become more or less important.
So, if the forget gate finds out that the cell state has become obsolete, it can
simply reset it by scaling the value on the self-recurrent connection with a
factor close to zero.
All the gating units have the so called peephole connections where they can
access the cell state directly. This helps them to learn to precisely time differ-
ent events. The gating units also have connections to other gates, themselves
and block inputs and outputs. All these weighted information gets summed
up and used to set the appropriate gate opening in every time step. This
functionality is optimized in the training process of the network.
13
Table 2.1: Description of symbols
Symbol Meaning
Wij Weight to unit i from unit j
The time step at which a function is evaluated (if
τ
nothing else is stated)
xk (τ ) Input x to unit k
yk (τ ) Activation y of unit k
E(τ ) Output error of the network
tk (τ ) Target output t of unit k
ek (τ ) Error output e of unit k
k (τ ) Backpropagated error to unit k
S Input sequence used for training
The set of all units in the network that may be con-
nected to other units. That is, all units who’s ac-
N
tivations are visible outside the memory block they
belong to.
C The set of all cells
c Suffix indicating a cell
ι Suffix indicating an input gate
φ Suffix indicating a forget gate
ω Suffix indicating an output gate
sc State s of cell c
f The function squashing the gate activation
g The function squashing the cell input
h The function squashing the cell output
α Learning rate
m Momentum
14
Activation y of input gate ι:
yι = f (xι ) (2.2)
yφ = f (xφ ) (2.4)
Input x to cell c: X
∀c ∈ C, xc = wcj yj (τ − 1) (2.5)
j∈N
State s of cell c:
sc = yφ sc (τ − 1) + yι g(xc ) (2.6)
Input x to output gate ω:
X X
xω = wωj yj (τ − 1) + wωc sc (τ ) (2.7)
j∈N c∈C
yω = f (xω ) (2.8)
Output y of cell c:
∀c ∈ C, yc = yω h(sc ) (2.9)
• First, reset all the partial derivatives, i.e. set their value to 0.
• Then calculate and feed the output errors backwards through the net,
starting from time τ1 .
The errors are propagated throughout the network making use of the stan-
dard BPTT algorithm. See the definitions below, where E(τ ) is the output
error of the net, tk (τ ) is the target value for output unit k and ek (τ ) is the
error for unit k, all at time τ . k (τ ) is the backpropagation error of unit k
at time τ .
15
Partial derivative δk (τ ) definition:
∂E(τ )
def ine δk (τ ) =
∂xk
Error of output unit k at time τ , ek (τ )
yk (τ ) − tk (τ ) k ∈ output units
ek (τ ) =
0 otherwise
Initial backpropagation error of unit k at time τ1 , k (τ1 ):
k (t1 ) = ek (t1 )
Backpropagation error of unit k at time τ − 1, k (τ − 1):
X
k (τ − 1) = ek (τ − 1) + wjk δj (τ )
j∈N
Error of cell c, c :
X
∀c ∈ C, c = wjc δj (τ + 1) (2.10)
j∈N
Partial derivative of the nets output error E with respect to state s of cell c,
∂E
sc (t):
∂E ∂E
(t) = c yω h0 (yc )+ (τ +1)yφ (τ +1)+δι (τ +1)wιc +δφ (τ +1)wφc +δω wωc
sc ∂sc
(2.12)
Partial derivative of error of cell c, δc :
∂E
∀c ∈ C, δc = yι g 0 (xc ) (2.13)
∂sc
Partial error derivative of the forget gate φ, δφ :
X ∂E
δφ = f 0 (xφ ) yc (τ − 1) (2.14)
∂sc
c∈C
16
Now, calculate the partial derivative of the cumulative sequence error by
summing all the derivatives:
Definition of the total error Etotal when network is presented to the input
sequence S:
Xτ1
def ine Etotal (S) = E(τ )
τ =τ0
Definition of the partial derivative of the cumulative sequence error, ∆ij (S):
τ1
∂Etotal (S) X
def ine ∆ij (S) = = δi (τ )yj (τ − 1)
∂wij τ =τ +1
0
17
Chapter 3
Sound Processing
This chapter will describe how sound waves from speech can be processed so
that features can be extracted from them. The features of the sound waves
can later be presented to a neural network for speaker recognition.
• Respiration - where the lungs produce the energy needed, in the form
of a stream of air.
• Phonation - where the larynx modifies the air stream to create phona-
tion.
18
• Articulation - where the vocal tract modulates the air stream via a set
of articulators.
Because all people are of different sizes and shapes, no two people’s voices
sound the same. The way our voices sound also varies depending on the peo-
ple in our surrounding, as we tend to adapt to the people around us. What
is more, our voices change as we grow and change our physical appearance.
So, voices are highly personal and invariant.
When sound enters the ear it soon reaches the eardrum, or tympanic mem-
brane, which is a cone shaped membrane that picks up the vibrations created
by the sound [36]. Higher and lower frequency sounds makes the eardrum
vibrate faster and slower respectively, whereas the amplitude of the sound
makes the vibrations more or less dramatic. The vibrations are transferred
through a set of bones into a structure called the bony labyrinth. The bony
labyrinth holds a fluid that starts to move with the vibrations and thereby
pushes towards two other membranes. In between these membranes there is
a structure called the organ of corti, which hold specialized auditory nerve
cells, known as hair cells. As these membranes move, the hair cells inside
the organ of corti gets stimulated and fires electrical impulses to the brain.
Different hair cells get stimulated by different frequencies and the higher the
amplitude of the vibrations, the easier the cells get excited.
The melody with which people speak, prosody, is dependent on language and
dialect but also change dependent on the mood of the speaker, for example.
However, every person make their own variations to it. They use a limited
set of words and often say things in a slightly similar way. All these things
19
Figure 3.1: The spectogram representation of the word "acting", pronounced
by two different speakers.
we learn and attach to the specific person. Because of all this contextual
information we can for instance more or less easily distinguish one person’s
voice from another. Unfortunately, all the information about the context is
usually not available when a computer tries to identify a speaker from her
speech. Therefore automatic speaker recognition pose a tricky problem.
20
languages as all languages have their own set of words and specific ways to
combine sounds into them.
How the phonemes are pronounced is highly individual. The intuitive way
of thinking when beginning to analyze speech signals might be that they
can easily be divided into a set of phonemes. That each phoneme have a
distinct start and ending that can be seen just by looking at the signal. Un-
fortunately that is not the case. The analogous nature of speech makes the
analyzing of it more difficult. Phonemes tend to be interleaved with one an-
other and therefore there are usually no pauses with silence in between them.
Some phonemes, such as /d/, /k/ and /t/, will make a silence before they
are pronounced though. It is because the glottis is completely shut in the
process of pronouncing them. This makes it impossible for air to be exhaled
from the lungs and hence there is no sound produced. This phenomenon can
be seen in figure 3.1.
21
stance. Also when we speak we tend to start sentences speaking louder than
we do in the end of them, as another example. Thus, due to the fact that
speech is highly variable by its nature, the temporal analysis methods are
not used very often in real life applications. Not in this thesis either.
The more often used technique to examine signals is spectral analysis. Using
this method, the waveform itself is not analyzed, but instead the spectral
representation of it. This opens up for richer, more complex information
to be extracted from the signal. For example spectral analysis makes it
possible to extract the parameters of the vocal tract. Therefore it is very
useful in speaker recognition applications, where the physical features of
one’s vocal tract is an essential part of what distinguishes one speaker from
another. Furthermore, spectral analysis can be applied to construct very
robust classification of phonemes because information that disturb the valu-
able information in the signal can be disregarded. For example excitation
and emotional coloring of speech can be peeled off from the signal to leave
only the information that is concerning the phoneme classification. Of course,
the information regarding emotional coloring can be used for other purposes.
The facts presented regarding spectral analysis methods make them useful
for extracting features for utilization in real life applications. In comparison
with temporal analysis, the spectral analysis methods are computationally
heavy. Thus the need for computational power is greater with spectral than
temporal analysis techniques. Spectral analysis can also be sensitive to noise
because of its dependency on the spectral form.
There are several commonly used spectral analysis methods to extract valu-
able features from speech signals. Within speaker recognition, Linear Pre-
diction Cepstral Coefficients and Mel Frequency Cepstral Coefficients have
proven to give the best results [23]. The features are used to create feature
vectors that will serve as input to a classification algorithm in speech/speaker
recognition applications. In this thesis the features will serve as input to a
bidirectional Long Short-Term Memory neural network.
22
The following will be a short outline of the steps in the process of acquiring
the Mel Frequency Cepstral Coefficients from a speech signal. The steps
presented below will be described in more detail further on.
The coefficients left are the ones that form the feature vectors exploited for
classification purposes. Usually features called Delta and Delta-Delta fea-
tures that are added to the feature vectors. These features are also known as
differential and acceleration coefficients and are the first and second deriva-
tives of the previously calculated coefficients.
The first step in the process is to divide the signal into short frames. This
is done because of the variable nature of the speech signal. To ease in the
classification process the signal is therefore divided into time frames of 20-40
milliseconds, where the standard is 25 milliseconds. During this time period
the signal is considered not to have changed that much and therefore the
signal will for instance not represent two spoken phonemes in this time win-
dow. The windows are set with a step of around 10 milliseconds in between
the start of two consecutive windows, making them overlap a bit.
When the signal has been split up into frames we should estimate the power
spectrum for each frame by calculating the periodogram of the frames. This
is the process where it is examined which frequencies are present in every
slice of the signal. Similar work is made by the hair cells inside the cochlea,
in the organ of corti in the human ear.
23
Figure 3.2: The Mel scale. Based on peoples judgment, it was created by
placing sounds with different pitch on what was perceived as equal melodic
distance from each other.
Symbol Meaning
N Number of samples in one frame.
Number of discrete points in a Discrete Fourier Transform
K
of a frame.
i Indicates frame.
si (n) Time domain signal si at sample n, in frame i.
Si (k) Discretized signal Si at point k, in frame i.
h(n) Analysis window h(n) at sample n.
Pi (k) Periodogram estimate Pi at point k, in frame i.
di Delta coefficient di of frame i.
ci±M Static coefficient c of frame i ± M , where M is usually 2.
First the Discrete Fourier Transforms (DFT) of the frames are determined:
N
X
Si (k) = si (n)h(n)e−2πkn/N 1≤k≤K (3.1)
n=1
24
Now, the result from this should be an estimation of the signals power spec-
trum from which the power of present frequencies can be withdrawn.
The next step in the process would be to filter the frequencies of the peri-
odogram, in other words combine frequencies close to each other into groups
of frequencies. This is done to correspond to limitations in the human hearing
system. Humans are not very good at distinguishing frequencies in the near
vicinity to each other. This is especially true for higher frequency sounds. At
lower frequencies we have a better ability to differentiate between sounds of
similar frequency. To better simulate what actually can be perceived by the
human ear, the frequencies are therefore grouped together. This also peels
away unnecessary information from the signal and hence makes the analysis
less computationally heavy.
The standard number of filters applied is 26, but may vary between 20 to
40 filters. Once the periodogram is filtered, it is known how much energy
is present in each of the different frequency groups, also referred to as fil-
terbanks. The energy calculated to be present in each filterbank is then
logarithmized to create a set of log filterbank energies. This is made because
loudness is not perceived on a linear scale by the human ear. In general, to
perceive a sound to be double the volume of another, the energy put into it
has to be eight times as high.
The cepstral coefficients are finally acquired by taking the Discrete Cosine
Transform (DCT) of the log filterbank energies. The calculation of the DCTs
is needed because the filterbanks are overlapping, see Figure 3.3, making the
filterbank energies connected to each other. Taking the DCT of the log fil-
terbank energies decorrelates them so that they can be modeled with more
ease. Out of the 20 - 40 coefficients acquired from the filterbanks, only the
lower 12 - 13 are used in speech recognition applications. These are com-
bined into a feature vector that can serve as input to, for instance, a neural
25
Figure 3.3: The mel frequency filter applied to withdraw the perceivable
frequencies of the sound wave.
network. The reason to not use all of the coefficients is that other coefficients
have very little, or degrading, impact on the success rate of the recognition
systems.
26
Part III
Experiment Setup
27
Chapter 4
Model
This chapter will describe how experiments were implemented within this
research and what parameters and aids were used in order to carry the them
out.
The data sets used in this research were constituted of excerpts from au-
dio books. 21 books, narrated by ten different people, were chosen as the
research base. The first ten books were all randomly selected, one from each
of the narrators. From each of these ten books, an excerpt of ten minutes was
randomly withdrawn as to constitute the training data set. Thus, the total
training data set consisted of 100 minutes of speech divided evenly upon ten
different speakers. Out of these 100 minutes, one minute of speech from every
speaker were chosen at random to make up a validation set. The validation
set was used to test whether improvement had been made throughout the
training process. That way it could be determined early on in the training
if it was worth continuing with the same parameter setup. Time was of the
essence. Though chosen at random, the validation set remained the same
throughout the whole research. As did the training data set.
For the purpose of this thesis there were three different data sets used for
28
testing the ability of the network. The test sets used were completely set
apart from the training set. So, not a single frame were existing in both the
training set and any of the test sets. The first test set (1) was compound of
five randomly selected one minute excerpts from each of the ten books used
in the training data set. Thus it consisted of 50 minutes of speech, spread
evenly among the ten speakers.
The remaining two test sets were used to see if the network could actually
recognize the speakers voices in a slightly different context. So the narra-
tors were all the same, but the books were different from the ones used in
the training set. The second test set (2) consisted of five randomly cho-
sen one minute excerpts from eight different books, narrated by eight of the
ten speakers. In total test set (2) consisted of 40 minutes of speech that
were evenly spread out among eight speakers. The third test set (3) was
the smallest one and consisted of five randomly selected one minute excerpts
from three of the speakers. Thus it was compound of 15 minutes of speech,
spread evenly across three narrators. These excerpts were withdrawn from
three books, different from the ones used in the other data sets. They were
books that came from the same series of books as some of the ones used
in the training set. In that sense it was thought that they would be more
similar to the ones used for training. Therefore it was the author’s belief
that test set (3) might be of less challenge for the network than test set (2),
but still a bigger challenge than (1).
The narrators of the selected books were all adult males. It was thought
that speakers of the same sex would be a greater challenge for the network,
compared to doing the research with a research base of mixed female and
male speakers. The language spoken on all of the audio books is English.
However, some speakers use a British accent and some an American. The
excerpts contained emotionally colored speech. All the audio files used were
studio recorded. Thus, they would not represent a real life situation with
regards to background noise, for example.
The features were extracted from the sound waves by processing a 25 mil-
lisecond window of the signal. This 25 millisecond window form a frame.
The window were then moved 10 milliseconds at a time until end of the sig-
29
nal. Thus, the frames overlap each other to lessen the risk of information
getting lost in the transition between frames. From every frame, 13 MFCC
coefficients were extracted using a set of 26 filter-bank channels. To better
model the behavior of the signal, the differentials and accelerations of the
MFCC coefficients were calculated. All these features were combined into a
feature vector of size 39. The feature vectors served as input to the neural
network.
The feature extraction was made using the Hidden Markov Model Toolkit
(HTK) [35]. This library can be used on its own as a speech recognition
software, making use of Hidden Markov Models. However, only the tools
regarding MFCC extraction were used during this research. Specifically the
tools HCOPY and HLIST were used to extract the features and aid in the
creation of data sets.
The difference between an ordinary RNN and an LSTM RNN lies within
the hidden layer of the neural network. Ordinary hidden layer units are ex-
changed with LSTM memory blocks. The memory blocks consists of at least
one memory cell, an input gate, an output gate and a forget gate. In this
research only one memory cell per memory block where used. The memory
cell was constituted of a linear unit whereas the gates where made up of
sigmoid units. Also the input and output squashing functions were sigmoid
functions. All of the sigmoid units ranged from -1 to 1. The activation of
the gates controlled the input to, and output of, the memory cell via mul-
tiplicative units. So, for example, the memory cells output was multiplied
with the output gates activation as to give the final output of the memory
block.
30
To create the neural network, a library called RNNLIB [18] was used. The
library was developed by Alex Graves, one of the main contributors to the
LSTM structure of today. It provided the necessary configuration possibili-
ties for the purpose of this thesis in a convenient way.
During the experiments the feature vectors were input to the network, one
vector at a time. The feature vectors correlated with one of the audio files,
from one speaker, were seen as a whole sequence that corresponded to one
target. Thus, the network was trained so that every sequence of speech
had one target speaker. So the network was used for sequence classification
rather than frame-wise classification.
31
Chapter 5
Experiments
This chapter will describe which experiments were carried out during this
thesis.
It was the author’s belief that the higher network complexity would bet-
ter model a problem like this. However, it was thought that, at some point,
the increasing of network complexity would rather slow computations down
than aid in making the classifications more correct.
These experiments were executed with three data sets: one for training,
one for validation and one for testing. All audio was excerpted from ten
audio books. The training data set consisted of 10 minutes of speech from
ten different speakers. Totally 100 minutes of speech. The validation set
was a compound of one minute of speech from the same speakers. It was a
subset of the training set. The test set consisted of five minutes of speech
from each of the ten speakers. Totally 50 minutes of speech from the same
books as those in the training set. The training set and test set did not
overlap, though.
32
Table 5.1: Network setups
These experiments were executed with four data sets: one for training, one
for validation and two for testing. All audio was excerpted from 21 au-
dio books. The training data set was made up from ten audio books and
consisted of ten minutes of speech from ten different speakers. Totally 100
minutes of speech. The validation set was a compound of one minute of
speech from the same speakers. It was a subset of the training set. Both
the training and validation set in this experiment were similar to the ones in
the previous experiments. However, this experiment was executed with two
test sets that where different from the other experiments. This was thought
to test the robustness of the network’s ability to correctly classify who was
speaking.
The first of the test sets used within this experiment consisted of five minutes
of speech from eight of the ten speakers. Totally 40 minutes of speech. These
excerpts came from a set of books different from the ones used to build the
training set. The books where written by other authors, but had a similar
genre. This set was thought to be of greatest challenge to the network as
the books were in a completely different setting from those in the training set.
The second of the test sets used within this experiment consisted of five
minutes of speech from three of the ten speakers. Totally 15 minutes of
speech. These excerpts came from a set of books different from the ones
used to build the training set and other test sets. The books where written
by the same authors as those in the training set. They were in the same
33
genre and even the same series of books. Therefore, it was thought that
they would be quite similar to the books used in the training set. This set
was thought to be of a smaller challenge to the network as the books had a
similar setting to those in the training set.
For these experiments the network architecture found to give the best results
in the previous experiments was used, LSTM4x25 figure 6.5. It consisted of
39 input units, four hidden layers and softmax output layer.
34
Part IV
35
Chapter 6
Results
This chapter will state all the results from the experiments carried out within
the scope of this research.
There were ten speakers used for the speaker identification task. The train
set contained ten minutes of speech from each speaker, divided into 5-minute
excerpts. The validation set contained one 1-minute sample of speech from
each speaker, taken from the train set. The test set consisted of 5 minutes of
speech from each speaker, divided into 1-minute excerpts. None of the sam-
ples used in the test set were present in the set the networks were trained on.
However, they were withdrawn from the same audio books. The experiments
were discontinued when the improvement of the training had stagnated or
reached perfect results.
36
Figure 6.1: A graph over how classification errors change during training of
an LSTM network using six hidden layers consisting of five memory blocks
each.
The increasing network depth also came with excessive training times. Not
only did the time to train a network on one epoch of training data increase,
but also the number of epochs required to reach a good result. Especially
for the training of the more complex six hidden layer networks, which took
around ten hours to execute on a mid range laptop computer. This is worth
mentioning as the networks with just two hidden layers were trained in below
15 minutes.
Taking a look at figures 6.4, 6.5 and 6.6 shows that the networks with a
depth of four performs better than those with a depth of six. LSTM4x25,
figure 6.5, was able to identify all ten speakers correctly within the validation
set after 100 epochs of training. However, the network’s classification error
on the test set did stagnate at 6 percent. Thus 47 out of the 50 1-minute
samples was correctly classified as one of the ten speakers by the LSTM4x25
network.
37
Figure 6.2: A graph over how classification errors change during training of
an LSTM network using six hidden layers consisting of 25 memory blocks
each.
The LSTM4x50 network, figure 6.6, made progress a lot faster than LSTM4x25,
reaching a 0 percent classification error on the validation set after only 40
epochs of training. On the test set, however, the LSTM4x50 network man-
aged to obtain 12 percent classification error after 48 epochs of training,
which can be considered a fairly good result. It corresponds to 44 correctly
classified 1-minute samples out of 50.
The most well performing network setups, with regards to the validation set,
turned out to be the ones using only two hidden layers, figures 6.7, 6.8
38
Figure 6.3: A graph over how classification errors change during training of
an LSTM network using six hidden layers consisting of 50 memory blocks
each.
and 6.9. They were also the network setups that needed the least amount
of training. The LSTM2x5 network, figure 6.7, did achieve 100 percent cor-
rect classifications within the validation set after only 19 epochs of training.
LSTM2x50 was not long after as it reached it in 20 epochs. Nevertheless,
none of these networks were able to reach perfect results on the test set,
where they reached a minimum classification error of 8 and 24 percent re-
spectively. These results correspond to 46 and 38 correctly classified speaker
samples out of 50. The LSTM2x5 result can be seen as a good achievement.
The LSTM2x25 network, figure 6.8, made the fastest progress of all net-
work setups. Though, the progress stagnated and did not give any less
classification error than 10 percent on the validation set. This result was
obtained after nine epochs of training, however. The minimum classification
error achieved on the test set was 14 percent. This corresponds to 43 out of
50 correct classifications on the 1-minute samples, which is fairly good.
39
Figure 6.4: A graph over how classification errors change during training of
an LSTM network using four hidden layers consisting of five memory blocks
each.
In the first experiment the network was tested on a set of 40 1-minute ex-
cerpts divided among eight of the ten speakers present in the training set.
This was thought to be the hardest of all tests as the samples were with-
drawn from books that were completely set apart from the ones utilized in
the training set. The result of this experiment were extremely poor. Only
two out of the 40 1-minute samples were correctly classified, thus giving a
classification error of 95 percent.
40
Figure 6.5: A graph over how classification errors change during training of
an LSTM network using four hidden layers consisting of 25 memory blocks
each.
network could identify the speakers’ voices within one additional data set.
This time the set consisted of 15 1-minute samples that were excerpted from
three books that were narrated by three of the speakers on which the system
had been trained on. These books were in the same series as the ones used
for training and therefore were of the same genre and writing style. There-
fore it was thought that this set would be of less challenge than the previous
one. Nevertheless, the results from this experiment were even worse than in
the first. Not a single one of the 15 voice samples was recognized properly
by the network. So the classification error was 100 percent. Thus it turned
out that the network was not able to perform well in any of these tasks.
41
Figure 6.6: A graph over how classification errors change during training of
an LSTM network using four hidden layers consisting of 50 memory blocks
each.
42
Figure 6.7: A graph over how classification errors change during training of
an LSTM network using two hidden layers consisting of five memory blocks
each.
43
Figure 6.8: A graph over how classification errors change during training of
an LSTM network using two hidden layers consisting of 25 memory blocks
each.
44
Figure 6.9: A graph over how classification errors change during training of
an LSTM network using two hidden layers consisting of 50 memory blocks
each.
45
Chapter 7
Discussion
In this chapter the results of the research will be discussed and some con-
clusions will be drawn from the results. The chapter will also cover what
possible future work within this area could be.
The system was trained and tested against a database of ten speakers. The
data sets were made up of excerpts from audio books. A total of 21 books
were used within this research. The audio files were processed into Mel
Frequency Cepstral Coefficients and their first and second derivatives as to
create feature vectors that were used as inputs to the neural network. Each
vector contained 39 features that were withdrawn from the short time power
spectrum of the audio signals.
During the experiments it was investigated whether the size and depth of
the neural network would have any effect on its ability to identify speakers.
Nine network setups were tested, making use of two, four and six hidden
layers and 5, 25 and 50 memory blocks within each of them. It turned out
that, within this application at least, a greater depth rather degrades the
performance of the system than enhances it. The networks using six hidden
layers did all perform badly, independent of size, even though the one us-
ing 50 hidden blocks did perform the least bad. Among the networks using
four hidden layers, the smallest network did not perform well. However, the
46
larger ones gave good results. In fact, the network using 25 hidden blocks
gave the best results out of all network setups, achieving an identification
rate of 94 percent on the test set and 100 percent on the validation set.
Overall, the network setups using only two hidden layers performed the best.
Among these networks the smaller ones proved to give better results though.
So, of the least deep networks, the smallest network performed the best,
among the middle-depth networks the middle size gave the best results and
among the deepest networks, the largest size performed the best. Thus it
seems that the size of the network does not on it self affect the performance
of the system, but in conjunction with depth, it does. So, to create a well
functioning system one needs to match the size and the depth of the network.
It is also the author’s belief that it is important to match the complexity of
the network with the complexity of the database and problem.
Another thing found during the experiments is that training time will be
heavily affected by the complexity of network. The network complexity, i.e.
number of weights, is in turn mostly dependent on the depth of the network.
The difference in training time between the least complex and the most com-
plex network was around ten hours for the same number of training epochs.
The general performance of the four and six layer networks was also lower
than with the less deep networks. Thus, if time is of the essence, it is not
really worth going for a more complex network model. Anyhow, four hidden
layers seems to be some kind of roof for the network to even sort a problem
of this size out.
The experiments carried out to try how well the network could recognize
the speakers in a slightly different context did not achieve satisfactory re-
sults, mildly said. No more than five percent of the audio samples were
classified properly, which obviously is not good enough. So, the robustness
of the networks speaker recognition ability was low. Nevertheless, the net-
work could identify the speakers with sound accuracy within the previous
experiments, where training and test data sets also were different from each
other, but built from the same recordings. It is the author’s belief that the
MFCC’s sensitivity to noise is the reason for this. Probably the background
noise and "setting" of each recording did affect the MFCCs too much for the
network to be able to identify the same speaker in two different recordings.
To sum things up; the Bidirectional Long Short-Term Memory neural net-
work algorithm proved to be useful also within the speaker recognition area.
Without any aid in the identification process, an BLSTM network could,
from MFCCs, identify speakers text-independently with 93 percent accuracy
and text-dependently with a 100 percent accuracy. This was done with a net-
work using four hidden layers containing 25 memory blocks each. It should
be noted that if the recordings are too different from each other, when it
comes to background noise etc., the network will have big problems identi-
fying speakers accurately. Therefore, if great accuracy is needed and audio
47
data is of diverse quality, this type of system, alone, may not be suitable for
text-independent speaker identification without further speech processing.
48
Bibliography
[5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term
dependencies with gradient descent is difficult. Neural Networks, IEEE
Transactions on, 5(2):157–166, 1994.
[6] Joseph P Campbell Jr. Speaker recognition: A tutorial. Proceedings of
the IEEE, 85(9):1437–1462, 1997.
[7] A.E. Coca, D.C. Correa, and Liang Zhao. Computer-aided music com-
position with lstm neural network and chaotic inspiration. In Neural
Networks (IJCNN), The 2013 International Joint Conference on, pages
1–7, Aug 2013. doi: 10.1109/IJCNN.2013.6706747.
49
[10] Nidhi Desai, Prof Kinnal Dhameliya, and Vijayendra Desai. Feature ex-
traction and classification techniques for speech recognition: A review.
International Journal of Emerging Technology and Advanced Engineer-
ing, 3(12), December 2013.
[11] Douglas Eck and Jürgen Schmidhuber. A first look at music composition
using lstm recurrent neural networks. Istituto Dalle Molle Di Studi Sull
Intelligenza Artificiale, 2002.
[12] F. Eyben, S. Petridis, B. Schuller, G. Tzimiropoulos, S. Zafeiriou, and
M. Pantic. Audiovisual classification of vocal outbursts in human con-
versation using long-short-term memory networks. In Acoustics, Speech
and Signal Processing (ICASSP), 2011 IEEE International Conference
on, pages 5844–5847, May 2011. doi: 10.1109/ICASSP.2011.5947690.
[13] F.A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: con-
tinual prediction with lstm. In Artificial Neural Networks, 1999. ICANN
99. Ninth International Conference on (Conf. Publ. No. 470), volume 2,
pages 850–855 vol.2, 1999. doi: 10.1049/cp:19991218.
[14] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber.
Learning precise timing with lstm recurrent networks. J. Mach.
Learn. Res., 3:115–143, March 2003. ISSN 1532-4435. doi:
10.1162/153244303768966139. URL http://dx.doi.org/10.1162/
153244303768966139.
[15] A. Graves and J. Schmidhuber. Framewise phoneme classification with
bidirectional lstm networks. In Neural Networks, 2005. IJCNN ’05.
Proceedings. 2005 IEEE International Joint Conference on, volume 4,
pages 2047–2052 vol. 4, July 2005. doi: 10.1109/IJCNN.2005.1556215.
[16] A. Graves, N. Jaitly, and A.-R. Mohamed. Hybrid speech recogni-
tion with deep bidirectional lstm. In Automatic Speech Recognition and
Understanding (ASRU), 2013 IEEE Workshop on, pages 273–278, Dec
2013. doi: 10.1109/ASRU.2013.6707742.
[17] A. Graves, A.-R. Mohamed, and G. Hinton. Speech recognition with
deep recurrent neural networks. In Acoustics, Speech and Signal Process-
ing (ICASSP), 2013 IEEE International Conference on, pages 6645–
6649, May 2013. doi: 10.1109/ICASSP.2013.6638947.
[18] Alex Graves. Rnnlib: A recurrent neural network library for sequence
learning problems. http://sourceforge.net/projects/rnnl/.
[19] Alex Graves and Jürgen Schmidhuber. Offline handwriting recogni-
tion with multidimensional recurrent neural networks. In D. Koller,
D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural
Information Processing Systems 21, pages 545–552. MIT Press, 2009.
[20] Alex Graves, Douglas Eck, Nicole Beringer, and Jürgen Schmidhuber.
Biologically plausible speech recognition with lstm neural nets. In in
Proc. of Bio-ADIT, pages 127–136, 2004.
50
[21] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
1997.
[22] Manish P Kesarkar. Feature extraction for speech recognition. Elec-
tronic Systems, EE. Dept., IIT Bombay, 2003.
[23] S. Chougule M. Chavan. Speaker features and recognition techniques: A
review. International Journal Of Computational Engineering Research,
May-June 2012(3):720–728, 2012.
[24] Derek Monner and James A Reggia. A generalized lstm-like training
algorithm for second-order recurrent neural networks. Neural Networks,
25:70–83, 2012.
[25] Climent Nadeu, Dušan Macho, and Javier Hernando. Time and fre-
quency filtering of filter-bank energies for robust hmm speech recogni-
tion. Speech Communication, 34(1):93–114, 2001.
[26] Tuija Niemi-Laitinen, Juhani Saastamoinen, Tomi Kinnunen, and Pasi
Fränti. Applying mfcc-based automatic speaker recognition to gsm and
forensic data. In Proc. Second Baltic Conf. on Human Language Tech-
nologies (HLT’2005), Tallinn, Estonia, pages 317–322, 2005.
[27] Lawrence J Raphael, Gloria J Borden, and Katherine S Harris. Speech
science primer: Physiology, acoustics, and perception of speech. Lippin-
cott Williams & Wilkins, 2007.
[28] A. J. Robinson and F. Fallside. The utility driven dynamic error prop-
agation network. Technical report, 1987.
[29] D. Rumelhart and J. McClelland. Parallel Distributed Processing: Ex-
plorations in the Microstructure of Cognition: Foundations. MIT Press,
1987.
[30] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term
memory based recurrent neural network architectures for large vocabu-
lary speech recognition. arXiv preprint arXiv:1402.1128, 2014.
[31] Urmila Shrawankar and Vilas M Thakare. Techniques for feature extrac-
tion in speech recognition system: A comparative study. arXiv preprint
arXiv:1305.1145, 2013.
[32] Stanley S Stevens and John Volkmann. The relation of pitch to fre-
quency: a revised scale. The American Journal of Psychology, 1940.
[33] Tharmarajah Thiruvaran, Eliathamby Ambikairajah, and Julien Epps.
Fm features for automatic forensic speaker recognition. In INTER-
SPEECH, pages 1497–1500, 2008.
[34] M. Wollmer, F. Eyben, J. Keshet, A. Graves, B. Schuller, and G. Rigoll.
Robust discriminative keyword spotting for emotionally colored sponta-
neous speech using bidirectional lstm networks. In Acoustics, Speech and
51
Signal Processing, 2009. ICASSP 2009. IEEE International Conference
on, pages 3949–3952, April 2009. doi: 10.1109/ICASSP.2009.4960492.
[35] Steve J Young and Sj Young. The HTK hidden Markov model toolkit:
Design and philosophy. Citeseer, 1993.
[36] Willard R Zemlin. Speech and hearing science, anatomy and physiology.
1968.
52