FULLTEXT01
FULLTEXT01
FULLTEXT01
Computer Engineering
Bachelor thesis
The Neural Network models were built using the Keras API together with TensorFlow
library. There are different types of Neural Network architectures. The architecture types
that were investigated in this project were Residual Neural Network, Visual Geometry
Group 16, Inception V3 and RCNN(Recurrent Neural Network). ImageNet weights have
been used to initialize the weights for the Neural Network base models. ImageNet weights
were provided by Keras API and was optimized for each base model[2]. The base models
uses ImageNet weights when extracting features from the input data.
The feature extraction using ImageNet weights or random weights together with the base
models showed promising results. Both the Deep Learning using dense layers and the
LSTM spatio-temporal sequence prediction were implemented successfully.
i
Sammanfattning
AMI Meeting Corpus (AMI) -databasen används för att undersöka igenkännande av grup-
paktivitet. AMI Meeting Corpus (AMI) -databasen ger forskare fjärrstyrda möten och
naturliga möten i en kontorsmiljö; mötescenario i ett fyra personers stort kontorsrum. För
att uppnå gruppaktivitetsigenkänning användes bildsekvenser från videos och 2-dimensionella
audiospektrogram från AMI-databasen. Bildsekvenserna är RGB-färgade bilder och ljud-
spektrogram har en färgkanal. Bildsekvenserna producerades i batcher så att temporala
funktioner kunde utvärderas tillsammans med ljudspektrogrammen. Det har visats att
inkludering av temporala funktioner både under modellträning och sedan förutsäga be-
teende hos en aktivitet ökar valideringsnoggrannheten jämfört med modeller som endast
använder rumsfunktioner[1]. Deep learning arkitekturer har implementerats för att känna
igen olika mänskliga aktiviteter i AMI-kontorsmiljön med hjälp av extraherade data från
theAMI-databas.
Neurala nätverks modellerna byggdes med hjälp av KerasAPI tillsammans med Tensor-
Flow biblioteket. Det finns olika typer av neurala nätverksarkitekturer. Arkitekturerna
som undersöktes i detta projektet var Residual Neural Network, Visual GeometryGroup
16, Inception V3 och RCNN (LSTM ). ImageNet-vikter har använts för att initialisera
vikterna för Neurala nätverk basmodeller. ImageNet-vikterna tillhandahålls av Keras API
och är optimerade för varje basmodell [2]. Basmodellerna använder ImageNet-vikter när
de extraherar funktioner från inmatningsdata.
Funktionsextraktionen med hjälp av ImageNet-vikter eller slumpmässiga vikter tillsam-
mans med basmodellerna visade lovande resultat. Både Deep Learning användningen av
täta skikt och LSTM spatio-temporala sekvens predikering implementerades framgångsrikt.
ii
Acknowledgement
The realization of this thesis would not have been possible without the support of our
supervisor Radu-Casian Mihailescu at the Internet of Things and People Research Centre
at Malmö Univeristy. His support with ideas and possible solutions to the problems we had
to face while producing the results and writing this thesis has been very important. His
enthusiasm and guidance helped us during the whole process and his insight into Neural
Networks was crucial.
iii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Activity recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 The AMI Corpus dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Audio and video signals . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theoretical background 5
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Artificial Neural Network (ANN) . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Long-Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . 9
2.4 Deep Learning Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Inception v3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3 Residual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Related work 12
3.1 An investigation of transfer learning for deep architectures in group activity
recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Towards Robust Human Activity Recognition from RGB Video Stream with
Limited Labeled Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Deep Residual Learning for Image Recognition . . . . . . . . . . . . . . . . 14
3.3.1 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Robust Audio Sensing with Multi-Sound Classification . . . . . . . . . . . . 15
3.4.1 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Method 16
4.1 Construct a Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Develop a System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Google Colaboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Analyze and Design the System . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Cifar-10 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Cats Vs. Dogs dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Feature extraction by using transfer learning . . . . . . . . . . . . . . . . . . 17
4.4.1 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.2 Parameters during model training . . . . . . . . . . . . . . . . . . . . 19
4.5 Build the (prototype) System . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv
4.5.1 Frame extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5.2 Video dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5.3 Audio dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5.4 Video and Audio combined . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Observe and Evaluate the System . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Results 25
5.1 Cats vs Dogs dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 Implementing the system architecture . . . . . . . . . . . . . . . . . 25
5.2 Cifar-10 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2.1 Benchmarking the system architecture on Cifar-10 dataset . . . . . . 27
5.3 AMI Corpus Video and Audio . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.1 Video dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.2 Audio dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.3 LSTM: Audio and video sequence . . . . . . . . . . . . . . . . . . . . 32
5.4 Looking deeper into ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.1 Audio: Mixed audio ES and IS meetings . . . . . . . . . . . . . . . . 33
5.4.2 Audio: Mixed ES and IS meetings with random weight initialization 35
5.4.3 Audio: Edinburgh Scenario meetings LSTM audio sequence . . . . . 36
5.4.4 Audio: Edinburgh Scenario meetings LSTM audio sequence with
random weight initialization . . . . . . . . . . . . . . . . . . . . . . . 37
6 Discussion 38
6.1 Shuffling the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Audio data compared to video data . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 ResNet compared to VGG16 and InceptionV3 . . . . . . . . . . . . . . . . . 39
6.4 ResNets with random weights . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.5 LSTM results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.6 Parameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.7 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
v
Abbreviations
vi
1 Introduction
In this chapter, we introduce the concepts of using Deep Learning models for recognition
of human activities. This chapter also shows the thesis’ research questions and aim.
1.1 Background
Neural Networks have successfully been applied to classification problems. Neural Net-
works have the capability of solving non-linear problems [3].
Deep learning, which is just a small part of Neural Networks, is a technique that emulates
the information processing of the human brain and has contributed to a breakthrough
regarding object recognition in image data [4]. Deep Learning has since the breakthrough
been adopted as an approach to deal with object recognition in image data. One of the
next big challenges in computer vision regarding video sequence processing is to allow com-
puters to not only recognize objects in the video, but also human activity recognition [5]
occurring in the video both during playback and live stream. Human activity recognition
includes a single human activity, an interactive activity between a human and an object
or a group activity including two or more humans.
Group activities are more difficult to classify compared to object recognition because of
the diverse possible ways that group activities can be carried out.
Activity recognition can be used for surveillance [6], human-machine interaction or mul-
timedia retrieval or determining human emotional states [7], a smart environment taking
place at home in the kitchen [8] or at the office. Activity recognition can also be used
in smart homes based on IoT solutions [9]. In [10] it is stated that recognizing peoples
activity from different views is a difficult task because research done on activity recognition
are usually view dependent meaning not invariant of the view angle. Activity recognition
uses a view angle and therefore only works for the view angle used as is described in [10].
Also implementing a surveillance system using computer vision that can recognize human
activity is important and would free up human resources needed to constantly monitor a
video feed for certain human activities [6].
Recognizing group activities in an office environment has shown a higher degree of val-
idation accuracy by incorporating spatio-temporal features [1]; compared to only using
spatial features. By considering a short time sequence, the group activity will be easier
to recognize by the Deep Learning model used in [1]. For example, the temporal domain
makes it possible to recognize if a human is going to sit down in the chair or getting up
from the chair.
1
1.3 The AMI Corpus dataset
The AMI Corpus dataset is a European-funded project that aims to improve the effective-
ness of meetings1 . The AMI Meeting Corpus consists of 100 hours of meeting records. The
meetings are annotated with, for example, topics (presentation, discussion and closing),
decisions, and intense discussions. The data can be used for different purposes for example
linguistics and social psychology but in this thesis, the meetings will specifically be used
for video and audio processing in order to extract features.
AMI Corpus dataset consists of 10-60 minutes long synchronized video and audio sequences.
These sequences take place in an office environment where the participants are pretending
to have a meeting; not all meetings are remote-controlled scenario meetings, some of these
are real meetings taking place. Both real and scenario meetings were done in a similar
manner so no distinction was made between remote controlled scenario meetings and nat-
ural meetings. The participants are not always the same people however there are always
four people present, each given one out of four different roles. The four different roles are
the project manager (PM) who runs the meetings, the marketing expert (ME), the user
interface designer (UI) and the industrial designer (ID), which would result in a variety
of behaviors because of the different characteristic role each participant had. The partici-
pants had no prior professional training or any experience in their role. The behavior was
expected to differ compared to expert designers. The decision to use participants with no
experience was based on economic and logistical difficulties. Participants would be affected
by past experience so this was taken into consideration in order to produce replicable be-
havior. Randomly assigning the roles resulted in the participants being unhappy with roles
that did not fit them, the teams performed poorly. Instead the participants were asked
who wanted to do what. The participants were given training at the beginning of the task
and were each assigned a personal coach. The personal coach gave enough hints by e-mail
on how to do their job. The disadvantages of role-playing were taken into consideration.
For example, there is no guarantee that the participants will care enough so that the data
provided is comparable to natural interactions. However no natural meeting data was used
because there were no annotations available. The AMI Corpus group had past experience
for similar team tasks which suggested that the approach described for the AMI Corpus
dataset will result in behavior that generalizes well to real groups.
There are differences regarding recording the audio and video signals depending on the
room. The different rooms are the Edinburgh Room, Idiap Room, and TNO Room. The
details about these rooms regarding video and audio setup and recording can be read on
the AMI Corpus website 2 . The important details have been taken into account and pre-
sented in this thesis paper. The video used in this thesis are videos with a camera angle
that shows the whole room. The camera angles used are overhead view (top view from
above) and corner view (camera positioned at the top or a corner), which can be seen in
figure 1.
1
http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml
2
http://groups.inf.ed.ac.uk/ami/corpus/
2
(a) Corner camera angle
(b) Overhead camera angle
Audio was used from far-field microphones and room-view video cameras. The meetings
were recorded in English and include mostly non-native speakers. The audio had a sample
rate of 16 kHz and save as WAV files. At each time frame, an omnidirectional microphone
was used to sample the sound with the most energy. An automatic energy threshold de-
rived from a simple energy-based technique [11] was applied to classify the frame as speech
or silence. The speech and silence segments were smoothed with a low-pass filter.
The scenario meetings range from 700 seconds to 3400 seconds. The video signals from
Edinburgh and Idiap were stored on disk using DivD AVI codec 5.2.1. Encoding bit rate
was 23 000 Kbps and a maximum interval of 25 frames between two consecutive MPEG
keyframes; in order to reduce redundant information. The image resolution of videos is
high [720x576], sufficient for doing person location and facial feature analysis.
Special hardware is used to provide synchronization signals. The recordings use a range of
signals synchronized to a common timeline. So the audio and video signals are synchronized
and make it possible to combine both signals and thereby makes it possible to answer the
research questions.
There are difficulties utilizing computer vision in recognizing human activities in everyday
environments. These difficulties include different background lightning, bad angles and
lack of information. In this project, we specifically look into the office environment and
try to classify a few human activities based on video data and/or audio data. The audio
data or audio and video data combined might provide another perspective that could not
be achieved with using video data alone.
The goal of this work is to explore the feasibility of deploying a Deep Learning approach
based on ResNet for group activity recognition. Especially in the context of an office en-
3
vironment including four humans. ResNet was chosen because of it being state-of-the-art
within image classification and has been used to win contests [12].
1.5 Limitations
This thesis will be limited to deep learning models: ResNet, VGG16 and Inception V3.
We are also limited to one specific video recordings, namely the AMI Meeting Corpus ES
and IS video recordings.
4
2 Theoretical background
This chapter will explain the following concepts: Machine learning, Artificial Neural Net-
works(ANN), Multilayer Perceptron(MLP), Convolutional Neural Network(CNN), Deep
Learning(DL), Residual Network(ResNet), Visual Geography Group(VGG)
Machine learning is a field of study that aims to give computers the ability to learn without
explicitly being programmed [13]. Machine learning algorithms build models from sample
data, known as "training data", that is used to make predictions or decisions without being
explicitly programmed to perform the task.
In machine learning, there are different types of tasks, namely supervised and unsupervised
learning. The majority of practical machine learning uses supervised learning. Both types
have an input variable (X) but only supervised have a corresponding output variable (Y).
Supervised learning uses an algorithm to learn the mapping function from the input to the
output [13].
Y = f (X) (1)
The goal of supervised learning is to be able to approximate the mapping function so well
that whenever you have new input data (X) the model can predict the output (Y) for that
data.
Since unsupervised learning does not have any output variable (Y), the goal instead be-
comes to model the underlying structure or distribution in the data in order to learn more
about the data [14].
ANN’s are a category of a type of machine learning architecture. ANNs were inspired
by observing how biological neurons operate. ANNs try to mimic how the human brain
operates. An ANN by itself is not an algorithm, but rather a framework used by numerous
different machine learning algorithms. The ANN architecture consists of layers, neurons,
and weights [13].
The Multilayer Perceptron (MLP) is a type of ANN architecture that has one or more
hidden layers. There are three different layer types; input layer, hidden layers, and output
layer. Hidden layers exist between the input layer and the output layer. Every layer has
one or more neurons. Every input layer and hidden layer also have a bias neuron each.
The bias neuron has a constant input value [13].
5
The information is propagated forward through the ANN by using weights and transfer
functions. The output from one neuron in a layer l becomes the input for another neuron
l is a
in the next layer l + 1; these networks are called feed-forward NN’s. The weight ωij
l l+1 l
connection between a neuron ni and a neuron nj . The output ni is multiplied with the
l and becomes an input for the neuron nl+1 , which can be seen in figure 2. All
weight ωij j
inputs to the neuron njl+1 are summed [13]:
n
X
νjl+1 = nli ωij
l
, (2)
i=0
where ni is the input neuron, ωij is the neuron weight and nl+1
j is the summed input
The sum of inputs vjl+1 are fed to the transfer function. The output from the transfer
function becomes the new output for the neuron nl+1 j . The new output is sent as an input
to the next neuron in the next layer and the process repeats until the output layer is reached.
The transfer function is a non-linear function that is differentiable. The Rectifier transfer
function is a non-linear transfer function that has shown promising results when used in
the domain of computer vision, speech recognition and deep learning[13]. The Rectifier
transfer function:
6
0 , ν≤0
Φ(ν) = (4)
aν , ν>0
A training dataset with labeled classes is used during supervised learning. If the network
output deviates compared to the training dataset, then the ANN has to adjust its weights.
The process of adjusting the weights of the ANN is called training. When the network is
trained then the weights are adjusted accordingly to minimize the network error [13].
The network error is calculated by evaluating a loss function. The loss function compares
the network’s output to the corresponding output value in the dataset. There are different
algorithms for adjusting the weights. One of the algorithms is called backpropagation.
The backpropagation algorithm propagates the error back through the network and makes
adjustments to all the weights. Every weight is adjusted according to how much the weight
contributed to the measured network error.[13]
Deep Learning(DL) is a type of machine learning architecture which is based on ANN. The
DL architecture is an MLP with many neurons and many layers. The DL is computationally
demanding and training a model can take a lot of time. The amount of data required to
train a model scales with the size of the model [15].
Recurrent neural network (RNN) is built to allow information to be persistent [13]. For
example, if we humans read a text we understand each word based on understanding the
previous word, we do not throw away everything and think from scratch between each
word. Traditional neural networks can’t do that but RNNs address this issue and make it
possible. To do this, RNNs make loops [16], see figure 3.
7
The network, A, takes some input xt and outputs a value ht and the loop allows information
to be passed from one step to the next in the network. Recurrent neural networks are not
that much different from normal neural networks. If the loop is unrolled it will just look
like multiple copies of the same network that is passing the information on to the next one,
shown in figure 4.
Long-Short Term Memory, or LSTM as it is usually called, is a special kind of RNN that
was introduced in 1997 by Hochreiter & Schmidhuber [17]. This work was later refined
and popularized by many people [16]. LSTMs are usually used in deep learning.
LSTM uses feedback connections that give it the possibility to not only process single data
points (such as images), but also sequences of data (speech and video). The most common
LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The
cells are used to remember values over arbitrary time intervals while the gates are used to
regulate the flow of information that goes in and out of the cell [13], visualised in figure
5.
8
2.3.3 Convolutional Neural Network
A special class of MLP is known as Convolutional Neural Networks(CNN). CNN’s are neu-
robiologically inspired and are well suited for pattern classification. CNN’s are designed
to recognize two-dimensional shapes and can handle a high degree of distortion [13].
The first step of CNN is feature extraction. Feature extraction means that each neuron
takes its inputs from a local receptive field in the previous layer [13]. The second step is
feature mapping. Each computational layer of the CNN has multiple feature maps. Each
feature map is characterized by being in the form of a plane where the individual neurons
are required to share the same weights [13]. The third step is subsampling. Subsampling
occurs after each convolutional layer. Subsampling is a computational process that lowers
the resolution of the feature map. The subsampling process lowers the sensitivity of the
output of the feature maps. Lowering the sensitivity enables the output to better with-
stand distortion[13].
The input to a CNN is a two-dimensional array where every element corresponds to a pixel
value. Certain inputs have the RGB color scheme and will then have three input channels
instead of one. The first hidden layer performs convolution. The hidden layer consists of
a number of filters of the same size. Each neuron is assigned a two-dimensional receptive
field. This receptive field is pre-determined and is also called a kernel [13]. The kernel size
will determine the output dimension of the convolution. The second hidden layer performs
subsampling and local averaging. The size of the filters get smaller compared to the pre-
vious layer.
The above architecture repeats itself until the output layer is reached. The output layer
is flat(1x1) and has as many outputs as there are class labels. The spatial resolution is
reduced while the feature maps are increased [13].
2.4.1 VGG16
VGG16 is a 16-layer CNN model proposed by K. Simonyan and A. Zisserman in the pa-
per “Very Deep Convolutional Networks for Large-Scale Image Recognition” [18]. Neural
networks prior to VGG used bigger receptive fields (7x7 and 11x11) as compared to 3x3
in VGG16, but they were not as deep as VGG16.
By default, the input for the first layer is of size 244 x 244 RGB image. The image is
then passed through a stack of convolutional layers, where the filters were used with a
very small receptive field: 3x3. Following the convolutional layers, there are three Fully-
Connected layers, or dense layers: the first two have 4096 channels each, the third performs
1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The
third dense layer should always be changed to match the number of classes that are being
trained. The final layer is the soft-max layer. This is all visualised in figure 6.
9
Figure 6: VGG16 architecture based on “Very Deep Convolutional Networks for Large-Scale
Image Recognition” [16]
2.4.2 Inception v3
Inception v3 is one of the multiple iterative improvements on the Inception network. In-
ception v3 was in introduced in the same paper as Inception v2, "Rethinking the Inception
Architecture for Computer Vision" by Szegedy, et. al [19]. Inception v3 is a widely-used
and popular image recognition model that has shown to reach greater than 78.1% accuracy
on the ImageNet dataset. [20]
The model is made up of symmetric and asymmetric building blocks, including convolu-
tions, average pooling, max pooling, concats, dropouts, and fully connected layers. Batch
normalization is used a lot throughout the model and is applied to the activation inputs.
The loss is computed using Softmax. This can be seen in figure 7.
10
2.4.3 Residual Network
So the original function becomes the sum of the residual function and input vector
11
3 Related work
Neural networks(NN) have successfully been applied to classification problems. NN’s have
the capability of solving non-linear problems. There are different types of NN’s and the
type of NN that will be investigated is called Residual Networks(ResNet), because of the
model winning the image classification competition ILSVRC 2015 [23]. ResNet will be
implemented to recognize different human activities in an office environment by evaluating
the data from a video stream. Similar models will also be evaluated and compared with
the ResNet with regards to complexity, training and validation accuracy.
The Residual Neural Network architecture has shown better results than earlier Deep
Learning architectures [22]. The Visual Geometry Group 16 network has been success-
fully implemented using the AMI office environments with a high degree of validation
accuracy [1]; correctly classified a finite number of well defined human group activities.
Another network architecture that was used in [1] was the Inception V3. The Inception
V3 architecture is also implemented in this thesis.
3.1.1 Background
In the article [1] the authors investigated how DL architectures with high performance in
solving activity recognition problems could benefit from applying transfer learning. Trans-
fer learning means that pre-trained DL architecture is reused to solve another problem by
peeling away the last layer.
The researchers tried out the DL architectures in a set of controlled experiments using the
AMI Meeting Corpus database [1]. The database was used because it offered a controlled
environment where the conditions were exactly the same. These conditions include the
camera angles, the lightning, and the participants. The database offered up to six different
camera angles; four cameras in middle each one directed towards one participant, centered
top view of the whole room and top corner view of the whole room. The researchers used
the data from camera angles that had an overview of the entire room.
3.1.2 Method
Three activity classes were used; presentation, meeting or empty. The reason for choos-
ing the three activity classes was because these three activity classes were present during
every meeting. From the controlled set of experiments, the researchers selected the best
performing DL architectures. These DL architectures were applied to a dataset captured
from two different cameras in the IoTaP Lab at Malmo University. The dataset that was
used from the IoTaP Lab at Malmo University was not relevant for this project because
this project aims to use only the AMI Meeting Corpus database.
12
The videos from the AMI Meeting Corpus database were pre-processed before using them
to train the DL networks. Each video was copied once and flipped horizontally; in order
to create a mirror image and thereby expand the total volume of data. Each video was
exactly five minutes long. In order to avoid overfitting the images that belonged to the
same video were kept together. The video was split into 25 170 frames. Each frame of the
video was extracted into a JPEG and resized to 224x224 pixels. The test set consisted of
15% of each activity class.
The network models used were VGG16 with randomly initialized weights, VGG16 with
pre-trained weights from the ImageNet dataset[23] and Inception V3 pre-trained on the
ImageNet weights. Each network model was trained for one hour using a GTX 1060 video
card. The activity classes presentation and meeting have similar features compared to
the activity class empty. So the researchers excluded the activity class empty in order to
investigate classification bias towards the activity class empty.
The researchers changed the original models by removing and adding specific layers. The
layers that were not modified had their weights locked during training. To incorporate
temporal features in the experiments the researchers added RNN elements and an LSTM
layer to the original models [1]. The temporal features of the network models had to be
incorporated into the validation process. The validation process was redesigned so that
image frames were grouped together. Image frames per video segment were classified in-
dependently during validation.
The researchers investigated deeper into temporal features and therefore implemented a
3D CNN [1]. The 3D CNN model required that the data be reprocessed. The reprocessing
of data was done by scaling the images to a width and height of 32x32 pixels and a depth of
10 images. The previous models mentioned are not applicable to 3D CNN. The researchers
used a pre-trained model that was given in [24]. The model was trained 100 epochs.
3.1.3 Results
The researchers results regarding temporal features scored the highest validation accuracy
among the models tested in the paper [1]. The experimental results in the controlled set
of experiments using the AMI database showed that the model with highest validation
accuracy was the 3D CNN; 94.8% validation accuracy [1]. The second highest validation
accuracy was achieved by the RCNN combining VGG16 features with LSTM layers; 88,0%
validation accuracy [1].
The researchers tried changing the dimensional input and trying unidirectional or bidirec-
tional LSTM layers in order to enhance the VGG16 model. The researchers could enhance
the validation accuracy of the VGG16 model to reach 92%. The changes that resulted in
high accuracy were unidirectional LSTM layers and high dimensional input [1].
13
3.1.4 Comments
This thesis is based upon work done in [1]. This thesis is extending the work from [1] by
incorporating both ResNet and audio.
3.2 Towards Robust Human Activity Recognition from RGB Video Stream
with Limited Labeled Data
This article [25] is investigating how to recognize human activity from an RGB video stream
that has limited data. The limited data in this instance is the lack of depth information
as opposed to RGB-D video that has depth data.
They propose a framework that couples skeleton data extracted from RBG video and deep
BLSTM model for activity recognition [25]. In order to train this model effectively, they
brought forward a set of algorithmic techniques. This solution can even outperform the
state-of-the-art RGB-D video stream solutions and can be widely deployed using ordinary
cameras.
Their proposed architecture combines deep BLSTM layers and MLP, with five consecutive
BLSTM with dropout [25]. They utilize Batch Normalization after each BLSTM layer and
then feed the output of the BLSTM layers to the MLP.
3.2.1 Comments
This article was chosen as relevant for getting an idea of combining layers and what type
of activation layers and optimizers might be useful.
In this article [22] the authors write about the problem of vanishing/exploding gradients in
DL networks. At a certain network depth, the training error will start to increase by adding
more layers to the network. The researchers investigated the ResNet architectural approach
to avoid vanishing/exploding gradients. Using identity mapping and identity shortcuts the
researchers added the output of a previous layer to the output of stacked layers. The iden-
tity shortcuts do not contribute to more parameters or more computational power required.
3.3.1 Comments
This article is relevant because it shows that using ResNet can help to avoid vanish-
ing/exploding gradients which will decrease the training error. This thesis is using ResNet
14
which makes the article useful.
The authors in [26] explore different approaches in multi-sound classification and propose
a stacked classifier based on recent advancements in deep learning. The proposed approach
can robustly classify sound categories among mixed acoustic signals, without the need to
know about the number and signature of sounds in the mixed signals.
To do this the authors apply a state-of-the-art CNN called VGGish pre-trained on AudioSet
[26]. To extract MFCC (Mel Frequency Cepstral Coefficient) from each frame the authors
compute a spectrogram using a magnitude of STFT (Short-Time Fourier Transform) to
each frame of the original time-domain signal. The STFT is configured using a window size
of 25 ms, a hop of 10 ms and a periodic Hanning window. The spectrogram is then mapped
to a 64 Mel bin which produce a Mel spectrogram. In order to stabilize the spectrogram,
the authors took the logarithm with an offset of 0.01 to avoid taking a logarithm of zero.
Lastly, the audio features are extracted as a log Mel spectrogram with a shape of 96x64
(96 frames of 10 ms each, with a range of 64 Mel bands).
3.4.1 Comments
The windows size of 25 ms, hop of 10 ms, mapping to 64 Mel bins and taking the logarithm
of the produced Mel spectrogram is implemented in this thesis. The results in [26] are
promising as the accuracy never drops by a significant amount when going from 1 to
5 mixed events of similar and distinct sound categories. On average the accuracy only
dropped 6.75% when going from 1 to 5 mixed sounds.
15
4 Method
The method of choice for this research is the system development method written by Nuna-
maker and Chen [27]. This systems development process is done in five stages [27]. The
Nunamaker and Chen’s method was chosen because of the possibility of an iterative process
until answering the research questions. Unforeseen difficulties that may be encountered
will force the project to take a step back and re-evaluate an earlier stage.
The first stage of this research methodology was to break down the problem domain into
smaller research questions and to perform a literature study. The defining of the research
questions are already done and presented in section 1.4.1. A literature study was done
to gain more knowledge into the problem domain which helped to provide information for
the theoretical background. The theoretical background is presented in chapter 2. Also
analyzing what was done in [1] and using the same approach. The method of approach is
described in 3.1.
With the first stage complete, the next step is to develop a system architecture based on
the conceptual framework. Based on the information from the conceptual framework the
components and requirements needed were acquired.
The development of the system architecture was done using Google Colab and by using
Keras API [2]. Also, Kaggle (competitions based on machine learning problems) was used
for inspiration on how to implement and build a prototype system. Kaggle provided a
variety of implementations regarding image classification using Keras API.
Colaboratory is a so-called Jupyter notebook in the cloud, requiring no setup to start using
3 . Jupyter notebook is a web application allowing you to create and share documents
containing live code, equations, visualizations, and narrative text. By using Colaboratory
we get access to powerful hardware which speeds up the training time for our deep learning
model.
3
https://colab.research.google.com/notebooks/welcome.ipynb
16
4.3 Analyze and Design the System
At this stage, we have a system architecture and it is time to analyze in order to figure out
if the system architecture meets the requirements so that results with good validity can be
produced
The Cifar-10 dataset is an image multiclassification problem. The Cifar-10 dataset consists
of 60000 32x32 color images in 10 classes. There are 6000 images per class. For training,
there are 50000 images and for validation, there are 10000 images. The Cifar-10 dataset is
a moderately difficult dataset because of the similarity of some of the classes. The target
names of Cifar-10 are as follows: airplane, automobile, bird, cat, deer, dog, frog, horse,
ship, and truck. The dataset is provided by Keras API and the models used for feature
extraction to answer the research questions are also provided by Keras API [2]. The AMI
Corpus dataset is a custom dataset and not supported by Keras API like the Cifar-10.
This means that the AMI Corpus data has to be loaded in a different way compared to
the Cifar-10 dataset in order to use Keras API.
In order to validate the implemented system design, it was needed to analyze the results
given by the models by comparing benchmark scores on Cifar-10. So the results were
compared to known official benchmark scores which would show the validity of the system
implementation. The image size used for Cifar-10 was 32x32 pixels.
The Cats Vs. Dogs dataset is a two class image classification problem. The dataset consists
of 8000 training samples and 2000 validation samples. The image sizes are not the same
and the images are reshaped to the same size.
The Cats Vs. Dogs dataset was used in order to analyze the design of the system. This
had to be done because the folder structure of the Cats Vs. Dogs dataset is not the same
as for 4.3.1. This means that the Cats Vs. Dogs dataset is undergoing a different loading
process regarding the data and the result has to be analyzed.
The Cats Vs. Dogs dataset is used in order to verify that the loading process with the
folder structure provided works. The same loading process of the data will also be used
for the AMI dataset.
The base models used for feature extraction were the pre-trained ResNet50[22], VGG16[18]
and Inception V3[19]. The ImageNet weights were used to initialize the base model weights.
17
Transfer learning was achieved by removing the top layer of the base model. Through trans-
fer learning the features of each image were extracted. The image features were then used
to train a smaller NN model to classify the image belonging to one of the three classes;
described in subsection 4.5.1.
Data augmentation is used to increase the sample size producing a larger variety of unique
data samples[2]. Keras API has a built-in data augmentation feature. The built-in data
augmentation feature was used so that the model would be able to generalize better and
thereby have a higher validation accuracy and less risk of overfitting data. Data augmen-
tation was only done on the training dataset.
Some of the data augmentations used are inspired from [1] but are also standard augmen-
tation techniques for images. The horizontal flip does not change the information in the
image but represents it in another way. For instance in the Cats Vs. Dogs dataset if the
cat image is flipped horizontally it still looks like a cat. Also using rotation, shear and
zoom will create augmented images that would show the same visual information but rep-
resented in another way. Distorting the image too much could lead to worse performance
because the augmented images do not represent reality any more. So the values used were
small but still do manage to create augmentation that is distinguishable from the original
image.
Keras data generators were used to augment the data and to extract image features through
the transfer learning process[2]. The generators used the base models to extract the image
features through online predictions on the training and testing datasets. The extracted
features were then used for training and evaluating the model. This approach solved the
problem of not being able to use large image sizes due to memory overflow in the Google Co-
laboratory environment. Memory overflow did not occur because the base models predicted
online and only loaded a small enough amount of images. The image size used for Dogs Vs
Cats and for AMI was 244x244. The loading process sorts the data in alphanumeric order.
The AMI data has file stamps for every sample in every video batch that is unique. The
batches will be sorted by the Keras data generators without shuffling the data samples.
So the integrity of the sample order within each batch of data is maintained. Maintaining
18
the order of the samples within each batch is of importance for the use of temporal features.
The parameters have been chosen by considering the paper [28]. The batch size used was
1000 for Cifar-10. The Cats vs Dogs dataset is smaller so a smaller batch size was used
during training; batch size of 256.
The optimizer chosen was the Stochastic Gradient Descent(SGD) with a learning rate of
0.0001, learning decay of 0 and a momentum of 0.9[28].
19
Figure 11: Neural Network after applying dropout[2]
The dropout will set a fraction of the inputs to 0 and is used to counteract the model
overfitting the data[2]. The inputs that are set to 0 were chosen randomly each time step
[2]. The dropout rate will be set to 0.4 and any changes will be specified.
A variety of approaches have been tested to speed up and reproduce the results. The
approach that has been adopted includes implementing data generators and also saving
the features extracted by the base models. The AMI Corpus data used is available at the
git repository 4 for future use.
Figure 12: Model architecture used after the base model feature extraction.
4
https://github.com/Geo90/dataset
20
4.5 Build the (prototype) System
This stage is all about the implementation of the system that was designed. Here we do
all the programming needed for the system.
Using the AMI Corpus NITE XML Toolkit5 to extract the annotations for each video. A
program was created to read the data in each excel file and put together and save all the
data into one excel file. This one excel file contained the most important data needed which
was the name of the video, the topic, start time and end time for each topic and the class
assigned to each topic. The assumption was made that video segments where the topic
name included the word "presentation" or "drawing" were classified to be a presentation.
Topic names that included the word "closing" were classified as "empty" and the rest of
the topics were classified as "meeting".
Another program was created in order to extract a batch of frames from the video signals.
The batch of video frames was stored according to the classification that was made based
on the topic name. For each batch of video frames, there is a 2-dimensional audio spectro-
gram extracted. The video batch of frames and the audio spectrogram was synchronized
which is a requirement for answering the research questions.
The extraction of samples was done by reading a video segment containing frames and
storing all the video segments in a list. The list was then randomly shuffled and the class
that had the least occurring samples, set the limit for how many video segments were to
be extracted from each class. Then the samples were extracted from the shuffled list and
the samples were chosen randomly. There were a couple of lists such as the class label, file
stamp, spectrogram and batch of video frames. The class "closing" occurs usually only at
the end of the video or never. The class that has the most occurrences is the "meetings"
class. By limiting the extraction the dataset becomes more balanced regarding the samples
per class. The extraction process has been thoroughly tested.
The same approach will be used as in 3.1.2 where three activity classes(presentation, empty
and meeting) are to be classified by the model. The datasets used can be augmented as
has been described in chapter 4.4.1 to expand the total volume of samples. The dataset
consisted of images with size 144x144. Each of the three classes had batches of 32 frames.
The 32 frames were captured by taking the fifth frame until reaching a batch size of 32
frames. The frame rate of the videos were 25 frames/second meaning that a batch of 32
frames had a duration of 6.2 seconds. From each video, it was possible to extract one
or maybe two batches for each class, for training and validation respectively. From some
videos, no batches were extracted because it was necessary to keep the dataset balanced,
and the class empty occurred only once in the video or zero times. The empty class also
occurred during a short time span in the video. We solved this by manually moving the
5
http://groups.inf.ed.ac.uk/nxt/tutorials/tutorial1.shtml
21
batches from validation folders to training folders. The frames were re-scaled to 144x144
and placed in folders with training and validation. Each folder containing subfolders for
each class 0,1,2. The images were then uploaded to GitHub.
Google Colab6 was used to implement the base models for transfer learning. The images
had to be resized to 244x244. Because of limited RAM, the images had to be resized and
predicted online in batches so that the RAM limit would not be exceeded.
The base models used made online predictions of batches with sample size of 32 frames.
The result from the predictions are the extracted image features. The features extracted
were stored in variables to be used by the model during training. Only the model was
trained, the base models were only used to extract the image features.
Figure 13: Shows how the ideal setting for each of the three classes can look like. The
presentation class should have someone standing up and speaking. The empty/no-activity
class should be an empty room. The meeting class should have all four participants con-
ducting a conversation.
The same approach was used in [26]. Mel spectrograms were extracted from the audio files
that were part of the videos. The FFT window had a time span of 25ms with a window
hop of 10ms. The audio sample rate was 16000 frames/second. This meant that for the
duration of 6.2 seconds there would be 400 audio frames captured. For every 160 audio
frames a new FFT window would be created. The number of Mel bins used was 64 and
the frequencies that were cut off were [175,7500]Hz. So there was an overlap of 240 audio
frames between every FFT window. The logarithm was taken from the spectrograms to
stabilize it. The spectrograms were uploaded to GitHub with the size of 144x144. In
Google Colab they were resized to 244,244 for feature extraction by the base model.
6
https://colab.research.google.com/notebooks/welcome.ipynb
22
Presentation Empty/No-activity Meeting
Figure 14: shows the Mel audio spectra corresponding to the figures in Fig 13.
To combine the video and audio extracted features the data had to be concatenated. The
concatenation was done along the feature axis of the audio and video data[7]. An LSTM
layer was used to analyze the temporal data The last layer is an output layer with three
outputs, one for each class, and the whole model makes three probability predictions indi-
cating which class the model thinks the batch of data belongs to.
The simple architecture used for testing audio and video where one LSTM layer was enough,
only audio and only video:
Each audio frame used for training will be of 0.2 seconds length. A total of 32 audio
frames make up one sequence and will be combined with corresponding video batch along
the extracted feature axis. The pre-processing of the images was done in the same way as
in section 4.5.3:
23
Figure 16: shows the Mel audio spectra for three consecutive frames in the meeting class.
Each image spans over 0.2 seconds.
The final stage is to observe and evaluate the results of our prototype system. Here we
can determine if the results were as expected or not.
The AMI Corpus dataset was successfully loaded into the system without any concern that
the system would be at fault because of earlier testing(see 4.3.1 and 4.3.2).
Test simulations done with the Cats Vs. Dogs dataset showed promising results with a
validation accuracy of more than 90%. The training accuracy did not change and the
reason could be that the dataset was too small.
Because online prediction was needed for the image size 244x244 the Cifar-10 and Cats Vs
Dogs datasets were tested with the online predict generator [2] The results were similar to
the results presented in Fig 20 and in 19b.
24
5 Results
In this section, the results from various experiments will be presented in detail.
Figure 17: Resulting images from the Cats Vs Dogs dataset by implementing the keras
data generator.
In Fig. 17 the generator is set to a batchsize of 1000 training samples. Because the Cats Vs
Dogs dataset only has 8000 training samples this will result in that the sequence repeats
itself for every eighth image. The generator can be used to generate more samples than
the total sample size and the extra samples can be augmented and thereby expand the
total training size.
Figure 18: shows the model accuracy for the Cats vs Dogs data set described in 4.3.2. The
data has been augmented. The base models used for feature extraction are shown under
each subplot. The model architecture used for training and prediction is shown in Fig. 15
The implemented system architecture for loading data, creating the data set described in
Fig. 17 then extracting the features and training a model based on the extracted features
does result in a high model validation accuracy shown in Fig. 18. The training accuracy
is around 50% and there is no definite explanation to why the training accuracy is not
higher. There could be a lot of variance in the Cats Vs Dogs data set which does reduce
over training. Also dropout is used which lowers the training accuracy so that the model
did not become overtrained.
25
(a) VGG16 (b) ResNet50 (c) Inception V3
Figure 19: shows the confusion matrix for the Cats vs Dogs dataset described in 4.3.2. The
data has not been augmented. The data processing type is show under each sub plot. The
model architecture used for training and prediction after the feature extraction is shown
in 15
The confusion matrix Fig. 19 shows that the classification done by the model does indeed
work. Also the confusion matrices in Fig. 19 show that some samples are not classified
correctly, meaning the model is confused not being able to distinguish which class is the
correct class for the sample that is classified wrong.
The table 5 shows that the model performs better when the data set is shuffled. The data
is read class by class and shuffling the data samples around seem to have an impact on the
learning. If the data is not shuffled then the model will first learn the class cats then dogs
which can make the model more biased towards classifying the data as cats.
Table 1: Shows the F1 score and for the resulting confusion matrix in Fig. 19 using no
shuffle with a ratio between training and validation samples of 50/50. Shuffling the data
with a ration of 80/20 and no shuffling with at ration of 80/20 (data generated by using
Keras generators) and data is not augmented. The accuracy for each class is the same as
the F1 score.
26
5.2 Cifar-10 dataset
Figure 20: The Cifar-10 example data of the first 20 images that are preprocessed and
then undergo feature extraction.
Figure 21: Shows first 20 images of the Cifar-10 dataset same as in Fig. 20 with the
difference that these images also are augmentation before doing anything else. The aug-
mentation done is described in section 4.4.1.
Testing to see how well the system architecture is implemented by training the model on
the Cifar-10 dataset. These tests are to ensure that the base models and transfer learning
works correctly and no anomalies are present in the system architecture.
(a) (b)
Figure 22: Accuracy and loss for Cifar-10 using ResNet50 where (a) shows accuracy and
(b) shows loss
27
Figure 23: Confusion matrix based on the result from Fig. 22
Table 2: Shows the precision, recall, F1-score and number of samples of the Fig. 22 and
Fig. 23. Data is shuffled but not augmented.
The validation accuracy shown in Fig. 22 shows that the feature extraction works properly
and that the model is able to learn from the extracted features and classify the validation
data set with a high validation accuracy. There is also clear evidence that the training
accuracy of the model exceeds the validation accuracy at around epoch 60. The confusion
matrix in Fig. 23 shows the miss-classifications done by the model where some samples
of the classes dog and cat seem to confuse the model and so does ship and airplane. The
reason why the model gets confused is because there are similarities between these classes
and the samples are classified wrong.
28
(a) (b) (c)
Figure 24: Shows the accuracy for ResNet50 without augmentation on AMI dataset for the
Edinburgh Scenario meetings. Only two classes to categorize, presentation and meeting.
The data set is shuffled and balanced so that the ratio between training and validation
is 80/20. Figure (a) represents the corner camera, (b) the overhead camera and (c) the
corner and overhead camera angles combined.
The model has a high validation accuracy for the corner camera angle, overhead camera
angle and the data from the two camera angles used together in Fig. 24. There are
extracted features in the images that are similar. The features could be different but very
unique for each camera angle. Meaning the model finds a pattern that results in a high
validation accuracy even if the camera angles are different. So the models complexity
allows the model to optimize its weight configuration in such a way that it can reach a
high validation accuracy indifferently of the camera angle used in Fig. 24.
Figure 25: Shows the loss for ResNet50 without augmentation on AMI data set for the
Edinburgh Scenario meetings. The network architecture can be seen in the Appendix at
Fig. ?? . Only two classes to categorize, presentation and meeting. The data set is
shuffled and balanced so that the ratio between training and validation is 80/20. Figure
(a) represents the corner camera, (b) the overhead camera and (c) the corner and overhead
camera angles combined.
The model loss shown in Fig. 25 does vary depending on the camera angle. Some camera
angles make it more difficult for the model to learn the classification pattern. Comparing
Fig. 25a and Fig. 25b it can be observed that the loss function is steeper for Fig. 25b.
The steeper loss function indicates that the data in the overhead camera angle is easier
for the model to learn compared to the corner camera angle. Putting the camera angles
together creates a model that can classify data from different camera angles.
29
(a) (b) (c)
Figure 26: Shows the confusion matrix for ResNet50 without augmentation on AMI dataset
showing the confusion matrix for the different datasets used. Only two classes to categorize,
presentation and meeting. The dataset is shuffled and balanced so that the ratio between
training and validation is 80/20. Figure (a) represents the corner camera, (b) the overhead
camera and (c) the corner and overhead camera angles combined.
The confusion matrices in Fig. 26 show that the model is more likely to miss-classify
presentation as meeting. The data is not clear enough regarding what a "presentation"
or a "meeting" setting looks like. The confusion matrix showing the corner camera angle
in Fig. 26a is more likely to miss-classify presentation as meeting comrpared to Fig.
26b.
Table 3: Shows the accuracy, precision and recall for Fig 24 shuffling the data and a
training/validation ratio of 80/20. Data is not augmented.
Figure 27: Shows validation accuracy for each of the three base models on the audio data set
for the Edinburgh Scenario and Idiap Scenario meetings combined without augmentation.
Figure (a) represents the base model InceptionV3, figure (b) representes the ResNet50 and
figure (c) the VGG16 base model.
The validation accuracy in Fig. 27 does not seem to provide enough unique features for
the model to be able to distinguish between different classes. The lack of smooth curves
are due to the small amount of data samples used. Each audio data sample in Fig. 27
30
spans over 6.2 seconds. ResNet50 does reach the highest validation accuracy. Shows that
ResNet50 has extracted features that make it easier for the model to classify the audio
data correctly.
Figure 28: Shows the loss for the different base models. Same dataset used as in Fig. 27.
Figure (a) represents the base model InceptionV3, figure (b) representes the ResNet50 and
figure (c) the VGG16 base model.
The ResNet50 model in Fig. 28 has a steeper loss function compared to both InceptionV3
and VGG16.
Figure 29: Shows the confusion matrix for each base model. Same data and training
parameters as shown in Fig. 27. Figure (a) represents the base model InceptionV3, figure
(b) representes the ResNet50 and figure (c) the VGG16 base model.
Table 4: Shows the validation accuracy and F1-score for each class for the resulting con-
fusion matrix in Fig. 29 InceptionV3, ResNet50 and VGG16. Data ratio is 80/20 between
training and validation.
According to the confusion matrix in Fig. 29 the InceptionV3 base model did miss-classify
presentation as meeting. Also VGG16 miss-classified empty as meeting and also is confused
regarding presentation and meeting. The model that excelled is ResNet50. ResNet50 did
get confused about presentation and meeting. All three base models did have difficulties
distinguishing between presentation and meeting. It is reasonable to think that the audio
spanning over 6.2 seconds is similar for both presentation and meeting which makes it
difficult for the model to distinguish between one ore more persons speaking.
31
5.3.3 LSTM: Audio and video sequence
Figure 30: Shows the validation accuracy during model training using the LSTM model
architecture. Using ResNet50 as base model for the feature extraction. Figure (a) repre-
sents only audio with 65% accuracy, figure (b) is only video with 100% accuracy and (c)
is both audio and video features combined with 100% accuracy.
The LSTM model did reach a high validation accuracy seen in Fig. 30 only by using
video sequences compared to audio sequence. Combining the audio and video sequences
along their feature axis did lower the validation accuracy with about 5 percentage. The
audio samples span over 0.2 seconds and for every video sample there is a corresponding
audio sample. One sequence makes up 32 samples. The dataset is shuffled and balanced
so that the ratio between training and validation is 80/20. The distribution of validation
batches corresponding to the classes (presentation, empty/no activity, meeting) are (3, 13,
15) respectively per class. The distribution of training batches corresponding to the same
classes have batches of (69, 34, 17) respectively per class.
Figure 31: Shows the loss during model training using the LSTM model architecture.
Figure (a) represents only audio, (b) only video and (c) both audio and video features
combined.
The loss function for the audio sequences in Fig. 30 does show that the model has difficulties
finding a unique pattern in the data for each class. On the other hand the loss function
for video is steep and indicates that the model does see a pattern in spatio-temporal video
data.
32
(a) (b) (c)
Figure 32: Shows the confusion matrix after training. Same data and training parameters
as shown in Fig. 32. Figure (a) represents only audio, (b) only video and (c) both audio
and video features combined.
Table 5: Shows the validation accuracy and F-1 score from each class for the resulting
confusion matrix in Fig. 32. Using ResNet50 as the base model for feature extraction of
the audio data, video & audio data and only audio data. The ratio between training and
validation data is 90/10.
The confusion matrices in Fig. 32 shows that the audio data confuses the model and results
in miss-classifying empty and meeting.
(a) (b)
Figure 33: Shows Fig. 30b as Fig. (a) represents the data where video features come first
then audio features with 100% validation accuracy. Fig. (b) represents the data where
audio features come first then video features with 97% validation accuracy.
From Fig. 33 the difference is clear regarding the efficiency of the training depending on
if the data combined is arranged first audio then video or first video then audio features.
If the audio data features are first the outcome of the training will have worse validation
accuracy compared to if the video data features are first during training.
33
(a) (b) (c)
Figure 34: Shows the training and validation accuracy during training. Same data and
training parameters as shown in Fig. 27. The accuracy in case (a) is 75%, in (b) 66% and
(c) 69% validation accuracy. Figure (a) represents base model ResNet50, (b) ResNet101
and (c) ResNet 152.
The validation accuracy does not increase with a deeper ResNet base model which can be
seen in Fig. 39. Meaning that there are no deeper features hidden in the data for the
ResNet to extract by making the base model more complex.
Figure 35: Shows the loss function during training. Same data and training parameters
as shown in Fig. 34. Figure (a) represents base model ResNet50, (b) ResNet101 and (c)
ResNet 152.
The best performing base model among the three shown in Fig. 35 is the ResNet50
base model. It is expected that increasing the complexity of the ResNet base model the
performance would increase. It needs to be noted that the sample size provided for the
model to train and validate can be too small.
34
(a) (b) (c)
Figure 36: Shows the confusion matrix after training. Same data and training parameters
as shown in Fig. 34. Figure (a) represents base model ResNet50, (b) ResNet101 and
(c)ResNet 152.
Figure 37: Shows the training and validation accuracy during training using random weight
initialization. Same data and training parameters as shown in Fig. 27. The accuracy in
case (a) is 56%, in (b) 67% and (c) 58% validation accuracy. Figure (a) represents base
model ResNet50, (b) ResNet101 and (c) ResNet 152.
Increasing the depth of the ResNet increases the validation accuracy which can be seen in
Fig. 37.
35
(a) (b) (c)
Figure 38: Shows the confusion matrix after training. Same data and training parameters
as shown in Fig. 37. Figure (a) represents base model ResNet50, (b) ResNet101 and
(c)ResNet 152.
Figure 39: Shows the validation accuracy and same data and training parameters as shown
in Fig. 30. Figure (a) represents base model ResNet50, (b) ResNet101 and (c) ResNet
152. The validation accuracy in (a) is 65% in (b) 55% and (c) 42%
The validation accuracy decreases with a deeper ResNet base model which can be seen in
Fig. 39. Meaning that the feature extraction is of poorer quality and does not enable the
model to better classify the data.
Figure 40: Shows the loss during model training using the LSTM model architecture.
Same data and training parameters as shown in Fig. 40. Figure (a) represents base model
ResNet50, (b) ResNet101 and (c) ResNet 152.
36
(a) (b) (c)
Figure 41: Shows the confusion matrix after training. Same data and training parameters
as shown in Fig. 39. Figure (a) represents base model ResNet50, (b) ResNet101 and (c)
ResNet 152.
5.4.4 Audio: Edinburgh Scenario meetings LSTM audio sequence with ran-
dom weight initialization
Figure 42: Shows the training and validation accuracy during training using random weight
initialization. Same data and training parameters as shown in Fig. 30. The accuracy in
case (a) is 42%, in (b) 58% and (c) 71% validation accuracy. Figure (a) represents base
model ResNet50, (b) ResNet101 and (c) ResNet 152.
The validation accuracy does also increase for the LSTM sequence data with a deeper
ResNet base model which can be seen in Fig. 42.
37
6 Discussion
Using the ImageNet weights instead of training the base models provides an advantage re-
garding the model training time [23]. The base models are quite large with a lot of param-
eters. Randomly initializing the weights of the base models would mean time consuming
optimization of the base models weights. Training the base model and using ImageNet
weights to initialize the base models could lead to better results [1]. There is the option of
adjusting the weights of some layers and not all the layers in the base model. The benefits
are less parameters to optimize and utilizing the already optimized base model for feature
extraction. There was no intention of going any deeper into adjusting the weights of the
base models or trying to optimize the base models with randomly initialized weights. How-
ever the results on the AMI dataset shows the powerful computation that lies in transfer
learning and utilizing an already optimized base model.
Shuffling the data resulted in a higher validation accuracy for the CatsVsDogs and Cifar-
10 datasets. Shuffling the AMI datasets also resulted in a higher accuracy compared to
unruffled datasets. The reason to why shuffling the data seems so successful might be the
order the data is presented to the model. If no shuffling would occur then the model would
train on the first class, then the second class and so on. The result would be that the
model weights would approach the local minimum for the first class and then the second
class. So shuffling the data might increase the probability of minimizing the cost function
and not getting stuck on a local minimum.
In Fig. 33 the order of the features do affect the training. The feature order not only
affects the end result but also during training. During training it can be observed in Fig.
33 that the two models training is different.
The audio spectrum used is not dependent on the environment in the same way as video.
So the images generated for the audio spectrum are not affected by the fact that there are
different camera angles. Also the audio spectrum is not affected by people behaving in a
way that does not fit the class. Both camera angles share the same audio so using audio
seems to be more beneficial compared to video. Although audio also has its drawbacks such
as the same audio pattern can be recorded for entirely different visual behavior. Meaning
that presentation and meeting could be interpreted the same by the model if only provided
audio data which might be the case in Fig. 29 and Fig. 32. So the problem with only audio
might be that the presentation class still has enough similarities to the meeting class so the
model got the two classes mixed up. The model does seem to confuse the classes meeting
and empty in Fig. 32 for the only audio part. There is not enough data for the class
presentation, so there is not enough data to say anything about the classification pattern
between presentation and the other two classes. The validation accuracy when using only
audio is around 70% for the dense models in Fig. 27. For the LSTM model in Fig. 30 only
the audio has a validation accuracy of around 50%. The datasets for audio are small but
38
seems to be sufficient for the model to learn the pattern. In the paper [7] the researchers
used short raw-audio sequences of 40 milliseconds which is a different approach compared
to the Mel-spectrograms that the researchers used in [26]. The Mel-spectrograms illustrate
a longer time span of 960 milliseconds. In [7] the audio was annotated manually and
the analysis was done differently where the RAW-audio was processed by training a CNN
model. The video part had the features extracted by a ResNet50 base model. In [26] the
base model used was an adaptation of the VGG16 model also the data was not annotated.
The researcher used cluster training to categorize the audio data into classes. Both [7] and
[26] has been implemented in this thesis. Both of them showed promising results but using
Mel-spectrograms that stretch over a larger time span seems more promising. The reason
might be that some parts during the video are silent and having shorter audio segment
might contain more silent parts while having a longer audio segment represent each class
more accuratly.
According to the ResNet paper [22] the ResNet50 base model performs better than the
VGG16 and InceptionV3. The more layers that are added to the ResNet will improve the
performance. Comparing the base models in this thesis shows that the ResNet50 does
extract features that the model can classify with a higher validation accuracy, compared
to VGG16 and Inception3. This can be observed in Fig. 27.
Using random weight initialization performed better when using deeper ResNets shown in
Fig. 42 compared to using ImageNet weights shows in Fig. 39. But the results for the
dense model using another approach the results with ImageNet weights are better com-
pared to random weight initialization which is seen in Fig. 34 compared to Fig. 37. These
results can be explained because of the small dataset used for the LSTM. Larger dataset
would enable better training and generalization. The dataset used for the dense model
was larger and might have presented another type of features that were not present in the
dataset used for the LSTM model.
If the ImageNet dataset differs too much from the AMI dataset then using ImageNet
weights would not give the best possible results. Using even deeper ResNets would not
improve the results. Initiailizeing the weights randomly might actually do a better job at
extracting the features compared to the ImageNet weights. This could be tested but lies
outside the scope of this thesis and requires resources that are not available.
The results from using LSTM model architecture can be seen in Fig. 39 are not as good
as the results in Fig. 34. The results suggest that the longer 6.2 second audio sequences
are easier to classify compared to the shorter 0.2 second long audio sequences.
39
6.6 Parameter optimization
Some parameters that are used have not been optimized because there has been research
done on what parameters result in higher validation accuracy. These parameters are the
learning rate for the SGD learning that was inspired from [21] paper and the batch size
used was inspired from [28] paper. The models used for the AMI datasets were created
in such a way to minimize the amount of parameters while acquiring the highest possible
validation accuracy. Batch normalization of the data was used to accelerate the learning
of the models and dropout to minimize the overtraining of the model. The results without
batch normalization and dropout do result in less validation accuracy, more time needed
for training and the model overfitting the data, these results are not presented in the thesis.
The model architecture can be seen in the Appendix section in Fig. ??.
There are differences in the model architecture between the models that were used for
audio (see Appendix Fig. ??) and the models used for video (see Appendix Fig. ??)
classification in Fig. 24. The model complexity for audio is much smaller compared to the
model for video. Small models require less computational power because of the smaller
amount of trainable parameters. A larger model architecture for the audio data lead to
over fitting the training data easier. Also the models used for audio classification could
not reach the same validation accuracy as the models used for video classification. The
reason why audio data performed poorer regarding validation accuracy might be because
the annotated data provided by AMI Corpus was not accurate enough regarding the time
for when the setting was taking part. Another reason might also be that the scenarios
taking place were not accurately enough depicting the annotations. One such example
is the "closing" annotation that was expected to result in participants leaving the room,
having an empty room and almost silent audio recording.
40
7 Conclusion and future work
Classifying human activities using computer vision has always been difficult, in this thesis
we attempted to specifically classify human activity in an office environment (meeting,
presentation and empty/no activity). To do this, one main research question with three
sub research questions were brought forward.
• RQ 1: What is the relationship between the complexity of the ResNet architecture
and the validation accuracy for activity recognition in office environments?
– RQ 1.1: What is the accuracy the ResNet can classify the AMI meeting office
environment activities?
– RQ 1.2: How does the validation accuracy and complexity compare to the
known literature and already implemented models; the VGG16 and Inception
V3 model?
– RQ 1.3: Among the models that are tested in this project, which one has
the highest performance regarding validation accuracy and what might be the
reason?
A system was developed in order to answer these questions. This system was used on the
AMI meeting office environment activities. The results of our system being used on these
activities can be seen in chapter 5.3, which answers our first question.
The relevant results from the known literature [1] is the Inception V3 with pre-trained
weights with an accuracy of 64.4%, the VGG16 with pre-trained weights with an accuracy
of 64.5% and lastly the VGG16 with pre-trained weights and LSTM with an accuracy of
88%. Comparing these results with the results gathered in chapter 5.3 we can see that
ResNet surpasses the accuracies from the known literature.
To answer our third research question as mentioned in R.Q. 1.3, three different models
were tested, namely ResNet, VGG16 and Inception V3. For this, only the audio part of
the dataset were used and the results from this can be found in chapter 5.3.2. The results
show ResNet as a clear winner with 78% accuracy compared to the 66% and 50% from
Inception V3 and VGG16 respectively.
7.1 Contribution
This thesis brings forth some useful information regarding the classification of human
activities, more specifically human activities within an office environment, using the ResNet
architecture.
To continue the work of this thesis the next step would be to expand the dataset to include
more specific and different activities within an office environment. An example of this
41
could be the activity of working at a computer to see how the architecture would handle
more specific activities.
42
References
[1] Karl Casserfelt & Radu-Casian Mihailescu. “An investigation of transfer learning
for deep architectures in group activity recognition”. In: (2019). url: https://h-
suwa.github.io/percomworkshops2019/papers/p58-mihailescu.pdf.
[2] François Chollet et al. Keras. https://keras.io. 2015.
[3] Jiequn Han, Arnulf Jentzen, and Weinan E. “Solving high-dimensional partial dif-
ferential equations using deep learning”. In: Proceedings of the National Academy
of Sciences 115.34 (2018), pp. 8505–8510. issn: 0027-8424. doi: 10 . 1073 / pnas .
1718942115. eprint: https://www.pnas.org/content/115/34/8505.full.pdf.
url: https://www.pnas.org/content/115/34/8505.
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification
with Deep Convolutional Neural Networks”. In: Neural Information Processing Sys-
tems 25 (Jan. 2012). doi: 10.1145/3065386.
[5] “A Review on Human Activity Recognition Using Vision-Based Method”. In: Journal
of Healthcare Engineering 25 (Jan. 2017). doi: 10.1155/2017/3090343.
[6] M. Babiker et al. “Automated daily human activity recognition for video surveil-
lance using neural network”. In: (Nov. 2017), pp. 1–5. doi: 10.1109/ICSIMA.2017.
8312024.
[7] P. Tzirakis et al. “End-to-End Multimodal Emotion Recognition Using Deep Neural
Networks”. In: IEEE Journal of Selected Topics in Signal Processing 11.8 (Dec. 2017),
pp. 1301–1309. issn: 1932-4553. doi: 10.1109/JSTSP.2017.2764438.
[8] S. Whitehouse et al. “Recognition of unscripted kitchen activities and eating be-
haviour for health monitoring”. In: (Oct. 2016), pp. 1–6. doi: 10.1049/ic.2016.
0050.
[9] Thinagaran Perumal et al. “IoT based activity recognition among smart home res-
idents”. In: 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE)
(2017). doi: 10.1109/gcce.2017.8229478.
[10] Roshan Singh, Alok Kumar Singh Kushwaha, and Rajeev Srivastava. “Multi-view
recognition system for human activity based on multiple features for video surveil-
lance system”. In: Multimedia Tools and Applications (Jan. 2019). issn: 1573-7721.
doi: 10.1007/s11042-018-7108-9. url: https://doi.org/10.1007/s11042-018-
7108-9.
[11] Daniel Gatica-Perez et al. “Audio-Visual Speaker Tracking With Importance Particle
Filters”. In: (Aug. 2004).
[12] /@vincent.fung13. An Overview of ResNet and its Variants. July 2017. url: https:
/ / towardsdatascience . com / an - overview - of - resnet - and - its - variants -
5281e2f56035.
[13] Simon Haykin. Neural Networks and Learning Machines. Pearson Prentice Hall,
Third Edition. Clarendon Press, 2009. isbn: 978-0-13-129376-2.
[14] Jason Brownlee. “Supervised and Unsupervised Machine Learning Algorithms”. In:
Machine Learning Mastery (Sept. 2016). url: https://machinelearningmastery.
com/supervised-and-unsupervised-machine-learning-algorithms/.
[15] Michael A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2015.
isbn: 978-0-13-129376-2.
[16] Understanding LSTM Networks. url: https://colah.github.io/posts/2015-08-
Understanding-LSTMs/.
[17] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural
computation 9.8 (1997), pp. 1735–1780.
43
[18] Simonyan et al. “Very Deep Convolutional Networks for Large-Scale Image Recogni-
tion”. In: arXiv.org (Apr. 2015). url: https://arxiv.org/abs/1409.1556.
[19] Simonyan et al. “Very Deep Convolutional Networks for Large-Scale Image Recogni-
tion”. In: arXiv.org (Apr. 2015). url: https://arxiv.org/abs/1409.1556.
[20] Advanced Guide to Inception v3 on Cloud TPU | Cloud TPU | Google Cloud. url:
https://cloud.google.com/tpu/docs/inception-v3-advanced.
[21] Leonardo Araujo dos Santos. Residual Net. url: https://leonardoaraujosantos.
gitbooks.io/artificial-inteligence/content/residual_net.html.
[22] “Deep Residual Learning for Image Recognition”. In: (10 Dec 2015). url: https:
//arxiv.org/abs/1512.03385#.
[23] Li Fei-Fei. “Imagenet large scale visual recognition challenge”. In: (Dec 2015). url:
https://arxiv.org/pdf/1409.0575.pdf.
[24] kcct-fujimotolab. “Fujimoto laboratory in Kobe City College of Technology”. In: (Feb
6, 2017). url: https://github.com/kcct-fujimotolab/3DCNN.
[25] Krishanu Sarker et al. “Towards Robust Human Activity Recognition from RGB
Video Stream with Limited Labeled Data”. In: 2018 17th IEEE International Con-
ference on Machine Learning and Applications (ICMLA) (2018). doi: 10 . 1109 /
icmla.2018.00029.
[26] Peter Haubrick and Juan Ye. “Robust Audio Sensing with Multi-Sound Classifica-
tion”. In: 2019 IEEE International Conference on Pervasive Computing and Com-
munications (PerCom) ().
[27] J.f. Nunamaker and M. Chen. “Systems development in information systems re-
search”. In: Twenty-Third Annual Hawaii International Conference on System Sci-
ences (). doi: 10.1109/hicss.1990.205401.
[28] Pavlo M. Radiuk. “Impact of Training Set Batch Size on the Performance of Con-
volutional Neural Networks for Diverse Datasets”. In: Information Technology and
Management Science 20 (Dec. 2017). doi: 10.1515/itms-2017-0003.
[29] Building powerful image classification models using very little data. https://blog.
keras . io / building - powerful - image - classification - models - using - very -
little-data.html. Accessed: 2019-05-18.
44