Ee210-Project Report Pdf-Ilovepdf-Compressed
Ee210-Project Report Pdf-Ilovepdf-Compressed
Ee210-Project Report Pdf-Ilovepdf-Compressed
Colin Togashi
Meng-Hao Li
Jack Shue
Gabriel Fernandez
A DAPTATION & L EARNING , U NIVERSITY OF C ALIFORNIA , L OS A NGELES
P ROFESSOR A. H. S AYED , UCLA, EMAIL : SAYED @ UCLA . EDU
This project was done for UCLA’s Electrical Engineering Department’s Adaptation &
Learning class under the supervision of Dr. Ali Sayed. If made public, photos need to be
purchased for rights.
First release, March 2017
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1 Motivation 6
1.1.1 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Objective 7
1.3 General Approach 7
1.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Theory Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Neural Networks 10
2.1.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Convolutional Networks 13
2.2.1 Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Initial Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Neural Networks 16
3.1.1 Gaussian Distribution Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Step size µ ..................................................... 17
4
5 In Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 Fish & Datasets 30
5.1.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Weaving Nets 34
5.3 Challenges 34
5.3.1 Computational Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Compromises 35
5.5 Algorithm Adjustments 36
5.5.1 Architecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5.2 Architecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Network Architecture 42
8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1 References 58
1. Introduction
1.1 Motivation
Illegal, unreported, and unregulated fishing practices amount to nearly 60% [1] of
all tuna caught around the world. If this trend continues, half of the world’s population
depending on seafood may be in danger as current fishing practices threaten to destroy the
earth’s fragile marine ecosystem. The Nature Conservancy aims to utilize technology to
preserve fisheries and protect nature for future generations.[2]
With advancements in image recognition, The Nature Conservancy wants to start
implementing cameras to monitor fishing activities to increase compliance by filling in
for underreported fishing activities. Hardware and electronics are fully prepared for mass
deployment. However, the cost for processing the massive amounts of data is unfeasible.
The Nature Conservancy is reaching out to the data science community to stem the cost
by implementing image processing and classification. With an algorithm that can correctly
identify fish in the picture, countries will be able to redirect resources to address issues
affecting marine life. Machine learning will help us learn more about our marine life as
well as maintaining a healthy balance for its ecosystem.
1.1.1 Kaggle
This problem comes from a machine learning competition database. These are real
world problems being solved by algorithms such as the one we will be discussing. One
of the harder parts about theory is putting it into practice as you will see in this report.
Professionals in the field enter in Kaggle competitions, so we have our work cut out for us. If
you are interested in entering a competition, follow this link: https://www.kaggle.com/
1.2 Objective 7
1.2 Objective
The Nature Conservancy will provide a limited dataset of the type of photos that will be
taken on top of boats. It will be our job to develop an algorithm to predict the likelihood of
fish species in each picture. There are eight target categories that are available in the dataset:
Albacore tuna, Bigeye tuna, Yellowfin tuna, Mahi Mahi, Opah, Sharks, Other (fish present
but none in the category), and No Fish (meaning that there are no fish in the image). Each
image in the given data set only has one fish category although there may be more than one
fish inside the picture.
1.3.1 Assumptions
This report is based on the assumption that the audience is familiar with the basic
concepts of neural networks and convolutional networks. We will only be focusing on
some of the basics, analogies, and smaller revelations that we have experienced throughout
lectures and our own research outside of class. Our purpose isn’t to reteach the material. It
is to cover an overview on some of the many things we’ve learned and to provide insight to
approaches through what we’ve found in our own experiences. For more information please
refer to Dr. Ali Sayed’s book: Adaptation, Learning, and Optimization over Networks,
Foundations and Trends in Machine Learning.
8 Chapter 1. Introduction
Figure 1.1: Simplified diagram of how algorithm works with categories of the types of fish
being identified. A raw image is fed through the 1st CNN to find only the fish segment of
the image. Then, the segmented fish is fed through a 2nd CNN to classify the type of fish.
This method is to reduce the complexity of the problem that each CNN must solve. [3], [4]
1.3.2 References
Different references will be listed throughout the report. To see the specific references
please refer to the bibliography section. However, beforehand we want to preface that the
vast majority of the report is based off of Dr. Ali Sayed’s books and notes. Due to the vast
amounts of references it is best to assume that anything that doesn’t have a direct footnote
has come from Dr. Ali Sayed.
R All the algorithms were built, tested, and analyzed using Matlab, as per request from
Dr. Ali Sayed. The way we talk about algorithms will be framed with this framework in
mind. For more information you can visit there website: https://www.mathworks.
com/
R The code that this report was based off of will be placed in a separate file to this report.
1.3 General Approach 9
R
2017.
c All Rights Reserved. No part of unreferenced sections of this report can
be reproduced, posted, or redistributed without written consent from Professor A. H.
Sayed (UCLA, email: sayed@ucla.edu).
2. Theory Overview
2.1.1 Perceptrons
The brain is not fully understood, but presently the scientific community has enough
knowledge to recreate part of the brain functionality through neurons. Neurons make up the
brain and are all interconnected. They have inputs, and after a large enough impulse the
neurons fire out a signal to all the neurons connected to them. These neurons can be viewed
as a perceptron.
In terms of the algorithm each neuron will be represented by a perceptron. The per-
ceptron will receive an input and with the input its output signal will basically be a linear
combination with given weights, mapping the input to the output:
2.1 Neural Networks 11
z = hT w − θ (2.1)
Equation 2.1 at a very basic view represents the underlying simplicity of neural networks.
z is just linear combinations of received inputs, hT w, with θ representing a bias. After going
through linear mapping the z is fed through a nonlinearity called the activation function.
Using the biological analogy again, the activation function represents the excitement
threshold needed for a neuron needed before firing its own signal. Other arguments have
been made for different types of activation functions and why they’re needed at all.
One interesting argument is that the world is nonlinear and therefore it is useful to
introduce some non-linearity into the mapping. This seems like a plausible argument.
One of the hardest parts of networks is generalization to datasets that the network hasn’t
trained on before. A pure weighted linear combination would perform well on data that it’s
seen before. Although, once unfamiliar, different data is introduced the likelihood that it
will do well is low. Thus, by introducing nonlinearities the mapping is no longer a linear
mapping which helps with generalization to data not seen before by the network. Below is a
perceptron with one common type of activation function called a sigmoid function.
Figure 2.1: A biological representation of the neuron and a perceptron. As can be seen,
the perceptron aims to mimic the function of a neuron by sharing a similar topographical
structure as the neuron.[5]
a final activation function that will allow of some interpretation. In our case the last layer
may tell us whether there is a fish or not for the first convolutional network. In the second
convolutional network it could be interpreted as the likelihood that the fish is one of the
eight categories given by The Nature Conservancy. This is all under the assumptions that
the neural network has the right weighting factor to map one high-dimensional space to
another correctly.
Figure 2.2 depicts exactly how the linear mapping of perceptrons interact with each
other. h is the feature vector or numerical representation, the input. Each yellow circle
represents a perceptron. In many constructions of these networks sometimes they can have
hundreds if not thousands of these nodes per layer.
will be discussed later in the report. The main idea here is to find the relevant weights to
minimize this cost hence providing valid weights to map from the feature vector space to
the actual labeling space. The bias, θ , also plays an important role when trying to minimize
the cost function.
L−1
1 N−1
Jemp (W, θ ) , ∑ ρkWl k + ∑ kγn − yn k2 (2.2)
l=1 N n=0
There are many different algorithmic approaches to train the network properly to arrive
at the best weights and biases. The main idea is to algorithmically descend the the cost
function with stochastic-gradient descent. In other words finding the derivative at a certain
point and moving in the opposite of the gradient. Further on in the report we will delve into
more details and share our experiences and approaches that were applicable to solving the
fish classification problem.
Equation 4.6 shows the recursive gradient descent algorithm. This will be used to
descend the cost function to find weights and bias. This recursive formula describes the
descent on of the cost function in terms of its gradient, constant B, step size µ, and the input
argument you are taking the gradient by. This recursive equation can be used for the bias as
well with little change to the algorithm. Whenever we refer to stochastic gradient descent
we are referring to gradient descent.
2.2.1 Masks
The idea of a mask is that you take a pixel and corresponding pixels around it and
convolve it with a matrix of certain values, the mask. By doing this certain features become
14 Chapter 2. Theory Overview
more prominent. For example the color of raspberry in the raspberry pie example becomes
more noticeable or the circular shape becomes more apparent. Convolution comes from this
terminology of convolving the mask with each pixel and its surrounding pixels. What we
mean when we say convolve is akin to correlate. This can take into account spatial relations
of certain colors and edges. Images with similar correlation patterns after masking should
result in similar original images. Let’s take a look at pixel patch, H , and mask, W .
X X X X X X
H = X X X , W = X X X (2.5)
X X X X X X
The boxed element in W represents the pixel on which we center the mask. We could
increase the length and width of the mask and pixel area, depending on the problem and
computational power.
Figure 2.3: An edge detection mask overlaid a fish image to emphasize shape of a fish[6]
Looking at Figure 2.3 we see that with a certain mask type, in this case an edge detector
one, we see that the edges are much more visible and much of the unnecessary information
is reduced to zeros. This can greatly help reduce the noise factor and assist in creating a
good set of features.
K K
corr(H , W ) , ∑ ∑ H (k, l)W (k, l) (2.6)
k=−K l=−K
When doing this for different masks or colors, we end up with different features of the
image becoming much more prominent. As in the neural network section we want to create
a linear mapping to a feature map and introduce nonlinearities into the process for the same
reasons discussed earlier.
In Figure 2.4 we see that the construction is very similar to neural networks. These
feature vectors then can be further compressed with pooling. Just as in neural networks we
want to find the best weights and biases that will reduce a cost function that we define. The
way we go about finding these is a similar process of finding the cost function derivative and
moving in the opposite direction. By using essentially the same method we can stochastically
2.2 Convolutional Networks 15
Figure 2.4: (Left RGB channels filter mapped to filtered image. (Right) vector representation
descend the cost function until we reach a minimum that would in theory reduce the amount
of error while still being able to generalize. The result after this whole process is a set of
input features that is representative of the whatever is trying to be classified.
For sanity checks and initial testing we took a very simple self generated test using
two Gaussian distributions. Here we can generate as many labeled data examples and play
around with this toy example to gather an intuition for what would happen with certain
parameters. This also reduces the problem into the R2 plane allowing for relatively fast
computation. Here we could consider some analysis again to gain understanding how many
setup the more complex algorithms.
In the following graphs and analysis of the toy Gaussian distribution example we will be
using 2000 training data points. This example is setup exactly how it was laid out in Dr. Ali
Sayed’s notes with implementations of stochastic-gradient training, cross-entropy training,
and softmax training specifically for neural networks. We also dig into learning curve is
shown. Testing data then classified by the trained Neural Network.
3.1 Neural Networks 17
Figure 3.1: This is the figure depicts the two separate Gaussian distribution in a plane. We
moved them closer and farther apart to see how accurately our classifier would work.
learning requiring much more time, computation, and possibly samples. There is an optimal
step size that we should be searching for, in this case it’s when µ = 0.01. Of course with
the tuning of multiple parameters, it could result into a multidimensional search for the
best combination of tuning parameters. For the most part people in the field test out many
different parameter sizes on training data to find the best combination of parameters.
3.1 Neural Networks 19
Figure 3.2: (Top) With µ = 0.001 we note that the step size decreases the cost function but
takes a long time to get there. (Middle) With µ = 0.01, the optimal choice, we see that the
cost function immediately drops, and it classifies quite well. (Bottom) Here µ = 0.1 and we
note that the cost function starts to thrash and misclassification occurs with increasing cost
function
4. Convolutional Neural Net
Figure 4.1: Toy images of a vertical bar (reft) and of a horizontal bar (right) to run initial
tests on the convolutional network. These allowed us to iterate our algorithm at a fast rate
while still presenting the CNN with a small classification challenge.
For the first convolutional neural network, We wanted to do a few sanity checks by
starting with simple, small images with either a vertical or horizontal line as seen in Figure
4.3 Icons Recognition Test 21
4.1. Just like in the Gaussian distribution example we want to create an experiment we
could easily and quickly control. While seemingly trivial, these allowed us to iterate quickly
and fix issues in the algorithm in a relatively low dimensional problem. Then, we could
move to the higher dimensional images with some level of confidence that the underlying
algorithms were functioning properly.
This simple test case enabled us to test the inner workings of the convolutional network
at a fast pace. Since each image was only 16x16, we could quickly execute training runs,
adjust parameters, and plot the cost function. We used this in part to gain some intuition
about different cost functions and initial conditions. We played around with softmax and
cross-entropy then reasoned the results by reading papers and articles.
One of the more spectacular occurrences was that the convolutional neural network still
classified correctly with a bug inside. On the backward pass between the neural network
and convolutional network, the indexing prevented a number of sensitivities from passing
upstream. Despite this, the convolutional network still trained the remaining weights and
was still able to converge about 70% of the time. This bug was only found when the code
was optimized to allow the move to the larger images.
The result of this test is 100% correct which shows the network is able to use small
amount of data to classify the Google icon and the Facebook icon. Here is an interesting
thing, in the Neural Network Handout, the bias coefficients, θ , are initialized with the
Gaussian distribution with zero mean and variance one, and the combination weights are
initialized with the Gaussian distribution with zero mean and customized variance √1nl which
is a typo and should be n1l , where nl is the number of depth or node in the l layer. However,
we found that when the variance of initialization of the bias coefficients equal to 1, the
output of the CNN is saturated as a constant, but if the variance is 0.1, the performance
is improved much better. In addition, in the initialization of the combination weights, we
found the performance with the variance equal to n12 is better than √1nl and n1l . This means,
l
in our CNN, the initial value of the bias coefficients and the combination weights need to be
small.
To see how the CNN works, we reduced the variance of the Gaussian noise to 0.05,
and get the noised images. Then we extract the feature maps before and after the rectifier
activation function. Take Facebook icon and the feature maps first two layers for example
(Figure4.4 & 4.5). From the feature maps in the first layer, we can see that the CNN separate
the useful information and the noise appropriately.
Figure 4.5: feature maps before (upper one) and after (lower one) the rectifier activation
function in the first layer
At the beginning of the 6-icons test, we used the same variances of the initialization of
the bias coefficients and the combination weights, i.e. 0.1 and n12 . However, it turned out
l
the probability of each icon predicted by the network is equally 16 which means the network
just made a random guess. To have an insight to tune the variances, we first looked into
the computation of outputs in the last layer of neural network. We found that the values
of the bias coefficients are larger about ten times than the values the combination weights.
Therefore, we decided to decrease the variance of the bias coefficients to 0.01, and increase
the the variance of the combination weights by changing the relation between the variance
and the node/depth number to n12 . The result turned out the network always predicted the
l
24 Chapter 4. Convolutional Neural Net
icon with the probability one. From these two results of classification, we noticed that
they are two opposite extreme points, and thus there might be some values of the variances
between the two extreme point such that the network will learn well. Finally, we used the
variance of the bias coefficients equal to 0.01 and the variance of the combination weights
1
equal to n1.65 . It is quite amazing because we only need 360 training data and 4 folds to get
l
a good network to classify 6 icons. This setting give us a successful prediction with a 87.5%
correction. From this test, we can have an intuition that a good initialization will enable the
network not to fall into the undesired local minimum and arrive near the global minimum
point.
Here, we present the feature maps (Figure 4.7), before and after the rectifier activation
function, of Twitter icon image in the first two layers. It shows that even though the data is
noised a lot, the network still can classify it under the class of Twitter icon. Furthermore,
it successfully extracts the useful information. For example, the 8-th image of the feature
maps before activation function in the second layer has a bird-shape outline in the middle of
the image.
Figure 4.7: The network can extract some useful information from the noised Twitter icon
(the most upper one). For example, the 8-th image of the feature maps before activation
function in the second layer has a bird-shape outline in the middle of the image.
the output of softmax. It is still nice to have a nice probability distribution over the categories
though. The node with the largest number in the output layer will be classified as such.
!−1
Q
zn (q) zn (k)
yn (q) , e ∑e (4.1)
k=1
The cross-entropy has its own distinct benefits. Since its cost function has a logarithmic
term as the non-regularized term, it cancels out the plateau that is present in the soft-max
last layer function therefore speeding up the learning[7]. In other words since soft-max
uses derivatives of the activation functions you start to see saturation which lead to poor
learning. This is no longer the case for the logarithmic cross-entropy cost function. The only
requirement is that the last layer needs to contain sigmoid activation functions to properly
implement cross-entropy. As a side note you don’t want to implement sigmoid throughout
the entire network because it would produce a vanishing gradient.
L−1
1 N−1 Q
Jemp (W, θ ) , ∑ ρkWl k2F + ∑ ∑ ln(yn(q)γn(q)(1 − yn(q)1−γn(q)))
N n=0
(4.7)
l=1 q=1
The following graph shows the three different types of ways to train networks. This
is used for the over-simplified Gaussian distribution example above. With such a trivial
example it’s hard to see the finer nuances since in this case it looks very similar. If you
look closely enough at figure 4.8 at the bottom graph, at the very beginning cross-entropy
descends faster than stochastic descent despite starting at a higher value. If you were
to measure gradients, you would indeed see that cross-entropy drops the fastest in the
beginning meaning that it’s learning off the data at a faster rate. We see this much more
pronounced when looking at higher dimensional problems. The other thing to note is that
we don’t except the cost functions to converge to the same place. This is simply due to the
fact that the definition of the cost function value depends on how you define it to be. The
26 Chapter 4. Convolutional Neural Net
Figure 4.8: (Left) Neural network ensemble average with smaller µ = 0.005.(Right) Neural
network ensemble average with slightly bigger µ = 0.05.
Figure 4.9: (Left) Convolutional neural network Jemp at 240 iterations.(Right) CNN learning
slope for 240 iterations.
cost function for all three of these methods are different especially if you choose coefficients
to adjust the weight of certain terms.
In the final figure, fig. 4.10 you can really see the steepness of the cross-entropy learning
slope. At the very beginning it is extremely high after which it drops. This implies that the
learning rate is much higher in the beginning whether than be good or bad learning. This is
more of a side note, but just because it converges somewhere doesn’t mean it’ll converge
to a good classifier. The idea for cross-entropy is to learn at a faster rate allowing for less
computation time and processing of data.
4.5 Numerical Checking 27
Figure 4.10: (Left) Convolutional neural network Jemp at 60 iterations.(Right) CNN learning
slope for 60 iterations.
(d,2)
Figure 4.11: Perturbing weight w j
(d,2) (d,2)
If we perturb on the particular element of w j , let’s say (w j )(α, β ) is perturbed
to (α, β ) ± ε. We apply both the positive perturbation and the negative perturbation on
(d,2) (d,2) (d,2)
the (α, β ) terms of w j , and the resulting is (w j )+ε and (w j )−ε respectively. For
feature the feature vector and label, (h, γ), coming in, will propagate forward and give two
different outputs, y+ε and y−ε .
∂ kγ − yk2
= (1TDc ⊗ Hn,c−1 )∆n,c ; (4.10)
∂Wc
Figure 4.12 above shows figuratively what is meant by perturbing a certain weight and
getting a resulting effect in y. This makes sense since everything is interconnected we
expect to see some change in the output unless there is an entire layer of weighting zero or
something highly unusual that would zero out the effect of the perturbation on the output.
As we see in the following figure that the numerical checking values are quite close. This
proves that our algorithm and set up work.
One smaller issue we ran into when calculating the numerical gradient was the central
difference theorem lower limit. The smallest ε value we can have is 1 × 10−5 and as we hit
the limit the gradients start to give us larger and larger errors for the perturbation. Once we
fixed that we were able to use this same method for convolutional neural networks. The
maximum difference on the numerical checking turned out to be on the oder of 1 × 10−9
4.5 Numerical Checking 29
Figure 4.13: (Top) Numerical gradient checking.(Bottom) The specific structure we are
checking.
5. In Practice
and poop decks that we could use to do this task. This intuitive approach is based on the fact
in this first convolutional neural network, classifying if there is a fish or not, you just have
to know what a fish is. That means Imagenet’s library can be used to increase the training
data with much cleaner data.
Figure 5.1: (Top) The picture here is one of the few types of pictures with quality that makes
it hard to classify.(Bottom) This is representative of descent quality of some pictures.
The following figure 5.2 shows the types of data in Imagenet. The one on the far left
depicts how the images in the library appear. If time did not play a factor we could very well
have used an RGB image (an image with color that has three separate intensity layers of
information in red, green, and blue additive primaries). Feeding in one RGB image would
is computationally costly; it’s almost like inputing three separate images. Since the first
convolutional neural network only classifies whether there is a fish or not, we came to the
conclusion that simplifying the problem down would help us meet our project deadline
without giving up classifications. After a few discussions, we came to the conclusion that
reducing the image to gray-scale was good enough.
The other big question was how to deal with images of differing sizes. We initially
came to the conclusion of using uniform noise in the background while randomizing the
orientation of where the image was in the plane. For example on an R2 space we filled it
with noise and randomly placed the center of the image somewhere in this space. The result
is the center image in figure 5.2.
We thought randomization of location and uniform noise would avoid classifying
based off of dimension edges. This may have been one of the contributing factors to
non-converging cost functions. Despite randomization it may have created unnecessary
32 Chapter 5. In Practice
Figure 5.2: (Left) Original types of images pulled off of Imagenet[4].(Center) The picture
on the left depicts a technique to deal with dimensionality issue.(Right) A second method of
dealing with dimensionality by cropping ideal images of fish.
issues due to the sheer size of the feature vectors (512x512 = 262144 feature vectors at
the input, 256x256x20 = 1310720 on the second, 128x128x40 = 655360 on the third, etc.).
Since we were presented with a number of computational challenges, they are all detailed in
a later Section 5.3. In addition, the academic papers and resources we had found elsewhere
mentioned much larger architectures for a smaller problem of 224x224 images [8]. Due to
this, we attributed a portion of the reduced error to the CNN under-fitting as we had much
fewer depths for a larger problem.
After running the CNN with different parameters and still yielding a poor classifier, We
decided to reduce the dimensions again by reducing the image size. This time, we manually
cropped the photos to be as close to a 280x280 image as possible while still capturing
the entire fish. Then, we used Matlab to scale the images slightly such that they all were
uniform 280x280 images for the CNN to classify.
By reducing the number of initial features, we could now increase the dimensionality of
our classifier to better fit the problem. While we kept a wary eye out for signs of over-fitting,
we were able to increase our depths to {48, 64, 72} on the convolutional side while reducing
the number of neural nodes to {1000, 1000}.
5.3 Challenges
The largest challenge for this classification problem is the size of the image with the
given time constraint. As we mentioned earlier that processing all this information to get
really accurate readings would result in months of training given 1240x700 pixel images.
This one issue leads into all our other problems. On small toy examples both of our
architectures work great.
With the complexity and dimensionality increasing there is also an issue of parameter
sensitivity. We noticed with smaller images that there was a large range of parameters
and methods that converged to a good classifier. However, with a much larger image and
dimension we noticed that some of the parameters would converge to weak classifiers.
Unfortunately, we ran out of time to try the more effective bagging and boosting methods.
Another nuanced challenge was deciding which source to listen to. There are many free
online materials with different points of views and techniques. The class lecture material
and notes formed a good foundation to build off of, but there was definitely a gap that needed
to be bridged in terms of intuition and knowledge. In many ways this was challenging but a
good learning exercise which is probably how Dr. Sayed set the class up.
5.4 Compromises
To start out our goal was and still is in many ways ambitious. Due to time restrictions,
we had to compromise on many fronts. One such compromise that was mentioned in
Section 5.1 was the dimension of our training data. Not only did we move from color to
gray-scale, but we also reduced the image size from 512x512 to 280x280. Even though
reducing the dimension of the image prevented us from classifying larger images, we were
able to increase our layer depth and overall performance of the network. It also helped ease
the memory issues and reduce the amount of computations required for each sample.
Another compromise was that we had to cap the number of samples that any given set
of parameters was trained on. The performance of the classifier and its rate of convergence
were highly dependent on the parameters and initial conditions. Since images at one point
took up to 5 minutes to run, we simply did not have enough time to let the networks run
for too long before stopping them. Thus, we were forced check the performance of each
classifier given a training size of 400-800 samples before starting at a new parameter set
or initial condition. We were almost always at odds between letting a particular weight set
train more or simply to reset the training.
Another area we would have liked to do cross-validation tests to test for better parameters.
We’d also have liked to use bagging and boasting. For such a complex problem we are
very interested in seeing how much improvement we could get by using these techniques.
There were also techniques about handeling images of different sizes. It required some
background knowledge in support vector machines and bag of words approach[9] but again
36 Chapter 5. In Practice
given the time we had to limit our self. We will still continue the project after the finished
dates on our own accord, after all it is for a good cause. The limiting factor in all of these
cases with time. This could be said in all fields about learning. The horizon of knowledge
can never be reached. If we keep are learning steps constant, we will always learn[10].
5.5.1 Architecture 1
The 1st CNN was implemented directly off of the algorithm in Figure 5.3.
Figure 5.3: Stochastic Gradient Descent Algorithm for Convolutional Neural Networks
In this CNN, an alternating pattern of rectifier linear units and scaled tanh functions
were used to allow for fast training, but also to prevent over-saturation. The combination
of cascading the two different types of activations proved experimentally more stable than
either alone. The activation functions for the neural network was set to rectifiers for the first
5.5 Algorithm Adjustments 37
two layers and sigmoid for the last layer for cross-entropy training. At first, the network
used softmax training on the last layer, but was switched to cross-entropy training after
experimental results showed a slightly faster convergence rate. Other important nuances
of this CNN were 3x3 convolutional masks and 2x2 pools at all layers. The network was
initialized with zero mean and a variance of 0.01 while the training weights were zero mean
and normalized by the square root of the number of depths/nodes, respectively [8].
There were a few changes though to to reduce memory/processing requirements or for
easier implementation. The first such adjustment was to the partition function as denoted by
equation 5.1.
The main practical issue with this representation for images is the size of the partitioning
transformation matrix, V . Even with reduced 280x280 images, Table 5.2 still shows that
V will be of size ((70x70x64x9)x(70x70x64) = (8467200x940800). Even just to store this
matrix, let alone use do a matrix multiplication, would require an extraordinary amount of
memory. The condensed form of v given by equation 5.2 is better, but still requires a large
memory size.
There are a large number of zeros in these two interpretation that end up taking up unneeded
space. Assuming a constant partition size, we get around this issue by taking advantage
of how the each element of the image is indexed. For example, the left hand argument of
equation 5.2 shows a 2x2 image padded with zeros on the border. Padding ensures that a
3x3 convolutional mask will generate as many partitions as there are pixels in the image.
Also, the index() function is simply a place holder function that returns indexes of each
element. To keep this algorithm general, we now stack all columns of the image into a
feature vector as show in 5.4. It is easy to see then that the indexes the right hand side of the
equations correspond to this stacked representation.
0 0 0 0 1 5 9 13
0 x22 x23 0 2 6 10 14
index(
0 x32 x33 )= (5.3)
0 3 7 11 15
0 0 0 0 4 8 12 16
38 Chapter 5. In Practice
0 1
0 2
.. ..
. .
x22 5
x32 6
0 7
index( . ) = . (5.4)
.. ..
x 10
23
x 11
33
0 12
. .
.. ..
0 16
Now, each element in a partition can instead be represented by their respective indexes.
For example, the first partition is a 3x3 square in the upper left hand corner represented by
the argument on the left hand side of equation 5.5. The right hand side shows the appropriate
indexes.
0 0 0 1 5 9
index(0 x22 x23 ) = 2 6 10 (5.5)
0 x32 x33 3 7 11
0 1
0 2
0 3
0 5
index(x22
) = 6 (5.6)
x32 7
0 9
x23 10
x33 11
All the columns of the partition are then stacked to form a feature vector while the
(d,c+1)
indexes are stacked to form a partition vector as shown by equation 5.6. The Hn
matrix can then be constructed by placing each of the feature vectors together. Likewise,
in
the partition vectors can be placed side by side to create a partition matrix denoted as Vc+1
as shown in 5.7.
5.5 Algorithm Adjustments 39
0 0 0 x22 1 5 2 6
0 x22 0 x32 2 6 3 7
0 x32 0 0 3 7 4 8
0 0 x22 x23 5 9 6 10
(d,c+1) in
Hn =x x
22 23 x32 ,
x33 Vc+1 =
6 10 7 11 (5.7)
x32 x33
0 0
7
11 8 12
0 0 x23 0 9 13 10 14
x23 0 x33 0 10 14 11 15
x33 0 0 0 11 15 12 16
(d,c+1)
It can then be seen that each element of the new Hn at each layer can simply be
(d,c)
found by taking the index of tn at that same position as shown by equation 5.5.1.
(d,c+1) (d,c)
Hn (i, j) = tn (Vc + 1in (i, j)), i = 1, ..., Sc ; j = 1, ..., Pc+1 (5.8)
(d,c+1)
In this way, we can construct Hn by only using a (Sc x Pc+1 ) matrix. This reduces
the memory storage by a large factor (in the 280x280 case, by a factor of 940800) as the
mask size is constant and there are only as many elements in Vc+1 in as is absolutely necessary.
Thus, by keeping a matrix of indexes that correspond to the partition elements from the
original image or feature vector, we can save most of the space. Also, since we are now
using array access instead of matrix multiplication, we save a number of calculations.
In addition, as long as partitioning is done in the same fashion across images, partitions
of any given image of the same size will have the same partition index matrix, Vc+1 in . Thus,
for our application of one image size of 280x280, we only need to calculate all Vc+1 in once
and can use them throughout the loops at little computational cost. If there were multiple
sizes, images of the same size could be batched together and we would only need to calculate
as many Vc+1 in as there are different image sizes.
While this method employs the default indexing for matrices in Matlab, any code which
follows this indexing and has constant partition sizes can use this to their advantage. This is
the main advantage towards the generalization of the features as a vector instead of a matrix.
This can also be extended to non-constant partition sizes by using different data structures
that allow variable number of rows. As long as the partitions are known ahead of time, this
algorithm will reduce the memory and processing requirements.
The same method for finding indexes of each partition also applies to the permute
and permute# transformations as well. In the case of permute, the we replace the 3x3
sliding window with a non-overlapping 2x2 square over which we find indexes for the pool
elements. The only difference is that when the pools are separated, there is no extra padding
as the image dimensions are assumed to be divisible by 2. Thus, the pools and subsequent
indexes should effectively divide the number of features per layer by 4. The same concepts
still apply in that we use the indexes to find an appropriate index matrix, Vcpool , that satisfies
equation 5.9.
(d,c) (d,c)
tn (i, j) = yn (Vcpool (i, j)) (5.9)
40 Chapter 5. In Practice
For the 2x2 image example laid out in equation 5.3, the solution is trivial as we only
(d,c)
have one pool of the nonzero elements. Thus, we introduce some 4x4 yn in equation 5.10
to motivate a more developed solution.
x11 x12 x13 x14 1 5 9 13
(d,c) x21 x22 x23 x24 2 6 10 14
index(yn ) = index( x31 x32 x33 x34 ) = 3 7 11 15
(5.10)
x41 x42 x43 x44 4 8 12 16
Now, permute takes 2x2 non-overlapping pools and gives equation 5.11. The pool
function is then called on each column vector made by permute. The transpose is to keep
(d,c)
with the notation that tn should return an column vector of max values. All the same
memory and processing savings are still applicable in this case as well.
x11 x31 x13 x31 1 3 9 11
), Vcpool = 2 4 10 12
(d,c) x21 x41 x23 x41
(tn )T = pool(
x12 x32 x14 x32 5 7 13 15 (5.11)
x22 x42 x24 x42 6 8 14 16
Then, looking at the permute# transformation in equation 5.12, we can see that it uses
up more space than is necessary. Using a similar method as before, we can map entries from
(d,c)
a pool back into yn . First, we can apply the same index trick we’ve been using as before
on the 4x4 matrix pool() argument matrix representing the pools. More formally, the setup
yields equation 5.13.
Rearranging the left hand side to get back the original matrix gives us the following
equation 5.14.
x11 x12 x13 x14 1 3 9 11
), Vcpermute_h = 2 4 10 12
x21 x22 x23 x24
index(
x31 x32 x33 x34 5 7 13 15 (5.14)
x41 x42 x43 x44 6 8 14 16
Because we are using 2x2 non-overlapping sliding windows, we actually end up with
Vcpermute_h= Vcpool . For the general mapping for any pooling indexes to the permute# matrix,
we must first define the intermediate index matrix given by 5.15. The mapping is then given
by 5.16.
5.5 Algorithm Adjustments 41
1 5 9 13
2 6 10 14
index_start =
3
(5.15)
7 11 15
4 8 12 16
(d,d 0 ,c+1)
The final large matrix that needs to be optimized is Vn−1 as it is (Pc0 x Pc). Instead
(d,d 0 ,c+1)
of doing such a large matrix multiplication, only one row of Vn−1 is used at a time to
compute each weight on the sensitivities. Although this uses more computational time, it
requires much less memory.
Another adjustment was made just prior to upv in order to compensate for the zero
padding of the image prior towards partitioning. Due to this, the sensitivities should
not propagate on the indexes that correspond to the padding. Thus, when partitioning
is performed, the indexes that correspond to non zero padded elements are also saved.
Referring back to the 4x4 padded example in equation 5.3, this means indexes 6, 7, 10 and
(d,d 0 ,c+1)
11 are saved. Then, just before upv we will only fill Vn−1 with weights the correspond
to non zero padded elements. That is, we start with an array of zeros and only add the
weights that correspond to the non padded elements.
With these adjustments the algorithm was able to perform faster and require less memory
without sacrificing accuracy.
5.5.2 Architecture 2
The second CNN is based on the concepts and algorithms in the Neural Network
handout and the Convolutional Network handout. However, due to limited time and heavy
computations and the huge memory cost of implementing CCN in the MATLAB, we are
motivated to modify the algorithms to run the program more efficiently. In addition, we
added some rules of thumb from other architectures into our algorithms. Those modifications
are explained as follow:
In the propagation process, we first preset how we do the convolution and pooling so that
we can know the size of the feature maps and the reduced maps in each layer. In practical,
a convolutional filter with small size and stride is preferred. Therefore, we choose a 3x3
filter with stride equal to 1, which means the partition is overlapping, for the convolutional
network. It is reasonable in terms of the image process because there is a specific relation
between adjacent pixel to form an image. Moreover, to keep spatial sizes constant after
doing the convolution, we padded the inputs of each layer with zeros around the border. For
pooling, we use a 2x2 max-pooling matrix. This preset enables us to reduce the computation
by only creating the partition order once and saving it for the usage in the future. The
number of the depth of the feature maps in each convolutional layer are set to be increasing
with the propagation, and the number of the node of each layer in the neural network are set
to be decreasing with the propagation. Last, to get more insight of the prediction made by
the network, we use the softmax implementation in the last layer in the neural network to
42 Chapter 5. In Practice
see the probability of each class. Since the exponentials computation can make the terms
ezn (q) and ∑Q
k=1 e
zn (k) in the softmax be very large, which might cause the numerical issue
when dividing the two terms. Hence, we multiply the two terms by a constant C and get the
following expression:
!−1 !−1 !−1
Q Q Q
yn (q) , ezn (q) ∑ ezn(k) = Cezn (q) C ∑ ezn (k) = ezn (q)+logC ∑ ezn(k)+logC
k=1 k=1 k=1
(5.17)
This will improve the numerical stability of the computation but not affect the resulted
value. As the reference recommends, we set logC = −max(zn (k)). This setup makes the
highest value of the vector zn be zero.
For the training part, the second CNN use the same algorithm to reduce the computation
and memory cost.
Using convolutional neural networks, our classifier is broken up into phases. The First
phase is to judge whether there is a fish or not in the image by way of their transformed
feature vectors. We are not constraining our images to one allowing one fish per picture.
Our algorithm handles pictures with multiple fishes. The second phase is to classify the fish
in the image that pass the first test and been judged +1. We will classify it to be one out of
the six categories. The second phase will receive an input image that is cropped such that a
fish make up a majority of the picture, making it easier to classify with less noise. In our
training data pool, no image with more than one species of fish will appear simultaneously.
Figure 5.5 depicts how the algorithm should work together.
5.6 Network Architecture 43
1
In the0 training procedure of the first phase, image features with labeling, γ, equals to
0 or 1 , which means there is either at least one fish inside the picture or no fish at all
respectively. Later on in the section we talk about how we tried to increase the learning rate
of the softmax function.
In the judgment procedure of the first phase after training, we feed all the0.381
raw images
into the CNN. Ideally, at this 0.823
0.823
moment, we will have, for example0 0.177 or 0.619 as the
output vector y. The 0.177 will be classified to the group
γ, 1 as there is at least one
fish, and the 0.3810.619 will be classified to the group γ, 0 as there are no fish. Figure 5.6
1
depicts graphically how the first convolutional neural network operates.
After the new data has been judged whether there are fish or not, we segment the image
such that we crop out everything except for the fish where the picture has been judged γ
1 . The way it works is that it cuts part of the picture with a fish and sends it back into the
0
first network. It process the picture again, and based on whether the new cropped image
has a fish or not it cuts a different part. It continues this process until it has the region of
the picture corresponding to the highest likelihood of being fish is cropped out. If there are
more then one fish then it would store the first piece of the picture of the fish as one fish
44 Chapter 5. In Practice
Figure 5.6: First convolutional neural network used to classify whether there is a fish or not
and if so, where it might be
and then send the cropped image without the region of the highest likelihood of being fish.
The algorithm continues this process until the likelihood of fish having fish in the remaining
image is below the 0.5 threshold. The segmentation process is illustrated in Figure 5.7. And
an ideal segmentation result is show in Figure 5.8.
After the fish(es) is preprocessed in a sense by the segmentation portion as shown in,
they are fed into the second CNN to generate output vector(s) in form of y = (a1 a2 ... a6) T .
Then we will classify the fish with this particular cropped image. The second convolutional
neural network then classifies the fish belonging to the species associated with the maximum
element in output vector y. This is an advantage of softmax. If through a1 to a6 there are
no prominent elements compared to other elements, we say that the fish is not one of the 6
species but instead other.
In the training procedure of phase two, we feed the fish-part-only images which are
cropped in segmentation stage with new labeling (1 0 0 0 0 0) T (0 1 0 0 0 0) T ... (0 0 0 0 0 1) T
representing six kinds of fish to train the second CNN as shown in figure 5.9.
After training of CNN1 and CNN2, we can run classification task on the raw testing
images.
5.6 Network Architecture 45
Figure 5.7: This algorithm is designed to be used in tandem with each other to find exactly
where the fish is and crop the image keeping only the fish, thus reducing the noise
Figure 5.8: (Left) Raw image under testing. (Middle) Imaged cropped by segmentation
process. (Right) remainder of one segmentation process
46 Chapter 5. In Practice
Figure 5.9: Second convoluitional neural network designed to receive well cropped images
with only relevant information. This network is to determine the likelihood that it’s a certain
type of fish
6. Results & Thoughts
6.1 Results
Figure 6.1: (All 48 sets of weights for the 1st layer of the 1st CNN used for fish detection.
Equivalently, these will be referenced to as the masks, filters, or kernels. Each mask is 3x3
and is applied as a sliding window across the feature matrix, H. The lighter portions indicate
a more positive correlation whereas the darker portions show a negative correlation.
Although we haven’t reached our entire goal we are on the cusp. We finally got our
algorithm to train to a decent set of weights that resulted in relatively good classification. The
48 Chapter 6. Results & Thoughts
Figure 6.2: (Left) The original 280x280 gray-scale image that is fed into the 1st CNN.
(Middle) The resulting image after the mask at depth 8 is applied in a sliding window
fashion onto the original image. As can be seen, this mask applies a level of edge detection
on very light pixel values to gray borders. (Right) The resulting image after mask at depth
23 is applied. This mask applies edge on a very dark pixel values to gray borders. The
masks evolved out of the training and return edge detection features that contain more useful
information.
One such set of pre-activation signals that was converted back into images is shown
in Figure 6.2. As can be seen, the convolutional network learned varying degrees of edge
detection to help in classifying the images. One of the most prominent features about
a fish in grayscale is its shape. Humans can easily recognize a fishes body by enabling
our own edge detection based on the contrast between the fish and the background. The
convolutional network was trying to copy this behavior to some degree as it is one of the
most common, yet powerful techniques in image processing.
Another pre-activation signal converted back to an image can be see in Figure 6.3. In this
case, the convolutional network learned to whiteout large portions of the image, saturating
most signals. In doing so, the network is able to clear out a large amount of detail that the
network has deemed as noise. The interesting point is that the prominent black back of the
tuna remains intact through the whiteout filter. When asked to classify an image as having a
fish, very few people with limited backgrounds in machine learning would suggest to look
for the black back as a major option. This simply goes to show how powerful convolutional
networks can be in seeing underlying trends that humans can often overlook.
Overall the cost function was minimized as it should have been over most of the training
run as show in 6.4. There were instances where certain parameters would yield large shifts
6.2 Thoughts 49
Figure 6.3: (Left) Again, the original 280x280 grayscale image that is fed into the 1st CNN.
(Middle) The resulting image after the mask at depth 15 is applied in a sliding window
fashion onto the original image. The image is much lighter and only the darkest portions
of the image are leftover. This leads to most of the details in the image to be thrown away
except the dark stripe on the back of the tuna fish. Thus, the CNN might be able to use this
in as a definitive feature when determining whether a fish exists in the image or not
in the cost function. For example, when mu was relatively large while rho was small, the
algorithm seem to react very quickly to misclassification (when the gradient had a relatively
high magnitude). Under these conditions, the network classification was highly dependent
on the previous label.
On the other hand, when rho was very large in comparison to mu, normalization took
over the weighting of the cost function. In this regard, the gradient had little effect as the
network was mostly aimed at keeping the weights low. Thus, the values of mu and rho
needed to be within a certain range that would yield good classification results.
6.2 Thoughts
Given a dataset with the size range from around 600x600 to 1200x900, the issues about
the computation speed, the memory cost, and varying size inputs are raised. To deal with
those problems, our strategy is to implement one CNN first to determine whether there is a
fish in the image. For those images with fish, we use segmentation algorithm to crop them
such that the fish is within the desired size image. This preprocessing not only reduces
the data size but also decrease the effect of noise. Then, we use the second CNN to do
classification of the types of fishes. The two architectures can be seen as different ways
to validate arguments. However, since the make up is a little different, we can also see
them as separate classifiers that are used in different parts of the algorithm to accomplish
slightly different tasks. The slight differences in parameters and structure of the nets give
themselves flexibility to cope with complex problem.
So far, it seems that we are on the right track for setting up the entire algorithm quite
soon. As the result section shows, we got the first convolutional neural network working
properly with relative success given the limited training. The CNN shows how well it
50 Chapter 6. Results & Thoughts
Figure 6.4: (Cross entropy cost function over a training run of 100 samples. The cost
function is generally minimized over the run. The bumps occurred when there was a
misclassification and a relatively reaction by the gradient.
extracts the useful information for the classification. Also we have set up a convolutional
neural network that works well on multi-class classification problem. Given more time we
can definitely implement the CNN and get a good result.
This is by no means easy to solve. Fortunately, through this experience we have gained
some intuition on the parameters and how they affect the overall performance and descent
of the cost function. Like Dr. Ali Sayed would put it: there’s more art in engineering than
you think.
layers are weighted more as it is more prominent in this scheme. By doing this the bins can
be cut to any number allowing for different image sizes to be fed through the network with
out any major issues. In the coming weeks we plan on looking into this technique further
and possibly implement it.
Another method that we are really excited about trying out in the future is bagging
and boosting. With such a complex problem as one could imagine we have many weak
classifiers. We are interested to seeing how far the theory can be put into practice. It’s really
ashamed that we have never heard of this sooner.
6.3.2 Competition
We plan on continuing this project. Kaggle’s competition deadline isn’t for another
month, so we have all spring break to train and modify the algorithm to get it ready for the
competition. If we want any chance of competing with the top we will probably have to
read a few more packets from Dr. Ali Sayed.
7. Feedback & Experience
of the algorithms into simpler forms because in the code many times the algorithm in the
notes would result in matrices that were out of bounds for Matlab to compute. He also
looked into spatial pyramid pooling and other resources outside of the notes. He organized
each meeting and kept the direction and focus of the group intact. He was responsible for
managing the report and managing the responsibilities of everyone.
Every week we had at least three group meetings since forming our groups on January 23,
2017. Every week we were responsible to read through, derive, and discuss the algorithms
in the reading. Once we finished with that we all started coding up smaller examples of the
algorithm. From there we started to have people specialize into certain areas and look into
things outside of the reading. The above is only a short summary of what we all worked on.
It is very difficult to say that someone specifically worked on one small topic because we all
worked on all topics and had to have discussions among each other to better understand the
material. There are portions where individuals had special skills or interest, but it wouldn’t
be fair for one to only consider the small list and its limited capacity above. We as a group
were at majority of the office hours of Dr. Ali Sayed and Stefan Vlaski. Each and everyone
in this group has made a tremendous effort and have learned much more than anticipated
through this project. We all had no prior experience with networks in general. To say that
we originally signed up for neural networks, then went into convolutional networks and even
dabbled with some of support vector machines, we as a group think that each individual
should be commended for the effort shown here in the paper and the additional work not
conveyed here.
It comes from the equation (47.1041) that for softmax algorithm, the activation function in
last layer is contributed by all of the element in the pre-activated vector z.
A few of us felt that towards the end of the Neural Network packet the material and
explanations seemed to tapper off as if bits and pieces were hastily put together. There
seemed to be less explanation and intuition given. Of course as mentioned this encouraged
us to go out and find the answer, yet we think the packet abruptly changed pace.
For the most part though the packet did what Dr. Ali Sayed expressed earlier on. It
supplied us with enough foundational knowledge to apply it to a real world problem. It also
introduced to the smaller intricacies of the inner workings of much larger machine learning
approaches.
in. This issue of limited time in a quarter and limited time to train our neural networks are
similar in this way.
A really minor inconvenience were the amount of typos. At times they were easy to
find, but at other times they were crippling in terms of homework or understanding a certain
derivation. We all can agree that writing a new book will have many typos but possibly
there may have been a better solution.
One other thing we wanted to raise was being able to see what other groups did. We
agree that full length presentations take up too much time. However, we at least what kind
of applications people are trying to do. Many of us would agree that if every group just had
a time limit of two minutes to do an informal introduction to what they were working on
and preliminary results we’d learn something. In addition to that it is very interesting to see
what everyone is working on.
One of the things we can appreciate is Dr. Ali Sayed’s teaching style. It’s all about
learning. In that sense we think we’ve learned a ton, much more so than in any other classes.
Dr. Ali Sayed is also very good at teaching. When he teaches you can tell when he pauses
sometimes he is thinking how to relay the information in ways we can understand. The
intuition and the way he stress important points twice really helps with learning. We are
all first-year mechanical graduate students, and we were all excited when Dr. Ali Sayed
talked about the optimal gain being the variance of xy over the variance of y. Mind blown!
why haven’t other teachers state this simple but deep line. Small tidbits like that added up
to really give us not only intuition in his class but also intuition into our own field.
7.2.4 Typos
47.1051 Neural Network Packet
• Issue: Missing N1 term averaging term
• Without it the Jemp continues and diverges. It was noted before but when
implementing algorithm it can be confusing.
• Comment: it would be helpful to note that the N is only the number of samples
• Corrected version[14][15]:
L−1
1 N−1 Q
Jemp(W, θ ) , ∑ ρkWl k2F + ∑ ∑ ln(yn (q)γn (q) (1 − yn (q)1−γn (q) ))
l=1 N n=0 q=1
(7.2)
dl = θnl+1 (7.4)
56 Chapter 7. Feedback & Experience
1
(7.7)
nl
47.1276 Convolutional Network Packet, pg 2559
• Issue: Extra µ term on the Neural Network gradient update. There should only
be one µ
•
• Issue: wrong sign.
Colin Togashi:
"My training has been all dependent on the parameters and the initial
conditions. No, I mean it’s fun, it’s hard, it’s definitely pushing me outside my
comfort zone, but I learned a lot."
7.3 How Much Did We Learn? 57
Meng-Hao Li:
"I really learn a lot from the lectures, homework, and especially the project.
I think this course is very good start for any beginner like me who have not
touched learning algorithm before. The lectures not only state the motivation
of each algorithm clearly but also provide good intuitions to understand the
algorithms. Thus, after this class, I can learn other learning algorithm fast
because I have already known the basic concept of learning."
Jack Shue:
Gabriel Fernandez:
"I never thought I would learn so much and be trying increase my step size
even more. I’m in pain."
8. Bibliography
8.1 References
[1] The Nature Conservancy Fisheries Monitoring | Kaggle. URL: https : / / www .
kaggle.com/c/the-nature-conservancy-fisheries-monitoring (cited on
page 6).
[2] The Nature Conservancy. URL : http://www.conserveca.org/?c=2 (cited on
page 6).
[3] The Nature Conservancy Fisheries Monitoring | Kaggle. URL: https : / / www .
kaggle . com / c / the - nature - conservancy - fisheries - monitoring / data
(cited on page 8).
[4] ImageNet Tree View. URL: http : / / www . image - net . org / synset ? wnid =
n02512053 (cited on pages 8, 32).
[5] Assistant Professor Follow Jia-Bin Huang. Lecture 29 Convolutional Neural Networks
- Computer Vision Spring2015. May 2015. URL: https : / / www . slideshare .
net/jbhuang/lecture- 29- convolutional- neural- networks- computer-
vision-spring2015 (cited on page 11).
[6] OpenCV 3 Tutorial. URL: http : / / www . bogotobogo . com / python / OpenCV _
Python / python _ opencv3 _ Image _ Canny _ Edge _ Detection . php (cited on
page 14).
[7] Michael A. Nielsen. Neural Networks and Deep Learning. Jan. 1970. URL: http:
//neuralnetworksanddeeplearning.com/chap3.html (cited on page 25).
[8] CS231n Convolutional Neural Networks for Visual Recognition. URL: http : / /
cs231n.github.io/convolutional-networks/ (cited on pages 33, 37).
8.1 References 59
[9] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond Bags of Features:
Spatial Pyramid Matching for Recognizing Natural Scene Categories. URL: http:
//www- cvr.ai.uiuc.edu/ponce_grp/publication/paper/cvpr06b.pdf
(cited on page 35).
[10] Ali Sayed. Adaptation and Learning. Mar. 2017 (cited on page 36).
[11] Kaiming He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for
Visual Recognition. Apr. 2015. URL: https://arxiv.org/abs/1406.4729 (cited
on page 50).
[12] Junfeng He, Shih-Fu Chang, and Lexing Xie. “Fast kernel learning for spatial pyramid
matching”. In: (2008), pages 1–7 (cited on page 50).
[13] Kristen Grauman and Trevor Darrell. “The pyramid match kernel: Efficient learning
with sets of features”. In: Journal of Machine Learning Research 8.Apr (2007),
pages 725–760 (cited on page 50).
[14] Cross Entropy, Wikipedia. URL: https : / / en . wikipedia . org / wiki / Cross _
entropy (cited on page 55).
[15] Improving the way neural networks learn. URL: http://neuralnetworksanddeeplearning.
com/chap3.html (cited on page 55).