Deep Learning in Python - Master Data Science and Machine Learning With Modern Neural Networks Written in Python, Theano, and TensorFlow (PDFDrive)
Deep Learning in Python - Master Data Science and Machine Learning With Modern Neural Networks Written in Python, Theano, and TensorFlow (PDFDrive)
Learning in Python
Master Data Science and Machine Learning with Modern
Neural Networks written in Python, Theano, and
TensorFlow
By: The LazyProgrammer (http://lazyprogrammer.me)
Introduction
Chapter 1: What is a neural network?
Chapter 2: Biological analogies
Chapter 3: Getting output from a neural network
Chapter 4: Training a neural network with backpropagation
Chapter 5: Theano
Chapter 6: TensorFlow
Chapter 7: Improving backpropagation with modern techniques -
momentum, adaptive learning rate, and regularization
Chapter 8: Unsupervised learning, autoencoders, restricted Boltzmann
machines, convolutional neural networks, and LSTMs
Chapter 9: You know more than you think you know
Conclusion
Introduction
Deep learning is making waves. At the time of this writing (March 2016),
Google’s AlghaGo program just beat 9-dan professional Go player Lee Sedol at
the game of Go, a Chinese board game.
Experts in the field of Artificial Intelligence thought we were 10 years away
from achieving a victory against a top professional Go player, but progress
seems to have accelerated!
While deep learning is a complex subject, it is not any more difficult to learn
than any other machine learning algorithm. I wrote this book to introduce you to
the basics of neural networks. You will get along fine with undergraduate-level
math and programming skill.
All the materials in this book can be downloaded and installed for free. We will
use the Python programming language, along with the numerical computing
library Numpy. I will also show you in the later chapters how to build a deep
network using Theano and TensorFlow, which are libraries built specifically for
deep learning and can accelerate computation by taking advantage of the GPU.
Unlike other machine learning algorithms, deep learning is particularly powerful
because it automatically learns features. That means you don’t need to spend
your time trying to come up with and test “kernels” or “interaction effects” -
something only statisticians love to do. Instead, we will let the neural network
learn these things for us. Each layer of the neural network learns a different
abstraction than the previous layers. For example, in image classification, the
first layer might learn different strokes, and in the next layer put the strokes
together to learn shapes, and in the next layer put the shapes together to form
facial features, and in the next layer have a high level representation of faces.
Do you want a gentle introduction to this “dark art”, with practical code
examples that you can try right away and apply to your own data? Then this
book is for you.
These connections between neurons have strengths. You may have heard the
phrase, “neurons that fire together, wire together”, which is attributed to the
Canadian neuropsychologist Donald Hebb.
Neurons with strong connections will be turned “on” by each other. So if one
neuron sends a signal (action potential) to another neuron, and their connection
is strong, then the next neuron will also have an action potential, would could
then be passed on to other neurons, etc.
If the connection between 2 neurons is weak, then one neuron sending a signal to
another neuron might cause a small increase in electrical potential at the 2nd
neuron, but not enough to cause another action potential.
Thus we can think of a neuron being “on” or “off”. (i.e. it has an action potential,
or it doesn’t)
What does this remind you of?
If you said “digital computers”, then you would be right!
Specifically, neurons are the perfect model for a yes no, true false, 0 / 1 type of
problem. We call this “binary classification” and the machine learning analogy
would be the “logistic regression” algorithm.
The above image is a pictorial representation of the logistic regression model. It
takes as inputs x1, x2, and x3, which you can imagine as the outputs of other
neurons or some other input signal (i.e. the visual receptors in your eyes or the
mechanical receptors in your fingertips), and outputs another signal which is a
combination of these inputs, weighted by the strength of those input neurons to
this output neuron.
Because we’re going to have to eventually deal with actual numbers and
formulas, let’s look at how we can calculate y from x.
y = sigmoid(w1*x1 + w2*x2 + w3*x3)
Note that in this book, we will ignore the bias term, since it can easily be
included in the given formula by adding an extra dimension x0 which is always
equal to 1.
So each input neuron gets multiplied by its corresponding weight (synaptic
strength) and added to all the others. We then apply a “sigmoid” function on top
of that to get the output y. The sigmoid is defined as:
sigmoid(x) = 1 / (1 + exp(-x))
If you were to plot the sigmoid, you would get this:
You can see that the output of a sigmoid is always between 0 and 1. It has 2
asymptotes, so that the output is exactly 1 when the input is + infinity, and the
output is exactly 0 when the input is - infinity.
The output is 0.5 when the input is 0.
You can interpret the output as a probability. In particular, we interpret it as the
probability:
P(Y=1 | X)
Which can be read as “the probability that Y is equal to 1 given X”. We usually
just use this and “y” by itself interchangeably. They are both “the output” of the
neuron.
To get a neural network, we simply combine neurons together. The way we do
this with artificial neural networks is very specific. We connect them in a
feedforward fashion.
I have highlighted in red one logistic unit. Its inputs are (x1, x2) and its output is
z1. See if you can find the other 2 logistic units in this picture.
We call the layer of z’s the “hidden layer”. Neural networks have one or more
hidden layers. A neural network with more hidden layers would be called
“deeper”.
“Deep learning” is somewhat of a buzzword. I have googled around about this
topic, and it seems that the general consensus is that any neural network with
one or more hidden layers is considered “deep”.
Exercise
Using the logistic unit as a building block, how would you calculate the output
of a neural network Y? If you can’t get it now, don’t worry, we’ll cover it in
Chapter 3.
Excitability Threshold
The output of a logistic unit must be between 0 and 1. In a classifier, we must
choose which class to predict (say, is this is a picture of a cat or a dog?)
If 1 = cat and 0 = dog, and the output is 0.7, what do we say? Cat!
Why? Because our model is saying, “the probability that this is an image of a cat
is 70%”.
The 50% line acts as the “excitability threshold” of a neuron, i.e. the threshold at
which an action potential would be generated.
Exercise
In preparation for the next chapter, you’ll need to make sure you have the
following installed on your machine: Python, Numpy, and optionally Pandas.
Feedforward action
Let us complete the formula for y. First, we have to compute z1 and z2.
z1 = s(w11*x1 + w12*x2)
z2 = s(w21*x1 + w22*x2)
s() can be any non-linear function (if it were linear, you’d just be doing logistic
regression). The most common 3 choices are, 1:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
Which we saw earlier.
2, the hyperbolic tangent: np.tanh()
And 3, the rectifier linear unit, or ReLU:
def relu(x):
if x < 0:
return 0
else:
return x
Prove to yourself that this alternative way of writing relu is correct:
def relu(x):
return x * (x > 0)
This latter form is needed in libraries like Theano which will automatically
calculate the gradient of the objective function.
And then y can be computed as:
y = s’(v1*z1 + v2*z2)
Where s’() can be a sigmoid or softmax, as we discuss in the next sections.
Note that inside the sigmoid functions we simply have the “dot product”
between the input and weights. It is more computationally efficient to use vector
and matrix operations in Numpy instead of for-loops, so we will try to do so
where possible.
This is an example of a neural network using ReLU and softmax in vectorized
form:
def forward(X, W, V):
Z = relu(X.dot(W))
Y = softmax(Z.dot(V))
return Y
Binary classification
As you can see, the last layer of our simple sigmoid network is just a logistic
regression layer. We can interpret the output as the probability that Y=1 given X.
Of course, since binary classification can only output a 0 or 1, then the
probability that Y=0 given X:
P(Y=0 | X) = 1 - P(Y=1 | X),
because they must sum to 1.
Softmax
What if we want to classify more than 2 things? For example, the famous
MNIST dataset contains the digits 0-9, so we have 10 output classes.
In this scenario, we use the softmax function, which is defined as follows:
softmax(a[k]) = exp(a[k]) / { exp(a[1]) + exp(a[2]) + … + exp(a[k]) + … +
exp(a[K]) }
Note that the “little k” and the “big K” are different.
Convince yourself that this always adds up to 1, and thus can also be considered
a probability.
Now in code!
Assuming that you have already loaded your data into Numpy arrays, you can
calculate the output y as we do in this section.
Note that there is a little bit of added complexity since the formulas shown above
only calculate the output for one input sample. When we are doing this in code,
we typically want to do this calculation for many samples simultaneously.
def sigmoid(a):
return 1 / (1 + np.exp(-a))
def softmax(a):
expA = np.exp(a)
return expA / expA.sum(axis=1, keepdims=True)
X,Y = load_csv(“yourdata.csv”)
W = np.random.randn(D, M)
V = np.random.randn(M, K)
V = np.random.randn(M, K)
Z = sigmoid(X.dot(W))
p_y_given_x = softmax(Z.dot(V))
Here “M” is the number of hidden units. It is what we call a “hyperparameter”,
which could be chosen using a method such as cross-validation.
Of course, the outputs here are not very useful because they are randomly
initialized. What we would like to do is determine the best W and V so that
when we take the predictions of P(Y | X), they are very close to the actual labels
Y.
Exercise
Add the bias term to the above examples.
Exercise
Use gradient descent to optimize the following functions:
maximize J = log(x) + log(1-x), 0 < x < 1
maximize J = sin(x), 0 < x < pi
minimize J = 1 - x^2 - y^2, 0 <= x <= 1, 0 <= y <= 1, x + y = 1
More Code
Before we start looking at Theano and TensorFlow, I want you to get a neural
network set up with just pure Numpy and Python. Assuming you’ve went
through the previous chapters, you should already have code to load the data and
feed the data into the neural network in the forward direction.
# … load data into X, T…
# … initialize W1 and W2
def forward(X, W1, W2):
Z = sigmoid(X.dot(W1))
Y = softmax(Z.dot(W2))
return Y, Z
def grad_W2(Z, T, Y):
return Z.T.dot(Y - T)
return Z.T.dot(Y - T)
def grad_W1(X, Z, T, Y, W2):
return X.T.dot(((Y - T).dot(W2.T) (Z(1 - Z))))
for i in xrange(epochs):
Y, Z = forward(X, W1, W2)
W2 -= learning_rate * grad_W2(Z, T, Y)
W1 -= learning_rate * grad_W1(X, Z, T, Y, W2)
print cost(T, Y)
And watch the cost magically decrease on every iteration of the loop! Some
notes about this code:
I have renamed the target variables T and the output of the neural network Y. In
the previous chapter I called the targets Y and the output of the neural network
p_y_given_x.
Notice we return both Z (the hidden layer values) as well as Y in the forward()
function. That’s because we need both to calculate the gradient.
Don’t worry about how I calculated the gradient functions, unless you know
enough calculus to derive them yourself and then implement them in code.
So what exactly is backpropagation? It just means the “error” is getting
propagated backward through the neural network. Notice how “Y - T” shows up
in both gradients. If you had more than 1 hidden layer in the neural network, you
would notice more patterns emerge.
Notice that we loop through a number of “epochs”, calculating the error on the
entire dataset at the same time. Refer back to chapter 2, when I talked about
repetition in biological analogies. We are just repeatedly showing the neural
network the same samples again and again.
Exercise
Use the above code on the MNIST dataset, or whatever dataset you chose to
download. Add the bias terms, or add a column of 1s to the matrix X and Z so
that you effectively have bias terms.
In addition to printing the cost, also print the classification rate or error rate.
Does a lower cost guarantee a lower error rate?
Chapter 5: Theano
Theano is a Python library that is very popular for deep learning. It allows you to
take advantage of the GPU for faster floating point calculations, since, as you
may have seen, gradient descent can take quite awhile.
In this book I show you how to write Theano code, but if you want to know the
particulars about how to get a machine that has GPU capabilities and how to
tweak your Theano code and commands to use them, you’ll want to consult my
course at: https://udemy.com/data-science-deep-learning-in-theano-tensorflow
If you would like to view this code in a Python file on your computer, please go
to:
https://github.com/lazyprogrammer/machine_learning_examples/tree/master/ann_class2
The relevant files are:
theano1.py
theano2.py
Theano Basics
Learning Numpy when you already know Python is pretty easy, right? You
simply have a few new functions to operate on special kinds of arrays.
Moving from Numpy to Theano is a whole other beast. There are a lot of new
concepts that just do not look like regular Python.
So let’s first talk about Theano variables. Theano has different types of variable
objects based on the number of dimensions of the object. For example, a 0-
dimensional object is a scalar, a 1-dimensional object is a vector, a 2-
dimensional object is a matrix, and a 3+ dimensional object is a tensor.
They are all within the theano.tensor module. So in your import section:
import theano.tensor as T
You can create a scalar variable like this:
c = T.scalar(‘c’)
The string that is passed in is the variable’s name, which may be useful for
debugging.
A vector could be created like this:
v = T.vector(‘v’)
And a matrix like this:
A = T.matrix(‘A’)
A = T.matrix(‘A’)
Since we generally haven’t worked with tensors in this book, we are not going to
look at those. When you start working with color images, this will add another
dimension, so you’ll need tensors. (Ex. a 28x28 image would have the
dimensions 3x28x28 since we need to have separate matrices for the red, green,
and blue channels).
What is strange about regular Python vs. Theano is that none of the variables we
just created have values!
Theano variables are more like nodes in a graph.
(Come to think of it, isn’t the neural network I described in Chapter 1 simply a
graphical model?)
We only “pass in” values to the graph when we want to perform computations
like feedforward or backpropagation, which we haven’t defined yet. TensorFlow
works in the same way.
Despite that, we can still define operations on the variables.
For example, if you wanted to do matrix multiplication, it is similar to Numpy:
u = A.dot(v)
You can think of this as creating a new node in the graph called u, which is
connected to A and v by a matrix multiply.
To actually do the multiply with real values, we need to create a Theano
function.
import theano
matrix_times_vector = theano.function(inputs=[A,v], outputs=[u])
import numpy as np
A_val = np.array([[1,2], [3,4]])
v_val = np.array([5,6])
u_val = matrix_times_vector(A_val, v_val)
Using this, try to think about how you would implement the “feedforward”
action of a neural network.
One of the biggest advantages of Theano is that it links all these variables up
into a graph and can use that structure to calculate gradients for you using the
chain rule, which we discussed in the previous chapter.
In Theano regular variables are not “updateable”, and to make an updateable
variable we create what is called a shared variable.
So let’s do that now:
x = theano.shared(20.0, ‘x')
Let’s also create a simple cost function that we can solve ourselves and we know
it has a global minimum:
cost = x*x + x
And let’s tell Theano how we want to update x by giving it an update
expression:
x_update = x - 0.3*T.grad(cost, x)
The grad function takes in 2 parameters: the function you want to take the
gradient of, and the variable you want the gradient with respect to. You can pass
in multiple variables as a list into the 2nd parameter, as we’ll be doing later for
each of the weights of the neural network.
Now let’s create a Theano train function. We’re going to add a new argument
called the updates argument. It takes in a list of tuples, and each tuple has 2
things in it. The first thing is the shared variable to update, and the 2nd thing is
the update expression to use.
train = theano.function(inputs=[], outputs=cost, updates=[(x, x_update)])
Notice that ‘x’ is not an input, it’s the thing we update. In later examples, the
inputs will be the data and labels. So the inputs param takes in data and labels,
and the updates param takes in your model parameters with their updates.
Now we simply write a loop to call the train function again and again:
for i in xrange(25):
cost_val = train()
print cost_val
And print the optimal value of x:
print x.get_value()
Now let’s take all these basic concepts and build a neural network in Theano.
Exercise
Complete the code above by adding the following:
A function to convert the labels into an indicator matrix (if you haven’t done so
yet) (Note that the examples above refer to the variables Ytrain_ind and
Ytest_ind - that’s what these are)
Add bias terms at the hidden and output layers and add the update expressions
for them as well.
Split your data into training and test sets to conform to the code above.
Try it on a dataset like MNIST.
Chapter 6: TensorFlow
If you would like to view this code in a Python file on your computer, please go
to:
https://github.com/lazyprogrammer/machine_learning_examples/tree/master/ann_class2
The relevant files are:
tensorflow1.py
tensorflow2.py
TensorFlow Basics
TensorFlow is a newer library than Theano developed by Google. It does a lot of
nice things for us like Theano does, like calculating gradients. In this first
section we are going to cover basic functionality as we did with Theano -
variables, functions, and expressions.
TensorFlow’s web site will have a command you can use to install the library. I
won’t include it here because the version number is likely to change.
If you are on a Mac, you may need to disable “System Integrity Protection”
(rootless) temporarily by booting into recovery mode, typing in csrutil disable,
and then rebooting. You can check if it is disabled or enabled by typing csrutil
status in your console.
Once you have TensorFlow installed, come back to the book and we’ll do a
simple matrix multiplication example like we did with Theano.
Import as usual:
import tensorflow as tf
With TensorFlow we have to specify the type (Theano variable = TensorFlow
placeholder):
A = tf.placeholder(tf.float32, shape=(5, 5), name='A')
But shape and name are optional:
v = tf.placeholder(tf.float32)
We use the ‘matmul’ function in TensorFlow. I think this name is more
appropriate than ‘dot’:
u = tf.matmul(A, v)
Similar to Theano, you need to "feed" the variables values. In TensorFlow you
do the "actual work" in a "session".
with tf.Session() as session:
# the values are fed in via the argument "feed_dict"
# v needs to be of shape=(5, 1) not just shape=(5,)
# it's more like "real" matrix multiplication
output = session.run(w, feed_dict={A: np.random.randn(5, 5), v:
np.random.randn(5, 1)})
print output, type(output)
Exercise
Run your TensorFlow neural network on the MNIST dataset:
Create a 1-hidden layer neural network with 500, 1000, 2000, and 3000 hidden
units. What is the impact on training error and test error?
Create neural networks with 1, 2, and 3 hidden layers, all with 500 hidden units.
What is the impact on training error and test error? (Hint: It should be overfitting
when you have too many hidden layers).
Exercise
Add all of these methods to your Theano code and experiment with different
values. Compare to vanilla backpropagation.
Note that TensorFlow includes many of these methods in its optimizers, so
incorporating them into your training with TensorFlow would be trivial.
Exercise
Send me an email at info@lazyprogrammer.me and let me know which of the
above topics you’d be most interested in learning about in the future. I always
use student feedback to decide what courses and books to create next!
Conclusion
I really hope you had as much fun reading this book as I did making it.
Did you find anything confusing? Do you have any questions?
I am always available to help. Just email me at: info@lazyprogrammer.me
Do you want to learn more about deep learning? Perhaps online courses are
more your style. I happen to have a few of them on Udemy.
A lot of the material in this book is covered in this course, but you get to see me
derive the formulas and write the code live:
Data Science: Deep Learning in Python
https://udemy.com/data-science-deep-learning-in-python
Are you comfortable with this material, and you want to take your deep learning
skillset to the next level? Then my follow-up Udemy course on deep learning is
for you. Similar to this book, I take you through the basics of Theano and
TensorFlow - creating functions, variables, and expressions, and build up neural
networks from scratch. I teach you about ways to accelerate the learning process,
including batch gradient descent, momentum, and adaptive learning rates. I also
show you live how to create a GPU instance on Amazon AWS EC2, and prove
to you that training a neural network with GPU optimization can be orders of
magnitude faster than on your CPU.
Data Science: Practical Deep Learning in Theano and TensorFlow
https://www.udemy.com/data-science-deep-learning-in-theano-tensorflow
When you’ve got the basics of deep learning down, you’re ready to explore
alternative architectures. One very popular alternative is the convolutional neural
network, created specifically for image classification. These have promising
applications in medical imaging, self-driving vehicles, and more. In this course, I
show you how to build convolutional nets in Theano and TensorFlow.
Deep Learning: Convolutional Neural Networks in Python
https://www.udemy.com/deep-learning-convolutional-neural-networks-theano-
tensorflow
In part 4 of my deep learning series, I take you through unsupervised deep
learning methods. We study principal components analysis (PCA), t-SNE
(jointly developed by the godfather of deep learning, Geoffrey Hinton), deep
autoencoders, and restricted Boltzmann machines (RBMs). I demonstrate how
unsupervised pretraining on a deep network with autoencoders and RBMs can
improve supervised learning performance.
Unsupervised Deep Learning in Python
https://www.udemy.com/unsupervised-deep-learning-in-python
Would you like an introduction to the basic building block of neural networks -
logistic regression? In this course I teach the theory of logistic regression (our
computational model of the neuron), and give you an in-depth look at binary
classification, manually creating features, and gradient descent. You might want
to check this course out if you found the material in this book too challenging.
Data Science: Logistic Regression in Python
https://udemy.com/data-science-logistic-regression-in-python
To get an even simpler picture of machine learning in general, where we don’t
even need gradient descent and can just solve for the optimal model parameters
directly in “closed-form”, you’ll want to check out my first Udemy course on the
classical statistical method - linear regression:
Data Science: Linear Regression in Python
https://www.udemy.com/data-science-linear-regression-in-python
If you are interested in learning about how machine learning can be applied to
language, text, and speech, you’ll want to check out my course on Natural
Language Processing, or NLP:
Data Science: Natural Language Processing in Python
https://www.udemy.com/data-science-natural-language-processing-in-python
If you are interested in learning SQL - structured query language - a language
that can be applied to databases as small as the ones sitting on your iPhone, to
databases as large as the ones that span multiple continents - and not only learn
the mechanics of the language but know how to apply it to real-world data
analytics and marketing problems? Check out my course here:
SQL for Marketers: Dominate data analytics, data science, and big data
https://www.udemy.com/sql-for-marketers-data-analytics-data-science-big-data
Finally, I am always giving out coupons and letting you know when you can get
my stuff for free. But you can only do this if you are a current student of mine!
Here are some ways I notify my students about coupons and free giveaways:
My newsletter, which you can sign up for at http://lazyprogrammer.me (it comes
with a free 6-week intro to machine learning course)
My Twitter, https://twitter.com/lazy_scientist
My Facebook page, https://facebook.com/lazyprogrammer.me (don’t forget to
hit “like”!)