Lecture 5 - CS50's Introduction to Artificial Intelligence with Python
Lecture 5 - CS50's Introduction to Artificial Intelligence with Python
OpenCourseWare
Donate (https://cs50.harvard.edu/donate)
Brian Yu (https://brianyu.me)
brian@cs.harvard.edu
Lecture 5
Neural Networks
AI neural networks are inspired by neuroscience. In the brain, neurons are cells that are
connected to each other, forming networks. Each neuron is capable of both receiving and
sending electrical signals. Once the electrical input that a neuron receives crosses some
threshold, the neuron activates, thus sending its electrical signal forward.
An Artificial Neural Network is a mathematical model for learning inspired by biological neural
networks. Artificial neural networks model mathematical functions that map inputs to outputs
based on the structure and parameters of the network. In artificial neural networks, the
structure of the network is shaped through training on data.
When implemented in AI, the parallel of each neuron is a unit that’s connected to other units.
For example, like in the last lecture, the AI might map two inputs, x₁ and x₂, to whether it is
going to rain today or not. Last lecture, we suggested the following form for this hypothesis
function: h(x₁, x₂) = w₀ + w₁x₁ + w₂x₂, where w₁ and w₂ are weights that modify the inputs, and w₀ is
a constant, also called bias, modifying the value of the whole expression.
Activation Functions
To use the hypothesis function to decide whether it rains or not, we need to create some sort of
threshold based on the value it produces.
One way to do this is with a step function, which gives 0 before a certain threshold is reached
and 1 after the threshold is reached.
Another way to go about this is with a logistic function, which gives as output any real number
from 0 to 1, thus expressing graded confidence in its judgment.
Another possible function is Rectified Linear Unit (ReLU), which allows the output to be any
positive value. If the value is negative, ReLU sets it to 0.
Whichever function we choose to use, we learned last lecture that the inputs are modified by
weights in addition to the bias, and the sum of those is passed to an activation function. This
stays true for simple neural networks.
The two white units on the left are the input and the unit on the right is an output. The inputs
are connected to the output by a weighted edge. To make a decision, the output unit multiplies
the inputs by their weights in addition to the bias (w₀), and the uses function g to determine the
output.
For example, an Or logical connective can be represented as a function f with the following
truth table:
x y f(x, y)
0 0 0
0 1 1
1 0 1
1 1 1
We can visualize this function as a neural network. x₁ is one input unit, and x₂ is another input
unit. They are connected to the output unit by an edge with a weight of 1. The output unit then
uses function g(-1 + 1x₁ + 2x₂) with a threshold of 0 to output either 0 or 1 (false or true).
For example, in the case where x₁ = x₂ = 0, the sum is (-1). This is below the threshold, so the
function g will output 0. However, if either or both of x₁ or x₂ are equal to 1, then the sum of all
inputs will be either 0 or 1. Both are at or above the threshold, so the function will output 1.
A similar process can be repeated with the And function (where the bias will be (-2)). Moreover,
inputs and outputs don’t have to be distinct. A similar process can be used to take humidity and
air pressure as input, and produce the probability of rain as output. Or, in a different example,
inputs can be money spent on advertising and the month when it was spent to get the output
of expected revenue from sales. This can be extended to any number of inputs by multiplying
₁ ₙ ₁ ₙ
each input x … x by weight w … w , summing up the resulting values and adding a bias w₀.
Gradient Descent
Gradient descent is an algorithm for minimizing loss when training neural networks. As was
mentioned earlier, a neural network is capable of inferring knowledge about the structure of the
network itself from the data. Whereas, so far, we defined the different weights, neural networks
allow us to compute these weights based on the training data. To do this, we use the gradient
descent algorithm, which works the following way:
Start with a random choice of weights. This is our naive starting place, where we don’t
know how much we should weight each input.
Repeat:
Calculate the gradient based on all data points that will lead to decreasing loss.
Ultimately, the gradient is a vector (a sequence of numbers).
Update weights according to the gradient.
The problem with this kind of algorithm is that it requires to calculate the gradient based on all
data points, which is computationally costly. There are a multiple ways to minimize this cost. For
example, in Stochastic Gradient Descent, the gradient is calculated based on one point chosen
at random. This kind of gradient can be quite inaccurate, leading to the Mini-Batch Gradient
Descent algorithm, which computes the gradient based on on a few points selected at random,
thus finding a compromise between computation cost and accuracy. As often is the case, none of
these solutions is perfect, and different solutions might be employed in different situations.
Using gradient descent, it is possible to find answers to many problems. For example, we might
want to know more than “will it rain today?” We can use some inputs to generate probabilities
for different kinds of weather, and then just choose the weather that is most probable.
This can be done with any number of inputs and outputs, where each input is connected to each
output, and where the outputs represent decisions that we can make. Note that in this kind of
neural networks the outputs are not connected. This means that each output and its associated
weights from all the inputs can be be seen as an individual neural network and thus can be
trained separately from the rest of the outputs.
So far, our neural networks relied on perceptron output units. These are units that are only
capable of learning a linear decision boundary, using a straight line to separate data. That is,
based on a linear equation, the perceptron could classify an input to be one type or another (e.g.
left picture). However, often, data are not linearly separable (e.g. right picture). In this case, we
turn to multilayer neural networks to model data non-linearly.
Backpropagation
Backpropagation is the main algorithm used for training neural networks with hidden layers. It
does so by starting with the errors in the output units, calculating the gradient descent for the
weights of the previous layer, and repeating the process until the input layer is reached. In
pseudocode, we can describe the algorithm as follows:
This can be extended to any number of hidden layers, creating deep neural networks, which are
neural networks that have more than one hidden layer.
Overfitting
Overfitting is the danger of modeling the training data too closely, thus failing to generalize to
new data. One way to combat overfitting is by dropout. In this technique, we temporarily remove
units that we select at random during the learning phase. This way, we try to prevent over-
reliance on any one unit in the network. Throughout training, the neural network will assume
different forms, each time dropping some other units and then using them again:
Note that after the training is finished, the whole neural network will be used again.
TensorFlow
Like often is the case in python, multiple libraries already have an implementation for neural
networks using the backpropagation algorithm, and TensorFlow is one such library. You are
welcome to experiment with TensorFlow neural networks in this web application
(http://playground.tensorflow.org/), which lets you define different properties of the network
and run it, visualizing the output. We will now turn to an example of how we can use
TensorFlow to perform the task we discussed last lecture: distinguishing counterfeit notes from
genuine notes.
import csv
import tensorflow as tf
from sklearn.model_selection import train_test_split
data = []
for row in reader:
data.append({
"evidence": [float(cell) for cell in row[:4]],
"label": 1 if row[4] == "0" else 0
})
We provide the CSV data to the model. Our work is often required in making the data fit the
format that the library requires. The difficult part of actually coding the model is already
implemented for us.
Keras is an api that different machine learning algorithms access. A sequential model is one
where layers follow each other (like the ones we have seen so far).
A dense layer is one where each node in the current layer is connected to all the nodes from
the previous layer. In generating our hidden layers we create 8 dense layers, each having 4
input neurons, using the ReLU activation function mentioned above.
Finally, we compile the model, specifying which algorithm should optimize it, what type of loss
function we use, and how we want to measure its success (in our case, we are interested in the
accuracy of the output). Finally, we fit the model on the training data with 20 repetitions
(epochs), and then evaluate it on the testing data.
Computer Vision
Computer vision encompasses the different computational methods for analyzing and
understanding digital images, and it is often achieved using neural networks. For example,
computer vision is used when social media employs face recognition to automatically tag
people in pictures. Other examples are handwriting recognition and self-driving cars.
Images consist of pixels, and pixels are represented by three values that range from 0 to 255,
one for red, one for green and one for blue. These values are often referred to with the acronym
RGB. We can use this to create a neural network where each color value in each pixel is an
input, where we have some hidden layers, and the output is some number of units that tell us
what it is that was shown in the image. However, there are a few drawbacks to this approach.
First, by breaking down the image into pixels and the values of their colors, we can’t use the
structure of the image as an aid. That is, as humans, if we see a part of a face we know to expect
to see the rest of the face, and this quickens computation. We want to be able to use a similar
advantage in our neural networks. Second, the sheer number of inputs is very big, which means
that we will have to calculate a lot of weights.
Image Convolution
Image convolution is applying a filter that adds each pixel value of an image to its neighbors,
weighted according to a kernel matrix. Doing so alters the image and can help the neural
network process it.
Different kernels can achieve different tasks. For edge detection, the following kernel is often
used:
The idea here is that when the pixel is similar to all its neighbors, they should cancel each
other, giving a value of 0. Therefore, the more similar the pixels, the darker the part of the
image, and the more different they are the lighter it is. Applying this kernel to an image (left)
results in an image with pronounced edges (right):
Let’s consider an implementation of image convolution. We are using the PIL library (stands for
Python Imaging Library) that can do most of the hard work for us.
import math
import sys
# Open image
image = Image.open(sys.argv[1]).convert("RGB")
Still, processing the image in a neural network is computationally expensive due to the number
of pixels that serve as input to the neural network. Another way to go about this is Pooling,
where the size of the input is reduced by sampling from regions in the input. Pixels that are
next to each other belong to the same area in the image, which means that they are likely to be
similar. Therefore, we can take one pixel to represent a whole area. One way of doing this is
with Max-Pooling, where the selected pixel is the one with the highest value of all others in the
same region. For example, if we divide the left square (below) into four 2X2 squares, by max-
pooling from this input, we get the small square on the right.
Convolutional Neural Networks
A convolutional neural network is a neural network that uses convolution, usually for analyzing
images. It starts by applying filters that can help distill some features of the image using
different kernels. These filters can be improved in the same way as other weights in the neural
network, by adjusting their kernels based on the error of the output. Then, the resulting images
are pooled, after which the pixels are fed to a traditional neural network as inputs (a process
called flattening).
The convolution and pooling steps can be repeated multiple times to extract additional
features and reduce the size of the input to the neural network. One of the benefits of these
processes is that, by convoluting and pooling, the neural network becomes less sensitive to
variation. That is, if the same picture is taken from slightly different angles, the input for
convolutional neural network will be similar, whereas, without convolution and pooling, the
input from each image would be vastly different.
In code, a convolutional neural network doesn’t differ by much from a traditional neural
network. TensorFlow offers datasets to test our models on. We will be using MNIST, which
contains pictures of black and white handwritten digits. We will train our convolutional neural
network to recognize digits.
import sys
import tensorflow as tf
# Flatten units
tf.keras.layers.Flatten(),
Since the model takes time to train, we can save the already trained model to use it later.
Now, if we run a program that receives hand-drawn digits as input, it will be able to classify and
output the digit using the model. For an implementation of such a program, refer to
recognition.py in the source code for this lecture.
As opposed to that, Recurrent Neural Networks consist of a non-linear structure, where the
network uses its own output as input. For example, Microsoft’s captionbot
(https://www.captionbot.ai) is capable of describing the content of an image with words in a
sentence. This is different from classification in that the output can be of varying length based
on the properties of the image. While feed-forward neural networks are incapable of varying
the number of outputs, recurrent neural networks are capable to do that due to their structure.
In the captioning task, a network would process the input to produce an output, and then
continue processing from that point on, producing another output, and repeating as much as
necessary.
Recurrent neural networks are helpful in cases where the network deals with sequences and
not a single individual object. Above, the neural network needed to produce a sequence of
words. However, the same principle can be applied to analyzing video files, which consist of a
sequence of images, or in translation tasks, where a sequence of inputs (words in the source
language) is processed to produce a sequence of outputs (words in the target language).