TensorFlow For Machine Intelligence
TensorFlow For Machine Intelligence
Preface
Welcome
Since its open source release in November 2015, TensorFlow has become one of the
most exciting machine learning libraries available. It is being used more and more in
research, production, and education. The library has seen continual improvements,
additions, and optimizations, and the TensorFlow community has grown dramatically.
With TensorFlow for Machine Intelligence, we hope to help new and experienced users
hone their abilities with TensorFlow and become fluent in using this powerful library to its
fullest!
Background education
While this book is primarily focused on the TensorFlow API, we expect you to have
familiarity with a number of mathematical and programmatic concepts. These include:
Derivative calculus (single-variable and multi-variables)
Matrix algebra (matrix multiplication especially)
Basic understanding of programming principles
Basic machine learning concepts
In addition to the above, you will get more out of this book if they have the following
knowledge:
Experience with Python and organizing modules
Experience with the NumPy library
Experience with the matplotlib library
Knowledge of more advanced machine learning concepts, especially feed-forward
neural networks, convolutional neural networks, and recurrent neural networks
When appropriate, well include refresher information to re-familiarize the reader with
some of the concepts they need to know, to fully understand the math and/or Python
concepts.
Further reading
If, after reading this book, youre interested in pursuing more with TensorFlow, here are
a couple of valuable resources:
The official TensorFlow website, which will contain the latest documentation, API,
and tutorials
The TensorFlow GitHub repository, where you can contribute to the open-source
implementation of TensorFlow, as well as review the source code directly
Officially released machine learning models implemented in TensorFlow. These
models can be used as-is or be tweaked to suit your own goals
The Google Research Blog provides the latest news from Google related to
TensorFlow applications and updates.
Kaggle is a wonderful place to find public datasets and compete with other dataminded people
Data.gov is the U.S. governments portal to find public datasets all across the United
States
Alright, thats enough of a pep-talk. Lets get started with TensorFlow for Machine
Intelligence!
Chapter 1. Introduction
Data is everywhere
We are truly in The Information Age. These days, data flows in from everywhere:
smartphones, watches, vehicles, parking meters, household appliances- almost any piece
of technology you can name is now being built to communicate back to a database
somewhere in the cloud. With access to seemingly unlimited storage capacity, developers
have opted for a more-is-better approach to data warehousing, housing petabytes of data
gleaned from their products and customers.
At the same time, computational capabilities continue to climb. While the growth of
CPU speeds has slowed, there has been an explosion of parallel processing architectures.
Graphics processing units (GPUs), once used primarily for computer games are now being
used for general purpose computing, and they have opened the floodgates for the rise of
machine learning.
Machine learning, sometimes abbreviated to ML, uses general-purpose mathematical
models to answer specific questions using data. Machine learning has been used to detect
spam email, recommend products to customers, and predict the value of commodities for
many years. In recent years, a particular kind of machine learning has seen an incredible
amount of success across all fields: deep learning.
Deep learning
Deep learning has become the term used to describe the process of using multi-layer
neural networks- incredibly flexible models that can use a huge variety and combination
of different mathematical techniques. They are incredibly powerful, but our ability to
utilize neural networks to great effect has been a relatively new phenomena, as we have
only recently hit the critical mass of data availability and computational power necessary
to boost their capabilities beyond those of other ML techniques.
The power of deep learning is that it gives the model more flexibility in deciding how to
use data to best effect. Instead of a person having to make wild guesses as to which inputs
are worth including, a properly tuned deep learning model can take all parameters and
automatically determine useful, higher-order combinations of its input values. This
enables a much more sophisticated decision-making process, making computers more
intelligent than ever. With deep learning, we are capable of creating cars that drive
themselves and phones that understand our speech. Machine translation, facial
recognition, predictive analytics, machine music composition, and countless artificial
intelligence tasks have become possible or significantly improved due to deep learning.
While the mathematical concepts behind deep learning have been around for decades,
programming libraries dedicated to creating and training these deep models have only
been available in recent years. Unfortunately, most of these libraries have a large trade-off
between flexibility and production-worthiness. Flexible libraries are invaluable for
researching novel model architectures, but are often either too slow or incapable of being
used in production. On the other hand, fast, efficient, libraries which can be hosted on
distributed hardware are available, but they often specialize in specific types of neural
networks and arent suited to researching new and better models. This leaves decision
makers with a dilemma: should we attempt to do research with inflexible libraries so that
we dont have to reimplement code, or should we use one library for research and a
completely different library for production? If we choose the former, we may be unable to
test out different types of neural network models; if we choose the latter, we have to
maintain code that may have completely different APIs. Do we even have the resources
for this?
TensorFlow aims to solve this dilemma.
What is TensorFlow?
Lets take a high-level view of TensorFlow to get an understanding what problems it is
trying to solve.
Just below, in the first paragraph under About TensorFlow, we are given an
alternative description:
TensorFlow is an open source software library for numerical computation using data flow graphs.
This second definition is a bit more specific, but may not be the most comprehensible
explanation for those with less mathematical or technical backgrounds. Lets break it
down into chunks and figure out what each piece means.
Open source:
TensorFlow was originally created by Google as an internal machine learning tool, but
an implementation of it was open sourced under the Apache 2.0 License in November
2015. As open source software, anyone is allowed to download, modify, and use the code.
Open source engineers can make additions/improvements to the code and propose their
changes to be included in a future release. Due to the popularity TensorFlow has gained,
there are improvements being made to the library on a daily basis- created by both Google
and third-party developers.
Notice that we say an implementation and not TensorFlow was open sourced. Technically speaking,
TensorFlow is an interface for numerical computation as described in the TensorFlow white paper, and Google
still maintains its own internal implementation of it. However, the differences between the open source
implementation and Googles internal implementation are due to connections to other internal software, and
not Google hoarding the good stuff. Google is constantly pushing internal improvements to the public
repository, and for all intents and purposes the open source release contains the same capabilities as Googles
internal version.
For the rest of this book, when we say TensorFlow, we are referring to the open source implementation.
learning in particular), we will usually talk about TensorFlow being used to create
machine learning models.
There are a number of reasons this is useful. First, many common machine learning
models, such as neural networks, are commonly taught and visualized as directed graphs
already, which makes their implementation more natural for machine learning
practitioners. Second, by splitting up computation into small, easily differentiable pieces,
TensorFlow is able to automatically compute the derivative of any node (or Operation,
as theyre called in TensorFlow) with respect to any other node that can affect the first
nodes output. Being able to compute the derivative/gradient of nodes, especially output
nodes, is crucial for setting up machine learning models. Finally, by having the
computation separated, it makes it much easier to distribute work across multiple CPUs,
GPUs, and other computational devices. Simply split the whole, larger graph into several
smaller graphs and give each device a separate part of the graph to work on (with a touch
of logic to coordinate sharing information across devices)
we can view an
tensor as a cube array of numbers (m numbers tall, m numbers wide, and m
numbers deep). In general, you can think about tensors the same way you would matrices, if you are more
comfortable with matrix math!
Distributed
As alluded to when described data flow graphs above, TensorFlow is designed to be
scalable across multiple computers, as well as multiple CPUs and GPUs within single
machines. Although the original open source implementation did not have distributed
capabilities upon release, as of version 0.8.0 the distributed runtime is available as part of
the TensorFlow built-in library. While this initial distributed API is a bit cumbersome, it is
incredibly powerful. Most other machine learning libraries do not have such capabilities,
and its important to note that native compatibility with certain cluster managers (such as
Kubernetes) are being worked on.
A suite of software
While TensorFlow is primarily used to refer to the API used to build and train
machine learning models, TensorFlow is really a bundle of software designed to be used in
tandem with:
TensorFlow is the API for defining machine learning models, training them with
data, and exporting them for further use. The primary API is accessed through
Python, while the actual computation is written in C++ . This enables data scientists
and engineers to utilize a more user-friendly environment in Python, while the actual
computation is done with fast, compiled C++ code. There is a C++ API for executing
TensorFlow models, but it is limited at this time and not recommended for most
users.
TensorBoard is graph visualization software that is included with any standard
TensorFlow installation. When a user includes certain TensorBoard-specific
operations in TensorFlow, TensorBoard is able to read the files exported by a
TensorFlow graph and can give insight into a models behavior. Its useful for
summary statistics, analyzing training, and debugging your TensorFlow code.
Learning to use TensorBoard early and often will make working with TensorFlow
that much more enjoyable and productive.
TensorFlow Serving is software that facilitates easy deployment of pre-trained
TensorFlow models. Using built-in TensorFlow functions, a user can export their
model to a file which can then be read natively by TensorFlow Serving. It is then able
to start a simple, high-performance server that can take input data, pass it to the
trained model, and return the output from the model. Additionally, TensorFlow
Serving is capable of seamlessly switching out old models with new ones, without
any downtime for end-users. While Serving is possibly the least recognized portion
of the TensorFlow ecosystem, it may be what sets TensorFlow apart from its
competition. Incorporating Serving into a production environment enables users to
avoid reimplementing their model, who can instead just pass along their TensorFlow
export. TensorFlow Serving is written entirely in C++ , and its API is only accessible
through C++.
We believe that using TensorFlow to its fullest means knowing how to use all of the
above in conjunction with one another. Hence, we will be covering all three pieces of
software in this book.
TensorFlows strengths
Usability
The TensorFlow workflow is relatively easy to wrap your head around, and its
consistent API means that you dont need to learn an entire new way to work when
you try out different models.
TensorFlows API is stable, and the maintainers fight to ensure that every
incorporated change is backwards-compatible.
TensorFlow integrates seamlessly with Numpy, which will make most Python-savvy
data scientists feel right at home.
Unlike some other libraries, TensorFlow does not have any compile time. This allows
you to iterate more quickly over ideas without sitting around.
There are multiple higher-level interfaces built on top of TensorFlow already, such as
Keras and SkFlow. This makes it possible to use the benefits of TensorFlow even if a
user doesnt want to implement the entire model by hand.
Flexibility
TensorFlow is capable of running on machines of all shapes and sizes. This allows it
to be useful from supercomputers all the way down to embedded systems- and
everything in between.
Its distributed architecture allows it to train models with massive datasets in a
reasonable amount of time.
TensorFlow can utilize CPUs, GPUs, or both at the same time.
Efficiency
When TensorFlow was first released, it was surprisingly slow on a number of popular
machine learning benchmarks. Since that time, the development team has devoted a
ton of time and effort into improving the implementation of much of TensorFlows
code. The result is that TensorFlow now boasts impressive times for much of its
library, vying for the top spot amongst the open-source machine learning
frameworks.
TensorFlows efficiency is still improving as more and more developers work
towards better implementations.
Support
TensorFlow is backed by Google. Google is throwing a ton of resources into
TensorFlow, since it wants TensorFlow to be the lingua franca of machine learning
researchers and developers. Additionally, Google uses TensorFlow in its own work
daily, and is invested in the continued support of TensorFlow.
An incredible community has developed around TensorFlow, and its relatively easy
to get responses from informed members of the the community or developers on
GitHub.
software that create environments, inside of which specific versions of software can
be maintained independently of those contained in other environments. With Python,
there are a couple of options. For the standard distributions of Python, Virtualenv is
available. If you are using Anaconda, it comes with a built-in environment system
with its package manager, Conda. Well cover how to install TensorFlow using both
of these below.
3. Use containers: Containers, such as Docker, are lightweight ways to package
software with an entire file system, including its runtime and dependencies. Because
of this, any machine (including virtual machines) that can run the container will be
able to run the software identically to any other machine running that container.
Starting up TensorFlow from a Docker container takes a few more steps than simply
activating a Virtualenv or Conda environment, but its consistency across runtime
environments can make it invaluable when deploying code across multiple instances
(either on virtual machines or physical servers). Well go over how to install Docker
and create your own TensorFlow containers (as well as how to use the official
TensorFlow image) below.
In general, we recommend using either Virtualenv or Condas environments when
installing TensorFlow for use on a single computer. They solve the conflicting dependency
issue with relatively low overhead, are simple to setup, and require little thought once they
are created. If you are preparing TensorFlow code to be deployed on one or more servers,
it may be worth creating a Docker container image. While there are a few more steps
involved, that cost pays itself back upon deployment across many servers. We do not
recommend installing TensorFlow without using either an environment or container.
Mac OS X
$ sudo easy_install pip
$ sudo pip install --upgrade virtualenv
Now that were ready to roll, lets create a directory to contain this environment, as well
as any future environments you might create in the future:
$ sudo mkdir ~/env
Next, well create the environment using the virtualenv command. In this example, it will
be located in ~/env/tensorflow.
$ virtualenv --system-site-packages ~/env/tensorflow
Once it has been created, we can activate the environment using the source command.
$ source ~/env/tensorflow/bin/activate
# Notice that your prompt now has a '(tensorflow)' indicator
(tensorflow)$
Well want to make sure that the environment is active when we install anything with
pip, as that is how Virtualenv keeps track of various dependencies.
When youre done with the environment, you can shut it off just by using the deactivate
command:
(tensorflow)$ deactivate
Since youll be using the virtual environment frequently, it will be useful to create a
shortcut for activating it instead of having to write out the entire source command each
time. This next command adds a bash alias to your ~/.bashrc file, which will let you
simply type tensorflow whenever you want to start up the environment:
$ sudo printf '\nalias tensorflow="source ~/env/tensorflow/bin/activate"' >> ~/.bashrc
Mac OS X installation
Technically, there are pre-built binaries for TensorFlow with GPU support, but they require specific versions of
NVIDIA software and are incompatible with future versions.
Installing dependencies
This assumes youve already installed python-pip, python-dev, and python-virtualenv from the previous
section on installing Virtualenv.
Building TensorFlow requires a few more dependencies, though! Run the following
commands, depending on your version of Python:
Python 2.7
$ sudo apt-get install python-numpy python-wheel python-imaging swig
Python 3
$ sudo apt-get install python3-numpy python3-wheel python3-imaging swig
Installing Bazel
Bazel is an open source build tool based on Googles internal software, Blaze. As of
writing, TensorFlow requires Bazel in order to build from source, so we must install it
ourselves. The Bazel website has complete installation instructions, but we include the
basic steps here.
The first thing to do is ensure that Java Development Kit 8 is installed on your system.
The following commands will add the Oracle JDK 8 repository as a download location for
apt and then install it:
$ sudo apt-get install software-properties-common
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
Ubuntu versions 15.10 and later can install OpenJDK 8 instead of the Oracle JDK. This is easier and
recommended- use the following commands instead of the above to install OpenJDK on your system:
# Ubuntu 15.10
$ sudo apt-get install openjdk-8-jdk
# Ubuntu 16.04
$ sudo apt-get install default-jdk
Next, youll need to download the Bazel installation script. To do so, you can either go
to the Bazel releases page on GitHub, or you can use the following wget command. Note
that for Ubuntu, youll want to download bazel installer-linux-x86_64.sh:
# Downloads Bazel 0.3.0
$ wget https://github.com/bazelbuild/bazel/releases/download/0.3.0/bazel-0.3.0-installer-linux-x86_64.sh
By using the --user flag, Bazel is installed to the /bin directory for the user. To ensure that this is
added to your PATH, run the following to update your /.bashrc:
$ sudo printf '\nexport PATH="$PATH:$HOME/bin"' >> ~/.bashrc
Restart your bash terminal and run bazel to make sure everything is working properly:
$ bazel version
# You should see some output like the following
Build label: 0.3.0
Build target: ...
...
Great! Next up, we need to get the proper dependencies for GPU support.
In addition to making sure that your GPU is on the list, make a note of the Compute
Capability number associated with your card. For example, the GeForce GTX 1080 has a
compute capability of 6.1, and the GeForce GTX TITAN X has a compute capability of
5.2. Youll need this number for when you compile TensorFlow. Once youve determined
that youre able to take advantage of CUDA, the first thing youll want to do is sign up for
NVIDIAs Accelerated Computer Developer Program. This is required to download all
of the files necessary to install CUDA and cuDNN. The link to do so is here:
https://developer.nvidia.com/accelerated-computing-developer
Once youre signed up, youll want to download CUDA. Go to the following link and
use the following instructions:
https://developer.nvidia.com/cuda-downloads
After signing in with the account you created above, youll be taken to a brief survey.
Fill that out and youll be taken to the download page. Click I Agree to the Terms to
be presented with the different download options. Because we installed CUDA 7.5 above,
well want to download cuDNN for CUDA 7.5 (as of writing, we are using cuDNN
version 5.0).
Click Download cuDNN v5 for CUDA 7.5 to expand a bunch of download options:
Click cuDNN v5 Library for Linux to download the zipped cuDNN files:
Navigate to where the .tgz file was downloaded and run the following commands to
place the correct files inside the /usr/local/cuda directory:
$ tar xvzf cudnn-7.5-linux-x64-v5.0-ga.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
And thats it for installing CUDA! With all of the dependencies taken care of, we can
now move on to the installation of TensorFlow itself.
Once inside, we need to run the ./configure script, which will tell Bazel which compiler
to use, which version of CUDA to use, etc. Make sure that you have the compute
capability number (as mentioned previously) for your GPU card available:
$ ./configure
Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python
# NOTE: For Python 3, specify /usr/bin/python3 instead
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] N
Do you wish to build TensorFlow with GPU support? [y/N] y
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 7.5
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 5.0.5
Please specify the location where cuDNN 5.0.5 library is installed. Refer to README.md for more details.
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size
[Default is: "3.5,5.2"]: <YOUR-COMPUTE-CAPABILITY-NUMBER-HERE>
Setting up Cuda include
Setting up Cuda lib64
Setting up Cuda bin
Setting up Cuda nvvm
Setting up CUPTI include
Setting up CUPTI lib64
Configuration finished
Google Cloud Platform support is currently in a closed alpha. If you have access to the program, feel free to
answer yes to the Google Cloud Platform support question.
With the configuration finished, we can use Bazel to create an executable that will
create our Python binaries:
$ bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
This will take a fair amount of time, depending on how powerful your computer is.
Once Bazel is done, run the output executable and pass in a location to save the Python
wheel:
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow/bin
This creates a Python .whl file inside of ~/tensorflow/bin/. Make sure that your
tensorflow Virtualentv is active, and install the wheel with pip! (Note that the exact
name of the binary will differ depending on which version of TensorFlow is installed,
which operating system youre using, and which version of Python you installed with):
$ tensorflow
(tensorflow)$ sudo pip install ~/tensorflow/bin/tensorflow-0.9.0-py2-none-any.whl
If you have multiple machines with similar hardware, you can use this wheel to quickly
install TensorFlow on all of them.
You should be good to go! Well finish up by installing the Jupyter Notebook and
matplotlib.
After that, two simple commands should get you going. First install the build-essential
dependency:
$ sudo apt-get install build-essential
Then use pip to install the Jupyter Notebook (pip3 if you are using Python 3):
# For Python 2.7
$ sudo pip install jupyter
# For Python 3
$ sudo pip3 install jupyter
Installing matplotlib
Installing matplotlib on Linux/Ubuntu is easy. Just run the following:
# Python 2.7
$ sudo apt-get build-dep python-matplotlib python-tk
# Python 3
$ sudo apt-get build-dep python3-matplotlib python3-tk
This will start up a Jupyter Notebook server and open the software up in your default
web browser. Assuming you dont have any files in your tf-notebooks directory, youll see
an empty workspace with the message Notebook list is empty. To create a new
notebook, click the New button in the upper right corner of the page, and then select
either Python 2 or Python 3, depending on which version of Python youve used to
install TensorFlow.
Your new notebook will open up automatically, and youll be presented with a blank
slate to work with. Lets quickly give the notebook a new name. At the top of the screen,
click the word Untitled:
This will pop up a window that allows you to rename the notebook. This also changes
the name of the notebook file (with the extension .ipynb). You can call this whatever
youd like- in this example were calling it My First Notebook
Now, lets look at the actual interface. Youll notice an empty cell with the block In [ ]:
next to it. You can type Python code directly into this cell, and it can include multiple
lines. Lets import TensorFlow, NumPy, and the pyplot module of matplotlib into
notebook:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
In order to run the cell, simply type shift-enter, which will run your code and create a
new cell below. Youll notice the indicator to the left now reads In [1]:, which means that
this cell was the first block of code to run in the kernel. Fill in the notebook with the
following code, using as many or as few cells as you find appropriate. You can use the
breaks in the cells to naturally group related code together.
%matplotlib inline
a = tf.random_normal([2,20])
sess = tf.Session()
out = sess.run(a)
x, y = out
plt.scatter(x, y)
plt.show()
It is a special command that tells the notebook to display matplotlib charts directly
inside the browser.
Lets go over what the rest of the code does, line-by-line. Dont worry if you dont
understand some of the terminology, as well be covering it in the book:
1. Use TensorFlow to define a 2x20 matrix of random numbers and assign it to the
variable a
2. Start a TensorFlow Session and assign it to sess
3. Execute a with the sess.run() method, and assign the output (which is a NumPy array)
to out
4. Split up the 2x20 matrix into two 1x10 vectors, x and y
5. Use pyplot to create a scatter plot with x and y
Assuming everything is installed correctly, you should get an output similar to the
above! Its a small first step, but hopefully it feels good to get the ball rolling.
For a more thorough tutorial on the ins-and-outs of the Jupyter Notebook, check out the
examples page here:
http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/examples_index.html
Conclusion
Voila! You should have a working version of TensorFlow ready to go. In the next
chapter, youll learn fundamental TensorFlow concepts and build your first models in the
library. If you had any issues installing TensorFlow on your system, the official
installation guide should be the first place to look:
https://www.tensorflow.org/versions/master/get_started/os_setup.html
Graph basics
At the core of every TensorFlow program is the computation graph described in code
with the TensorFlow API. A computation graph, is a specific type of directed graph that is
used for defining, unsurprisingly, computational structure. In TensorFlow it is, in essence,
a series of functions chained together, each passing its output to zero, one, or more
functions further along in the chain. In this way, a user can construct a complex
transformation on data by using blocks of smaller, well-understood mathematical
functions. Lets take a look at a bare-bones example.
In the above example, we see the graph for basic addition. The function, represented by
a circle, takes in two inputs, represented as arrows pointing into the function. It outputs the
result of adding 1 and 2 together: 3, which is shown as an arrow pointing out of the
function. The result could then be passed along to another function, or it might simply be
returned to the client.
We can also look at this graph as a simple equation:
The above illustrates how the two fundamental building blocks of graphs, nodes and
edges, are used when constructing a computation graph. Lets go over their properties:
Nodes, typically drawn as circles, ovals, or boxes, represent some sort of
computation or action being done on or with data in the graphs context. In the above
example, the operation add is the sole node.
Edges are the actual values that get passed to and from Operations, and are typically
drawn as arrows. In the add example, the inputs 1 and 2 are both edges leading into
the node, while the output 3 is an edge leading out of the node. Conceptually, we can
think of edges as the link between different Operations as they carry information
from one node to the next.
Now, heres a slightly more interesting example:
Theres a bit more going on in this graph! The data is traveling from left to right (as
indicated by the direction of the arrows), so lets break down the graph, starting from the
left.
1. At the very beginning, we can see two values flowing into the graph, 5 and 3. They
may be coming from a different graph, being read in from a file, or entered directly
by the client.
2. Each of these initial values is passed to one of two explicit input nodes, labeled a
and b in the graphic. The input nodes simply pass on values given to them- node a
receives the value 5 and outputs that same number to nodes c and d, while node b
performs the same action with the value 3.
3. Node c is a multiplication operation. It takes in the values 5 and 3 from nodes a and b,
respectively, and outputs its result of 15 to node e. Meanwhile, node d performs
addition with the same input values and passes the computed value of 8 along to node
e.
4. Finally, node e, the final node in our graph, is another add node. It receives the
values of 15 and 8, adds them together, and spits out 23 as the final result of our
graph.
Heres how the above graphical representation might look as a series of equations:
and
With that, the computation is complete! There are concepts worth pointing out here:
The pattern of using input nodes is useful, as it allows us to relay a single input
value to a huge amount of future nodes. If we didnt do this, the client (or whoever
passed in the initial values) would have to explicitly pass each input value to multiple
nodes in our graph. This way, the client only has to worry about passing in the
appropriate values once and any repeated use of those inputs is abstracted away.
Well touch a little more on abstracting graphs shortly.
Pop quiz: which node will run first- the multiplication node c, or the addition node d?
The answer: you cant tell. From just this graph, its impossible to know which of c
and d will execute first. Some might read the graph from left-to-right and top-tobottom and simply assume that node c would run first, but its important to note that
the graph could have easily been drawn with d on top of c. Others may think of these
nodes as running concurrently, but that may not always be the case, due to various
implementation details or hardware limitations. In reality, its best to think of them as
running independently of one another. Because node c doesnt rely on any
information from node d, it doesnt have to wait for node d to do anything in order to
complete its operation. The converse is also true: node d doesnt need any
information from node c. Well talk more about dependency later in this chapter.
Next, heres a slightly modified version of the graph:
With both of these graphs, we can begin to see the benefit of abstracting the graphs
input. We were able to manipulate the precise details of whats going on inside of our
graph, but the client only has to know to send information to the same two input nodes.
We can extend this abstraction even further, and can draw our graph like this:
By doing this we can think of entire sequences of nodes as discrete building blocks with
a set input and output. It can be easier to visualize chaining together groups of
computations instead of having to worry about the specific details of each piece.
Dependencies
There are certain types of connections between nodes that arent allowed, the most
common of which is one that creates an unresolved circular dependency. In order to
explain a circular dependency, were going to illustrate what a dependency is. Lets take a
look at this graph again:
The concept of a dependency is straight-forward: any node, A, that is required for the
computation of a later node, B, is said to be a dependency of B. If a node A and node B do
not need any information from one another, they are said to be independent. To visually
represent this, lets take a look at what happens if the multiplication node c is unable to
finish its computation (for whatever reason):
Predictably, since node e requires the output from node c, it is unable to perform its
calculation and waits indefinitely for node cs data to arrive. Its pretty easy to see that
nodes c and d are dependencies of node e, as they feed information directly into the final
addition function. However, it may be slightly less obvious to see that the inputs a and b
are also dependencies of e. What happens if one of the inputs fails to pass its data on to the
next functions in the graph?
As you can see, removing one of the inputs halts most of the computation from actually
occurring, and this demonstrates the transitivity of dependencies. That is to say, if A is
dependent on B, and B is dependent on C, then A is dependent on C. In this case, the final
node e is dependent on nodes c and d, and the nodes c and d are both dependent on input
node b. Therefore, the final node e is dependent on the input node b. We can make the same
reasoning for node e being dependent on node a, as well. Additionally, we can make a
distinction between the different dependencies e has:
1. We can say that e is directly dependent on nodes c and d. By this, we mean that data
must come directly from both node c and d in order for node e to execute.
2. We can say that e is indirectly dependent on nodes a and b. This means that the
outputs of a and b do not feed directly into node e. Instead, their values are fed into an
intermediary node(s) which is also a dependency of e, which can either be a direct
dependency or indirect dependency. This means that a node can be indirectly
dependent on a node with many layers of intermediaries in-between (and each of
those intermediaries is also a dependency).
Finally, lets see what happens if we redirect the output of a graph back into an earlier
portion of it:
Well, unfortunately it looks like that isnt going to fly. We are now attempting to pass
the output of node e back into node b and, hopefully, have the graph cycle through its
computations. The problem here is that node b now has node e as a direct dependency,
while at the same time, node e is dependent on node b (as we showed previously). The
result of this is that neither b nor e can execute, as they are both waiting for the other node
to complete its computation.
Perhaps you are clever and decide that we could provide some initial state to the value
feeding into either b or e. It is our graph, after all. Lets give the graph a kick-start by
giving the output of e an initial value of 1:
Heres what the first few loops through the graph look like. It creates an endless
feedback loop, and most of the edges in the graph tend towards infinity. Neat! However,
for software like TensorFlow, these sorts of infinite loops are bad for a number of reasons:
1. Because its an infinite loop, the termination of the program isnt going to be
graceful.
2. The number of dependencies becomes infinite, as each subsequent iteration is
dependent on all previous iterations. Unfortunately, each node does not count as a
single dependency- each time its output changes values it is counted again. This
makes it impossible to keep track of dependency information, which is critical for a
number of reasons (see the end of this section).
3. Frequently you end up in situations like this scenario, where the values being passed
on either explode into huge positive numbers (where they will eventually overflow),
huge negative numbers (where you will eventually underflow), or become close to
zero (at which point each iteration has little additional meaning).
Because of this, truly circular dependencies cant be expressed in TensorFlow, which is
not a bad thing. In practical use, we simulate these sorts of dependencies by copying a
finite number of versions of the graph, placing them side-by-side, and feeding them into
one another in sequence. This process is commonly referred to as unrolling the graph,
and will be touched on more in the chapter on recurrent neural networks. To visualize
what this unrolling looks like graphically, heres what the graph would look like after
If you analyze this graph, youll discover that this sequence of nodes and edges is
identical to looping through the previous graph 5 times. Note how the original input values
(represented by the arrows skipping along the top and bottom of the graph) get passed
onto each copy as they are needed for each copied iteration through the graph. By
unrolling our graph like this, we can simulate useful cyclical dependencies while
maintaining a deterministic computation.
Now that we understand dependencies, we can talk about why its useful to keep track
of them. Imagine for a moment, that we only wanted to get the output of node c from the
previous example (the multiplication node). Weve already defined the entire graph,
including node d, which is independent of c, and node e, which occurs after c in the graph.
Would we have to calculate the entire graph, even though we dont need the values of d
and e? No! Just by looking at the graph, you can see that it would be a waste of time to
calculate all of the nodes if we only want the output from c. The question is: how do we
make sure our computer only computes the necessary nodes without having to tell it by
hand? The answer: use our dependencies!
The concept behind this is fairly simple, and the only thing we have to ensure is that
each node has a list of the nodes it directly (not indirectly) depends on. We start with an
empty stack, which will eventually hold all of the nodes we want to run. Start with the
node(s) that you want to get the output from. Obviously it must execute, so we add it to
our stack. We look at our output nodes list of dependencies- which means that those
nodes must run in order to calculate our output, so we add all of them to the stack. Now
we look at all of those nodes and see what their direct dependencies are and add those to
the stack. We continue this pattern all the way back in the graph until there are no
dependencies left to run, and in this way we guarantee that we have all of the nodes we
need to run the graph, and only those nodes. In addition, the stack will be ordered in a way
that we are guaranteed to be able to run each node in the stack as we iterate through it. The
main thing to look out for is to keep track of nodes that were already calculated and to
store their value in memory- that way we dont calculate the same node over and over
again. By doing this, we are able to make sure our computation is as lean as possible,
which can save hours of processing time on huge graphs.
Lets break this code down line by line. First, youll notice this import statement:
import tensorflow as tf
This, unsurprisingly, imports the TensorFlow library and gives it an alias of tf. This is
by convention, as its much easer to type tf, rather than tensorflow over and over as
we use its various functions!
Next, lets focus on our first two variable assignments:
a = tf.constant(5, name="input_a")
b = tf.constant(3, name="input_b")
Here, were defining our input nodes, a and b. These lines use our first TensorFlow
Operation: tf.constant(). In TensorFlow, any computation node in the graph is called an
Operation, or Op for short. Ops take in zero or more Tensor objects as input and output
zero or more Tensor objects. To create an Operation, you call its associated Python
constructor- in this case, tf.constant() creates a constant Op. It takes in a single tensor
value, and outputs that same value to nodes that are directly connected to it. For
convenience, the function automatically converts the scalar numbers 5 and 3 into Tensor
objects for us. We also pass in an optional string name parameter, which we can use to give
an identifier to the nodes we create.
Dont worry if you dont fully understand what an Operation or Tensor object are at this time, since well be
going into more detail later in this chapter.
c = tf.mul(a,b, name="mul_c")
d = tf.add(a,b, name="add_d")
Here, we are defining the next two nodes in our graph, and they both use the nodes we
defined previously. Node c uses the tf.mul. Op, which takes in two inputs and outputs the
result of multiplying them together. Similarly, node d uses tf.add, an Operation that outputs
the result of adding two inputs together. We again pass in a name to both of these Ops (its
something youll be seeing a lot of). Notice that we dont have to define the edges of the
graph separately from the node- when you create a node in TensorFlow, you include all of
the inputs that the Operation needs to compute, and the software draws the connections for
you.
e = tf.add(c,d, name="add_e")
This last line defines the final node in our graph. e uses tf.add in a similar fashion to
node d. However, this time it takes nodes c and d as input- exactly as its described in the
graph above.
With that, our first, albeit small, graph has been fully defined! If you were to execute
the above in a Python script or shell, it would run, but it wouldnt actually do anything.
Remember- this is just the definition part of the process. To get a brief taste of what
running a graph looks like, we could add the following two lines at the end to get our
graph to output the final node:
sess = tf.Session()
sess.run(e)
If you ran this in an interactive environment, such as the python shell or the
Jupyter/iPython Notebook, you would see the correct output:
...
>>> sess = tf.Session()
>>> sess.run(e)
23
Thats enough talk for now: lets actually get this running in live code!
You could write this as a Python file and run it non-interactively, but the output of running a graph is
not displayed by default when doing so. For the sake of seeing the result of your graph, getting
immediate feedback on your syntax, and (in the case of the Jupyter Notebook) the ability to fix errors
and change code on the fly, we highly recommend doing these examples in an interactive environment.
Plus, interactive TensorFlow is fun!
First, we need to load up the TensorFlow library. Write out your import statement as follows:
import tensorflow as tf
It may think for a few seconds, but afterward it will finish importing and will be ready for the next line of code. If
you installed TensorFlow with GPU support, you may see some output notifying you that CUDA libraries were
imported. If you get an error that looks like this:
ImportError: cannot import name pywrap_tensorflow
Make sure that you didnt launch your interactive environment from the TensorFlow source folder. If you get an
error that looks like this:
ImportError: No module named tensorflow
Double check that TensorFlow is installed properly. If you are using Virtualenv or Conda, ensure that your
TensorFlow environment it is active when you start your interactive Python software. Note that if you have
multiple terminals running, one terminal may have an environment active while the other does not.
Assuming the import worked without any hiccups, we can move on to the next portion of the code:
a = tf.constant(5, name="input_a")
b = tf.constant(3, name="input_b")
This is the same code that we saw above- feel free to change the values or name parameters of these constants. In
this book, well stick to the same values we had for the sake of consistency.
c = tf.mul(a,b, name="mul_c")
d = tf.add(a,b, name="add_d")
Next up, we have the first Ops in our code that actually perform a mathematical function. If youre sick and tired
of tf.mul and tf.add, feel free to swap in tf.sub, tf.div, or tf.mod, which perform subtraction, division, or
modulo operations, respectively.
tf.div performs either integer division or floating point division depending on the type of input
provided. If you want to ensure floating point division, try out tf.truediv!
You probably noticed that there hasnt been any output when calling these Operations. Thats because they have
been simply adding Ops to a graph behind the scenes, but no computation is actually taking place. In order to run
the graph, were going to need a TensorFlow Session:
sess = tf.Session()
Session objects are in charge of supervising graphs as they run, and are the primary interface for running graphs.
Were going to discuss Session objects in depth after this exercise, but for now just know that in TensorFlow you
need a Session if you want to run your code! We assign our Session to the variable sess so we can access it
later.
Heres where we finally can see the result! After running this code, you should see the output of your graph. In
our example graph, the output was 23, but it will be different depending the exact functions and inputs you used.
Thats not all we can do however. Lets try plugging in one of the other nodes in our graph to sess.run():
sess.run(c)
You should see the intermediary value of c as the output of this call (15, in the example code). TensorFlow
doesnt make any assumptions about graphs you create, and for all the program cares node c could be the output
you want! In fact, you can use the run() on any Operation in your graph. When you pass an Op into sess.run(),
what you are essentially saying to TensorFlow is, Here is a node I would like to output. Please run all operations
necessary to calculate that node. Play around and try outputting some of the other nodes in your graph!
You can also save the output from running the graph- lets save the output from node e to a Python variable called
output:
output = sess.run(e)
Great! Now that we have a Session active and our graph defined, lets visualize it to confirm that its structured
the same way we drew it out. To do that, were going to use TensorBoard, which came installed with
TensorFlow. To take advantage of TensorBoard, were just going to add one line to our code:
writer = tf.train.SummaryWriter('./my_graph', sess.graph)
Lets break down what this code does. We are creating a TensorFlow SummaryWriter object, and assigning it to
the variable writer. In this exercise, we wont be performing any additional actions with the SummaryWriter, but
in the future well be using them to save data and summary statistics from our graphs, so we assign it to a variable
to get in the habit. We pass in two parameters to initialize SummaryWriter. The first is a string output directory,
which is where the graph description will be stored on disk. In this case, the files created will be put in a directory
called my_graph, and will be located inside the directory we are running our Python code. The second input we
pass into SummaryWriter is the graph attribute of our Session. tf.Session objects, as managers of graphs
defined in TensorFlow, have a graph attribute that is a reference to the graph they are keeping track of. By
passing this on to SummaryWriter, the writer will output a description of the graph inside the my_graph
directory. SummaryWriter objects write this data immediately upon initialization, so once you have executed this
line of code, we can start up TensorBoard.
Go to your terminal and type in the following command, making sure that your present working directory is the
same as where you ran your Python code (you should see the my_graph directory listed):
$ tensorboard --logdir="my_graph"
You should see some log info print to the console, and then the message Starting TensorBoard on port 6006.
What youve done is start up a TensorBoard server that is using data from the my_graph directory. By default,
the server started on port 6006- to access TensorBoard, open up a browser and type http://localhost:6006.
Youll be greeting with an orange-and-white-themed screen:
Dont be alarmed by the No scalar data was found warning message. That just means that we didnt save out
any summary statistics for TensorBoard to display- normally, this screen would show us information that we
asked TensorFlow to save using our SummaryWriter. Since we didnt write any additional stats, theres nothing to
display. Thats fine, though, as were here to admire our beautiful graph. Click on the Graphs link at the top of
the page, and you should see a screen similar to this:
Thats more like it! If your graph is too small, you can zoom in on TensorBoard by scrolling your mousewheel
up. You can see how each of the nodes is labeled based on the name parameter we passed into each Operation. If
you click on the nodes, you can get information about them such as which other nodes they are attached to. Youll
notice that the inputs, a and b appear to be duplicated, but if you hover or click on either of the nodes labeled
input_a, you should see that they both get a highlighted together. This graph doesnt look exactly like the graph
we drew above, but it is the same graph since the input nodes are simply shown twice. Pretty awesome!
And thats it! Youve officially written and run your first ever TensorFlow graph, and youve checked it out in
TensorBoard! Not bad for a few lines of code!
For more practice, try adding in a few more nodes, experimenting with some of the different math Ops talked
about and adding in a few more tf.constant nodes. Run the different nodes youve added and make sure you
understand exactly how data is moving through the graph.
Once you are done constructing your graph, lets be tidy and close the Session and SummaryWriter:
writer.close()
sess.close()
Technically, Session objects close automatically when the program terminates (or, in the interactive case, when
you close/restart the Python kernel). However, its best to explicitly close out of the Session to avoid any sort of
weird edge case scenarios.
Heres the full Python code after going through this tutorial with our example values:
import tensorflow as tf
a = tf.constant(5, name="input_a")
b = tf.constant(3, name="input_b")
c = tf.mul(a,b, name="mul_c")
d = tf.add(a,b, name="add_d")
e = tf.add(c,d, name="add_e")
sess = tf.Session()
output = sess.run(e)
writer = tf.train.SummaryWriter('./my_graph', sess.graph)
writer.close()
sess.close()
Now, instead of having two separate input nodes, we have a single node that can take in
a vector (or 1-D tensor) of numbers. This graph has several advantages over our previous
example:
1. The client only has to send input to a single node, which simplifies using the graph.
2. The nodes that directly depend on the input now only have to keep track of one
dependency instead of two.
3. We now have the option of making the graph take in vectors of any length, if wed
like. This would make the graph more flexible. We can also have the graph enforce a
strict requirement, and force inputs to be of length two (or any length wed like)
We can implement this change in TensorFlow by modifying our previous code:
import tensorflow as tf
a = tf.constant([5,3], name="input_a")
b = tf.reduce_prod(a, name="prod_b")
c = tf.reduce_sum(a, name="sum_c")
d = tf.add(b,c, name="add_d")
Aside from adjusting the variable names, we made two main changes here:
1. We replaced the separate nodes a and b with a consolidated input node (now just a).
We passed in a list of numbers, which tf.constant is able to convert to a 1-D Tensor
2. Our multiplication and addition Operations, which used to take in scalar values, are
now tf.reduce_prod() and tf.reduce_sum(). These functions, when just given a Tensor as
input, take all of its values and either multiply or sum them up, respectively.
In TensorFlow, all data passed from node to node are Tensor objects. As weve seen,
TensorFlow Operations are able to look at standard Python types, such as integers and
strings, and automatically convert them into tensors. There are a variety of ways to create
Tensor objects manually (that is, without reading it in from an external data source), so lets
go over a few of them.
In this book, when discussing code we will use tensor and Tensor interchangeably.
tf.float64
tf.int8
tf.int16
tf.int32
tf.int64
tf.uint8
tf.string
tf.bool
Boolean
tf.complex64
Complex number, with 32-bit floating point real portion, and 32-bit floating point imaginary portion
tf.qint8
tf.qint32
tf.quint8
Using Python types to specify Tensor objects is quick and easy, and it is useful for
prototyping ideas. However, there is an important and unfortunate downside to doing it
this way. TensorFlow has a plethora of data types at its disposal, but basic Python types
lack the ability to explicitly state what kind of data type youd like to use. Instead,
TensorFlow has to infer which data type you meant. With some types, such as strings, this
is simple, but for others it may be impossible. For example, in Python all integers are the
same type, but TensorFlow has 8-bit, 16-bit, 32-bit, and 64-bit integers available. There
are ways to convert the data into the appropriate type when you pass it into TensorFlow,
but certain data types still may be difficult to declare correctly, such as complex numbers.
Because of this, it is common to see hand-defined Tensor objects as NumPy arrays.
NumPy arrays
TensorFlow is tightly integrated with NumPy, the scientific computing package
designed for manipulating N-dimensional arrays. If you dont have experience with
NumPy, we highly recommend looking at the wealth of tutorials and documentation
available for the library, as it has become part of the lingua franca of data science.
TensorFlows data types are based on those from NumPy; in fact, the statement np.int32 ==
tf.int32 returns True! Any NumPy array can be passed into any TensorFlow Op, and the
beauty is that you can easily specify the data type you need with minimal effort.
As a bonus, you can use the functionality of the numpy library both before and after
running your graph, as the tensors returned from Session.run are NumPy arrays. Heres an
example of how to create NumPy arrays, mirroring the above example.
import numpy as np # Don't forget to import NumPy!
# 0-D Tensor with 32-bit integer data type
t_0 = np.array(50, dtype=np.int32)
# 1-D Tensor with byte string data type
# Note: don't explicitly specify dtype when using strings in NumPy
t_1 = np.array([b"apple", b"peach", b"grape"])
# 1-D Tensor with boolean data type
t_2 = np.array([[True, False, False],
[False, False, True],
[False, True, False]],
dtype=np.bool)
# 3-D Tensor with 64-bit integer data type
t_3 = np.array([[ [0, 0], [0, 1], [0, 2] ],
[ [1, 0], [1, 1], [1, 2] ],
[ [2, 0], [2, 1], [2, 2] ]],
dtype=np.int64)
...
Although TensorFlow is designed to understand NumPy data types natively, the converse is not true. Dont
accidentally try to initialize a NumPy array with tf.int32!
Tensor shape
Throughout the TensorFlow library, youll commonly see functions and Operations that
refer to a tensors shape. The shape, in TensorFlow terminology, describes both the
number dimensions in a tensor as well as the length of each dimension. Tensor shapes can
either be Python lists or tuples containing an ordered set of integers: there are as many
numbers in the list as there are dimensions, and each number describes the length of its
corresponding dimension. For example, the list [2, 3] describes the shape of a 2-D tensor
of length 2 in its first dimension and length 3 in its second dimension. Note that either
tuples (wrapped with parentheses ()) or lists (wraped with brackets []) can be used to
define shapes. Lets take a look at more examples to illustrate this further:
# Shapes that specify a 0-D Tensor (scalar)
# e.g. any single number: 7, 1, 3, 4, etc.
s_0_list = []
s_0_tuple = ()
# Shape that describes a vector of length 3
# e.g. [1, 2, 3]
s_1 = [3]
# Shape that describes a 3-by-2 matrix
# e.g [[1 ,2],
# [3, 4],
# [5, 6]]
s_2 = (3, 2)
In addition to being able to specify fixed lengths to each dimension, you are also able
assign a flexible length by passing in None as a dimensions value. Furthermore, passing in
the value None as a shape (instead of using a list/tuple that contains None), will tell
TensorFlow to allow a tensor of any shape. That is, a tensor with any amount of
dimensions and any length for each dimension:
# Shape for a vector of any length:
s_1_flex = [None]
# Shape for a matrix that is any amount of rows tall, and 3 columns wide:
s_2_flex = (None, 3)
# Shape of a 3-D Tensor with length 2 in its first dimension, and variable# length in its second and third dimensions:
s_3_flex = [2, None, None]
# Shape that could be any Tensor
s_any = None
If you ever need to figure out the shape of a tensor in the middle of your graph, you can
use the tf.shape Op. It simply takes in the Tensor object youd like to find the shape for,
and returns it as an int32 vector:
import tensorflow as tf
# ...create some sort of mystery tensor
# Find the shape of the mystery tensor
shape = tf.shape(mystery_tensor, name="mystery_shape")
Remember that tf.shape, like any other Operation, doesnt run until it is executed inside
of a Session.
REMINDER!
Tensors are just a superset of matrices!
TensorFlow operations
As mentioned earlier, TensorFlow Operations, also known as Ops, are nodes that
perform computations on or with Tensor objects. After computation, they return zero or
more tensors, which can be used by other Ops later in the graph. To create an Operation,
you call its constructor in Python, which takes in whatever Tensor parameters needed for
its calculation, known as inputs, as well as any additional information needed to properly
create the Op, known as attributes. The Python constructor returns a handle to the
Operations output (zero or more Tensor objects), and it is this output which can be passed
on to other Operations or Session.run:
import tensorflow as tf
import numpy as np
# Initialize some tensors to use in computation
a = np.array([2, 3], dtype=np.int32)
b = np.array([4, 5], dtype=np.int32)
# Use `tf.add()` to initialize an "add" Operation
# The variable `c` will be a handle to the Tensor output of this Op
c = tf.add(a, b)
In this example, we give the name my_add_op to the add Operation, which well be
able to refer to when using tools such as TensorBoard.
You may find that youll want to reuse the same name for different Operations in a graph. Instead of manually
adding prefixes or suffixes to each name, you can use a name_scope to group operations together
programmatically. Well go over the basic use of name scopes in the exercise at the end of this chapter.
Overloaded operators
TensorFlow also overloads common mathematical operators to make multiplication, addition, subtraction, and
other common operations more concise. If one or more arguments to the operator is a Tensor object, a
TensorFlow Operation will be called and added to the graph. For example, you can easily add two tensors
together like this:
# Assume that `a` and `b` are `Tensor` objects with matching shapes
c = a + b
Operator
Related TensorFlow
Operation
Description
-x
tf.neg()
~x
tf.logical_not()
Returns the logical NOT of each element in x. Only compatible with Tensor objects
with dtype of tf.bool
abs(x)
tf.abs()
Binary operators
Operator
Related TensorFlow
Operation
Description
x + y
tf.add()
x - y
tf.sub()
x * y
tf.mul()
x / y
(Python 2)
tf.div()
Will perform element-wise integer division when given an integer type tensor, and floating
point (true) division on floating point tensors
x / y
(Python 3)
tf.truediv()
x // y
(Python 3)
tf.floordiv()
Element-wise floor division, not returning any remainder from the computation
x % y
tf.mod()
Element-wise modulo
x ** y
tf.pow()
x < y
tf.less()
x <= y
tf.less_equal()
x > y
tf.greater()
x >= y
tf.greater_equal()
x & y
tf.logical_and()
x | y
tf.logical_or()
x ^ y
tf.logical_xor()
Using these overloaded operators can be great when quickly putting together code, but you will not be able to
give name values to each of these Operations. If you need to pass in a name to the Op, call the TensorFlow
Operation directly.
Technically, the == operator is overloaded as well, but it will not return a Tensor of boolean values.
Instead, it will return True if the two tensors being compared are the same object, and False otherwise.
This is mainly used for internal purposes. If youd like to check for equality or inequality, check out
tf.equal() and tf.not_equal, respectively.
TensorFlow graphs
Thus far, weve only referenced the graph as some sort of abstract, omni-presence in
TensorFlow, and we havent questioned how Operations are automatically attached to a
graph when we start coding. Now that weve seen some examples, lets take a look at the
TensorFlow Graph object, learn how to create more of them, use multiple graphs in
conjunction with one another.
Creating a Graph is simple- its constructor doesnt take any variables:
import tensorflow as tf
# Create a new graph:
g = tf.Graph()
Once we have our Graph initialized, we can add Operations to it by using the
Graph.as_default() method to access its context manager. In conjunction with the with
statement, we can use the context manager to let TensorFlow know that we want to add
Operations to a specific Graph:
with g.as_default():
# Create Operations as usual; they will be added to graph `g`
a = tf.mul(2, 3)
...
You might be wondering why we havent needed to specify the graph wed like to add
our Ops to in the previous examples. As a convenience, TensorFlow automatically creates
a Graph when the library is loaded and assigns it to be the default. Thus, any Operations,
tensors, etc. defined outside of a Graph.as_default() context manager will automatically be
placed in the default graph:
# Placed in the default graph
in_default_graph = tf.add(1,2)
# Placed in graph `g`
with g.as_default():
in_graph_g = tf.mul(2,3)
# We are no longer in the `with` block, so this is placed in the default graph
also_in_default_graph = tf.sub(5,1)
If youd like to get a handle to the default graph, use the tf.get_default_graph() function:
default_graph = tf.get_default_graph()
In most TensorFlow programs, you will only ever deal with the default graph. However,
creating multiple graphs can be useful if you are defining multiple models that do not have
interdependencies. When defining multiple graphs in one file, its best practice to either
not use the default graph or immediately assign a handle to it. This ensures that nodes are
added to each graph in a uniform manner:
Correct - Create new graphs, ignore default graph:
import tensorflow as tf
g1 = tf.Graph()
g2 = tf.Graph()
with g1.as_default():
# Define g1 Operations, tensors, etc.
...
with g2.as_default():
# Define g2 Operations, tensors, etc.
...
TensorFlow Sessions
Sessions, as discussed in the previous exercise, are responsible for graph execution. The
constructor
https://www.tensorflow.org/versions/master/api_docs/python/client.html#Session.init[tf.Session()
takes in three optional parameters:
specifies the execution engine to use. For most applications, this will be left at
its default empty string value. When using sessions in a distributed setting, this
parameter is used to connect to tf.train.Server instances (covered in the later chapters
of this book).
graph specifies the Graph object that will be launched in the Session. The default value is
None, which indicates that the current default graph should be used. When using
multiple graphs, its best to explicitly pass in the Graph youd like to run (instead of
creating the Session inside of a with block).
config allows users to specify options to configure the session, such as limiting the
number of CPUs or GPUs to use, setting optimization parameters for graphs, and
logging options.
target
In a typical TensorFlow program, Session objects will be created without changing any
of the default construction parameters.
import tensorflow as tf
# Create Operations, Tensors, etc (using the default graph)
a = tf.add(2, 5)
b = tf.mul(a, 3)
# Start up a `Session` using the default graph
sess = tf.Session()
Once a Session is opened, you can use its primary method, run(), to calculate the value of
a desired Tensor output:
sess.run(b) # Returns 21
Fetches
accepts any graph element (either an Operation or Tensor object), which specifies
what the user would like to execute. If the requested object is a Tensor, then the output of
run() will be a NumPy array. If the object is an Operation, then the output will be None.
fetches
In the above example, we set fetches to the tensor b (the output of the tf.mul Operation).
This tells TensorFlow that the Session should find all of the nodes necessary to compute
the value of b, execute them in order, and output the value of b. We can also pass in a list
of graph elements:
sess.run([a, b]) # returns [7, 21]
When fetches is a list, the output of run() will be a list with values corresponding to the
output of the requested elements. In this example, we ask for the values of a and b, in that
order. Since both a and b are tensors, we receive their values as output.
In addition using fetches to get Tensor outputs, youll also see examples where we give
fetches a direct handle to an Operation which a useful side-effect when run. An example of
this is tf.initialize_all_variables(), which prepares all TensorFlow Variable objects to be
used (Variable objects will be covered later in this chapter). We still pass the Op as the
fetches parameter, but the result of Session.run() will be None:
# Performs the computations needed to initialize Variables, but returns `None`
sess.run(tf.initialize_all_variables())
Feed dictionary
The parameter feed_dict is used to override Tensor values in the graph, and it expects a
Python dictionary object as input. The keys in the dictionary are handles to Tensor objects
that should be overridden, while the values can be numbers, strings, lists, or NumPy arrays
(as described previously). The values must be of the same type (or able to be converted to
the same type) as the Tensor key. Lets show how we can use feed_dict to overwrite the
value of a in the previous graph:
import tensorflow as tf
# Create Operations, Tensors, etc (using the default graph)
a = tf.add(2, 5)
b = tf.mul(a, 3)
# Start up a `Session` using the default graph
sess = tf.Session()
# Define a dictionary that says to replace the value of `a` with 15
replace_dict = {a: 15}
# Run the session, passing in `replace_dict` as the value to `feed_dict`
sess.run(b, feed_dict=replace_dict) # returns 45
Notice that even though a would normally evaluate to 7, the dictionary we passed into
feed_dict replaced that value with 15. feed_dict can be extremely useful in a number of
situations. Because the value of a tensor is provided up front, the graph no longer needs to
compute any of the tensors normal dependencies. This means that if you have a large
graph and want to test out part of it with dummy values, TensorFlow wont waste time
with unnecessary computations. feed_dict is also useful for specifying input values, as well
cover in the upcoming placeholder section.
After you are finished using the Session, call its close() method to release unneeded
resources:
# Open Session
sess = tf.Session()
# Run the graph, write summary statistics, etc.
...
As an alternative, you can also use the Session as a context manager, which will
automatically close when the code exits its scope:
with tf.Session() as sess:
# Run graph, write summary statistics, etc.
...
# The Session closes automatically
We can also use a Session as a context manager by using its as_default() method.
Similarly to how Graph objects can be used implicitly by certain Operations, you can set a
session to be used automatically by certain functions. The most common of such functions
are Operation.run() and Tensor.eval(), which act as if you had passed them in to Session.run()
directly.
# Define simple constant
a = tf.constant(5)
# Open up a Session
sess = tf.Session()
# Use the Session as a default inside of `with` block
with sess.as_default():
a.eval()
# Have to close Session manually.
sess.close()
MORE ON INTERACTIVESESSION
Earlier in the book, we mentioned that InteractiveSession is another type of TensorFlow session, but that
we wouldnt be using it. All InteractiveSession does is automatically make itself the default session in the
runtime. This can be handy when using an interactive Python shell, as you can use a.eval() or a.run()
instead of having to explicitly type out sess.run([a]). However, if you need to juggle multiple sessions,
things can get a little tricky. We find that maintaining a consistent way of running graphs makes debugging
much easier, so were sticking with regular Session objects.
Now that weve got a firm understanding of running our graph, lets look at how to
properly specify input nodes and use feed_dict in conjunction with them.
tf.placeholder
specifies the data type of values that will be passed into the placeholder. This is
required, as it is needed to ensure that there will be no type mismatch errors.
shape specifies what shape the fed Tensor will be. See the discussion on Tensor shapes
above. The default value of shape is None, which means a Tensor of any shape will be
accepted.
dtype
Like any Operation, you can also specify a name identifier to tf.placeholder.
In order to actually give a value to the placeholder, well use the feed_dict parameter in
Session.run(). We use the handle to the placeholders output as the key to the dictionary (in
the above code, the variable a), and the Tensor object we want to pass in as its value:
# Open a TensorFlow Session
sess = tf.Session()
# Create a dictionary to pass into `feed_dict`
# Key: `a`, the handle to the placeholder's output Tensor
# Value: A vector with value [5, 3] and int32 data type
input_dict = {a: np.array([5, 3], dtype=np.int32)}
# Fetch the value of `d`, feeding the values of `input_vector` into `a`
sess.run(d, feed_dict=input_dict)
You must include a key-value pair in feed_dict for each placeholder that is a dependency
of the fetched output. Above, we fetched d, which depends on the output of a. If we had
defined additional placeholders that d did not depend on, we would not need to include
them in the feed_dict.
You cannot fetch the value of placeholders- it will simply raise an exception if you try to feed one into
Session.run().
Variables
Creating variables
and Operation objects are immutable, but machine learning tasks, by their nature,
need a mechanism to save changing values over time. This is accomplished in TensorFlow
with Variable objects, which contain mutable tensor values that persist across multiple calls
to Session.run(). You can create a Variable by using its constructor, tf.Variable():
Tensor
import tensorflow as tf
# Pass in a starting value of three for the variable
my_var = tf.Variable(3, name="my_variable")
The initial value of Variables will often be large tensors of zeros, ones, or random
values. To make it easier to create these common values, TensorFlow has a number of
helper Ops, such as tf.zeros(), tf.ones(), tf.random_normal(), and tf.random_uniform(), each of
which takes in a shape parameter which specifies the dimension of the desired Tensor:
# 2x2 matrix of zeros
zeros = tf.zeros([2, 2])
# vector of length 6 of ones
ones = tf.ones([6])
# 3x3x3 Tensor of random uniform values between 0 and 10
uniform = tf.random_uniform([3, 3, 3], minval=0, maxval=10)
# 3x3x3 Tensor of normally distributed numbers; mean 0 and standard deviation 2
normal = tf.random_normal([3, 3, 3], mean=0.0, stddev=2.0)
You can pass in these Operations as the initial values of Variables as you would a handwritten Tensor:
# Default value of mean=0.0
# Default value of stddev=1.0
random_var = tf.Variable(tf.truncated_normal([2, 2]))
Variable Initialization
objects live in the Graph like most other TensorFlow objects, but their state is
actually managed by a Session. Because of this, Variables have an extra step involved in
order to use them- you must initialize the Variable within a Session. This causes the Session to
start keeping track of the ongoing value of the Variable. This is typically done by passing in
the tf.initialize_all_variables() Operation to Session.run():
Variable
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
If youd only like to initialize a subset of Variables defined in the graph, you can use
tf.initialize_variables(), which takes in a list of Variables to be initialized:
var1 = tf.Variable(0, name="initialize_me")
var2 = tf.Variable(1, name="no_initialization")
init = tf.initialize_variables([var1], name="init_var1")
sess = tf.Session()
sess.run(init)
Changing Variables
In order to change the value of the Variable, you can use the Variable.assign() method,
which gives the Variable the new value to be. Note that Variable.assign() is an Operation, and
must be run in a Session to take effect:
# Create variable with starting value of 1
my_var = tf.Variable(1)
# Create an operation that multiplies the variable by 2 each time it is run
my_var_times_two = my_var.assign(my_var * 2)
# Initialization operation
init = tf.initialize_all_variables()
# Start a session
sess = tf.Session()
# Initialize variable
sess.run(init)
# Multiply variable by two and return it
sess.run(my_var_times_two)
## OUT: 2
# Multiply again
sess.run(my_var_times_two)
## OUT: 4
# Multiply again
sess.run(my_var_times_two)
## OUT: 8
Because Sessions maintain Variable values separately, each Session can have its own
current value for a Variable defined in a graph:
# Create Ops
my_var = tf.Variable(0)
init = tf.initialize_all_variables()
# Start Sessions
sess1 = tf.Session()
sess2 = tf.Session()
# Initialize Variable in sess1, and increment value of my_var in that Session
sess1.run(init)
sess1.run(my_var.assign_add(5))
## OUT: 5
If youd like to reset your Variables to their starting value, simply call
tf.initialize_all_variables() again (or tf.initialize_variables if you only want to reset a subset
of them):
# Create Ops
my_var = tf.Variable(0)
init = tf.initialize_all_variables()
# Start Session
sess = tf.Session()
# Initialize Variables
sess.run(init)
# Change the Variable
sess.run(my_var.assign(10))
# Reset the Variable to 0, its initial value
sess.run(init)
Trainable
Later in this book, youll see various Optimizer classes which automatically train machine
learning models. That means that it will change values of Variable objects without
explicitly asking to do so. In most cases, this is what you want, but if there are Variables in
your graph that should only be changed manually and not with an Optimizer, you need to set
their trainable parameter to False when creating them:
not_trainable = tf.Variable(0, trainable=False)
This is typically done with step counters or anything else that isnt going to be involved
in the calculation of a machine learning model.
To see the result of these name scopes in TensorBoard, lets open up a SummaryWriter and
write this graph to disk.
writer = tf.train.SummaryWriter('./name_scope_1', graph=tf.get_default_graph())
writer.close()
Because the SummaryWriter exports the graph immediately, we can simply start up
TensorBoard after running the above code. Navigate to where you ran the previous script
and start up TensorBoard:
$ tensorboard --logdir='./name_scope_1'
As before, this will start a TensorBoard server on your local computer at port 6006.
Open up a browser and enter localhost:6006 into the URL bar. Navigate to the Graph tab,
and youll see something similar to this:
Youll notice that the add and mul Operations we added to the graph arent immediately
visible- instead, we see their enclosing name scopes. You can expand the name scope
boxes by clicking on the plus + icon in their upper right corner.
Inside of each scope, youll see the individual Operations youve added to the graph.
You can also nest name scopes within other name scopes:
graph = tf.Graph()
with graph.as_default():
in_1 = tf.placeholder(tf.float32, shape=[], name="input_a")
in_2 = tf.placeholder(tf.float32, shape=[], name="input_b")
const = tf.constant(3, dtype=tf.float32, name="static_value")
with tf.name_scope("Transformation"):
with tf.name_scope("A"):
A_mul = tf.mul(in_1, const)
A_out = tf.sub(A_mul, in_1)
with tf.name_scope("B"):
To mix things up, this code explicitly creates a tf.Graph object instead of using the
default graph. Lets look at the code and focus on the name scopes to see exactly how its
structured:
graph = tf.Graph()
with graph.as_default():
in_1 = tf.placeholder(...)
in_2 = tf.placeholder(...)
const = tf.constant(...)
with tf.name_scope("Transformation"):
with tf.name_scope("A"):
# Takes in_1, outputs some value
...
with tf.name_scope("B"):
# Takes in_2, outputs some value
...
with tf.name_scope("C"):
# Takes the output of A and B, outputs some value
...
with tf.name_scope("D"):
# Takes the output of A and B, outputs some value
...
# Takes the output of C and D
out = tf.maximum(...)
Now its easier to dissect. This model has two scalar placeholder nodes as input, a
TensorFlow constant, a middle chunk called Transformation, and then a final output
node that uses tf.maximum() as its Operation. We can see this high-level overview inside of
TensorBoard:
# Start up TensorBoard in a terminal, loading in our previous graph
$ tensorboard --logdir='./name_scope_2'
Inside of the Transformation name scope are four more name scopes arranged in two
layers. The first layer is comprised of scopes A and B, which pass their output
values into the next layer of C and D. The final node then uses the outputs from this
last layer as its input. If you expand the Transformation name scope in TensorBoard,
youll get a look at this:
This also gives us a chance to showcase another feature in TensorBoard. In the above
picture, youll notice that name scopes A and B have matching color (blue), as do C
and D (green). This is due to the fact that these name scopes have identical Operations
setup in the same configuration. That is, A and B both have a tf.mul() Op feeding into
a tf.sub() Op, while C and D have a tf.div() Op that feeds into a tf.add() Op. This
becomes handy if you start using functions to create repeated sequences of Operations.
In this image, you can see that tf.constant objects dont behave quite the same way as
other Tensors or Operations when displayed in TensorBoard. Even though we declared
static_value outside of any name scope, it still gets placed inside them. Furthermore,
instead of there being one icon for static_value, it creates a small visual whenever it is used.
The basic idea for this is that constants can be used at any time and dont necessarily need
to be used in any particular order. To prevent arrows flying all over the graph from a single
point, it just makes a little small impression whenever a constant is used.
ASIDE: tf.maximum()
Separating a huge graph into meaningful clusters can make understanding and
debugging your model a much more approachable task.
But, this time its going to have some important differences that use TensorFlow more
fully:
Our inputs will be placeholders instead of tf.constant nodes
Instead of taking two discrete scalar inputs, our model will take in a single vector of
any length
Were going to accumulate the total value of all outputs over time as we use the graph
The graph will be segmented nicely with name scopes
After each run, we are going to save the output of the graph, the accumulated total of
all outputs, and the average value of all outputs to disk for use in TensorBoard
Heres a rough outline of what wed like our graph to look like:
Here are the key things to note about reading this model:
Notice how each edge now has either a [None] or [] icon next to it. This represents the
TensorFlow shape of the tensor flowing through that edge, with None representing a
vector of any length, and [] representing a scalar.
The output of node d now flows into an update section, which contains Operations
necessary to update Variables as well as pass data through to the TensorBoard
summaries.
We have a separate name scope to contain our two Variables- one to store the
accumulated sum of our outputs, the other to keep track of how many times weve
run the graph. Since these two Variables operate outside of the flow of our main
transformation, it makes sense to put them in a separate space.
There is a name scope dedicated to our TensorBoard summaries which will hold our
tf.scalar_summary Operations. We place them after the update section to ensure that
the summaries are added after we update our Variables, otherwise things could run
out of order.
Lets get going! Open up your code editor or interactive Python environment.
Were going to explicitly create the graph that wed like to use instead of using the
default graph, so make one with tf.Graph():
graph = tf.Graph()
And then well set our new graph as the default while we construct our model:
with graph.as_default():
We have two global style Variables in our model. The first is a global_step, which
will keep track of how many times weve run our model. This is a common paradigm in
TensorFlow, and youll see it used throughout the API. The second Variable is called
total_output- its going to keep track of the total sum of all outputs run on this model
over time. Because these Variables are global in nature, we declare them separately from
the rest of the nodes in the graph and place them into their own name scope:
with graph.as_default():
with tf.name_scope("variables"):
# Variable to keep track of how many times the graph has been run
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
# Variable that keeps track of the sum of all output values over time:
total_output = tf.Variable(0.0, dtype=tf.float32, trainable=False, name="total_output")
Note that we use the trainable=False setting- it wont have an impact in this model (we
arent training anything!), but it makes it explicit that these Variables will be set by hand.
Next up, well create the core transformation part of the model. Well encapsulate the
entire transformation in a name scope, transformation, and separate them further into
separate input, intermediate_layer, and output name scopes:
with graph.as_default():
with tf.name_scope("variables"):
...
with tf.name_scope("transformation"):
# Separate input layer
with tf.name_scope("input"):
# Create input placeholder- takes in a Vector
a = tf.placeholder(tf.float32, shape=[None], name="input_placeholder_a")
# Separate middle layer
with tf.name_scope("intermediate_layer"):
b = tf.reduce_prod(a, name="product_b")
c = tf.reduce_sum(a, name="sum_c")
# Separate output layer
with tf.name_scope("output"):
output = tf.add(b, c, name="output")
This is extremely similar to the code written for the previous model, with a few key
differences:
Our input node is a tf.placeholder that accepts a vector of any length (shape=[None]).
Instead of using tf.mul() and tf.add(), we use tf.reduce_prod() and tf.reduce_sum(),
respectively. This allows us to multiply and add across the entire input vector, as the
earlier Ops only accept exactly 2 input scalars.
After the transformation computation, were going to need to update our two Variables
from above. Lets create an update name scope to hold these changes:
with graph.as_default():
with tf.name_scope("variables"):
...
with tf.name_scope("transformation"):
...
with tf.name_scope("update"):
# Increments the total_output Variable by the latest output
update_total = total_output.assign_add(output)
# Increments the above `global_step` Variable, should be run whenever the graph is run
increment_step = global_step.assign_add(1)
The first thing we do inside of this section is compute the average output value over
time. Luckily, we have the total value of all outputs with total_output (we use the output
from update_total to make sure that the update happens before we compute avg), as well as
the total number of times weve run the graph with global_step (same thing- we use the
output of increment_step to make sure the graph runs in order). Once we have the average,
we save output, update_total and avg with separate tf.scalar_summary objects.
To finish up the graph, well create our Variable initialization Operation as well as a
helper node to group all of our summaries into one Op. Lets place these in a name scope
called global_ops:
with graph.as_default():
...
with tf.name_scope("summaries"):
...
with tf.name_scope("global_ops"):
# Initialization Op
init = tf.initialize_all_variables()
# Merge all summaries into one Operation
merged_summaries = tf.merge_all_summaries()
You may be wondering why we placed the tf.merge_all_summaries() Op here instead of the
summaries name scope. While it doesnt make a huge difference here, placing
with other global Operations is generally best practice. This graph only
has one section for summaries, but you can imagine a graph having different summaries
for Variables, Operations, name scopes, etc. By keeping merge_all_summaries() separate, it
ensures that youll be able to find the Operation without having to remember which
specific summary code block you placed it in.
merge_all_summaries()
Thats it for creating the graph! Now lets get things set up to execute the graph.
With a Session started, lets initialize our Variables before doing anything else:
sess.run(init)
To actually run our graph, lets create a helper function, run_graph() so that we dont have
to keep typing the same thing over and over again. What wed like is to pass in our input
vector to the function, which will run the graph and save our summaries:
def run_graph(input_tensor):
feed_dict = {a: input_tensor}
_, step, summary = sess.run([output, increment_step, merged_summaries],
feed_dict=feed_dict)
writer.add_summary(summary, global_step=step)
Do is as many times as youd like. Once youve had your fill, go ahead and write the
summaries to disk with the SummaryWriter.flush() method:
writer.flush()
Finally, lets be tidy and close both our SummaryWriter and Session, now that were done
with them:
writer.close()
sess.close()
And thats it for our TensorFlow code! It was a little longer than the last graph, but it
wasnt too bad. Lets open up TensorBoard and see what weve got. Fire up a terminal
shell, navigate to the directory where you ran this code (make sure the improved_graph
directory is located there), and run the following:
$ tensorboard --logdir="improved_graph"
As usual, this starts up a TensorBoard server on port 6006, hosting the data stored in
improved_graph. Type in localhost:6006 into your web browser and lets see what
weve got! Lets first check out the Graph tab:
Youll see that this graph closely matches what we diagrammed out earlier. Our
transformation operations flow into the update block, which then feeds into both the
summary and variable name scopes. The main difference between this and our diagram is
the global_ops name scope, which contains operations that arent critical to the primary
transformation computation.
You can expand the various blocks to get a more granular look at their structure:
Now we can see the separation of our input, the intermediate layer, and the output. It
might be overkill on a simple model like this, but this sort of compartmentalization is
extremely useful. Feel free to explore the rest of the graph. When youre ready, head over
to the Events page.
When you open up the Events page, you should see three collapsed tabs, named based
on the tags we gave our scalar_summary objects above. By clicking on any of them, youll see
a nice line chart showing the values we stored at various time steps. If you click the blue
rectangle at the bottom left of the charts, theyll expand to look like the image above.
Definitely check out the results of your summaries, compare them, make sure that they
make sense, and pat yourself on the back! That concludes this exercise- hopefully by now
you have a good sense of how to create TensorFlow graphs based on visual sketches, as
well as how to do some basic summaries with TensorBoard.
def run_graph(input_tensor):
"""
Helper function; runs the graph with given input tensor and saves summaries
"""
feed_dict = {a: input_tensor}
_, step, summary = sess.run([output, increment_step, merged_summaries],
feed_dict=feed_dict)
writer.add_summary(summary, global_step=step)
# Run the graph with various inputs
run_graph([2,8])
run_graph([3,1,3,3])
run_graph([8])
run_graph([1,2,3])
run_graph([11,4])
run_graph([4,1])
run_graph([7,3,1])
run_graph([6,3])
run_graph([0,2])
run_graph([4,5,6])
# Write the summaries to disk
writer.flush()
# Close the SummaryWriter
writer.close()
# Close the session
sess.close()
Conclusion
That wraps up this chapter! There was a lot of information to absorb, and you should
definitely play around with TensorFlow now that you have a grasp of the fundamentals.
Get yourself fluent with Operations, Variables, and Sessions, and embed the basic loop of
building and running graphs into your head.
Using TensorFlow for simple math problems is fun (for some people), but we havent
even touched on the primary use case for the library yet: machine learning. In the next
chapter, youll be introduced to some of the core concepts and techniques for machine
learning and how to use them inside of TensorFlow.
in order to minimize the loss through a number of training steps. Most commonly,
you can use a gradient descent algorithm for this, which we will explain in the
following section.
The loop repeats this process through a number of cycles, according to the learning rate
that we need to apply, and depending on the model and data we input to it.
After training, we apply an evaluation phase; where we execute the inference against a
different set data to which we also have the expected output, and evaluate the loss for it.
Given how this dataset contains examples unknown for the model, the evaluation tells you
how well the model predicts beyond its training. A very common practice is to take the
original dataset and randomly split it in 70% of the examples for training, and 30% for
evaluation.
Lets use this structure to define some generic scaffolding for the model code.
import tensorflow as tf
# initialize variables/model parameters
# define the training loop operations
def inference(X):
# compute inference model over data X and return the result
def loss(X, Y):
# compute loss over training data X and expected outputs Y
def inputs():
# read/generate input training data X and expected outputs Y
def train(total_loss):
# train / adjust model parameters according to computed total loss
def evaluate(sess, X, Y):
# evaluate the resulting trained model
# Launch the graph in a session, setup boilerplate
with tf.Session() as sess:
tf.initialize_all_variables().run()
X, Y = inputs()
total_loss = loss(X, Y)
train_op = train(total_loss)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
# actual training loop
training_steps = 1000
for step in range(training_steps):
sess.run([train_op])
# for debugging and learning purposes, see how the loss gets decremented thru training steps
if step % 10 == 0:
print "loss: ", sess.run([total_loss])
evaluate(sess, X, Y)
coord.request_stop()
coord.join(threads)
sess.close()
This is the basic shape for an inference model code. First it initializes the model
parameters. Then it defines a method for each of the training loop operations: read input
training data (inputs method), compute inference model (inference method), calculate loss
over expected output (loss method), adjust model parameters (train method), evaluate the
resulting model (evaluate method), and then the boilerplate code to start a session and run
the training loop. In the following sections we will fill these template methods with
required code for each type of inference model.
Once you are happy with how the model responds, you can focus on exporting it and
serving it to run inference against the data you need to work with.
In the code above we instatiate a saver before opening the session, inserting code inside
the training loop to call the tf.train.Saver.save method for each 1000 training steps, along
with the final step when the training loop finishes. Each call will create a checkpoint file
with the name template my-model-{step} like my-model-1000, my-model-2000, etc. The file stores the
current values of each variable. By default the saver will keep only the most recent 5 files
and delete the rest.
If we wish to recover the training from a certain point, we should use the
tf.train.get_checkpoint_state method, which will verify if we already have a checkpoint
saved, and the tf.train.Saver.restore method to recover the variable values.
with tf.Session() as sess:
# model setup.
initial_step = 0
# verify if we don't have a checkpoint saved already
ckpt = tf.train.get_checkpoint_state(os.path.dirname(__file__))
if ckpt and ckpt.model_checkpoint_path:
# Restores from checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
initial_step = int(ckpt.model_checkpoint_path.rsplit('-', 1)[1])
#actual training loop
for step in range(initial_step, training_steps):
...
In the code above we check first if we have a checkpoint file, and then restore the
variable values before staring the training loop. We also recover the global step number
from the checkpoint file name.
Now that we know how supervised learning works in general, as well as how to store
our training progress, lets move on to explain some inference models.
Linear regression
Linear regression is the simplest form of modeling for a supervised learning problem.
Given a set of data points as training data, you are going to find the linear function that
best fits them. In a 2-dimensional dataset, this type of function represents a straight line.
Here is the charting of the lineal regression model in 2D. Blue dots are the training data
points and the red line is the what the model will infer.
Lets begin with a bit of math to explain how the model will work. The general formula
of a linear function is:
def inference(X):
return tf.matmul(X, W) + b
Now we have to define how to compute the loss. For this simple model we will use a
squared error, which sums the squared difference of all the predicted values for each
training example with their corresponding expected values. Algebraically it is the squared
euclidean distance between the predicted output vector and the expected one. Graphically
in a 2d dataset is the length of the vertical line that you can trace from the expected data
point to the predicted regression line. It is also known as L2 norm or L2 loss function. We
use it squared to avoid computing the square root, since it makes no difference for trying
to minimize the loss and saves us a computing step.
Its now time to actually train our model with data. As an example, we are going to
work with a dataset that relates age in years and weight in kilograms with blood fat
content (http://people.sc.fsu.edu/~jburkardt/datasets/regression/x09.txt).
As the dataset is short enough for our example, we are just going to embed it in our
code. In the following section we will show how to deal with reading the training data
from files, like you would in a real world scenario.
def inputs():
weight_age = [[84, 46], [73, 20], [65, 52], [70, 30], [76, 57], [69, 25], [63, 28], [72, 36], [79
blood_fat_content = [354, 190, 405, 263, 451, 302, 288, 385, 402, 365, 209, 290, 346, 254, 395,
return tf.to_float(weight_age), tf.to_float(blood_fat_content)
And now we define the model training operation. We will use the gradient descent
algorithm for optimizing the model parameters, which we describe in the following
section.
def train(total_loss):
learning_rate = 0.0000001
return tf.train.GradientDescentOptimizer(learning_rate).minimize(total_loss)
When you run it, you will see printed how the loss gets smaller on each training step.
Now that we trained the model, its time to evaluate it. Lets compute the expected
blood fat for a 25 year old person who weighs 80 kilograms. This is not originally in the
source data, but we will compare it with another person with the same age who weighs 65
kilograms:
As a quick evaluation, you can check that the model learned how the blood fat decays
with weight, and the output values are in between the boundaries of the original trained
values.
Logistic regression
The linear regression model predicts a continuous value, or any real number. Now we
are going to present a model that can answer a yes-no type of question, like Is this email
spam?
There is a function used commonly in machine learning called the logistic function. It
is also known as the sigmoid function, because its shape is an S (and sigma is the greek
letter equivalent to s).
Here you see the charting of a logistic/sigmoid function, with its S shape.
The logistic function is a probability distribution function that, given a specific input
value, computes the probability of the output being a success, and thus the probability for
the answer to the question to be yes.
This function takes a single input value. In order to feed the function with the multiple
dimensions, or features from the examples of our training datasets, we need to combine
them into a single value. We can use the linear regression model expression for doing this,
like we did in the section above.
To express it in code, you can reuse all of the elements of the linear model, however,
you just slightly change the prediction to apply the sigmoid.
Now lets focus on the loss function for this model. We could use the squared error. The
logistic function computes the probability of the answer being yes. In the training set, a
yes answer should represent 100% of probability, or simply the output value to be 1.
Then the loss should be how much probability our model assigned less than 1 for that
particular example, squared. Consequently, a no answer will represent 0 probability,
hence the loss is any probability the model assigned for that example, and again squared.
Consider an example where the expected answer is yes and the model is predicting a
very low probability for it, close to 0. This means that it is close to 100% sure that the
answer is no.
The squared error penalizes such a case with the same order of magnitude for the loss as
if the probability would have been predicted as 20, 30, or even 50% for the no output.
There is a loss function that works better for this type of problem, which is the cross
entropy function.
We can visually compare the behavior of both loss functions according to the predicted
output for a yes.
The cross entropy and squared error (L2) functions are charted together. Cross entropy
outputs a much greater value (penalizes), because the output is farther from what is
expected.
With cross entropy, as the predicted probability comes closer to 0 for the yes
example, the penalty increases closer to infinity. This makes it impossible for the model to
make that misprediction after training. That makes the cross entropy better suited as a loss
function for this model.
There is a Tensorflow method for calculating cross entropy directly for a sigmoid output
in a single, optimized step:
def loss(X, Y):
return tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(combine_inputs(X), Y))
You can actually link this entropy with the thermodynamics concept of entropy, in addition to their math
expressions being analogous.
For instance, lets calculate the entropy for the word HELLO.
So you need 2 bits per symbol to encode HELLO in the optimal encoding.
If you encode the symbols assuming any other probability for the than the real need more bits for encoding
each symbol. Thats where cross-entropy comes to play. It allows you to calculate the average minimum number
of bits needed to encode the same string in another suboptimal encoding.
So, you need 8 bits per symbol to encode HELLO in ASCII, as you would have expected.
As a loss function, consider to be expected training output and distribution probability (encoding), where the
actual value has 100% and any other 0. And use as the model calculated output. Remember that the sigmoid
function computes a probability.
It is a theorem that cross entropy is at its minimum when
how a distribution fits another. The closer the cross entropy is to the entropy, the better is an approximation
of . Then effectively, cross-entropy reduces as the model better resembles the expected output, like you need in
a loss function.
We can freely exchange
with
for minimizing the entropy as you switch one to another by
multiplying by the change of the base constant.
Now lets apply the model to some data. We are going to use the Titanic survivor
Kaggle contest dataset from https://www.kaggle.com/c/titanic/data.
The model will have to infer, based on the passenger age, sex and ticket class if the
passenger survived or not.
To make it a bit more interesting, lets use data from a file this time. Go ahead and
download the train.csv file.
Here are the code basics for reading the file. This is a new method for our scaffolding.
You can load and parse it, and create a batch to read many rows packed in a single tensor
for computing the inference efficiently.
def read_csv(batch_size, file_name, record_defaults):
filename_queue = tf.train.string_input_producer([os.path.dirname(__file__) + "/" + file_name])
reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read(filename_queue)
# decode_csv will convert a Tensor from type string (the text line) in
# a tuple of tensor columns with the specified defaults, which also
# sets the data type for each column
decoded = tf.decode_csv(value, record_defaults=record_defaults)
# batch actually reads the file and loads "batch_size" rows in a single tensor
return tf.train.shuffle_batch(decoded,
batch_size=batch_size,
capacity=batch_size * 50,
min_after_dequeue=batch_size)
You have to use categorical data from this dataset. Ticket class and gender are string
features with a predefined possible set of values that they can take. To use them in the
inference model we need to convert them to numbers. A naive approach might be
assigning a number for each possible value. For instance, you can use 1 for first ticket
class, 2 for second, and 3 for third. Yet that forces the values to have a lineal
relationship between them that doesnt really exist. You cant say that third class is 3
times first class. Instead you should expand each categorical feature to N boolean
features, or one for each possible value of the original. This allows the model to learn the
importance of each possible value independently. In our example data, first class should
have greater probability of survival than others.
When working with categorical data, convert it to multiple boolean features, one for
each possible value. This allows the model to weight each possible value separately.
In the case of categories with only two possible values, like the gender in the dataset, it
is enough to have a single variable for it. Thats because you can express a linear
relationship between the values. For instance if possible values are female = 1 and male = 0,
then male = 1 - female, a single weight can learn to represent both possible states.
def inputs():
passenger_id, survived, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked = \
read_csv(100, "train.csv", [[0.0], [0.0], [0], [""], [""], [0.0], [0.0], [0.0], [""], [0.0],
# convert categorical data
is_first_class = tf.to_float(tf.equal(pclass, [1]))
is_second_class = tf.to_float(tf.equal(pclass, [2]))
is_third_class = tf.to_float(tf.equal(pclass, [3]))
gender = tf.to_float(tf.equal(sex, ["female"]))
# Finally we pack all the features in a single matrix;
# We then transpose to have a matrix with one example per row and one feature per column.
features = tf.transpose(tf.pack([is_first_class, is_second_class, is_third_class, gender, age]))
survived = tf.reshape(survived, [100, 1])
return features, survived
In the code above we define our inputs as calling read_csv and converting the data. To
convert to boolean, we use the tf.equal method to compare equality to a certain constant
value. We also have to convert the boolean back to a number to apply inference with
tf.to_float. We then pack all the booleans in a single tensor with tf.pack.
Finally, lets train our model.
def train(total_loss):
learning_rate = 0.01
return tf.train.GradientDescentOptimizer(learning_rate).minimize(total_loss)
To evaluate the results we are going to run the inference against a batch of the training
set and count the number of examples that were correctly predicted. We call that
measuring the accuracy.
def evaluate(sess, X, Y):
predicted = tf.cast(inference(X) > 0.5, tf.float32)
print sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32)))
As the model computes a probability of the answer being yes, we convert that to a
positive answer if the output for an example is greater than 0.5. Then we compare equality
with the actual value using tf.equal. Finally, we use tf.reduce_mean, which counts all of the
correct answers (as each of them adds 1) and divides by the total number of samples in the
batch, which calculates the percentage of right answers.
If you run the code above you should get around 80% of accuracy, which is a good
number for the simplicity of this model.
Softmax classification
With logistic regression we were able to model the answer to the yes-no question. Now
we want to be able to answer a multiple choice type of question like: Were you born in
Boston, London, or Sydney?
For that case there is the softmax function, which is a generalization of the logistic
regresion for C possible different values.
Regarding loss computation, the same considerations for logistic regression apply for
fitting a candidate loss function, as the output here is also a probability. We are going to
use then cross-entropy again, adapted for multiple classes in the computed probability.
For a single training example , cross entropy now becomes:
Summing the loss for each output class on that training example. Note that would
equal 1 for the expected class of the training example and 0 for the rest, so only one loss
value is actually summed, the one measuring how far the model predicted the probability
for the true class.
Now to calculate the total loss among the training set, we sum the loss for each training
example:
In code, there are two versions implemented in Tensorflow for the softmax crossentropy function: one specially optimized for training sets with a single class value per
example. For example, our training data may have a class value that could be either dog,
person or tree. That function is tf.nn.sparse_softmax_cross_entropy_with_logits.
def loss(X, Y):
return tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(combine_inputs(X), Y))
The other version of it lets you work with training sets containing the probabilities of
each example to belong to every class. For instance, you could use training data like 60%
of the asked people consider that this picture is about dogs, 25% about trees, and the rest
about a person. That function is tf.nn.softmax_cross_entropy_with_logits. You may need such
a function with some real world usages, but we wont need it for our simple examples. The
sparse version is preferred when possible because it is faster to compute. Note that the
final output of the model will always be one single class value, and this version is just to
support a more flexible training data.
Lets define our input method. We will reuse the read_csv function from the logistic
regression example, but will call it with the defaults for the values on our dataset, which
are all numeric.
def inputs():
sepal_length, sepal_width, petal_length, petal_width, label =\
read_csv(100, "iris.data", [[0.0], [0.0], [0.0], [0.0], [""]])
# convert class names to a 0 based class index.
label_number = tf.to_int32(tf.argmax(tf.to_int32(tf.pack([
tf.equal(label, ["Iris-setosa"]),
tf.equal(label, ["Iris-versicolor"]),
tf.equal(label, ["Iris-virginica"])
])), 0))
# Pack all the features that we care about in a single matrix;
# We then transpose to have a matrix with one example per row and one feature per column.
features = tf.transpose(tf.pack([sepal_length, sepal_width, petal_length, petal_width]))
return features, label_number
We dont need to convert each class to its own variable to use with
sparse_softmax_cross_entropy_with_logits, but we need the value to be a number in the range of
0..2, since we have 3 possible classes. In the dataset file the class is a string value from the
possible Iris-setosa, Iris-versicolor, or Iris-virginica. To convert it we create a tensor
with tf.pack, comparing the file input with each possible value using tf.equal. Then we use
tf.argmax to find the position on that tensor which is valued true, effectively converting the
classes to a 0..2 integer.
The training function is also the same.
For evaluation of accuracy, we need a slight change from the sigmoid version:
def evaluate(sess, X, Y):
predicted = tf.cast(tf.arg_max(inference(X), 1), tf.int32)
print sess.run(tf.reduce_mean(tf.cast(tf.equal(predicted, Y), tf.float32)))
The inference will compute the probabilities for each output class for our test examples.
We use the tf.argmax function to choose the one with the highest probability as the
predicted output value. Finally, we compare with the expected class with tf.equal and
apply tf.reduce_mean just like with the sigmoid example.
Running the code should print an accuracy of about 96%.
In the case of softmax classification, we used a network with C neurons- one for each
possible output class:
Now, in order to resolve more difficult tasks, like reading handwritten digits, or actually
identifying cats and dogs on images, we are going to need a more developed model.
Lets start with a simple example. Suppose we want to build a network that learns how
to fit the XOR (eXclusive OR) boolean operation:
Table 4-1. XOR operation truth table
Input 1
Input 2
Output
It should return 1 when either input equals to 1, but not when both do.
This seems to be a far more simpler problem that the ones we have tried so far, yet none
of the models that we presented can solve it.
The reason is that sigmoid type of neurons require our data to be linearly separable to
do their job well. That means that there must exist a straight line in 2 dimensional data (or
hyperplane in higher dimensional data) that separates all the data examples belonging to a
class in the same side of the plane, which looks something like this:
In the chart we can see example data samples as dots, with their associated class as the
color. As long as we can find that yellow line completely separating the red and the blue
dots in the chart, the sigmoid neuron will work fine for that dataset.
We cant find a single straight line that would split the chart, leaving all of the 1s (red
dots) in one side and 0s (blue dots) in the other. Thats because the XOR function output is
not linearly separable.
This problem actually resulted in neural network research losing importance for about a
decade around 1970s. So how did they fix the lack of linear separability to keep using
networks? They did it by intercalating more neurons between the input and the output of
the network, as you can see in the figure:
We say that we added a hidden layer of neurons between the input and the output
layers. You can think of it as allowing our network to ask multiple questions to the input
data, one question per neuron on the hidden layer, and finally deciding the output result
based on the answers of those questions.
Graphically, we are allowing the network to draw more than one single separation line:
As you can see in the chart, each line divides the plane for the first questions asked to
the input data. Then you can leave all of the equal outputs grouped together in a single
area.
You can now guess what the deep means in deep learning. We make our networks
deeper by adding more hidden layers on them. We may use different type of connections
between them and even different activation functions in each layer.
Later in this book we present different types of deep neural networks for different usage
scenarios.
You should think about a partial derivative as if your function would receive only one
single variable, replacing all of the others by constants, and then applying the usual single
variable derivation procedure.
The partial derivatives measure the rate of change of the function output with respect of
a particular input variable. In other words, how much the output value will increase if we
increase that input variable value.
Here is a caveat before going on. When we talk about input variables of the loss
function, we are referring to the model weights, not that actual dataset features inputs.
Those are fixed by our dataset and cannot be optimized. The partial derivatives we
calculate are with respect of each individual weight in the inference model.
We care about the gradient because its output vector indicates the direction of maximum
growth for the loss function. You could think of it as a little arrow that will indicate in
every point of the function where you should move to increase its value:
Suppose the chart above shows the loss function. The red dot represents the current
weight values, where you are currently standing. The gradient represents the arrow,
indicating that you should go right to increase the loss. More over, the length of the arrow
indicates conceptually how much would you gain if you move in that direction.
Now, if we go the opposite direction of the gradient, the loss will also do the opposite:
decrease.
In the chart, if we go in the opposite direction of the gradient (blue arrow) we will go in
the direction of decreasing loss.
If we move in that direction and calculate the gradient again, and then repeat the
process until the gradient length is 0, we will arrive at the loss minimum. That is our goal,
and graphically should look like:
Notice how we added the value to scale the gradient. We call it the learning rate. We
need to add that because the length of the gradient vector is actually an amount measured
in the loss function units, not in weight units, so we need to scale the gradient to be
able to add it to our weights.
The learning rate is not a value that model will infer. It is an hyperparameter, or a
manually configurable setting for our model. We need to figure out the right value for it. If
it is too small then it will take many learning cycles to find the loss minimum. If it is too
large, the algorithm may simply skip over the minimum and never find it, jumping
cyclically. Thats known as overshooting. In our example loss function chart, it would
look like:
In practice, we cant really plot the loss function because it has many variables. So to
know that we are trapped in overshooting, we have to look at the plot of the computed
total loss thru time, which we can get in Tensorboard by using a tf.scalar_summary on the
loss.
This is how a well behaving loss should diminish through time, indicating a good
learning rate:
The blue line is the Tensorboard chart, and the red one represents the tendency line of
the loss.
This is what it looks like when it is overshooting:
You should play with adjusting the learning rate so it is small enough that it doesnt
overshoot, but is large enough to get it decaying quickly, so you can achieve learning
faster using less cycles.
Besides the learning rate, other issues affect the gradient descent in the algorithm. The
presence of local optima is in the loss function. Going back to the toy example loss
function plot, this is how the algorithm would work if we had our initial weights close to
the right side valley of the loss function:
The algorithm will find the valley and then stop because it will think that it is where the
best possible value is located. The gradient is valued at 0 in all minima. The algorithm
cant distinguish if it stopped in the absolute minimum of the function, the global
minimum, or a local minimum that is the best value only in the close neighborhood.
We try to fight against it by initializing the weights with random values. Remember that
the first value for the weights is set manually. By using random values, we improve the
chance to start descending closer from the global minimum.
In a deep network context like the ones we will see in later chapters, local minima are
very frequent. A simple way to explain this is to think about how the same input can travel
many different paths to the output, thus generating the same outcome. Luckily, there are
papers showing that all of those minima are closely equivalent in terms of loss, and they
are not really much worse than the global minimum.
So far we havent been explicitly calculating any derivatives here, because we didnt
have to. Tensorflow includes the method tf.gradients to symbolically computate the
gradients of the specified graph steps and output that as tensors. We dont even need to
manually call, because it also includes implementations of the gradient descent algorithm,
among others. That is why we present high level formulas on how things should work
without requiring us to go in-depth with implementation details and the math.
We are going to present through backpropagation. It is a technique used for efficiently
computing the gradient in a computational graph.
Lets assume a really simply network, with one input, one output, and two hidden layers
with a single neuron. Both hidden and output neurons will be sigmoids and the loss will be
calculated using cross entropy. Such a network should look like:
To run one step of gradient decent, we need to calcuate the partial derivatives of the loss
function with respect of the three weights in the network. We will start from the output
layer weights, applying the chain rule:
Now lets calculate the derivative for the second hidden layer weight,
You should notice a pattern. The derivative on each layer is the product of the
derivatives of the layers after it by the output of the layer before. Thats the magic of the
chain rule and what the algorithm takes advantage of.
We go forward from the inputs calculating the outputs of each hidden layer up to the
output layer. Then we start calculating derivatives going backwards through the hidden
layers and propagating the results in order to do less calculations by reusing all of the
elements already calculated. Thats the origin of the name backpropagation.
Conclusion
Notice how we have not used the definition of the sigmoid or cross entropy derivatives.
We could have used a network with different activation functions or loss and the result
would be the same.
This is a very simple example, but in a network with thousands of weights to calculate
their derivatives, using this algorithm can save orders of magnitude in training time.
To close, there are a few different optimization algorithms included in Tensorflow,
though all of them are based in this method of computing gradients. Which one works
better depends upon the shape of your input data and the problem you are trying to solve.
Sigmoid hidden layers, softmax output layers, and gradient descent with
backpropagation are the most fundamentals blocks that we are going to use to build on for
the more complex models that will see in the next chapters.
If one of the images shown above is loaded into the model, it should output a label of
Siberian Husky. These example images wouldnt be a fair test of the models accuracy
because they exist in the training dataset. Finding a fair metric to calculate the models
accuracy requires a large number of images which wont be used in training. The images
which havent been used in training the model will be used to create a separate test
dataset.
The reason to bring up the fairness of an image to test a models accuracy is because its
part of keeping a separated test, train and cross-validation datasets. While processing
input, it is a required practice to separate a large percentage of the data used to train a
network. This separation is to allow a blind test of a model. Testing a model with input
which was used to train it will likely create a model which accurately matches input it has
already seen while not being capable of working with new input. The testing dataset is
then used to see how well the model performs with data that didnt exist in the training.
Over time and iterations of the model, it is possible that the changes being made to
increase accuracy are making the model better equipped to the testing dataset while
performing poorly in the real world. A good practice is to use a cross-validation dataset to
check the final model and receive a better estimate of its accuracy. With images, its best
to separate the raw dataset while doing any preprocessing (color adjustments or cropping)
keeping the input pipeline the same across all the datasets.
discusses a grouping of cells that extend vertically combining together to match certain
visual traits. The study of primate brains may seem irrelevant to a machine learning task,
yet it was instrumental in the development of deep learning using CNNs.
CNNs follow a simplified process matching information similar to the structure found
in the cellular layout of a monkeys striate cortex. As signals are passed through a
monkeys striate cortex, certain layers signal when a visual pattern is highlighted. For
example, one layer of cells activate (increase its output signal) when a horizontal line
passes through it. A CNN will exhibit a similar behavior where clusters of neurons will
activate based on patterns learned from training. For example, after training, a CNN will
have certain layers that activate when a horizontal line passes through it.
Matching horizontal lines would be a useful neural network architecture. but CNNs take
it further by layering multiple simple patterns to match complex patterns. In the context of
CNNs, these patterns are known as filters or kernels and the goal is to adjust these kernel
weights until they accurately match the training data. Training these filters is often
accomplished by combining multiple different layers and learning weights using gradient
descent.
A simple CNN architecture may combine a convolutional layer (tf.nn.conv2d), nonlinearity layer (tf.nn.relu), pooling layer (tf.nn.max_pool) and a fully connected layer
(tf.matmul). Without these layers, its difficult to match complex patterns because the
network will be filled with too much information. A well designed CNN architecture
highlights important information while ignoring noise. Well go into details on how these
layers work together later in this chapter.
The input image for this architecture is a complex format designed to support the ability
to load batches of images. Loading a batch of images allows the computation of multiple
images simultaneously but it requires a more complex data structure. The data structure
used is a rank four tensor including all the information required to convolve a batch of
images. TensorFlows input pipeline (which is used to read and decode files) has a special
format designed to work with multiple images in a batch including required information
for an image ([image_batch_size, image_height, image_width, image_channels]). Using the example
code, its possible to examine the structure of an example input used while working with
images in TensorFlow.
image_batch = tf.constant([
[ # First Image
[[0, 255, 0], [0, 255, 0], [0, 255, 0]],
[[0, 255, 0], [0, 255, 0], [0, 255, 0]]
],
[ # Second Image
[[0, 0, 255], [0, 0, 255], [0, 0, 255]],
[[0, 0, 255], [0, 0, 255], [0, 0, 255]]
]
])
image_batch.get_shape()
NOTE: The example code and further examples in this chapter do not include the
common bootstrapping required to run TensorFlow code. This includes importing the
tensorflow (usually as tf for brevity), creating a TensorFlow session as sess, initializing all
variables, and starting thread runners. Undefined variable errors may occur if the example
code is executed without running these steps.
In this example code, a batch of images is created that includes two images. Each image
has a height of two pixels and a width of three pixels with an RGB color space. The output
from executing the example code shows the amount of images as the size of the first set of
dimensions Dimension(2), the height of each image as the size of the second set Dimension(2),
the width of each image as the third set Dimension(3), and the size of the color channel as the
final set Dimension(3).
Its important to note each pixel maps to the height and width of the image. Retrieving
the first pixel of the first image requires each dimension accessed as follows.
sess.run(image_batch)[0][0][0]
Instead of loading images from disk, the image_batch variable will act as if it were images
loaded as part of an input pipeline. Images loaded from disk using an input pipeline have
the same format and act the same. Its often useful to create fake data similar to the
image_batch example above to test input and output from a CNN. The simplified input will
make it easier to debug any simple issues. Its important to work on simplification of
debugging because CNN architectures are incredibly complex and often take days to train.
The first complexity working with CNN architectures is how a convolution layer works.
After any image loading and manipulation, a convolution layer is often the first layer in
the network. The first convolution layer is useful because it can simplify the rest of the
network and be used for debugging. The next section will focus on how convolution layers
Convolution
As the name implies, convolution operations are an important component of
convolutional neural networks. The ability for a CNN to accurately match diverse patterns
can be attributed to using convolution operations. These operations require complex input,
which was shown in the previous section. In this section well experiment with
convolution operations and the parameters that are available to tune them. Here the
convolution operation convolves two input tensors (input and kernel) into a single output
tensor, which represents information from each input.
The example code creates two tensors. The input_batch tensor has a similar shape to the
image_batch tensor seen in the previous section. This will be the first tensor being convolved
and the second tensor will be kernel. Kernel is an important term that is interchangeable
with weights, filter, convolution matrix or mask. Since this task is computer vision related,
its useful to use the term kernel because it is being treated as an image kernel. There is no
practical difference in the term when used to describe this functionality in TensorFlow.
The parameter in TensorFlow is named filter and it expects a set of weights which will be
learned from training. The amount of different weights included in the kernel (filter
parameter) will configure the amount of kernels that will be learned.
In the example code, there is a single kernel which is the first dimension of the kernel
variable. The kernel is built to return a tensor that will include one channel with the
original input and a second channel with the original input doubled. In this case, channel is
used to describe the elements in a rank 1 tensor (vector). Channel is a term from computer
vision that describes the output vector, for example an RGB image has three channels
represented as a rank 1 tensor [red, green, blue]. At this time, ignore the strides and padding
parameter, which will be covered later, and focus on the convolution (tf.nn.conv2d) output.
conv2d = tf.nn.conv2d(input_batch, kernel, strides=[1, 1, 1, 1], padding='SAME')
sess.run(conv2d)
The output is another tensor which is the same rank as the input_batch but includes the
In this simplified example, each pixel of every image is multiplied by the corresponding
value found in the kernel and then added to a corresponding layer in the feature map.
Layer, in this context, is referencing a new dimension in the output. With this example, its
hard to see a value in convolution operations.
Strides
The value of convolutions in computer vision is their ability to reduce the
dimensionality of the input, which is an image in this case. An images dimensionality (2D
image) is its width, height and number of channels. A large image dimensionality requires
an exponentially larger amount of time for a neural network to scan over every pixel and
judge which ones are important. Reducing dimensionality of an image with convolutions
is done by altering the strides of the kernel.
The parameter strides, causes a kernel to skip over pixels of an image and not include
them in the output. Its not fair to say the pixels are skipped because they still may affect
the output. The strides parameter highlights how a convolution operation is working with a
kernel when a larger image and more complex kernel are used. As a convolution is sliding
the kernel over the input, its using the strides parameter to change how it walks over the
input. Instead of going over every element of an input, the strides parameter could
configure the convolution to skip certain elements.
For example, take the convolution of a larger image and a larger kernel. In this case, its
a convolution between a 6 pixel tall, 6 pixel wide and 1 channel deep image (6x6x1) and a
(3x3x1) kernel.
input_batch = tf.constant([
[ # First Input (6x6x1)
[[0.0], [1.0], [2.0], [3.0], [4.0], [5.0]],
[[0.1], [1.1], [2.1], [3.1], [4.1], [5.1]],
[[0.2], [1.2], [2.2], [3.2], [4.2], [5.2]],
[[0.3], [1.3], [2.3], [3.3], [4.3], [5.3]],
[[0.4], [1.4], [2.4], [3.4], [4.4], [5.4]],
[[0.5], [1.5], [2.5], [3.5], [4.5], [5.5]],
],
])
kernel = tf.constant([ # Kernel (3x3x1)
[[[0.0]], [[0.5]], [[0.0]]],
[[[0.0]], [[1.0]], [[0.0]]],
[[[0.0]], [[0.5]], [[0.0]]]
])
# NOTE: the change in the size of the strides parameter.
conv2d = tf.nn.conv2d(input_batch, kernel, strides=[1, 3, 3, 1], padding='SAME')
sess.run(conv2d)
The input_batch was combined with the kernel by moving the kernel over the input_batch
striding (or skipping) over certain elements. Each time the kernel was moved, it get
centered over an element of input_batch. Then the overlapping values are multiplied
together and the result is added together. This is how a convolution combines two inputs
using whats referred to as pointwise multiplication. It may be easier to visualize using the
following figure.
In this figure, the same logic follows what is found in the code. Two tensors convolved
together while striding over the input. The strides reduced the dimensionality of the output
a large amount while the kernel size allowed the convolution to use all the input values.
None of the input data was completely removed from striding but now the input is a
smaller tensor.
Strides are a way to adjust the dimensionality of input tensors. Reducing dimensionality
requires less processing power, and will keep from creating receptive fields which
completely overlap. The strides parameter follows the same format as the input tensor
[image_batch_size_stride, image_height_stride, image_width_stride, image_channels_stride].
Changing the first or last element of the stride parameter are rare, theyd skip data in a
tf.nn.conv2d operation and not take the input into account. The image_height_stride and
image_width_stride are useful to alter in reducing input dimensionality.
A challenge that comes up often with striding over the input is how to deal with a stride
which doesnt evenly end at the edge of the input. The uneven striding will come up often
due to image size and kernel size not matching the striding. If the image size, kernel size
and strides cant be changed then padding can be added to the image to deal with the
uneven area.
Padding
When a kernel is overlapped on an image it should be set to fit within the bounds of the
image. At times, the sizing may not fit and a good alternative is to fill the missing area in
the image. Filling the missing area of the image is known as padding the image.
TensorFlow will pad the image with zeros or raise an error when the sizes dont allow a
kernel to stride over an image without going past its bounds. The amount of zeros or the
error state of tf.nn.conv2d is controlled by the parameter padding which has two possible
values (VALID, SAME).
SAME: The convolution output is the SAME size as the input. This doesnt take the
filters size into account when calculating how to stride over the image. This may stride
over more of the image than what exists in the bounds while padding all the missing
values with zero.
VALID: Take the filters size into account when calculating how to stride over the
image. This will try to keep as much of the kernel inside the images bounds as possible.
There may be padding in some cases but will avoid.
Its best to consider the size of the input but if padding is necessary then TensorFlow
has the option built in. In most simple scenarios, SAME is a good choice to begin with. VALID
is preferential when the input and kernel work well with the strides. For further
information, TensorFlow covers this subject well in the convolution documentation.
Data Format
Theres another parameter to tf.nn.conv2d which isnt shown from these examples named
data_format. The tf.nn.conv2d docs explain how to change the data format so the input, kernel
and strides follow a format other than the format being used thus far. Changing this format
is useful if there is an input tensor which doesnt follow the [batch_size, height, width,
channel] standard. Instead of changing the input to match, its possible to change the
data_format parameter to use a different layout.
data_format: An optional string from: NHWC, NCHW. Defaults to NHWC. Specify the data format of
the input and output data. With the default format NHWC, the data is stored in the order of: [batch,
in_height, in_width, in_channels]. Alternatively, the format could be NCHW, the data storage order of:
[batch, in_channels, in_height, in_width].
Data Format
Definition
Kernels in Depth
In TensorFlow the filter parameter is used to specify the kernel convolved with the
input. Filters are commonly used in photography to adjust attributes of a picture, such as
the amount of sunlight allowed to reach a cameras lens. In photography, filters allow a
photographer to drastically alter the picture theyre taking. The reason the photographer is
able to alter their picture using a filter is because the filter can recognize certain attributes
of the light coming in to the lens. For example, a red lens filter will absorb (block) every
frequency of light which isnt red allowing only red to pass through the filter.
The output created from convolving an image with an edge detection kernel are all the
areas where and edge was detected. The code assumes a batch of images is already
available (image_batch) with a real image loaded from disk. In this case, the image is an
example image found in the Stanford Dogs Dataset. The kernel has three input and three
output channels. The channels sync up to RGB values between
with 255 being
the maximum intensity. The tf.minimum and tf.nn.relu calls are there to keep the convolution
values within the range of valid RGB colors of
There are many other) common kernels which can be used in this simplified example.
Each will highlight different patterns in an image with different results. The following
kernel will sharpen an image by increasing the intensity of color changes.
kernel = tf.constant([
[
[[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]],
[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
[[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]]
],
[
[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
[[ 5., 0., 0.], [ 0., 5., 0.], [ 0., 0., 5.]],
[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]
],
[
[[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]],
[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],
[[ 0, 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]]
]
])
conv2d = tf.nn.conv2d(image_batch, kernel, [1, 1, 1, 1], padding="SAME")
activation_map = sess.run(tf.minimum(tf.nn.relu(conv2d), 255))
The values in the kernel were adjusted with the center of the kernel increased in
intensity and the areas around the kernel reduced in intensity. The change, matches
patterns with intense pixels and increases their intensity outputting an image which is
visually sharpened. Note that the corners of the kernel are all 0 and dont affect the output
that operates in a plus shaped pattern.
These kernels match patterns in images at a rudimentary level. A convolutional neural
network matches edges and more by using a complex kernel it learned during training.
The starting values for the kernel are usually random and over time theyre trained by the
CNNs learning layer. When a CNN is complete, it starts running and each image sent in is
convolved with a kernel which is then changed based on if the predicted value matches the
labeled value of the image. For example, if a Sheepdog picture is considered a Pit Bull by
the CNN being trained it will then change the filters a small amount to try and match
Sheepdog pictures better.
Learning complex patterns with a CNN involves more than a single layer of
convolution. Even the example code included a tf.nn.relu layer used to prepare the output
for visualization. Convolution layers may occur more than once in a CNN but theyll
likely include other layer types as well. These layers combined form the support network
required for a successful CNN architecture.
Common Layers
For a neural network architecture to be considered a CNN, it requires at least one
convolution layer (tf.nn.conv2d). There are practical uses for a single layer CNN (edge
detection), for image recognition and categorization it is common to use different layer
types to support a convolution layer. These layers help reduce over-fitting, speed up
training and decrease memory usage.
The layers covered in this chapter are focused on layers commonly used in a CNN
architecture. A CNN isnt limited to use only these layers, they can be mixed with layers
designed for other network architectures.
Convolution Layers
One type of convolution layer has been covered in detail (tf.nn.conv2d) but there are a
few notes which are useful to advanced users. The convolution layers in TensorFlow dont
do a full convolution, details can be found in the TensorFlow API documentation. In
practice, the difference between a convolution and the operation TensorFlow uses is
performance. TensorFlow uses a technique to speed up the convolution operation in all the
different types of convolution layers.
There are use cases for each type of convolution layer but for tf.nn.conv2d is a good place
to start. The other types of convolutions are useful but not required in building a network
capable of object recognition and classification. A brief summary of each is included.
tf.nn.depthwise_conv2d
This convolution is used when attaching the output of one convolution to the input of
another convolution layer. An advanced use case is using a tf.nn.depthwise_conv2d to create a
network following the inception architecture.
tf.nn.separable_conv2d
This is similar to tf.nn.conv2d, but not a replacement for it. For large models, it speeds up
training without sacrificing accuracy. For small models, it will converge quickly with
worse accuracy.
tf.nn.conv2d_transpose
This applies a kernel to a new feature map where each section is filled with the same
values as the kernel. As the kernel strides over the new image, any overlapping sections
are summed together. There is a great explanation on how tf.nn.conv2d_transpose is used for
learnable upsampling in Stanfords CS231n Winter 2016: Lecture 13.
Activation Functions
These functions are used in combination with the output of other layers to generate a
feature map. Theyre used to smooth (or differentiate) the results of certain operations.
The goal is to introduce non-linearity into the neural network. Non-linearity means that
the input is a curve instead of a straight line. Curves are capable of representing more
complex changes in input. For example, non-linear input is capable of describing input
which stays small for the majority of the time but periodically has a single point at an
extreme. Introduction of non-linearity in a neural network allows it to train on the complex
patterns found in data.
TensorFlow has multiple activation functions available. With CNNs, tf.nn.relu is
primarily used because of its performance although it sacrifices information. When
starting out, using tf.nn.relu is recommended but advanced users may create their own.
When considering if an activation function is useful there are a few primary
considerations.
1. The function is monotonic, so its output should always be increasing or decreasing
along with the input. This allows gradient descent optimization to search for local
minima.
2. The function is differentiable, so there must be a derivative at any point in the
functions domain. This allows gradient descent optimization to properly work using
the output from this style of activation function.
Any functions that satisfy those considerations could be used as activation functions. In
TensorFlow there are a few worth highlighting which are common to see in CNN
architectures. A brief summary of each is included with a small sample code illustrating
their usage.
tf.nn.relu
A rectifier (rectified linear unit) called a ramp function in some documentation and
looks like a skateboard ramp when plotted. ReLU is linear and keeps the same input
values for any positive numbers while setting all negative numbers to be 0. It has the
benefits that it doesnt suffer from gradient vanishing and has a range of
. A
drawback of ReLU is that it can suffer from neurons becoming saturated when too high of
a learning rate is used.
features = tf.range(-2, 3)
# Keep note of the value for negative features
sess.run([features, tf.nn.relu(features)])
In this example, the input in a rank one tensor (vector) of integer values between
. A tf.nn.relu is ran over the values the output highlights that any value less
than 0 is set to be 0. The other input values are left untouched.
tf.sigmoid
A sigmoid function returns a value in the range of
. Larger values sent into
a tf.sigmoid will trend closer to 1.0 while smaller values will trend towards 0.0. The ability
for sigmoids to keep a values between
In this example, a range of integers is converted to be float values (1 becomes 1.0) and a
sigmoid function is ran over the input features. The result highlights that when a value of
0.0 is passed through a sigmoid, the result is 0.5 which is the midpoint of the simoids
domain. Its useful to note that with 0.5 being the sigmoids midpoint, negative values can
be used as input to a sigmoid.
tf.tanh
A hyperbolic tangent function (tanh) is a close relative to tf.sigmoid with some of the
same benefits and drawbacks. The main difference between tf.sigmoid and tf.tanh is that
has a range of
. The ability to output negative values may be
useful in certain network architectures.
tf.tanh
In this example, all the setup is the same as the tf.sigmoid example but the output shows
an important difference. In the output of tf.tanh the midpoint is 0.0 with negative values.
This can cause trouble if the next layer in the network isnt expecting negative input or
input of 0.0.
tf.nn.dropout
Set the output to be 0.0 based on a configurable probability. This layer performs well in
scenarios where a little randomness helps training. An example scenario is when there are
patterns being learned that are too tied to their neighboring features. This layer will add a
little noise to the output being learned.
NOTE: This layer should only be used during training because the random noise it adds
In this example, the output has a 50% probability of being kept. Each execution of this
layer will have different output (most likely, its somewhat random). When an output is
dropped, its value is set to 0.0.
Pooling Layers
Pooling layers reduce over-fitting and improving performance by reducing the size of
the input. Theyre used to scale down input while keeping important information for the
next layer. Its possible to reduce the size of the input using a tf.nn.conv2d alone but these
layers execute much faster.
tf.nn.max_pool
Strides over a tensor and chooses the maximum value found within a certain kernel size.
Useful when the intensity of the input data is relevant to importance in the image.
The same example is modeled using example code below. The goal is to find the largest
value within the tensor.
# Usually the input would be output from a previous layer and not an image directly.
batch_size=1
input_height = 3
input_width = 3
input_channels = 1
layer_input = tf.constant([
[
[[1.0], [0.2], [1.5]],
[[0.1], [1.2], [1.4]],
[[1.1], [0.4], [0.4]]
]
])
# The strides will look at the entire input by using the image_height and image_width
tf.nn.avg_pool
Strides over a tensor and averages all the values at each depth found within a kernel
size. Useful when reducing values where the entire kernel is important, for example, input
tensors with a large width and height but small depth.
The same example is modeled using example code below. The goal is to find the
Do a summation of all the values in the tensor, then divide them by the size of the
number of scalars in the tensor:
This is exactly what the example code did above, but by reducing the size of the kernel,
its possible to adjust the size of the output.
Normalization
Normalization layers are not unique to CNNs and arent used as often. When using
tf.nn.relu, it is useful to consider normalization of the output. Since ReLU is unbounded,
its often useful to utilize some form of normalization to identify high-frequency features.
tf.nn.local_response_normalization (tf.nn.lrn)
Local response normalization is a function which shapes the output based on a
summation operation best explained in TensorFlows documentation.
Within a given vector, each component is divided by the weighted, squared sum of inputs within
depth_radius.
One goal of normalization is to keep the input in a range of acceptable numbers. For
instance, normalizing input in the range of
where the full range of possible
values is normalized to be represented by a number greater than or equal to 0.0 and less
than or equal to 1.0. Local response normalization normalizes values while taking into
account the significance of each value.
Cuda-Convnet includes further details on why using local response normalization is
useful in some CNN architectures. ImageNet uses this layer to normalize the output from
tf.nn.relu.
# Create a range of 3 floats.
# TensorShape([batch, image_height, image_width, image_channels])
layer_input = tf.constant([
[[[ 1.]], [[ 2.]], [[ 3.]]]
])
lrn = tf.nn.local_response_normalization(layer_input)
sess.run([layer_input, lrn])
In this example code, the layer input is in the format [batch, image_height, image_width,
image_channels]. The normalization reduced the output to be in the range of
. For tf.nn.relu, this layer will reduce its unbounded output to be in the
same range.
tf.contrib.layers.convolution2d
The convolution2d layer will do the same logic as tf.nn.conv2d while including weight
initialization, bias initialization, trainable variable output, bias addition and adding an
activation function. Many of these steps havent been covered for CNNs yet but should be
familiar. A kernel is a trainable variable (the CNNs goal is to train this variable), weight
initialization is used to fill the kernel with values (tf.truncated_normal) on its first run. The
rest of the parameters are similar to what have been used before except they are reduced to
short-hand version. Instead of declaring the full kernel, now its a simple tuple (1,1) for the
kernels height and width.
image_input = tf.constant([
[
[[0., 0., 0.], [255., 255., 255.], [254., 0., 0.]],
[[0., 191., 0.], [3., 108., 233.], [0., 191., 0.]],
[[254., 0., 0.], [255., 255., 255.], [0., 0., 0.]]
]
])
conv2d = tf.contrib.layers.convolution2d(
image_input,
num_output_channels=4,
kernel_size=(1,1), # It's only the filter height and width.
activation_fn=tf.nn.relu,
stride=(1, 1), # Skips the stride values for image_batch and input_channels.
trainable=True)
# It's required to initialize the variables used in convolution2d's setup.
sess.run(tf.initialize_all_variables())
sess.run(conv2d)
This example sets up a full convolution against a batch of a single image. All the
parameters are based off of the steps done throughout this chapter. The main difference is
that tf.contrib.layers.convolution2d does a large amount of setup without having to write it all
again. This can be a great time saving layer for advanced users.
NOTE: tf.to_float should not be used if the input is an image, instead use
tf.image.convert_image_dtype which will properly change the range of values used to describe
colors. In this example code, float values of 255. were used which arent what TensorFlow
expects when is sees an image using float values. TensorFlow expects an image with
colors described as floats to stay in the range of
tf.contrib.layers.fully_connected
A fully connected layer is one where every input is connected to every output. This is a
fairly common layer in many architectures but for CNNs, the last layer is quite often fully
connected. The tf.contrib.layers.fully_connected layer offers a great short-hand to create this
last layer while following best practices.
Typical fully connected layers in TensorFlow are often in the format of
tf.matmul(features, weight) + bias where feature, weight and bias are all tensors. This short-hand
layer will do the same thing while taking care of the intricacies involved in managing the
weight and bias tensors.
features = tf.constant([
[[1.2], [3.4]]
])
fc = tf.contrib.layers.fully_connected(features, num_output_units=2)
# It's required to initialize all the variables first or there'll be an error about precondition failures.
sess.run(tf.initialize_all_variables())
sess.run(fc)
This example created a fully connected layer and associated the input tensor with each
neuron of the output. There are plenty of other parameters to tweak for different fully
connected layers.
Layer Input
Each layer serves a purpose in a CNN architecture. Its important to understand them at
a high level (at least) but without practice theyre easy to forget. A crucial layer in any
neural network is the input layer, where raw input is sent to be trained and tested. For
object recognition and classification, the input layer is a tf.nn.conv2d layer which accepts
images. The next step is to use real images in training instead of example input in the form
of tf.constant or tf.range variables.
Each scalar can be changed to make the pixel another color or a mix of colors. The rank
1 tensor of a pixel is in the format of [red, green, blue] for an RGB color space. All the
pixels in an image are stored in files on a disk which need to be read into memory so
TensorFlow may operate on them.
Loading images
TensorFlow is designed to make it easy to load files from disk quickly. Loading images
is the same as loading any other large binary file until the contents are decoded. Loading
this example 3x3 pixel RGB JPG image is done using a similar process to loading any
other type of file.
# The match_filenames_once will accept a regex but there is no need for this example.
image_filename = "./images/chapter-05-object-recognition-and-classification/working-with-images/test-inputfilename_queue = tf.train.string_input_producer(
tf.train.match_filenames_once(image_filename))
image_reader = tf.WholeFileReader()
_, image_file = image_reader.read(filename_queue)
image = tf.image.decode_jpeg(image_file)
The image, which is assumed to be located in a relative directory from where this code
is ran. An input producer (tf.train.string_input_producer) finds the files and adds them to a
queue for loading. Loading an image requires loading the entire file into memory
(tf.WholeFileReader) and onces a file has been read (image_reader.read) the resulting image is
decoded (tf.image.decode_jpeg).
Now the image can be inspected, since there is only one file by that name the queue will
always return the same image.
sess.run(image)
Inspect the output from loading an image, notice that its a fairly simple rank 3 tensor.
The RGB values are found in 9 rank 1 tensors. The higher rank of the image should be
familiar from earlier sections. The format of the image loaded in memory is now
[batch_size, image_height, image_width, channels].
The batch_size in this example is 1 because there are no batching operations happening.
Batching of input is covered in the TensorFlow documentation with a great amount of
detail. When dealing with images, note the amount of memory required to load the raw
images. If the images are too large or too many are loaded in a batch, the system may stop
responding.
Image Formats
Its important to consider aspects of images and how they affect a model. Consider what
would happen if a network is trained with input from a single frame of a RED Weapon
Camera, which at the time of writing this, has an effective pixel count of 6144x3160. Thatd
be 19,415,040 rank one tensors with 3 dimensions of color information.
Practically speaking, an input of that size will use a huge amount of system memory.
Training a CNN takes a large amount of time and loading very large files slow it down
more. Even if the increase in time is acceptable, the size a single image would be hard to
fit in memory on the majority of systems GPUs.
A large input image is counterproductive to training most CNNs as well. The CNN is
attempting to find inherent attributes in an image, which are unique but generalized so that
they may be applied to other images with similar results. Using a large input is flooding a
network with irrelevant information which will keep from generalizing the model.
In the Stanford Dogs Dataset there are two extremely different images of the same dog
breed which should both match as a Pug. Although cute, these images are filled with
useless information which mislead a network during training. For example, the hat worn
by the Pug in n02110958_4030.jpg isnt a feature a CNN needs to learn in order to match
a Pug. Most Pugs prefer pirate hats so the jester hat is training the network to match a hat
which most Pugs dont wear.
dogs. Setting those pieces to black would make them seem of similar importance to other
black colored items in the image. Setting the removed hat to have an alpha of 0 would
help in distinguishing its removal.
When working with JPEG images, dont manipulate them too much because itll leave
artifacts. Instead, plan to take raw images and export them to JPEG while doing any
manipulation needed. Try to manipulate images before loading them whenever possible to
save time in training.
PNG images work well if manipulation is required. PNG format is lossless so itll keep
all the information from the original file (unless theyve been resized or downsampled).
The downside to PNGs is that the files are larger than their JPEG counterpart.
TFRecord
TensorFlow has a built-in file format designed to keep binary data and label (category
for training) data in the same file. The format is called TFRecord and the format requires a
preprocessing step to convert images to a TFRecord format before training. The largest
benefit is keeping each input image in the same file as the label associated with it.
Technically, TFRecord files are protobuf formatted files. They are great for use as a
preprocessed format because they arent compressed and can be loaded into memory
quickly. In this example, an image is written to a new TFRecord formatted file and its
label is stored as well.
# Reuse the image from earlier and give it a fake label
image_label = b'\x01' # Assume the label data is in a one-hot representation (00000001)
# Convert the tensor into bytes, notice that this will load the entire image file
image_loaded = sess.run(image)
image_bytes = image_loaded.tobytes()
image_height, image_width, image_channels = image_loaded.shape
# Export TFRecord
writer = tf.python_io.TFRecordWriter("./output/training-image.tfrecord")
# Don't store the width, height or image channels in this Example file to save space but not required.
example = tf.train.Example(features=tf.train.Features(feature={
'label': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_label])),
'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes]))
}))
# This will save the example to a text file tfrecord
writer.write(example.SerializeToString())
writer.close()
The label is in a format known as one-hot encoding which is a common way to work
with label data for categorization of multi-class data. The Stanford Dogs Dataset is being
treated as multi-class data because the dogs are being categorized as a single breed and not
a mix of breeds. In the real world, a multilabel solution would work well to predict dog
breeds because itd be capable of matching a dog with multiple breeds.
In the example code, the image is loaded into memory and converted into an array of
bytes. The bytes are then added to the tf.train.Example file which are serialized
SerializeToString before storing to disk. Serialization is a way of converting the in memory
object into a format safe to be transferred to a file. The serialized example is now saved in
a format which can be loaded and deserialized back to the example format saved here.
Now that the image is saved as a TFRecord it can be loaded again but this time from the
TFRecord file. This would be the loading required in a training step to load the image and
label for training. This will save time from loading the input image and its corresponding
label separately.
# Load TFRecord
tf_record_filename_queue = tf.train.string_input_producer(
tf.train.match_filenames_once("./output/training-image.tfrecord"))
# Notice the different record reader, this one is designed to work with TFRecord files which may
# have more than one example in them.
tf_record_reader = tf.TFRecordReader()
_, tf_record_serialized = tf_record_reader.read(tf_record_filename_queue)
# The label and image are stored as bytes but could be stored as int64 or float64 values in a
# serialized tf.Example protobuf.
tf_record_features = tf.parse_single_example(
tf_record_serialized,
features={
'label': tf.FixedLenFeature([], tf.string),
'image': tf.FixedLenFeature([], tf.string),
})
# Using tf.uint8 because all of the channel information is between 0-255
tf_record_image = tf.decode_raw(
tf_record_features['image'], tf.uint8)
# Reshape the image to look like the image saved, not required
tf_record_image = tf.reshape(
tf_record_image,
[image_height, image_width, image_channels])
# Use real values for the height, width and channels of the image because it's required
# to reshape the input.
tf_record_label = tf.cast(tf_record_features['label'], tf.string)
At first, the file is loaded in the same way as any other file. The main difference is that
the file is then read using a TFRecordReader. Instead of decoding the image, the TFRecord is
parsed tf.parse_single_example and then the image is read as raw bytes (tf.decode_raw).
After the file is loaded, it is reshaped (tf.reshape) in order to keep it in the same layout as
tf.nn.conv2d expects it [image_height, image_width, image_channels]. Itd be save to expand the
dimensions (tf.expand) in order to add in the batch_size dimension to the input_batch.
In this case a single image is in the TFRecord but these record files support multiple
examples being written to them. Itd be safe to have a single TFRecord file which stores
an entire training set but splitting up the files doesnt hurt.
The following code is useful to check that the image saved to disk is the same as the
image which was loaded from TensorFlow.
sess.run(tf.equal(image, tf_record_image))
All of the attributes of the original image and the image loaded from the TFRecord file
are the same. To be sure, load the label from the TFRecord file and check that it is the
same as the one saved earlier.
# Check that the label is still 0b00000001.
sess.run(tf_record_label)
Creating a file that stores both the raw image data and the expected output label will
save complexities during training. Its not required to use TFRecord files but its highly
recommend when working with images. If it doesnt work well for a workflow, its still
recommended to preprocess images and save them before training. Manipulating an image
each time its loaded is not recommended.
Image Manipulation
CNNs work well when theyre given a large amount of diverse quality training data.
Images capture complex scenes in a way which visually communicates an intended
subject. In the Stanford Dogs Dataset, its important that the images visually highlight the
importance of dogs in the picture. A picture with a dog clearly visible in the center is
considered more valuable than one with a dog in the background.
Not all datasets have the most valuable images. The following are two images from the
Stanford Dogs Dataset, which are supposed to highlight dog breeds. The image on the left
n02113978_3480.jpg highlights important attributes of a typical Mexican Hairless Dog, while
the image on the right n02113978_1030.jpg highlights the look of inebriated party goers
scaring a Mexican Hairless Dog. The image on the right n02113978_1030.jpg is filled with
irrelevant information which may train a CNN to categorize party goer faces instead of
Mexican Hairless Dog breeds. Images like this may still include an image of a dog and
could be manipulated to highlight the dog instead of people.
Cropping
Cropping an image will remove certain regions of the image without keeping any
information. Cropping is similar to tf.slice where a section of a tensor is cut out from the
full tensor. Cropping an input image for a CNN can be useful if there is extra input along a
dimension which isnt required. For example, cropping dog pictures where the dog is in
the center of the images to reduce the size of the input.
sess.run(tf.image.central_crop(image, 0.1))
The example code uses tf.image.central_crop to crop out 10% of the image and return it.
This method always returns based on the center of the image being used.
Cropping is usually done in preprocessing but it can be useful when training if the
background is useful. When the background is useful then cropping can be done while
randomizing the center offset of where the crop begins.
# This crop method only works on real value input.
real_image = sess.run(image)
bounding_crop = tf.image.crop_to_bounding_box(
real_image, offset_height=0, offset_width=0, target_height=2, target_width=1)
sess.run(bounding_crop)
The example code uses tf.image.crop_to_bounding_box in order to crop the image starting at
the upper left pixel located at (0, 0). Currently, the function only works with a tensor
which has a defined shape so an input image needs to be executed on the graph first.
Padding
Pad an image with zeros in order to make it the same size as an expected image. This
can be accomplished using tf.pad but TensorFlow has another function useful for resizing
images which are too large or too small. The method will pad an image which is too small
including zeros along the edges of the image. Often, this method is used to resize small
images because any other method of resizing with distort the image.
# This padding method only works on real value input.
real_image = sess.run(image)
pad = tf.image.pad_to_bounding_box(
real_image, offset_height=0, offset_width=0, target_height=4, target_width=4)
sess.run(pad)
This example code increases the images height by one pixel and its width by a pixel as
well. The new pixels are all set to 0. Padding in this manner is useful for scaling up an
image which is too small. This can happen if there are images in the training set with a
mix of aspect ratios. TensorFlow has a useful shortcut for resizing images which dont
match the same aspect ratio using a combination of pad and crop.
# This padding method only works on real value input.
real_image = sess.run(image)
crop_or_pad = tf.image.resize_image_with_crop_or_pad(
real_image, target_height=2, target_width=5)
sess.run(crop_or_pad)
The real_image has been reduced in height to be 2 pixels tall and the width has been
increased by padding the image with zeros. This function works based on the center of the
image input.
Flipping
Flipping an image is exactly what it sounds like. Each pixels location is reversed
horizontally or vertically. Technically speaking, flopping is the term used when flipping an
image vertically. Terms aside, flipping images is useful with TensorFlow to give different
perspectives of the same image for training. For example, a picture of an Australian
Shepherd with crooked left ear could be flipped in order to allow matching of crooked
right ears.
TensorFlow has functions to flip images vertically, horizontally and choose randomly.
The ability to randomly flip an image is a useful method to keep from overfitting a model
to flipped versions of images.
top_left_pixels = tf.slice(image, [0, 0, 0], [2, 2, 3])
flip_horizon = tf.image.flip_left_right(top_left_pixels)
flip_vertical = tf.image.flip_up_down(flip_horizon)
sess.run([top_left_pixels, flip_vertical])
This example code flips a subset of the image horizontally and then vertically. The
subset is used with tf.slice because the original image flipped returns the same images (for
this example only). The subset of pixels illustrates the change which occurs when an
image is flipped. tf.image.flip_left_right and tf.image.flip_up_down both operate on tensors
which are not limited to images. These will flip an image a single time, randomly flipping
an image is done using a separate set of functions.
top_left_pixels = tf.slice(image, [0, 0, 0], [2, 2, 3])
random_flip_horizon = tf.image.random_flip_left_right(top_left_pixels)
random_flip_vertical = tf.image.random_flip_up_down(random_flip_horizon)
sess.run(random_flip_vertical)
This example does the same logic as the example before except that the output is
random. Every time this runs, a different output is expected. There is a parameter named
seed which may be used to control how random the flipping occurs.
This example brightens a single pixel, which is primarily red, with a delta of 0.2.
Unfortunately, in the current version of TensorFlow 0.9, this method doesnt work well
with a tf.uint8 input. Its best to avoid using this when possible and preprocess brightness
changes.
adjust_contrast = tf.image.adjust_contrast(image, -.5)
sess.run(tf.slice(adjust_contrast, [1, 0, 0], [1, 3, 3]))
The example code changes the contrast by -0.5 which makes the new version of the
image fairly unrecognizable. Adjusting contrast is best done in small increments to keep
from blowing out an image. Blowing out an image means the same thing as saturating a
neuron, it reached its maximum value and cant be recovered. With contrast changes, an
image can become completely white and completely black from the same adjustment.
The tf.slice operation is for brevity, highlighting one of the pixels which has changed. It
is not required when running this operation.
adjust_hue = tf.image.adjust_hue(image, 0.7)
sess.run(tf.slice(adjust_hue, [1, 0, 0], [1, 3, 3]))
The example code adjusts the hue found in the image to make it more colorful. The
adjustment accepts a delta parameter which controls the amount of hue to adjust in the
image.
adjust_saturation = tf.image.adjust_saturation(image, 0.4)
sess.run(tf.slice(adjust_saturation, [1, 0, 0], [1, 3, 3]))
Colors
CNNs are commonly trained using images with a single color. When an image has a
single color it is said to use a grayscale colorspace meaning it uses a single channel of
colors. For most computer vision related tasks, using grayscale is reasonable because the
shape of an image can be seen without all the colors. The reduction in colors equates to a
quicker to train image. Instead of a 3 component rank 1 tensor to describe each color
found with RGB, a grayscale image requires a single component rank 1 tensor to describe
the amount of gray found in the image.
Although grayscale has benefits, its important to consider applications which require a
distinction based on color. Color in images is challenging to work with in most computer
vision because it isnt easy to mathematically define the similarity of two RGB colors. In
order to use colors in CNN training, its useful to convert the colorspace the image is
natively in.
Grayscale
Grayscale has a single component to it and has the same range of color as RGB
.
gray = tf.image.rgb_to_grayscale(image)
sess.run(tf.slice(gray, [0, 0, 0], [1, 3, 1]))
This example converted the RGB image into grayscale. The tf.slice operation took the
top row of pixels out to investigate how their color has changed. The grayscale conversion
is done by averaging all the color values for a pixel and setting the amount of grayscale to
be the average.
HSV
Hue, saturation and value are what make up HSV colorspace. This space is represented
with a 3 component rank 1 tensor similar to RGB. HSV is not similar to RGB in what it
measures, its measuring attributes of an image which are closer to human perception of
color than RGB. It is sometimes called HSB, where the B stands for brightness.
hsv = tf.image.rgb_to_hsv(tf.image.convert_image_dtype(image, tf.float32))
sess.run(tf.slice(hsv, [0, 0, 0], [3, 3, 3]))
RGB
RGB is the colorspace which has been used in all the example code so far. Its broken
up into a 3 component rank 1 tensor which includes the amount of red
, green
and blue
. Most images are already in RGB but TensorFlow has
builtin functions in case the images are in another colorspace.
rgb_hsv = tf.image.hsv_to_rgb(hsv)
rgb_grayscale = tf.image.grayscale_to_rgb(gray)
The example code is straightforward except that the conversion from grayscale to RGB
doesnt make much sense. RGB expects three colors while grayscale only has one. When
the conversion occurs, the RGB values are filled with the same value which is found in the
grayscale pixel.
Lab
Lab is not a colorspace which TensorFlow has native support for. Its a useful
colorspace because it can map to a larger number of perceivable colors than RGB.
Although TensorFlow doesnt support this natively, it is a colorspace, which is often used
in professional settings. Another Python library python-colormath has support for Lab
conversion as well as other colorspaces not described here.
The largest benefit using a Lab colorspace is it maps closer to humans perception of the
difference in colors than RGB or HSV. The euclidean distance between two colors in a
Lab colorspace are somewhat representative of how different the colors look to a human.
Casting Images
In these examples, tf.to_float is often used in order to illustrate changing an images
type to another format. For examples, this works OK but TensorFlow has a built in
function to properly scale values as they change types. tf.image.convert_image_dtype(image,
dtype, saturate=False) is a useful shortcut to change the type of an image from tf.uint8 to
tf.float.
CNN Implementation
Object recognition and categorization using TensorFlow required a basic understanding
of convolutions (for CNNs), common layers (non-linearity, pooling, fc), image loading,
image manipulation and colorspaces. With these areas covered, its possible to build a
CNN model for image recognition and classification using TensorFlow. In this case, the
model is a dataset provided by Stanford which includes pictures of dogs and their
corresponding breed. The network needs to train on these pictures then be judged on how
well it can guess a dogs breed based on a picture.
The network architecture follows a simplified version of Alex Krizhevskys AlexNet
without all of AlexNets layers. This architecture was described earlier in the chapter as
the network which won ILSVRC12 top challenge. The network uses layers and
techniques familiar to this chapter which are similar to the TensorFlow provided tutorial
on CNNs.
The network described in this section including the output TensorShape after each layer.
The layers are read from left to right and top to bottom where related layers are grouped
together. As the input progresses further into the network, its height and width are reduced
while its depth is increased. The increase in depth reduces the computation required to use
the network.
An example of how the archive is organized. The glob module allows directory listing
which shows the structure of the files which exist in the dataset. The eight digit number is
tied to the WordNet ID of each category used in ImageNet. ImageNet has a browser for
image details which accepts the WordNet ID, for example the Chihuahua example can be
accessed via http://www.image-net.org/synset?wnid=n02085620.
from itertools import groupby
from collections import defaultdict
training_dataset = defaultdict(list)
testing_dataset = defaultdict(list)
# Split up the filename into its breed and corresponding filename. The breed is found by taking the directo
image_filename_with_breed = map(lambda filename: (filename.split("/")[2], filename), image_filenames
# Group each image by the breed which is the 0th element in the tuple returned above
for dog_breed, breed_images in groupby(image_filename_with_breed, lambda x: x[0]):
# Enumerate each breed's image and send ~20% of the images to a testing set
for i, breed_image in enumerate(breed_images):
if i % 5 == 0:
testing_dataset[dog_breed].append(breed_image[1])
else:
training_dataset[dog_breed].append(breed_image[1])
# Check that each breed includes at least 18% of the images for testing
breed_training_count = len(training_dataset[dog_breed])
breed_testing_count = len(testing_dataset[dog_breed])
This example code organized the directory and images (./imagenet-dogs/n02085620Chihuahua/n02085620_10131.jpg) into two dictionaries related to each breed including
all the images for that breed. Now each dictionary would include Chihuahua images in the
following format:
training_dataset["n02085620-Chihuahua"] = ["n02085620_10131.jpg", ...]
Organizing the breeds into these dictionaries simplifies the process of selecting each
type of image and categorizing it. During preprocessing, all the image breeds can be
iterated over and their images opened based on the filenames in the list.
def write_records_file(dataset, record_location):
"""
Fill a TFRecords file with the images found in `dataset` and include their category.
Parameters
--------- dataset : dict(list)
Dictionary with each key being a label for the list of image filenames of its value.
record_location : str
Location to store the TFRecord output.
"""
writer = None
# Enumerating the dataset because the current index is used to breakup the files if they get over 100
# images to avoid a slowdown in writing.
current_index = 0
for breed, images_filenames in dataset.items():
for image_filename in images_filenames:
if current_index % 100 == 0:
if writer:
writer.close()
record_filename = "{record_location}-{current_index}.tfrecords".format(
record_location=record_location,
current_index=current_index)
writer = tf.python_io.TFRecordWriter(record_filename)
current_index += 1
image_file = tf.read_file(image_filename)
# In ImageNet dogs, there are a few images which TensorFlow doesn't recognize as JPEGs. This
# try/catch will ignore those images.
try:
image = tf.image.decode_jpeg(image_file)
except:
print(image_filename)
continue
# Converting to grayscale saves processing and memory but isn't required.
grayscale_image = tf.image.rgb_to_grayscale(image)
resized_image = tf.image.resize_images(grayscale_image, 250, 151)
# tf.cast is used here because the resized images are floats but haven't been converted into
# image floats where an RGB value is between [0,1).
image_bytes = sess.run(tf.cast(resized_image, tf.uint8)).tobytes()
# Instead of using the label as a string, it'd be more efficient to turn it into either an
# integer index or a one-hot encoded rank one tensor.
# https://en.wikipedia.org/wiki/One-hot
image_label = breed.encode("utf-8")
example = tf.train.Example(features=tf.train.Features(feature={
'label': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_label])),
'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes]))
}))
writer.write(example.SerializeToString())
writer.close()
write_records_file(testing_dataset, "./output/testing-images/testing-image")
write_records_file(training_dataset, "./output/training-images/training-image")
The example code is opening each image, converting it to grayscale, resizing it and then
adding it to a TFRecord file. The logic isnt different from earlier examples except that the
operation tf.image.resize_images is used. The resizing operation will scale every image to be
the same size even if it distorts the image. For example, if an image in portrait orientation
and an image in landscape orientation were both resized with this code then the output of
the landscape image would become distorted. These distortions are caused because
tf.image.resize_images doesnt take into account aspect ratio (the ratio of height to width) of
an image. To properly resize a set of images, cropping or padding is a preferred method
because it ignores the aspect ratio stopping distortions.
Load Images
Once the testing and training dataset have been transformed to TFRecord format, they
can be read as TFRecords instead of as JPEG images. The goal is to load the images a few
at a time with their corresponding labels.
filename_queue = tf.train.string_input_producer(
tf.train.match_filenames_once("./output/training-images/*.tfrecords"))
reader = tf.TFRecordReader()
_, serialized = reader.read(filename_queue)
features = tf.parse_single_example(
serialized,
features={
'label': tf.FixedLenFeature([], tf.string),
'image': tf.FixedLenFeature([], tf.string),
})
record_image = tf.decode_raw(features['image'], tf.uint8)
# Changing the image into this shape helps train and visualize the output by converting it to
# be organized like an image.
image = tf.reshape(record_image, [250, 151, 1])
label = tf.cast(features['label'], tf.string)
min_after_dequeue = 10
batch_size = 3
capacity = min_after_dequeue + 3 * batch_size
image_batch, label_batch = tf.train.shuffle_batch(
[image, label], batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue)
This example code loads training images by matching all the TFRecord files found in
the training directory. Each TFRecord includes multiple images but the
tf.parse_single_example will take a single example out of the file. The batching operation
discussed earlier is used to train multiple images simultaneously. Batching multiple
images is useful because these operations are designed to work with multiple images the
same as with a single image. The primary requirement is that the system have enough
memory to work with them all.
With the images available in memory, the next step is to create the model used for
training and testing.
Model
The model used is similar to the mnist convolution example which is often used in
tutorials describing convolutional neural networks in TensorFlow. The architecture of this
model is simple yet it performs well for illustrating different techniques used in image
classification and recognition. An advanced model may borrow more from Alex
Krizhevskys AlexNet design that includes more convolution layers.
# Converting the images to a float of [0,1) to match the expected input to convolution2d
float_image_batch = tf.image.convert_image_dtype(image_batch, tf.float32)
conv2d_layer_one = tf.contrib.layers.convolution2d(
float_image_batch,
num_output_channels=32, # The number of filters to generate
kernel_size=(5,5), # It's only the filter height and width.
activation_fn=tf.nn.relu,
weight_init=tf.random_normal,
stride=(2, 2),
trainable=True)
pool_layer_one = tf.nn.max_pool(conv2d_layer_one,
ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1],
padding='SAME')
# Note, the first and last dimension of the convolution output hasn't changed but the
# middle two dimensions have.
conv2d_layer_one.get_shape(), pool_layer_one.get_shape()
The first layer in the model is created using the shortcut tf.contrib.layers.convolution2d.
Its important to note that the weight_init is set to be a random normal, meaning that the
first set of filters are filled with random numbers following a normal distribution (this
parameter is renamed in TensorFlow 0.9 to be weights_initializer). The filters are set as
trainable so that as the network is fed information, these weights are adjusted to improve
the accuracy of the model.
After a convolution is applied to the images, the output is downsized using a max_pool
operation. After the operation, the output shape of the convolution is reduced in half due
to the ksize used in the pooling and the strides. The reduction didnt change the number of
filters (output channels) or the size of the image batch. The components that were reduced
dealt with the height and width of the image (filter).
conv2d_layer_two = tf.contrib.layers.convolution2d(
pool_layer_one,
num_output_channels=64, # More output channels means an increase in the number of filters
kernel_size=(5,5),
activation_fn=tf.nn.relu,
weight_init=tf.random_normal,
stride=(1, 1),
trainable=True)
pool_layer_two = tf.nn.max_pool(conv2d_layer_two,
ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1],
padding='SAME')
conv2d_layer_two.get_shape(), pool_layer_two.get_shape()
The second layer changes little from the first except the depth of the filters. The number
of filters is now doubled while again reducing the size of the height and width of the
image. The multiple convolution and pool layers are continuing to reduce the height and
width of the input while adding further depth.
At this point, further convolution and pool steps could be taken. In many architectures
there are over 5 different convolution and pooling layers. The most advanced architectures
take longer to train and debug but they can match more sophisticated patterns. In this
example, the two convolution and pooling layers are enough to illustrate the mechanics at
work.
The tensor being operated on is still fairly complex tensor, the next step is to fully
connect every point in each image with an output neuron. Since this example is using
softmax later, the fully connected layer needs to be changed into a rank two tensor. The
tensors first dimension will be used to separate each image while the second dimension is
a rank one tensor of each input tensor.
flattened_layer_two = tf.reshape(
pool_layer_two,
[
batch_size, # Each image in the image_batch
-1 # Every other dimension of the input
])
flattened_layer_two.get_shape()
has a special value that can be used to signify, use all the dimensions
remaining. In this example code, the -1 is used to reshape the last pooling layer into a giant
rank one tensor. With the pooling layer flattened out, it can be combined with two fully
connected layers which associate the current network state to the breed of dog predicted.
tf.reshape
# The weight_init parameter can also accept a callable, a lambda is used here returning a truncated normal
# with a stddev specified.
hidden_layer_three = tf.contrib.layers.fully_connected(
flattened_layer_two,
512,
weight_init=lambda i, dtype: tf.truncated_normal([38912, 512], stddev=0.1),
activation_fn=tf.nn.relu
)
# Dropout some of the neurons, reducing their importance in the model
hidden_layer_three = tf.nn.dropout(hidden_layer_three, 0.1)
# The output of this are all the connections between the previous layers and the 120 different dog breeds
# available to train on.
final_fully_connected = tf.contrib.layers.fully_connected(
hidden_layer_three,
120, # Number of dog breeds in the ImageNet Dogs dataset
weight_init=lambda i, dtype: tf.truncated_normal([512, 120], stddev=0.1)
)
This example code creates the final fully connected layer of the network where every
pixel is associated with every breed of dog. Every step of this network has been reducing
the size of the input images by converting them into filters which are then matched with a
breed of dog (label). This technique has reduced the processing power required to train or
test a network while generalizing the output.
Training
Once a model is ready to be trained, the last steps follow the same process discussed in
earlier chapters of this book. The models loss is computed based on how accurately it
guessed the correct labels in the training data which feeds into a training optimizer which
updates the weights of each layer. This process continues one iteration at a time while
attempting to increase the accuracy of each step.
An important note related to this model, during training most classification functions
(tf.nn.softmax) require numerical labels. This was highlighted in the section describing
loading the images from TFRecords. At this point, each label is a string similar to
n02085620-Chihuahua. Instead of using tf.nn.softmax on this string, the label needs to be
converted to be a unique number for each label. Converting these labels into an integer
representation should be done in preprocessing.
For this dataset, each label will be converted into an integer which represents the index
of each name in a list including all the dog breeds. There are many ways to accomplish
this task, for this example a new TensorFlow utility operation will be used (tf.map_fn).
import glob
# Find every directory name in the imagenet-dogs directory (n02085620-Chihuahua, ...)
labels = list(map(lambda c: c.split("/")[-1], glob.glob("./imagenet-dogs/*")))
# Match every label from label_batch and return the index where they exist in the list of classes
train_labels = tf.map_fn(lambda l: tf.where(tf.equal(labels, l))[0,0:1][0], label_batch, dtype=tf.int64
This example code uses two different forms of a map operation. The first form of map is
used to create a list including only the dog breed name based on a list of directories. The
second form of map is tf.map_fn which is a TensorFlow operation that will map a function
over a tensor on the graph. The tf.map_fn is used to generate a rank one tensor including
only the integer indexes where each label is located in the list of all the class labels. These
unique integers can now be used with tf.nn.softmax to classify output predictions.
And, here is a single feature map from the first convolution layer highlighting
randomness in the output:
Debugging a CNN requires a familiarity working with these filters. Currently there isnt
any built in support in tensorboard to display filters or feature maps. A simple view of the
filters can be done using a tf.image_summary operation on the filters being trained and the
feature maps generated. Adding an image summary output to a graph gives a good
overview of the filters being used and the feature map generated by applying them to the
input images.
The Jupyter notebook extension worth mentioning is TensorDebugger, which is in an
early state of development. The extension has a mode capable of viewing changes in
filters as an animated Gif over iterations.
Conclusion
Convolutional Neural Networks are a useful network architecture that are implemented
with a minimal amount of code in TensorFlow. While theyre designed with images in
mind, a CNN is not limited to image input. Convolutions are used in multiple industries
from music to medical and a CNN can be applied in a similar manner. Currently,
TensorFlow is designed for two dimensional convolutions but its still possible to work
with higher dimensionality input using TensorFlow.
While a CNN could theoretically work with natural language data (text), it isnt
designed for this type of input. Text input is often stored in a SparseTensor where the
majority of the input is 0. CNNs are designed to work with dense input where each value
is important and the majority of the input is not 0. Working with text data is a challenge
which is addressed in the next chapter on Recurrent Neural Networks and Natural
Language Processing.
Various variants of RNNs have been around since the 1980s but were not widely used
until recently because of insufficient computation power and difficulties in trainng. Since
the invention of architectures like LSTM in 2006 we have seen very powerful applications
of RNNs. They work well for sequential tasks in many domains, for example speech
The state of an RNN depends on the current input and the previous state, which in turn
depends on the input and state before that. Therefore, the state has indirect access to all
previous inputs of the sequence and can be interpreted as a working memory.
Lets make an analogy to computer programs. Say we want to recognize the letters from
an image of handwritten text. We would try and solve this with computer program in
Python using variables, loops, conditionals. Feel free to try this, but I think it would be
very hard to get this to work robustly.
The good news is that we can train an RNN from example data instead. As we would
store intermediate information in variables, the RNN learns to store intermediate
information in its state. Similarly, the weight matrix of an RNN defines defines the
programm it executes, deciding what inputs to store in hidden activation and how to
combine activations to new activations and outputs.
In fact, RNNs with sigmoid activations were proven to be Turing-complete by Schfer
and Zimmermann in 2006. Given the right weights, RNNs can thus compute any
computable program. This is a theoreticaly property since there is no method to find the
perfect weights for a task. However, we can already get very good results using gradient
descent, as described in the next section.
Before we look into optimizing RNNs, you might ask why we even need RNNs if we
can write Python programs instead? Well, the space of possible weight matrices is much
easier to explore automatically than the space of possible C programs.
The trick for optimizing RNNs is that we can unfold them in time (also referred to as
unrolling) to optimize them the same way we optimize forward networks. Lets say we
want to operate on sequence of length ten. We can then copy the hidden neurons ten times
spanning their connections from one copy to the next one. By doing this, we get rid of
recurrent connections without changing the semantics of the computation. This yields a
forward network now, with the corresponding weights between time steps being tied to the
same strengths. Unfolding an RNN in time does not change the computation, it is just
another view.
We can now apply standard backpropagation through this unfolded RNN in order to
compute the gradient of the error with respect to the weights. This algorithm is called
Back-Propagation Through Time (BPTT). It will return a derivative for each weight in
time, including those that are tied together. To keep tied weights at the same value, we
handle them as tied weights are normally handeled, that is, summing up their gradients.
Note that this equals the way convolutional filters are handeled in convents.
Sequence labelling is the case you probably though of during the ealier sections. We
have sequences as input and train the network to produce the right output for each frame.
We are basically mapping from one sequence to another sequence of the same length.
In the sequence classification setting, we have sequential inputs that each have a class.
We can train RNNs in this setting by only selecting the output at the last time frame.
During optimization, the errors will flow back through all time steps to update the weights
in order to collect and integrate useful information at each time step.
Sequence generation is the opposite case where we have a single starting point, for
example a class label, that we want to generate sequences from. To generate sequences,
we feed the output back into the network as next input. This makes sense since the actual
output is often different from the neural network output. For example, the network might
output a distribution over all classes but we only choose the most likely one.
In both sequence classification and sequence generation, we can see the single vector as
dense representations of information. In first case, we encode the sequence into a dense
vector to predict a class. In the second case, we decode a dense vector back into a
sequence.
We can combine these approaches for sequence translation where we first encode a
sequence of one domain, for example English. We then decode the last hidden activation
back into a sequence of another domain, for example French. This works with a single
RNN but when input and output are conceptually different, it can make sense to use two
different RNNs and initialize the second one with the last activation of the first one. When
using a single network, we need to pass a special token as input after the sequence so that
the network can learn when it should stop encoding and start decoding.
Most often, we will use a network architecture called RNNs with output projections.
This is an RNN with fully-connected hidden units and inputs and output mapping to or
from them, respectively. Another way to look at this is that we have an RNN where all
hidden units are outputs and another feed-forward layer stacked on top. You will see that
this is how we implement RNNs in TensorFlow because it both is convenient and allows
us to specify different activation functions to the hidden and output units.
Now that we have defined the RNN and unfolded it in time, we can just load some data
and train the network using one of TensorFlows optimizers, for example
tf.train.RMSPropOptimizer or tf.train.AdamOptimizer. We will see examples of this in the later
sections of this chapter where we approach practical problems with the help of RNNs.
The reason why it is difficult for an RNN to learn such long-term dependencies lies in
how errors are propagated trough the network during optimization. Remember that we
propagate the errors through the unfolded RNN in order to compute the gradients. For
long sequences, the unfolded network gets very deep and has many layers. At each layer,
backpropagation scales the error from above in the network by the local derivatives.
If most of local derivatives are much smaller than the value of one, the gradient gets
scaled down at every layer causing it to shrink exponentially and eventually, vanish.
Analogously, many local derivatives being greater than one cause the gradient to explode.
Lets compute the gradient of this example network with just one hidden neuron per
layer in order to get a better understanding of this problem. For each layer the local
derivatives
As you can see, the error term contains the transposed weight matrix several times as a
multiplicative term. In our toy network, the weight matrix contains just one element and
its easy to see that the terms gets close to zero or infinity when most weights are smaller
or larger than one. In a larger network with real weight matrices, the same problem occurs
when the eigen values of thei weight matrices are smaller or larger than one.
This problem actually exists in any deep networks, not just recurrent ones. However, in
RNNs the connections between time steps are tied each. Therefore, local derivatives of
such weights are either all lesser or all greater than one and the gradient is alawys scaled
in the same direction for each weight in the original (not unfolded) RNN. Therefore, the
problem of vanishing and exploding gradients is more prominent in RNNs than in forward
networks.
There are a couple of problems with very small or very large gradients. With elements
of the gradient close to zero or infinity, learning stagnates or diverges, respectively.
Moreover, we are optimizing numerically and floating point precision comes into play
distorting the gradient. This problem, also known as the fundamental problem of deep
learning has been studied and approached by many researchers in the last years. The most
popular solution is an RNN architecture called Long-Short Term Memory (LSTM) that we
will look at in the next section.
While the purpose of the internal state is to deliver errors over many time steps, the
LSTM architecture leaves learning to the sourrounding gates that have non-linear, usually
sigmoid, activation functions. In the original LSTM cell, there are two gates: One learns to
scale the incoming activation and one learns to scale the outgoing activation. The cell can
thus learn when to incorporate or ignore new inputs and when to release the feature it
represents to other cells. The input to a cell is feeded into all gates using individual
weights.
We also refer to recurrent networks as layers because we we can use them as part of
larger architectures. For example, we could first feed the time steps through several
convolution and pooling layers, then process these outputs with an LSTM layer and add a
softmax layer ontop of the LSTM activation at the last time step.
TensorFlow provides such an LSTM network with the LSTMCell class that can be used as
a drop-in replacement for BasicRNNCell but also provides some additional switches. Despite
its name, this class represents a whole LSTM layer. In the later sections we will see how to
connect LSTM layers to other networks in order to form larger architectures.
Architecture Variations
A popular extension to LSTM is to add a forget gate scaling the internal recurrent
connection, allowing the network to learn to forget (Gers, Felix A., Jrgen Schmidhuber,
and Fred Cummins. Learning to forget: Continual prediction with LSTM. Neural
computation 12.10 (2000): 2451-2471.). The derivative of the internal recurrent
connection is now the activation of the forget gate and can differ from the value of one.
The network can still learn to leave the forget gate closed as long as remembering the cell
context is important.
It is important to initialize the forget gate to a value of one so that the cell starts in a
remembering state. Forget gates are the default is almost all implementations nowadays. In
TensorFlow, we can initialize the bias values of the forget gates by specifying the
forget_bias parameter to the LSTM layer. The default is the value one and usually its best to
leave it that way.
Another extension are so called peephole connections, which allows the gates to look at
the cell state (Gers, Felix A., Nicol N. Schraudolph, and Jrgen Schmidhuber. Learning
precise timing with LSTM recurrent networks. The Journal of Machine Learning
Research 3 (2003): 115-143.). The authors claim that peephole connections are benefitial
when the task involves precise timing and intervals. TensorFlows LSTM layer supports
peephole connections. They can be activated by passing the use_peepholes=True flag to the
LSTM layer.
Based on the idea of LSTM, an alternative memory cell called Gated Recurrent Unit
(GRU) has been proposed in 2014 (Chung, Junyoung, et al. Empirical evaluation of
gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
(2014).). In contrast to LSTM, GRU has a simpler architecture and requires less
computation while yielding very similar results. GRU has no output gate and combines the
input and forget gates into a single update gate.
This update gate determines how much the internal state is blended with a candidate
activation. The candidate activation is computed from a fraction of the hidden state
determined by the so-called reset gate and the new input. The TensorFlow GRU layer is
called GRUCell and have no parameters other than the number of cells in the layer. For
further reading, we suggest the 2015 paper by Jozefowicz et al. who empirically explored
recurrent cell architectures (Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. An
empirical exploration of recurrent network architectures. Proceedings of the 32nd
International Conference on Machine Learning (ICML-15). 2015.).
So far we looked at RNNs with fully connected hidden units. This is the most general
architecture since the network can learn to set unneeded weights to zero during trainng.
However, it is common to stack two or more layers of fully-connected RNNs on top of
each other. This can still be seen as one RNN that has some structure in its connections.
Since nformation can only flow upward between layers, multi-layer RNNs have less
weights than a huge fully connected RNN and tend to learn more abstract features.
In 2013, Mikolov et al. came up with a practical and efficient way to compute word
representations from context. The paper is: Mikolov, Tomas, et al. Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). Their skipgram model starts with random representations and has a simple classifier that tries to
predict a context word from the current word. The errors are propagated through both the
classifier weights and the word representations and we adjust both to reduce the prediction
error. It has been found that training this model over a large corpus makes the
representation vectors approximate compressed co-occurance vectors. We will now
implement the skip-gram model in TensorFlow.
class Wikipedia:
def __init__(self, url, cache_dir, vocabulary_size=10000):
pass
def __iter__(self):
"""Iterate over pages represented as lists of word indices."""
pass
@property
def vocabulary_size(self):
pass
def encode(self, word):
"""Get the vocabulary index of a string word."""
pass
def decode(self, index):
"""Get back the string word from a vocabulary index."""
pass
def _read_pages(self, url):
"""
Extract plain words from a Wikipedia dump and store them to the pages
file. Each page will be a line with words separated by spaces.
"""
pass
def _build_vocabulary(self, vocabulary_size):
"""
Count words in the pages file and write a list of the most frequent
words to the vocabulary file.
"""
pass
@classmethod
def _tokenize(cls, page):
pass
There are a couple of steps to perform in order to get the data into the right format. As
you might have seen earler in this book, data collection and cleaning is both a demanding
and important task. Ultimately, we would like to iterate over Wikipedia pages represented
as one-hot encoded words. We do this is in multiple steps:
1. Download the dump and extract pages and their words.
2. Count words to form a vocabulary of the most common words.
3. Encode the extracted pages using the vocabulary.
The whole corpus does not fit into main memory easily, so we have to perform these
operations on data streams by reading the file line by line and write the intermediate
results back to disk. This way, we have checkpoints between the steps so that we dont
have to start all over if something crashes. We use the following class to handle the
Wikipedia processing. In the __init__() you can see the checkpointing logic using fileexistance checks.
def __init__(self, url, cache_dir, vocabulary_size=10000):
self._cache_dir = os.path.expanduser(cache_dir)
self._pages_path = os.path.join(self._cache_dir, 'pages.bz2')
self._vocabulary_path = os.path.join(self._cache_dir, 'vocabulary.bz2')
if not os.path.isfile(self._pages_path):
print('Read pages')
self._read_pages(url)
if not os.path.isfile(self._vocabulary_path):
print('Build vocabulary')
self._build_vocabulary(vocabulary_size)
with bz2.open(self._vocabulary_path, 'rt') as vocabulary:
print('Read vocabulary')
self._vocabulary = [x.strip() for x in vocabulary]
self._indices = {x: i for i, x in enumerate(self._vocabulary)}
def __iter__(self):
"""Iterate over pages represented as lists of word indices."""
with bz2.open(self._pages_path, 'rt') as pages:
for page in pages:
words = page.strip().split()
words = [self.encode(x) for x in words]
yield words
@property
def vocabulary_size(self):
return len(self._vocabulary)
def encode(self, word):
"""Get the vocabulary index of a string word."""
return self._indices.get(word, 0)
def decode(self, index):
"""Get back the string word from a vocabulary index."""
return self._vocabulary[index]
As you have noticed, we still have to implement two important functions of this Wclass.
The first one, _read_pages() will download the Wikipedia dump which comes as a
compressed XML file, iterate over the pages and extract the plain words to get rid of any
formatting. To read the compressed file, we need the bz2 module that provides an open()
function that works similar to its standard equivalent but takes care of compression and
decompression, even when streaming the file. To save some disk space, we will also use
this compression for the intermediate results. The regex used to extract words just captures
any sequence of consecutive letter and individual occurences of some special characters.
from lxml import etree
TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')
def _read_pages(self, url):
"""
Extract plain words from a Wikipedia dump and store them to the pages
file. Each page will be a line with words separated by spaces.
"""
wikipedia_path = download(url, self._cache_dir)
with bz2.open(wikipedia_path) as wikipedia, \
bz2.open(self._pages_path, 'wt') as pages:
for _, element in etree.iterparse(wikipedia, tag='{*}page'):
if element.find('./{*}redirect') is not None:
continue
page = element.findtext('./{*}revision/{*}text')
words = self._tokenize(page)
We need a vocabulary of words to use for the one-hot encoding. We can then encode
each word by its intex in the vocabulary. To remove some misspelled or very unkommon
words, the vocabulary only contains the the vocabulary_size - 1 most common words and an
<unk> token that will be used for every word that is not in the vocabulary. This token will
also give us a word-vector that we can use for unseen words later.
def _build_vocabulary(self, vocabulary_size):
"""
Count words in the pages file and write a list of the most frequent
words to the vocabulary file.
"""
counter = collections.Counter()
with bz2.open(self._pages_path, 'rt') as pages:
for page in pages:
words = page.strip().split()
counter.update(words)
common = ['<unk>'] + counter.most_common(vocabulary_size - 1)
common = [x[0] for x in common]
with bz2.open(self._vocabulary_path, 'wt') as vocabulary:
for word in common:
vocabulary.write(word + '\n')
Since we extracted the plain text and defined the encoding for the words, we can form
training examples of it one the fly. This is nice since storing the examples would require a
lot of storage space. Most of the time will be spent for the training anyway, so this doesnt
impact performance by much. We also want to group the resulting examples into batches
to train them more efficiently. We will be able to use very large batches with this model
because the classifier does not require a lot of memory.
So how do we form the training examples? Remember that the skip-gram model
predicts context words from current words. While iterating over the text, we create
training examples with the current word as data and its surrounding words as targets. For a
context size of
, we would thus generate ten training examples per word, with the
five words to the left and right being the targets. However, one can argue that close
neighbors are more important to the semantic context than far neighbors. We thus create
less training examples with far context words by randomly choosing a context size in
range
at each word.
target = np.zeros(batch_size)
for index in range(batch_size):
data[index], target[index] = next(iterator)
yield data, target
Model structure
Now that we got the Wikipedia corpus prepared, we can define a model to compute the
word embeddings.
class EmbeddingModel:
def __init__(self, data, target, params):
self.data = data
self.target = target
self.params = params
self.embeddings
self.cost
self.optimize
@lazy_property
def embeddings(self):
pass
@lazy_property
def optimize(self):
pass
@lazy_property
def cost(self):
pass
Each word starts off being represented by a random vector. From the intermediate
representation of a word, a classifier will then try to predict the current representation of
one of its context words. We will then propagate the errors to tweak both the weights and
the representation of the input word. The thus use a tf.Variable for the representations.
@lazy_property
def embeddings(self):
initial = tf.random_uniform(
[self.params.vocabulary_size, self.params.embedding_size],
-1.0, 1.0)
return tf.Variable(initial)
We use the MomentumOptimizer that is not very clever but has the advantage of being very
fast. This makes it play nicely with our large Wikipedia corpus and the idea behind skipgram to prefer more data over clever algorithms.
@lazy_property
def optimize(self):
optimizer = tf.train.MomentumOptimizer(
self.params.learning_rate, self.params.momentum)
return optimizer.minimize(self.cost)
The only missing part of our model is the classifier. This is the heart of the successful
skip-gram model and we will now take a look at how it works.
After about five hours of training, this we will get the learned embeddings as a stored
Numpy array. While we will use the embeddings in the later chapter, you dont have to
compute them yourself, if you dont want to. There are pre-trained word embeddings
available online and we will point to them later when we need them.
Sequence Classification
Sequence classification is a problem setting where we predict a class for the whole
input sequence. Such problems are common in many fields including genomics and
finance. A prominent from NLP is sentiment analysis: Predicting the attitude towards a
given topic from user-written text. For example, one could predict the sentiment of tweets
mentioning a certain candidate in an election and use that to forecast the election results.
Another example is predicting product or movie ratings from written reviews. This is used
as a benchmark task in the NLP community because reviews often contain numerical
ratings that make for convenient target values.
We will use a dataset of movie reviews from the International Movie Database with the
binary targets positive or negative. On this dataset, navive methods that just look at the
existance of words tend to fail because of negations, irony and ambiguity in language in
general. We will build a recurrent model operating on the word vectors from the last
section. The recurrent network will see a review word-by-word. From the activation at the
last word, we will train a classifier to predict the sentiment of the whole review. Because
we train the architecture end-to-end, the RNN will to collect and encode the useful
information from the words that will be most valuable for the later classification.
class ImdbMovieReviews:
DEFAULT_URL = \
'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')
def __init__(self, cache_dir, url=None):
self._cache_dir = cache_dir
self._url = url or type(self).DEFAULT_URL
def __iter__(self):
filepath = download(self._url, self._cache_dir)
with tarfile.open(filepath) as archive:
for filename in archive.getnames():
if filename.startswith('aclImdb/train/pos/'):
yield self._read(archive, filename), True
elif filename.startswith('aclImdb/train/neg/'):
yield self._read(archive, filename), False
def _read(self, archive, filename):
with archive.extractfile(filename) as file_:
data = file_.read().decode('utf-8')
data = type(self).TOKEN_REGEX.findall(data)
data = [x.lower() for x in data]
return data
class Embedding:
def __init__(self, vocabulary_path, embedding_path, length):
self._embedding = np.load(embedding_path)
with bz2.open(vocabulary_path, 'rt') as file_:
self._vocabulary = {k.strip(): i for i, k in enumerate(file_)}
self._length = length
def __call__(self, sequence):
data = np.zeros((self._length, self._embedding.shape[1]))
indices = [self._vocabulary.get(x, 0) for x in sequence]
embedded = self._embedding[indices]
data[:len(sequence)] = embedded
return data
@property
def dimensions(self):
return self._embedding.shape[1]
First, we obtain the lengths of sequences in the current data batch. We need this since
the data comes as a single tensor, padded with zero vectors to the longest review length.
Instead of keeping track of the sequence lengths of every review, we just compute it
dynamically in TensorFlow. To get the length per sequence, we collapse the word vectors
using the maximum on the absolute values. The resulting scalars will be zero for zero
vectors and larger than zero for any real word vector. We then discretize these values to
zero or one using tf.sign() and sum up the results along the time steps to obtain the length
of each sequence. The resulting tensor has the length of batch size and contains a scalar
length for each sequence.
@lazy_property
def length(self):
used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2))
length = tf.reduce_sum(used, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
As of now, TensorFlow only supports indexing along the first dimension, using
tf.gather(). We thus flatten the first two dimensions of the output activations from their
shape of sequences x time_steps x word_vectors and construct an index into this resulting tensor.
The index takes into account the start indices for each sequence in the flat tensor and adds
the sequence length to it. Actually, we only add length - 1 so that we select the last valid
time step.
@staticmethod
def _last_relevant(output, length):
batch_size = tf.shape(output)[0]
max_length = int(output.get_shape()[1])
output_size = int(output.get_shape()[2])
index = tf.range(0, batch_size) * max_length + (length - 1)
flat = tf.reshape(output, [-1, output_size])
relevant = tf.gather(flat, index)
return relevant
We will be able to train the whole model end-to-end with TensorFlow propagating the
errors through the softmax layer and the used time steps of the RNN. The only thing that
is missing for training is a cost function.
Gradient clipping
For sequence classification, we can use any cost function that makes sense for
classification because the model output is just a probability distribution over the available
classes. In our example, the two classes are positive and negative sentiment and we will
use a standard cross-entropy cost as explain in the previous chapter on object recognition
and classification.
To minimize the cost function, we use the optimizer defined in the configuration.
However, we will improve on what weve learned so far by adding gradient clipping.
RNNs are quite hard to train and weights tend to diverge if the hyper parameters do not
play nicely together. The idea of gradient clipping is to restrict the the values of the
gradient to a sensible range. This way, we can limit the maximum weight updates.
@lazy_property
def cost(self):
cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction))
return cross_entropy
@lazy_property
def optimize(self):
gradient = self.params.optimizer.compute_gradients(self.cost)
if self.params.gradient_clipping:
limit = self.params.gradient_clipping
gradient = [
(tf.clip_by_value(g, -limit, limit), v)
if g is not None else (None, v)
for g, v in gradient]
optimize = self.params.optimizer.apply_gradients(gradient)
return optimize
@lazy_property
def error(self):
mistakes = tf.not_equal(
tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))
return tf.reduce_mean(tf.cast(mistakes, tf.float32))
TensorFlow supports this szenario with the compute_gradients() function that each
optimizer instance provides. We can then modify the gradients and apply the weight
changes with apply_gradients(). For gradient clipping, we set elements to -limit if they are
lower than that or to limit if they are larger than that. The only tricky part is that
derivatives in TensorFlow can be None which means there is no relation between a variable
and the cost function. Mathematically, those derivatives should be zero vectors, but using
None allows for internal performance optimizations. We handle those cases by just passing
the None value back as in the tuple.
We can easily train the model now. We define the hyper parameters, load the dataset and
embeddings, and run the model on the preprocessed training batches.
params = AttrDict(
rnn_cell=GRUCell,
rnn_hidden=300,
optimizer=tf.train.RMSPropOptimizer(0.002),
batch_size=20,
)
reviews = ImdbMovieReviews('/home/user/imdb')
length = max(len(x[0]) for x in reviews)
embedding = Embedding(
'/home/user/wikipedia/vocabulary.bz2',
'/home/user/wikipedia/embedding,npy', length)
batches = preprocess_batched(reviews, length, embedding, params.batch_size)
data = tf.placeholder(tf.float32, [None, length, embedding.dimensions])
target = tf.placeholder(tf.float32, [None, 2])
model = SequenceClassificationModel(data, target, params)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
for index, batch in enumerate(batches):
feed = {data: batch[0], target: batch[1]}
error, _ = sess.run([model.error, model.optimize], feed)
print('{}: {:3.1f}%'.format(index + 1, 100 * error))
This time, the training success of this model will not only depend on the network
structure and hyper parameter, but also on the quality of the word embeddings. If you did
not train your own word embeddings as described in the last section, you can load pretrained embeddings from the word2vec project:
https://code.google.com/archive/p/word2vec/ that implements the skip-gram model, or the
very similar Glove model from the Stanford NLP group:
http://nlp.stanford.edu/projects/glove/. In both cases you will be able to find Python
loaders on the web.
We have this model now, so what can you do with it? There is an open learning
competition on Kaggle, a famous website hosting data science challenges. It uses the same
IMDB movie review dataset as we did in this section. So if you are interested how your
results compare to others, you can run the model on their testset and upload your results at
https://www.kaggle.com/c/word2vec-nlp-tutorial.
Sequence Labelling
In the last section, we built a sequence classification model that uses an LSTM network
and stacked a softmax layer ontop of the last activation. Building on this, we will now
tackle the slightly harder problem of sequence labelling. This setting differs from
sequence classification in that we want to predict an individual class for each frame in the
input sequence.
For example, lets think about regocnizing handwritten text. Each word is a sequence of
letters and while we could classify each letter independendly, human language has a
strong structure that we can take advantage of. If you take a look at a handwritten sample,
there are often letters that are hard to read on their own, for example n, m, and u.
They can be regocnized from the context of nearby letters however. In this section, we will
use RNNs to make use of the dependencies between letters and built a more robust OCR
(Optical Character Recognition) system.
class OcrDataset:
"""
Dataset of handwritten words collected by Rob Kassel at the MIT Spoken
Language Systems Group. Each example contains the normalized letters of the
word, padded to the maximum word length. Only contains lower case letter,
capitalized letters were removed.
From: http://ai.stanford.edu/~btaskar/ocr/
"""
URL = 'http://ai.stanford.edu/~btaskar/ocr/letter.data.gz'
def __init__(self, cache_dir):
path = download(type(self).URL, cache_dir)
lines = self._read(path)
data, target = self._parse(lines)
self.data, self.target = self._pad(data, target)
@staticmethod
def _read(filepath):
with gzip.open(filepath, 'rt') as file_:
reader = csv.reader(file_, delimiter='\t')
lines = list(reader)
return lines
@staticmethod
def _parse(lines):
lines = sorted(lines, key=lambda x: int(x[0]))
data, target = [], []
next_ = None
for line in lines:
if not next_:
data.append([])
target.append([])
else:
assert next_ == int(line[0])
next_ = int(line[2]) if int(line[2]) > -1 else None
pixels = np.array([int(x) for x in line[6:134]])
pixels = pixels.reshape((16, 8))
data[-1].append(pixels)
target[-1].append(line[1])
return data, target
@staticmethod
def _pad(data, target):
max_length = max(len(x) for x in target)
padding = np.zeros((16, 8))
data = [x + ([padding] * (max_length - len(x))) for x in data]
target = [x + ([''] * (max_length - len(x))) for x in target]
return np.array(data), np.array(target)
We first sort by those following id values so that we can read the letters of each word in
the correct order. Then, we continue collecting letters until the field of the next id is not set
in which case we start a new sequence. After reading the target letters and their data
pixels, we pad the sequences with zero images so that they fit into two big Numpy arrays
containing the target letters and all the pixel data.
Lets implement the methods of our sequence labelling mode. First off, we again need
to compute the sequence lengths. We already did this in the last section so there is not
much to add here.
@lazy_property
def length(self):
used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2))
length = tf.reduce_sum(used, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
Now, we come to the prediction, were the main different to the sequence classification
model lies. There would be two ways to add a softmax layer to all frames. We could either
add several different classifiers or share the same among all frames. Since classifying the
third letter should not be very different from classifying the eighth letter, it makes sense to
take the latter approach. This way, the classifier weights are also trainer more often
because each letter of the word contributes to training them.
In order to implement a shared layer in TensorFlow, we have to apply a little trick. A
weight matrix of a fully-connected layer always has the dimentions batch_size x in_size x
out_size. But we now have two input dimensions along which we want to apply the matrix,
batch_size and sequence_steps.
What we can do to circumvent this problem is to flatten the input to the layer, in this
case the ougoing activation of the RNN, to shape batch_size * sequence_steps x in_size. This
way, it just looks like a large batch to the weight matrix. Of course we have to reshape the
results bach to unflatten them.
@lazy_property
def prediction(self):
output, _ = tf.nn.dynamic_rnn(
GRUCell(self.params.rnn_hidden),
self.data,
dtype=tf.float32,
sequence_length=self.length,
)
# Softmax layer.
max_length = int(self.target.get_shape()[1])
num_classes = int(self.target.get_shape()[2])
weight = tf.Variable(tf.truncated_normal(
[self.params_rnn_hidden, num_classes], stddev=0.01))
bias = tf.Variable(tf.constant(0.1, shape=[num_classes]))
# Flatten to apply same weights to all time steps.
output = tf.reshape(output, [-1, self.params.rnn_hidden])
prediction = tf.nn.softmax(tf.matmul(output, weight) + bias)
prediction = tf.reshape(prediction, [-1, max_length, num_classes])
return prediction
The cost and error function change slightly compared to what we had for sequence
classification. Namely, there is now an prediction-target pair for each frame in the
sequence, so we have to average over that dimension as well. However, tf.reduce_mean()
doesnt work here since it would normalize by the tensor length which is the maximum
sequence length. Instead, we want to normalize by the actual sequence lengths computed
earlier. Thus, we manually use tf.reduce_sum() and a division to obtain the correct mean.
@lazy_property
def cost(self):
# Compute cross entropy for each frame.
cross_entropy = self.target * tf.log(self.prediction)
cross_entropy = -tf.reduce_sum(cross_entropy, reduction_indices=2)
mask = tf.sign(tf.reduce_max(tf.abs(self.target), reduction_indices=2))
cross_entropy *= mask
# Average over actual sequence lengths.
cross_entropy = tf.reduce_sum(cross_entropy, reduction_indices=1)
cross_entropy /= tf.cast(self.length, tf.float32)
return tf.reduce_mean(cross_entropy)
Analogously to the cost, we have to adjust the error function. The axis that tf.argmax()
operates on is axis two rather than axis one now. Then we mask padding frames and
compute the average over the actual sequence length. The last tf.reduce_mean() averages
over the words in the data batch.
@lazy_property
def error(self):
mistakes = tf.not_equal(
tf.argmax(self.target, 2), tf.argmax(self.prediction, 2))
mistakes = tf.cast(mistakes, tf.float32)
mask = tf.sign(tf.reduce_max(tf.abs(self.target), reduction_indices=2))
mistakes *= mask
# Average over actual sequence lengths.
mistakes = tf.reduce_sum(mistakes, reduction_indices=1)
mistakes /= tf.cast(self.length, tf.float32)
return tf.reduce_mean(mistakes)
The nice thing about TensorFlows automatic gradient computation is that we can use
the same optimization operation for this model as we used for sequence classification, just
plugging in the new cost function. We will apply gradient clipping in all RNNs from now
on, since it can prevent divergence during training while it does not have any negative
impace.
@lazy_property
def optimize(self):
gradient = self.params.optimizer.compute_gradients(self.cost)
if self.params.gradient_clipping:
limit = self.params.gradient_clipping
gradient = [
(tf.clip_by_value(g, -limit, limit), v)
if g is not None else (None, v)
for g, v in gradient]
optimize = self.params.optimizer.apply_gradients(gradient)
return optimize
After training of 1000 words our model classifies about 9% of all letter in the test set
correctly. Thats not too bad, but there is also room for improvement here.
Our current model very similar to the model for sequence classification. This was by
intent so that you see the differences needed to apply in order to adapt existing models to
new tasks. What worked on another problem is more likely to work well on a new
problem than if you would make a wild guess. However, we can do better! In the next
section, we will try and improve on our results using a more advanced recurrent
architecture.
Bidirectional RNNs
How can we improve the results on the OCR dataset that we got with the RNN plus
Softmax architecture? Well, lets take a look at our motivation to use RNNs. The reason
we chose them for the OCR dataset was that there are dependencies, or mutual
information, between the letters within one word. The RNN stores information about all
the previous inputs of the same word in its hidden activation.
If you think about it, the recurrency in our model doesnt help much for classifying the
first few letters because the network hasnt had many inputs yet to infer additional
information from. In sequence classification, this wasnt a problem since the network sees
all frames before making a descision. In sequence labelling, we can address this
shortcoming using bidirectional RNNs, a technique that holds state or the art in several
classification problems.
The idea of bidirectional RNNs is simple. There are two RNNs that take a look at the
input sequence, one going from the left reading the word in normal order, and one going
from the right reading the letters in reverse order. At each time step, we now got two
output activations that we concatenate before passing them up into the shared softmax
layer. Using this architecture, the classifier can access information of the whole word at
each letter.
The _shared_softmax() function above is easy; we already had the code in the prediction
property before. The difference is that this time, we infer the input size from the data tensor
that gets passed into the function. This way, we can reuse the function for other
architectures if needed. Then we use the same flattening trick to share the same softmax
layer accross all time steps.
aef _shared_softmax(self, data, out_size):
max_length = int(data.get_shape()[1])
in_size = int(data.get_shape()[2])
weight = tf.Variable(tf.truncated_normal(
[in_size, out_size], stddev=0.01))
bias = tf.Variable(tf.constant(0.1, shape=[out_size]))
# Flatten to apply same weights to all time steps.
flat = tf.reshape(data, [-1, in_size])
output = tf.nn.softmax(tf.matmul(flat, weight) + bias)
output = tf.reshape(output, [-1, max_length, out_size])
return output
Here somes the interesting part, the implementation of bidirectional RNNs. As you can
see, we have create two RNNs using tf.nn.dynamic_rnn. The forward network should look
familiar while the backward network is new.
Instead of just feeding in the data into the backward RNN, we first reverse the
sequences. This is easier than implementing a new RNN operation that would go
backwards. TensorFlow helps us with with the tf.reverse_sequence() functions that takes care
of only reversing the used frames up to sequence_length. Note that at the moment of writing
this, the function expects the lengths to be a 64-bit integer tensor. Its likely that it will also
work with 32-bit tensors and you can just pass in self.length.
def _bidirectional_rnn(self, data, length):
length_64 = tf.cast(length, tf.int64)
forward, _ = tf.nn.dynamic_rnn(
cell=self.params.rnn_cell(self.params.rnn_hidden),
inputs=data,
dtype=tf.float32,
sequence_length=length,
scope='rnn-forward')
backward, _ = tf.nn.dynamic_rnn(
cell=self.params.rnn_cell(self.params.rnn_hidden),
inputs=tf.reverse_sequence(data, length_64, seq_dim=1),
dtype=tf.float32,
sequence_length=self.length,
scope='rnn-backward')
backward = tf.reverse_sequence(backward, length_64, seq_dim=1)
output = tf.concat(2, [forward, backward])
return output
We also use the scope parameter this time. Why do we need this? As explained in the
TensorFlow Fundamentals chapter, nodes in the compute graph have names. scope is the
name of the variable scope used by rnn.dynamic_cell and it detauls to RNN. This time we have
two RNNs that have different parameters so they have to live in different scopes.
After feeding the reversed sequence into the backward RNN, we again reverse the
network outputs to align with the forward outputs. Then we concatenate both tensors
along the dimension of the output neurons of the RNNs and return this. For example, with
a batch size of 50, 300 hidden units per RNN and words of up to 14 letters, the resulting
tensor would have the shape 50 x 14 x 600.
Okay cool, we built our first architecture that is composed of multiple RNNs! Lets see
how it performs using the training code from the last section. As you can see from
comparing the graphs, the bidirectional model performs significantly better. After seeing
1000 words, it only misclassifies 4% of the letters in the test split.
To summarize, in this section we learned how to use RNNs for sequence labelling and
the differences to the sequence classification setting. Namely, we want a classifier that
takes the RNN outputs and is shared accross all time steps.
This architecture can be improved drastically by adding a second RNN that visits the
sequence from back to front and combining the outputs at each time step. This is because
now information of the whole sequence are available for the classification of each letter.
In the next section, we will take a look at training an RNN in an unsupervised fashion in
order to learn language.
Predictive coding
We already learned how to use RNNs to classify the sentiment of movie reviews, and to
recognize hand-written words. These applications have been supervised, meaning that we
needed a labelled dataset. Another interesting learning setting is called predictive coding.
We just show the RNN a lot of sequences and train it to predict the next frame of the
sequence.
Lets take text as an example, where predicting the likelihood of the next word in a
sentence is called language modelling. Why would it be useful to predict the next word in
a sentence? One groups of applications is recongnizing language. Lets say you want to
build a handwriting recognizer that translates scans of handwriting to typed text. While
you can try to recover all the words from the input scans only, knowing the distribution of
likely next words already narrows down the candidate words to decide between. Its the
difference between dumb recognition of shapes and reading, basically.
Besides improving performance in tasks involving natural language, we can also sample
from the distribution of what the network thinks should follow next in order to generate
text. After training, we can start feeding a seed word into the RNN and look at the next
word prediction. Then we feed the most likely word back into the RNN as the next input
so see what it thinks should follow now. Doing this repeatedly, we can generate new
content looking similar to the training data.
The interesting thing is that predictive coding trains the network to compress all the
important information of any sequence. The next words in a sentence usually depends on
the previous words, their order and relations between each other. A network that is able to
accurately predict the next character in natural language thus needs to capture the rules of
grammar and language well.
Since we are interested in machine learning papers, we will search within the categories
Machine Learning, Neural and Evolutionary Computing, and Optimization and Control.
We further restrict the results to those containing any of the words neural, network or deep
in the metadata. This gives us about 7 MB of text which is a fair amount of data to learn a
simple RNN language model. It would be reasonable to use more data and get better
results, but we dont want to wait for too many hours of training to pass before seeing
some results. Feel free to use a broader search query and train this model on more data
though.
ENDPOINT = 'http://export.arxiv.org/api/query'
def _build_url(self, amount, offset):
categories = ' OR '.join('cat:' + x for x in self.categories)
The _fetch_all() method basically performs pagination. The API only gives us a certain
amount of abstracts per request and we can specify an offset to get results of the second,
third, etc page. As you can see, we can specify the page size which gets passed into the
next function, _fetch_page(). In theory, we could set the page size to a huge number and try
to get all results at once. In practice however, this makes the request very slow. Fetching in
pages is also more fault tolerant and most importantly, does not stress the Arxiv API too
much.
PAGE_SIZE = 100
def _fetch_all(self, amount):
page_size = type(self).PAGE_SIZE
count = self._fetch_count()
if amount:
count = min(count, amount)
for offset in range(0, count, page_size):
print('Fetch papers {}/{}'.format(offset + page_size, count))
yield from self._fetch_page(page_size, count)
Here we perform the actual fetching. The result comes in XML and we use the popular
and powerful BeautifulSoup library to extract the abstracts. If you havent installed it
already, you can issue a sudo -H pip3 install beautifulsoup4. BeautifulSoup parses the XML
result for us so that we can easily iterate over the tags that are of our interest. First, we
look for <entry> tags corresponding to publications and within each of them, we read our
the <summary> tag containing the abstract text.
def _fetch_page(self, amount, offset):
url = self._build_url(amount, offset)
response = requests.get(url)
soup = BeautifulSoup(response.text)
for entry in soup.findAll('entry'):
text = entry.find('summary').text
text = text.strip().replace('\n', ' ')
yield text
class Preprocessing:
VOCABULARY = \
" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
"\\^_abcdefghijklmnopqrstuvwxyz{|}"
def __init__(self, texts, length, batch_size):
self.texts = texts
self.length = length
self.batch_size = batch_size
self.lookup = {x: i for i, x in enumerate(self.VOCABULARY)}
def __call__(self, texts):
batch = np.zeros((len(texts), self.length, len(self.VOCABULARY)))
for index, text in enumerate(texts):
text = [x for x in text if x in self.lookup]
assert 2 <= len(text) <= self.length
for offset, character in enumerate(text):
code = self.lookup[character]
batch[index, offset, code] = 1
return batch
def __iter__(self):
windows = []
for text in self.texts:
for i in range(0, len(text) - self.length + 1, self.length // 2):
windows.append(text[i: i + self.length])
assert all(len(x) == len(windows[0]) for x in windows)
while True:
random.shuffle(windows)
for i in range(0, len(windows), self.batch_size):
batch = windows[i: i + self.batch_size]
yield self(batch)
class PredictiveCodingModel:
def __init__(self, params, sequence, initial=None):
self.params = params
self.sequence = sequence
self.initial = initial
self.prediction
self.state
self.cost
self.error
self.logprob
self.optimize
@lazy_property
def data(self):
pass
@lazy_property
def target(self):
pass
@lazy_property
def mask(self):
pass
@lazy_property
def length(self):
pass
@lazy_property
def prediction(self):
pass
@lazy_property
def state(self):
pass
@lazy_property
def forward(self):
pass
@lazy_property
def cost(self):
pass
@lazy_property
def error(self):
pass
@lazy_property
def logprob(self):
pass
@lazy_property
def optimize(self):
pass
def _average(self, data):
pass
In the code example above, you can see an overview of the functions that our model
will implemnet. Dont worry it that looks overwhelming at first: We just want to expose
some more values of our model than we did in the previous chapters.
Lets start with the data processing. As we said, the model just takes a one block of
sequences as input. First, we use that to construct input data and target sequences from it.
This is where we introduce a temporal difference because at timestep , the model should
have character as input but
as target. As easy way to obtain data or target is to slice
the provided sequence and cut away the last or the first frame, respectively.
We do this slicing using tf.slice() which takes the sequence to slice, a tuple of start
indices for each dimension, and a tuple of sizes for each dimension. For the sizes -1 means
to keep all elemnts from the start index in that dimension until the end. Since we want to
slices frames, we only care about the second dimension.
@lazy_property
def data(self):
max_length = int(self.sequence.get_shape()[1])
return tf.slice(self.sequence, (0, 0, 0), (-1, max_length - 1, -1))
@lazy_property
def target(self):
return tf.slice(self.sequence, (0, 1, 0), (-1, -1, -1))
@lazy_property
def mask(self):
return tf.reduce_max(tf.abs(self.target), reduction_indices=2)
@lazy_property
def length(self):
return tf.reduce_sum(self.mask, reduction_indices=1)
We also define two properties on the target sequence as we already discussed in earler
sections: mask is a tensor of size batch_size x max_length where elements are zero or one
depending on wheter the respective frame is used or a padding frame. The length property
sums up the mask along the time axis in order to obtain the length of each sequence.
Note that the mask and length properties are also valid for the data sequence since
conceptually, they are of the same length as the target sequence. However, we couldnt
compute them on the data sequence since it still contains the last frame that is not needed
since there is no next character to predict. You are right, we sliced away the last frame of
the data tensor, but that didnt contained the actual last frame of most sequences but
mainly padding frames. This is the reason why we will use mask below to mask our cost
function.
Now we will define the actual network that consists of a recurrent network and a shared
softmax layer, just like we used for sequence labelling in the previous section. We dont
show the code for the shared softmax layer here again but you can find it in the previous
section.
@lazy_property
def prediction(self):
prediction, _ = self.forward
return prediction
@lazy_property
def state(self):
_, state = self.forward
return state
@lazy_property
def forward(self):
cell = self.params.rnn_cell(self.params.rnn_hidden)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * self.params.rnn_layers)
hidden, state = tf.nn.dynamic_rnn(
inputs=self.data,
cell=cell,
dtype=tf.float32,
initial_state=self.initial,
sequence_length=self.length)
vocabulary_size = int(self.target.get_shape()[2])
prediction = self._shared_softmax(hidden, vocabulary_size)
return prediction, state
The new part about the neural network code above is that we want to get both the
prediction and the last recurrent activation. Previously, we only returned the prediction but
the last activation allows us to generate sequences more effectively later. Since we only
want to construct the graph for the recurrent network once, there is a forward property that
return the tuple of both tensors and prediction and state are just there to provide easy access
from the outside.
The next part of our model is the cost and evaluation functions. At each time step, the
model predicts the next character out of the vocabulary. This is a classification problem
and we use the cross entropy cost, accordingly. We can easily compute the error rate of
character predictions as well.
The logprob property is new. It describes the probability that our model assigned to the
correct next character in logarithmic space. This is basically the negative cross entropy
transformed into logarithmic space and averaged there. Converting the result back into
linear space yields the so-called perplexity, a common measure to evaluate the
performance of language models.
The perplexity is defined as
. Intuitively, it represents the number of options
the model had to guess between at each time step. A perfect model has a perplexity of 1
while a model that always outputs the same probability for each of the classes has a
perplexity of . The perplexity can even become infinity when the model assigns a zero
probability to the next character once. We prevent this extreme case by clamping the
prediction probabilities within a very small positive number and one.
@lazy_property
def cost(self):
prediction = tf.clip_by_value(self.prediction, 1e-10, 1.0)
cost = self.target * tf.log(prediction)
cost = -tf.reduce_sum(cost, reduction_indices=2)
return self._average(cost)
@lazy_property
def error(self):
error = tf.not_equal(
tf.argmax(self.prediction, 2), tf.argmax(self.target, 2))
error = tf.cast(error, tf.float32)
return self._average(error)
@lazy_property
def logprob(self):
logprob = tf.mul(self.prediction, self.target)
logprob = tf.reduce_max(logprob, reduction_indices=2)
logprob = tf.log(tf.clip_by_value(logprob, 1e-10, 1.0)) / tf.log(2.0)
return self._average(logprob)
def _average(self, data):
data *= self.mask
length = tf.reduce_sum(self.length, 1)
data = tf.reduce_sum(data, reduction_indices=1) / length
data = tf.reduce_mean(data)
return data
All the three properties above are averaged over the frames of all sequences. With
fixed-length sequences, this would be a single tf.reduce_mean(), but as we work with
variable-length sequences, we have to be a bit more careful. First, we mask out padding
frames by multiplying with the mask. Then we aggregate along the frame size. Because
the three functions above all multiply with the target, each frame has just one element set
and we use tf.reduce_sum() to aggregate each frame into a scalar.
Next, we want to average along the frames of each sequence using the actual sequence
length. To protect against division by zero in case of empty sequences, we use the
maximum of each sequence length and one. Finally, we can use tf.reduce_mean() to average
over the examples in the batch.
We will directly head to training this model. Note that we did not the define the optimize
operation. It is identical to those used for sequence classification or sequence labelling
ealier in the chapter.
class Training:
@overwrite_graph
def __init__(self, params):
self.params = params
self.texts = ArxivAbstracts('/home/user/dataset/arxiv')()
self.prep = Preprocessing(
self.texts, self.params.max_length, self.params.batch_size)
self.sequence = tf.placeholder(
tf.float32,
[None, self.params.max_length, len(self.prep.VOCABULARY)])
self.model = PredictiveCodingModel(self.params, self.sequence)
self._init_or_load_session()
def __call__(self):
print('Start training')
self.logprobs = []
batches = iter(self.prep)
for epoch in range(self.epoch, self.params.epochs + 1):
self.epoch = epoch
for _ in range(self.params.epoch_size):
self._optimization(next(batches))
self._evaluation()
return np.array(self.logprobs)
def _optimization(self, batch):
logprob, _ = self.sess.run(
(self.model.logprob, self.model.optimize),
{self.sequence: batch})
if np.isnan(logprob):
raise Exception('training diverged')
self.logprobs.append(logprob)
def _evaluation(self):
self.saver.save(self.sess, os.path.join(
self.params.checkpoint_dir, 'model'), self.epoch)
self.saver.save(self.sess, os.path.join(
self.params.checkpoint_dir, 'model'), self.epoch)
perplexity = 2 ** -(sum(self.logprobs[-self.params.epoch_size:]) /
self.params.epoch_size)
print('Epoch {:2d} perplexity {:5.1f}'.format(self.epoch, perplexity))
def _init_or_load_session(self):
self.sess = tf.Session()
self.saver = tf.train.Saver()
checkpoint = tf.train.get_checkpoint_state(self.params.checkpoint_dir)
if checkpoint and checkpoint.model_checkpoint_path:
path = checkpoint.model_checkpoint_path
print('Load checkpoint', path)
self.saver.restore(self.sess, path)
self.epoch = int(re.search(r'-(\d+)$', path).group(1)) + 1
else:
ensure_directory(self.params.checkpoint_dir)
print('Randomly initialize variables')
self.sess.run(tf.initialize_all_variables())
self.epoch = 1
dataset and keep track of the logarithmic probabilities. We use those at evaluation time
between each training epoch to compute and print the perplexity.
In _init_or_load_session() we introduce a tf.train.Saver() that stores the current values of
all tf.Variable() in the graph to a checkpoint file. While the actual checkpointing is done in
_evaluation(), here we create the class and look for existing checkpoints to load. The
tf.train.get_checkpoint_state() looks for TensorFlows meta data file in our checkpoint
directory. As of writing, it only contains the file of the least recent checkpoint file.
Checkpoint files are prepended by a number that we can specify, in our case the epoch
number. When loading a checkpoint, we apply a regular experession with Pythons re
package to extract that epoch number. With the checkpointing logic set up, we can start
training. Here is the configuration:
def get_params():
checkpoint_dir = '/home/user/model/arxiv-predictive-coding'
max_length = 50
sampling_temperature = 0.7
rnn_cell = GRUCell
rnn_hidden = 200
rnn_layers = 2
learning_rate = 0.002
optimizer = tf.train.AdamOptimizer
gradient_clipping = 5
batch_size = 100
epochs = 20
epoch_size = 200
return AttrDict(**locals())
To run the code, you can just call Training(get_params())(). On my notebook, it takes about
one hour for the 20 epochs. During this training, the model saw 20 epochs * 200 batches * 100
examples * 50 characters = 20M characters.
As you can see on the graph, the model converges at a perplexity of about 1.5 per
character. This means that with our model, we could compress a text at an average of 1.5
bits per character.
For comparison with word-level language models, we would have to average by the
number of words rather than the number of characters. As a rough estimate, we can
multipy it by the average number of characters per word, which is on our test set.
In the constructor, we create an instance of our preprocessing class that we will use
convert the current generated sequence into a Numpy vector to feed into the graph. The
sequence placeholder for this is only has one sequence per batch because we dont want to
generate multiple sequences at the same time.
One thing to explain is the sequence length of two. Remember that the model use all but
the last characters as input data and all but the first characters as targets. We feed in the
last character of the current text and any second character as sequence. The network will
predict the target for the first character. The second character is used as target but since we
dont train anything, it will be ignored.
You may wonder how we can get along with only passing the last character of the
current text into the network. The trick here is that we will get the last activation of the
recurrent network and use that to initialize the state in the next run. For this, we make use
of the initial state argument of our model. For the GRUCell that we used, the state is a vector
of size rnn_layers * rnn_units.
@overwrite_graph
def __init__(self, params, length):
self.params = params
self.prep = Preprocessing([], 2, self.params.batch_size)
self.sequence = tf.placeholder(
tf.float32, [1, 2, len(self.prep.VOCABULARY)])
self.state = tf.placeholder(
tf.float32, [1, self.params.rnn_hidden * self.params.rnn_layers])
self.model = PredictiveCodingModel(
self.params, self.sequence, self.state)
self.sess = tf.Session()
checkpoint = tf.train.get_checkpoint_state(self.params.checkpoint_dir)
if checkpoint and checkpoint.model_checkpoint_path:
tf.train.Saver().restore(
self.sess, checkpoint.model_checkpoint_path)
else:
print('Sampling from untrained model.')
print('Sampling temperature', self.params.sampling_temperature)
The __call__() functions defines the logic for sampling a text sequence. We start with the
seed and predict one character at a time, always feeding the current text into the network.
We use the same preprocessing class to convert the current texts into padded Numpy
blocks and feed them into the network. Since we only have one sequence with a single
output frame in the batch, we only care at the prediction at index [0, 0]. We then sample
from the softmax output using the _sample() function described next.
def __call__(self, seed, length=100):
text = seed
state = np.zeros((1, self.params.rnn_hidden * self.params.rnn_layers))
for _ in range(length):
feed = {self.state: state}
feed[self.sequence] = self.prep([text[-1] + '?'])
prediction, state = self.sess.run(
[self.model.prediction, self.model.state], feed)
text += self._sample(prediction[0, 0])
return text
How do we sample from the network output? Earlier we said we can generate
sequences by taking their best bet and feeding that in as the next frame. Actually, we dont
just choose the most likely next frame but randomly sample one from the probability
distribution that the RNN outputs. This way, words with a high output probability are
more likely to be chosen but less likely words are still possible. This results in more
dynamic generated sequences. Otherwise, we might just generate the same average
sentence again and again.
There is a simple mechanism to manually control how advantageous the generation
process should be. For example, if we always choose the next word randomly (and ignore
the network output completely), we get very new and unique sentences, but they would
not make any sense. If we always choose the networks highest output as the next word,
we get a lot of common, but meaningless words like the, a, etc.
The way can control this behavior is by introducing a temperature parameter . We use
this parameter to make the predictions of the output distribution at the softmax layer more
similar or more radical. This will result in more interesting but random sequences on the
one side of the spectrum, and to more plausible but boring sequences on the other side.
The way it works is that we scale the scale the outputs in linear space, then transform them
back into exponential space and normalize again:
Since the network already outputs a softmax distribution, we undo it by applying the
natural logarithm. We dont have to undo the normalization since we will normalize our
results again, anyways. Then we divide each value by the chosen temperature value and
re-apply the softmax function.
def _sample(self, dist):
dist = np.log(dist) / self.params.sampling_temperature
dist = np.exp(dist) / np.exp(dist).sum()
choice = np.random.choice(len(dist), p=dist)
choice = self.prep.VOCABULARY[choice]
return choice
Lets run the code by calling Sampling(get_params())('We', 500)) for the network to generate
a new abstract. While you can certainly tell that this text is not written by a human, it is
quite remarkable what the network learns from examples.
We study nonconvex encoder in the networks (RFNs) hasding configurations with
non-convex large-layers of images, each directions literatic for layers. More
recent results competitive strategy, in which data at training and more
difficult to parallelize. Recent Newutic systems, the desirmally parametrically
in the DNNs improves optimization technique, we extend their important and
subset of theidesteding and dast and scale in recent advances in sparse
recovery to complicated patterns of the $L_p$
We did not tell the RNN what a space is, but it captured statistically dependencies in the
data to place whitespace accordingly in the generated text. Even between some nonexistent words that the network dreamed up, the whitespace looks reasonable. Moreover,
those words are composed of valid combinations of vowels and consonants, another
abstract feature learned from the example texts.
Conclusion
RNNs are powerful sequential models that are applicable to a wide range of problems
and are responsible for state-of-the-art results. We learned how to optimize RNNs, what
problems arise doing so, and how architectures like LSTM and GRU help to overcome
them. Using these building blocks, we solved several problems in natural language
processing and related domains including classifying the sentiment of movie reviews,
recognizing hand-written words, and generating fake scientific abstracts.
In the next chapter we will put our trained models in production so they can be
consumed by other applications.
That will load mount your home directory in the /mnt/home path of the container, and will
let you working in a terminal inside of it. This is useful as you can work your code directly
on your favorite IDE/editor, and just use the container for running the build tools. It will
also leave the port 9999 open to access it from your host machine for later usage of the
server we are going to build.
You can leave the container terminal with exit, which will stop it from running, and start
it again as many times you want using command above.
Bazel workspace
Tensorflow Serving programs are coded in C++ and should be built using Googles
Bazel build tool. We are going to run Bazel from inside the recently created container.
Bazel manages third party dependencies at code level, downloading and building them,
as long as they are also built with Bazel. To define which third party dependencies our
project would support, we must define a WORKSPACE file at the root of our project repository.
The dependencies we need are Tensorflow Serving repository, and for the case of our
example, the Tensorflow Models repository includes the Inception model code.
Sadly, at the moment of this writing, Tensorflow Serving does not support being
referenced directly thru Bazel as a Git repository, so we must include it as a Git
submodule in our project:
# on your local machine
mkdir ~/serving_example
cd ~/serving_example
git init
git submodule add https://github.com/tensorflow/serving.git tf_serving
git submodule update --init --recursive
We now define the third party dependencies as locally stored files using the
local_repository rule on the WORKSPACE file. We also have to initialize Tensorflow dependencies
using the tf_workspace rule imported from the project:
# Bazel WORKSPACE file
workspace(name = "serving")
local_repository(
name = "tf_serving",
path = __workspace_dir__ + "/tf_serving",
)
local_repository(
name = "org_tensorflow",
path = __workspace_dir__ + "/tf_serving/tensorflow",
)
load('//tf_serving/tensorflow/tensorflow:workspace.bzl', 'tf_workspace')
tf_workspace("tf_serving/tensorflow/", "@org_tensorflow")
bind(
name = "libssl",
actual = "@boringssl_git//:ssl",
)
bind(
name = "zlib",
actual = "@zlib_archive//:zlib",
)
# only needed for inception model export
local_repository(
name = "inception_model",
path = __workspace_dir__ + "/tf_serving/tf_models/inception",
)
As a last step we have to run ./configure for Tensorflow from within the container:
# on the docker container
cd /mnt/home/serving_example/tf_serving/tensorflow
./configure
def inference(x):
# from the original model
external_x = tf.placeholder(tf.string)
x = convert_external_inputs(external_x)
y = inference(x)
In the code above we define the placeholder for the input. We call a function to convert
the external input represented in the placeholder to the format required for the original
model inference method. For example we will convert from the JPEG string to the image
format required for Inception. Finally we call the original model inference method with
the converted input.
For example, for the Inception model we should have methods like:
import tensorflow as tf
from tensorflow_serving.session_bundle import exporter
from inception import inception_model
def convert_external_inputs(external_x):
# transform the external input to the input format required on inference
# convert the image string to a pixels tensor with values in the range 0,1
image = tf.image.convert_image_dtype(tf.image.decode_jpeg(external_x, channels=3), tf.float32)
# resize the image to the model expected width and height
images = tf.image.resize_bilinear(tf.expand_dims(image, 0), [299, 299])
# Convert the pixels to the range -1,1 required by the model
images = tf.mul(tf.sub(images, 0.5), 2)
return images
def inference(images):
logits, _ = inception_model.inference(images, 1001)
return logits
In the code above we define the placeholder for the input. We call a function to convert
the external input represented in the placeholder to the format required for the original
model inference method. For example we will convert from the JPEG string to the image
format required for Inception. Finally we call the original model inference method with
the converted input.
The inference method requires values for its parameters. We will recover those from a
training checkpoint. As you may recall from the basics chapter, we periodically save
training checkpoint files of our model. Those contain the learned values of parameters at
the time, so in case of disaster we dont lose the training progress.
When we declare the training complete, the last saved training checkpoint will contain
the most updated model parameters, which are the ones we wish to put in production.
To restore the checkpoint, the code should be:
saver = tf.train.Saver()
with tf.Session() as sess:
# Restore variables from training checkpoints.
ckpt = tf.train.get_checkpoint_state(sys.argv[1])
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, sys.argv[1] + "/" + ckpt.model_checkpoint_path)
else:
print("Checkpoint file not found")
raise SystemExit
For the Inception model, you can download a pretrained checkpoint from
http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
# on the docker container
cd /tmp
curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
tar -xzf inception-v3-2016-03-01.tar.gz
Because of dependencies to auto-generated code in the Exporter class code, you will
need to run our exporter using bazel, inside the Docker container.
To do so we will save our code as export.py inside the bazel workspace we started before.
We can then run the exporter from between the container with the command:
# on the docker container
cd /mnt/home/serving_example
bazel run :export /tmp/inception-v3
You can use this same interface definition for any kind of service that receives an
image, or an audio fragment, or a piece of text.
For using an structured input like a database record, just change the ClassificationRequest
message. For example, if we were trying to build the classification service for the Iris
dataset:
message ClassificationRequest {
float petalWidth = 1;
float petalHeight = 2;
float sepalWidth = 3;
float sepalHeight = 4;
}
The proto file will be converted to the corresponding classes definitions for the client
and the server by the proto compiler. To use the protobuf compiler, we have to add a new
rule to the BUILD file like:
load("@protobuf//:protobuf.bzl", "cc_proto_library")
cc_proto_library(
name="classification_service_proto",
srcs=["classification_service.proto"],
cc_libs = ["@protobuf//:protobuf"],
protoc="@protobuf//:protoc",
default_runtime="@protobuf//:protobuf",
use_grpc_plugin=1
)
Notice the load at the top of the code fragment. It imports the cc_proto_library rule
definition from the externally imported protobuf library. Then we use it for defining a
build to our proto file. Lets run the build using bazel build :classification_service_proto and
check the resulting bazel-genfiles/classification_service.grpc.pb.h:
...
class ClassificationService {
...
class Service : public ::grpc::Service {
public:
Service();
virtual ~Service();
virtual ::grpc::Status classify(::grpc::ServerContext* context, const ::ClassificationRequest*
};
...
class ClassificationRequest : public ::google::protobuf::Message {
...
const ::std::string& input() const;
void set_input(const ::std::string& value);
...
}
class ClassificationResponse : public ::google::protobuf::Message {
...
const ::ClassificationClass& classes() const;
void set_allocated_classes(::ClassificationClass* classes);
...
}
class ClassificationClass : public ::google::protobuf::Message {
...
const ::std::string& name() const;
void set_name(const ::std::string& value);
float score() const;
void set_score(float value);
...
}
You can see how the proto definition became a C++ class interface for each type. Their
implementations are autogenerated too so we can just use them right away.
unique_ptr<SessionBundle> sessionBundle;
bundle_factory->CreateSessionBundle(pathToExportFiles, &sessionBundle);
return sessionBundle;
In the code we use a SessionBundleFactory to create the SessionBundle configured to load the
model exported in the path specified by pathToExportFiles. It returns a unique pointer to the
instance of the created SessionBudle.
We now have to define the implementation of the service, ClassificationServiceImpl that
will receive the SessionBundle as parameter to be used to do the inference.
class ClassificationServiceImpl final : public ClassificationService::Service {
private:
unique_ptr<SessionBundle> sessionBundle;
public:
ClassificationServiceImpl(unique_ptr<SessionBundle> sessionBundle) :
sessionBundle(move(sessionBundle)) {};
Status classify(ServerContext* context, const ClassificationRequest* request,
ClassificationResponse* response) override {
if (!signatureStatus.ok()) {
return Status(StatusCode::INTERNAL, signatureStatus.error_message());
}
vector<tensorflow::Tensor> outputs;
// Run inference.
const tensorflow::Status inferenceStatus = sessionBundle->session->Run(
{{signature.input().tensor_name(), input}},
{signature.classes().tensor_name(), signature.scores().tensor_name()},
{},
&outputs);
if (!inferenceStatus.ok()) {
return Status(StatusCode::INTERNAL, inferenceStatus.error_message());
}
return Status::OK;
}
};
ClassificationServiceImpl classificationServiceImpl(move(sessionBundle));
ServerBuilder builder;
builder.AddListeningPort(serverAddress, grpc::InsecureServerCredentials());
builder.RegisterService(&classificationServiceImpl);
unique_ptr<Server> server = builder.BuildAndStart();
cout << "Server listening on " << serverAddress << endl;
server->Wait();
return 0;
}
To compile this code we have to define a rule in our BUILD file for it
cc_binary(
name = "server",
srcs = [
"server.cc",
],
deps = [
":classification_service_proto",
"@tf_serving//tensorflow_serving/servables/tensorflow:session_bundle_factory",
"@grpc//:grpc++",
],
)
With this code we can run the inference server from the container with bazel run :server
9999 /tmp/inception-v3/export/{timestamp}.
To call inference from our webapp server, we need the corresponding Python protocol
buffer client for the ClassificationService. To generate it we will need to run the protocol
buffer compiler for Python:
pip install grpcio cython grpcio-tools
python -m grpc.tools.protoc -I. --python_out=. --grpc_python_out=. classification_service.proto
It will generate the classification_service_pb2.py file that contains the stub for calling the
service.
On POST the server will parse the sent form and create a ClassificationRequest with it.
Then setup a channel to the classification server and submit the request to it. Finally, it will
render the classification response as HTML and send it back to the user.
def do_POST(self):
form = cgi.FieldStorage(
fp=self.rfile,
headers=self.headers,
environ={
'REQUEST_METHOD': 'POST',
'CONTENT_TYPE': self.headers['Content-Type'],
})
request = classification_service_pb2.ClassificationRequest()
request.input = form['file'].file.read()
channel = implementations.insecure_channel("127.0.0.1", 9999)
stub = classification_service_pb2.beta_create_ClassificationService_stub(channel)
response = stub.classify(request, 10) # 10 secs timeout
self.respond_form("<div>Response: %s</div>" % response)
To run the server we can python client.py from outside the container. Then we navigate
with a browser to http://localhost:8080 to access its UI. Go ahead and upload an image to
try inference working on it.
Now, outside the container we have to commit its state into a new Docker image. That
basically means creating a snapshot of the changes in its virtual file system.
# outside the container
docker ps
# grab container id from above
docker commit <container id>
Thats it. Now we can push the image to our favorite docker serving cloud and start
serving it.
Conclusion
In this chapter we learned how to adapt our models for serving, exporting them and
building fast lightweight servers that run them. We also learned how to create simple web
apps for consuming them giving the full toolset for consuming Tensorflow models from
other apps.
In the next chapter we provide code snippets and explanations for some of the helper
functions and classes used throughout this book.
Download function
We download several datasets throughout the book. In all cases, there is shared logic
and it makes sense to extract that into a function. First, we determine the filename from
the URL if not specified. Then, we use the function defined above to ensure that the
directory path of the download location exists.
import shutil
from urllib.request import urlopen
def download(url, directory, filename=None):
"""
Download a file and return its filename on the local file system. If the
file is already there, it will not be downloaded again. The filename is
derived from the url if not provided. Return the filepath.
"""
if not filename:
_, filename = os.path.split(url)
directory = os.path.expanduser(directory)
ensure_directory(directory)
filepath = os.path.join(directory, filename)
if os.path.isfile(filepath):
return filepath
print('Download', filepath)
with urlopen(url) as response, open(filepath, 'wb') as file_:
shutil.copyfileobj(response, file_)
return filepath
Before starting the actual download, check if there is already a file with the target name
in the download location. If so, skip the download since we do not want to repeat large
downloads unnecessarily. Finally, we download the file and return its path. In case you
need to repeat a download, just delete the corresponding file on the file system.
Here is an example usage of the disk cache decorator to save the data processing
pipeline.
@disk_cache('dataset', '/home/user/dataset/')
def get_dataset(one_hot=True):
dataset = Dataset('http://example.com/dataset.bz2')
dataset = Tokenize(dataset)
if one_hot:
dataset = OneHotEncoding(dataset)
return dataset
For methods, there is a method=False argument that tells the decorator whether to ignore
the first argument or not. In methods and class methods, the first argument is the object
instance self that is different for every program run and thus shouldnt determine if there is
a cache available. For static methods and functions outside of classes, this should be False.
Attribute Dictionary
This simple class just provides some convenince when working with configuration
objects. While you could perfectly well store your configurations in Python dictionaries, it
is a bit verbose to access their elements using the config['key'] syntax.
class AttrDict(dict):
def __getattr__(self, key):
if key not in self:
raise AttributeError
return self[key]
def __setattr__(self, key, value):
if key not in self:
raise AttributeError
self[key] = value
This class, inheriting from the built-in dict, allows to access and change existing elemets
using the attribute syntax: config.key and config.key = value. You can create attribute
dictionaries by either passing in a standard dictionary, passing in entries keyword
arguments, or using **locals().
parmas = AttrDict({
'key': value,
})
params = AttrDict(
key=value,
)
def get_params():
key = value
return AttrDict(**locals())
The locals() built-in just returns a mapping from all local variable names in the scope to
their values. While some people who are not that familiar with Python might argue that
there too much magic going on here, this technique also provides some benefits. Mainly,
we can have configuration entries that rely on ealier entries.
def get_params():
learning_rate = 0.003
optimizer = tf.train.AdamOptimizer(learning_rate)
return AttrDict(**locals())
This function returns an attribute dictionary containing both the learning_rate and the
optimizer. This would not be possible within the declaration of a dictionary. As always, just
find a way that works for you (and your colleagues) and use that.
Using an instance of this from the outside creates a new computation path in the graph
when we access model.optimize, for example. Moreover, this internally calls model.prediction
creating new weights and biases. To address this design problem, we introduce the
following @lazy_property decorator.
import functools
def lazy_property(function):
attribute = '_lazy_' + function.__name__
@property
@functools.wraps(function)
def wrapper(self):
if not hasattr(self, attribute):
setattr(self, attribute, function(self))
return getattr(self, attribute)
return wrapper
The idea is to define a property that is only evaluated once. The result is stored in a
member called like the function with some prefix, for example _lazy_ here. Subsequent
calls to the property name then return the existing node of of the graph. We can now write
the above model like this:
class Model:
def __init__(self, data, target):
self.data = data
self.target = target
self.prediction
self.optimize
self.error
@lazy_property
def prediction(self):
data_size = int(self.data.get_shape()[1])
target_size = int(self.target.get_shape()[1])
weight = tf.Variable(tf.truncated_normal([data_size, target_size]))
bias = tf.Variable(tf.constant(0.1, shape=[target_size]))
incoming = tf.matmul(self.data, weight) + bias
return tf.nn.softmax(incoming)
@lazy_property
def optimize(self):
cross_entropy = -tf.reduce_sum(self.target, tf.log(self.prediction))
optimizer = tf.train.RMSPropOptimizer(0.03)
return optimizer.minimize(cross_entropy)
@lazy_property
def error(self):
mistakes = tf.not_equal(
tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))
return tf.reduce_mean(tf.cast(mistakes, tf.float32))
Lazy properties are a nice tool to structure TensorFlow models and decompose them
into classes. It is useful for both node that are needed from the outside and to break up
internal parts of the computation.
Even more conveniently, put the graph creation in in a decorator like this and decorate
your main function with it. This main function should define the whole graph, for example
defined the placeholders and calling another function to create the model.
import functools
import tensorflow as tf
def overwrite_graph(function):
@functools.wraps(function)
def wrapper(*args, **kwargs):
with tf.Graph().as_default():
return function(*args, **kwargs)
return wrapper
This is the end of the chapter, but take a look at the next chapter to read our wrapup of
the book.
Chapter 9. Conclusion
You made it! Thank for you reading TensorFlow for Machine Intelligence. You should
now have a firm understanding of the core mechanics and API for building machine
learning models in TensorFlow. If you werent already knowledgable about deep learning,
we hope that youve gained more insight and comfort with some of the most common
architectures in convolutional and recurrent neural networks. Youve also seen how simple
it can be to put a trained model into a production setting and start adding the power of
TensorFlow to your own applications.
TensorFlow has the capabilities to change the way researchers and businesses approach
machine learning. With the skills learned in this book, be confident in your ability to build,
test, and implement existing models as well as your own newly designed experimental
networks. Now that you are comfortable with the essentials, dont be afraid to play around
with whats possible in TensorFlow. You now bring a new edge with you in any discussion
about creating machine learning solutions.
Stay Updated
The absolute best way to keep up-to-date with the latest functionality and features of
TensorFlow is the official TensorFlow Git repository on GitHub. By reading pull requests,
issues, and release notes, youll know ahead of time what will be included in upcoming
releases, and youll even get a sense of when new releases are planned.
https://github.com/tensorflow/tensorflow
Distributed TensorFlow
Although the basic concepts of running TensorFlow in a distributed setting are
relatively simple, the details of setting up a cluster to efficiently train a TensorFlow model
could be its own book. The first place you should look to get started with distributed
TensorFlow is the official how-to on the tensorflow.org website:
https://www.tensorflow.org/versions/master/how_tos/distributed/index.html
Note that we expect many new features to be released in the near future that will make
distributed TensorFlow much simpler and more flexible- especially with regards to using
cluster management software, such as Kubernetes.
Preface
I. Getting started with TensorFlow
1. Introduction
Data is everywhere
Deep learning
TensorFlow: a modern machine learning library
TensorFlow: a technical overview
A brief history of deep learning at Google
What is TensorFlow?
Breaking down the one-sentence description
Beyond the one-sentence description
When to use TensorFlow
TensorFlows strengths
Challenges when using TensorFlow
Onwards and upwards!
2. TensorFlow Installation
Selecting an installation environment
Jupyter Notebook and Matplotlib
Creating a Virtualenv environment
Simple installation of TensorFlow
Example installation from source: 64-bit Ubuntu Linux with GPU support
Installing dependencies
Installing Bazel
Installing CUDA Software (NVIDIA CUDA GPUs only)
Building and Installing TensorFlow from Source
Installing Jupyter Notebook:
Installing matplotlib
Testing Out TensorFlow, Jupyter Notebook, and matplotlib
Conclusion
II. TensorFlow and Machine Learning fundamentals
3. TensorFlow Fundamentals
Introduction to Computation Graphs
Graph basics
Dependencies
Defining Computation Graphs in TensorFlow
Building your first TensorFlow graph
Thinking with tensors
Tensor shape
TensorFlow operations
TensorFlow graphs
TensorFlow Sessions
Adding Inputs with Placeholder nodes
Variables
Organizing your graph with name scopes
Logging with TensorBoard
Exercise: Putting it together
Building the graph
Running the graph
Conclusion
4. Machine Learning Basics
Supervised learning introduction
Saving training checkpoints
Linear regression
Logistic regression
Softmax classification
Multi-layer neural networks
Gradient descent and backpropagation
Conclusion
III. Implementing Advanced Deep Models in TensorFlow
5. Object Recognition and Classification
Convolutional Neural Networks
Convolution
Input and Kernel
Strides
Padding
Data Format
Kernels in Depth
Common Layers
Convolution Layers
Activation Functions
Pooling Layers
Normalization
High Level Layers
Images and TensorFlow
Loading images
Image Formats
Image Manipulation
Colors
CNN Implementation
Stanford Dogs Dataset
Convert Images to TFRecords
Load Images
Model
Training
Debug the Filters with Tensorboard
Conclusion
6. Recurrent Neural Networks and Natural Language Processing
Introduction to Recurrent Networks
Approximating Arbitrary Programs
Backpropagation Through Time
Encoding and Decoding Sequences
Implementing Our First Recurrent Network
Vanishing and Exploding Gradients
Long-Short Term Memory
Architecture Variations
Word Vector Embeddings
Preparing the Wikipedia Corpus
Model structure
Noise Contrastive Classifier
Download function
Disk caching decorator
Attribute Dictionary
Lazy property decorator
Overwrite Graph Decorator
9. Conclusion
Next steps and additional resources
Read the docs
Stay Updated
Distributed TensorFlow
Building New TensorFlow Functionality
Get involved with the community
Code from this book