3.3 Distributed Execution: Relu Dadd Drelu
3.3 Distributed Execution: Relu Dadd Drelu
3.3 Distributed Execution: Relu Dadd Drelu
6
This generally means that temporary outputs are con- fetch
sumed soon after being constructed, so their memory can
be reused quickly. When the heuristic is ineffective, the
user can change the order of graph construction, or add e f e f
control dependencies as described in Section 5. When
gradient nodes are automatically added to the graph, the
d c d c
user has less control, and the heuristics may break down.
In particular, because gradients reverse the forward com-
putation order, tensors that are used early in a graph’s a b a b
execution are frequently needed again near the end of a
gradient computation. Such tensors can hold on to a lot feed
of scarce GPU memory and unnecessarily limit the size
of computations. We are actively working on improve-
ments to memory management to deal better with such Figure 6: Before and after graph transformation for par-
cases. Options include using more sophisticated heuris- tial execution
tics to determine the order of graph execution, recom-
puting tensors instead of retaining them in memory, and special feed and fetch nodes, the set of nodes to execute
swapping out long-lived tensors from GPU memory to can be determined by starting at each of the nodes named
more plentiful host CPU memory. by any output and working backwards in the graph using
the graph dependencies to determine the full set of nodes
that must be executed in the rewritten graph in order to
4.2 Partial Execution
compute the outputs. Figure 6 shows an original graph
Often a client wants to execute just a subgraph of the on the left, and the transformed graph that results when
entire execution graph. To support this, once the client Run is invoked with inputs=={b} and outputs=={f:0}.
has set up a computation graph in a Session, our Run Since we only need to compute the output of node f, we
method allows them to execute an arbitrary subgraph of will not execute nodes d and e, since they have no con-
the whole graph, and to inject arbitrary data along any tribution to the output of f.
edge in the graph, and to retrieve data flowing along any
edge in the graph.
4.3 Device Constraints
Each node in the graph has a name, and each output of
a node is identified by the source node name and the out- TensorFlow clients can control the placement of nodes
put port from the node, numbered from 0 (e.g., “bar:0” on devices by providing partial constraints for a node
refers to the 1st output of the “bar” node, while “bar:1” about which devices it can execute on. For ex-
refers to the 2nd output). ample, “only place this node on a device of type
Two arguments to the Run call help define the exact GPU”, or “this node can be placed on any device in
subgraph of the computation graph that will be executed. /job:worker/task:17”, or “Colocate this node
First, the Run call accepts inputs, an optional mapping with the node named variable13”. Within the con-
of name:port names to “fed” tensors values. Second, fines of these constraints, the placement algorithm is re-
the Run call accepts output names, a list of output sponsible for choosing an assignment of nodes to de-
name[:port] specifications indicating which nodes vices that provides fast execution of the computation and
should be executed, and, if the port portion is present in a also satisfies various constraints imposed by the devices
name, that that particular output tensor value for the node themselves, such as limiting the total amount of memory
should be returned to the client if the Run call completes needed on a device in order to execute its subset of graph
successfully. nodes.
The graph is transformed based on the values of in- Supporting such constraints requires changes to the
puts and outputs. Each node:port specified in inputs is placement algorithm described in Section 3.2.1. We first
replaced with a feed node, which will pick up the pro- compute the feasible set of devices for each node, and
vided input tensor from specially-initialized entries in a then use union-find on the graph of colocation constraints
Rendezvous object used for the Run call. Similarly, each to compute the graph components that must be placed
output name with a port is connected to a special fetch together. For each such component, we compute the in-
node that arranges to save the output tensor and return it tersection of the feasible device sets. The computed fea-
to the client when the Run call is complete. Finally, once sible device set per node fits easily into the placement
the graph has been rewritten with the insertion of these algorithm’s simulator.
7
4.4 Control Flow 4.5 Input Operations
Although dataflow graphs without any explicit control Although input data can be provided to a computation via
flow are quite expressive, we have observed a number of feed nodes, another common mechanism used for train-
cases where supporting conditionals and loops can lead ing large-scale machine learning models is to have spe-
to more concise and efficient representations of machine cial input operation nodes in the graph, which are typi-
learning algorithms. cally configured with a set of filenames and which yield
a tensor containing one or more examples from the data
Much as in the dataflow-machine approach described
stored in that set of files each time they are executed.
by Arvind [3], we introduce a small set of primitive con-
This allows data to be read directly from the underlying
trol flow operators into TensorFlow and generalize Ten-
storage system into the memory of the machine that will
sorFlow to handle cyclic dataflow graphs. The Switch
perform subsequent processing on the data. In configura-
and Merge operators allow us to skip the execution of
tions where the client process is separate from the worker
an entire subgraph based on the value of a boolean ten-
process, if the data were fed, it typically would require an
sor. The Enter, Leave, and NextIteration operators allow
extra network hop (from the storage system to the client
us to express iteration. High-level programming con-
and then from the client to the worker vs. directly from
structs such as if-conditionals and while-loops can be
the storage system to ther worker when using an input
easily compiled into dataflow graphs with these control
node).
flow operators.
The TensorFlow runtime implements a notion of tags
and frames conceptually similar to the MIT Tagged- 4.6 Queues
Token machine [4]. Each iteration of a loop is uniquely
identified by a tag, and its execution state is represented Queues are a useful feature that we have added to Ten-
by a frame. An input can enter an iteration whenever it sorFlow. They allow different portions of the graph to
becomes available; thus, multiple iterations can be exe- execute asynchronously, possibly at different candences,
cuted concurrently. and to hand off data through Enqueue and Dequeue op-
erations. Enqueue operations can block until space be-
TensorFlow uses a distributed coordination mecha-
comes available in the queue, and Dequeue operations
nism to execute graphs with control flow. In general, a
can block until a desired minimum number of elements
loop can contain nodes that are assigned to many dif-
are available in the queue. One use of queues is to allow
ferent devices. Therefore, managing the state of a loop
input data to be prefetched from disk files while a previ-
becomes a problem of distributed termination detection.
ous batch of data is still being processed by the compu-
TensorFlow’s solution is based on graph rewriting. Dur-
tational portion of a machine learning model. They can
ing the graph partitioning, we automatically add control
also be used for other kinds of grouping, including accu-
nodes to each partition. These nodes implement a small
mulating many gradients in order to compute some more
state machine that orchestrates the start and termination
complex combination of gradients over a larger batch,
of each iteration, and decides the termination of the loop.
or to group different input sentences for recurrent lan-
For each iteration, the device that owns the loop termi-
guage models into bins of sentences that are approxi-
nation predicate sends a tiny control message to every
mately the same length, which can then be processed
participating device.
more efficiently.
As explained above, we often train machine learning In addition to normal FIFO queues, we have also im-
models by gradient descent, and represent gradient com- plemented a shuffling queue, which randomly shuffles its
putations as part of dataflow graphs. When a model elements within a large in-memory buffer. This shuffling
includes control-flow operations, we must account for functionality is useful for machine learning algorithms
them in the corresponding gradient computation. For ex- that want to randomize the order in which they process
ample, the gradient computation for a model with an if- examples, for example.
conditional will need to know which branch of the con-
ditional was taken, then apply the gradient logic to this
branch. Similarly, the gradient computation for a model 4.7 Containers
with a while-loop will need to know how many iterations
were taken, and will also rely on the intermediate values A Container is the mechanism within TensorFlow for
computed during those iterations. The basic technique is managing longer-lived mutable state. The backing store
to rewrite the graph so to memorize the values needed for for a Variable lives in a container. The default con-
the gradient computation. We omit the somewhat intri- tainer is one that persists until the process terminates,
cate details of this encoding. but we also allow other named containers. A container
8
can be reset by clearing it of its contents entirely. Us- 5.3 Asynchronous Kernels
ing containers, it is possible to share state even across
completely disjoint computation graphs associated with In addition to normal synchronous kernels that complete
different Sessions. their execution at the end of the Compute method, our
framework also supports non-blocking kernels. Such
non-blocking kernels use a slightly different interface
whereby the Compute method is passed a continuation
5 Optimizations
that should be invoked when the kernel’s execution is
complete. This is an optimization for environments
In this section, we describe some of the optimizations where having many active threads is relatively expensive
in the TensorFlow implementation that improve perfor- in terms of memory usage or other resources, and allows
mance or resource usage of the system. us to avoid tying up an execution thread for unbounded
periods of time while waiting for I/O or other events to
occur. Examples of asynchronous kernels include the
5.1 Common Subexpression Elimination Receive kernel, and the Enqueue and Dequeue kernels
(which might need to block if queue space is not avail-
Since the construction of computation graphs is often able or if no data is available to be read, respectively).
done by many different layers of abstractions in the client
code, computation graphs can easily end up with redun-
dant copies of the same computation. To handle this, we
5.4 Optimized Libraries for Kernel Imple-
have implemented a common subexpression pass similar mentations
to the algorithm described by Click [12] that runs over We often make use of pre-existing highly-optimized nu-
the computation graph and canonicalizes multiple copies merical libraries to implement kernels for some opera-
of operations with identical inputs and operation types tions. For example, there are a number of optimized li-
to just a single one of these nodes, and redirects graph braries for performing matrix multiplies on different de-
edges appropriately to reflect this canonicalization. vices, including BLAS [15] and cuBLAS [39], or GPU
libraries for convolutional kernels for deep neural nets
such as cuda-convnet [28] and cuDNN [9]. Many of
5.2 Controlling Data Communication and our kernel implementations are relatively thin wrappers
Memory Usage around such optimized libraries.
We make fairly extensive use of the open-source Eigen
Careful scheduling of TensorFlow operations can result linear algebra library [25] for many of the kernel imple-
in better performance of the system, in particular with mentations in the system. As one part of the develop-
respect to data transfers and memory usage. Specifically, ment of TensorFlow, our team (primarily Benoit Steiner)
scheduling can reduce the time window during which has extended the open source Eigen library with support
intermediate results need to be kept in memory in be- for arbitrary dimensionality tensor operations.
tween operations and hence the peak memory consump-
tion. This reduction is particularly important for GPU
devices where memory is scarce. Furthermore, orches- 5.5 Lossy Compression
trating the communication of data across devices can re-
Some machine learning algorithms, including those typ-
duce contention for network resources.
ically used for training neural networks, are tolerant of
While there are many opportunities for scheduling op- noise and reduced precision arithmetic. In a manner sim-
timizations, here we focus on one that we found partic- ilar to the DistBelief system [14], we often use lossy
ularly necessary and effective. It concerns the schedul- compression of higher precision internal representations
ing of Receive nodes for reading remote values. If no when sending data between devices (sometimes within
precautions are taken, these nodes may start much ear- the same machine but especially across machine bound-
lier than necessary, possibly all at once when execution aries). For example, we often insert special conversion
starts. By performing an as-soon-as-possible/as-late-as- nodes that convert 32-bit floating point representations
possible (ASAP/ALAP) calculation, of the kind common into a 16-bit floating point representation (not the pro-
in operations research, we analyze the critical paths of posed IEEE 16-bit floating point standard, but rather just
graphs, in order to estimate when to start the Receive a 32-bit IEEE 794 float format, but with 16 bits less pre-
nodes. We then insert control edges with the aim of de- cision in the mantissa), and then convert back to a 32-
laying the start of these nodes until just before their re- bit representation on the other side of the communica-
sults are needed. tion channel (by just filling in zeroes for the lost portion
9
of the mantissa, since that’s less computationally expen- strated subtle flaws in a complex network architec-
sive than doing the mathematically correct probabilistic ture specification. In particular we were able to
rounding when doing this 32 → 16 → 32-bit conver- identify operations and variables instantiated incor-
sion). rectly due to automatic broadcasting in a mathemat-
ical operation across a dimension.
6 Status and Experience 2. Start small and scale up. The first convolutional
neural network that we ported from our previ-
The TensorFlow interface and a reference implemen- ous system was a small network employed on the
tation have been open sourced under an Apache 2.0 CIFAR-10 data set [30]. Debugging such a network
license, and the system is available for download at elucidated subtle edge cases in individual opera-
www.tensorflow.org. The system includes detailed docu- tions (e.g., max-pooling) within the machine learn-
mentation, a number of tutorials, and a number of exam- ing system that would have been practically indeci-
ples demonstrating how to use the system for a variety pherable in more complex models.
of different machine learning tasks. The examples in-
clude models for classifying hand-written digits from the
3. Always ensure that the objective (loss function)
MNIST dataset (the “hello world” of machine learning
matches between machine learning systems when
algorithms) [32], classifying images from the CIFAR-
learning is turned off. Setting the learning rate to be
10 dataset [30], doing language modeling using a recur-
zero helped us identify unexpected behavior in how
rent LSTM [22] network, training word embedding vec-
we had randomly initialized variables in a model.
tors [35] and more.
Such an error would have been difficult to identify
The system includes front-ends for specifying Tensor- in a dynamic, training network.
Flow computations in Python and C++, and we expect
other front-ends to be added over time in response to 4. Make a single machine implementation match be-
the desires of both internal Google users and the broader fore debugging a distributed implementation. This
open-source community. strategy helped us delineate and debug discrep-
We have quite a few machine learning models in our ancies in training performance between machine
previous DistBelief system [14] that we have migrated learning system. In particular, we identified bugs
over to TensorFlow. The rest of this section discusses due to race conditions and non-atomic operations
some lessons we have learned that are generalizable for incorrectly assumed to be atomic.
any such migration of machine learning models from one
system to another, and therefore may be valuable to oth- 5. Guard against numerical errors. Numerical li-
ers. braries are inconsistent in how they handle non-
In particular, we focus on our lessons from porting a finite floating point values. Convolutional neu-
state-of-the-art convolutional neural network for image ral networks are particularly susceptible to numer-
recognition termed Inception [23]. This image recogni- ical instability and will tend to diverge quite regu-
tion system classifies 224 × 224 pixel images into one larly during experimentation and debugging phases.
of 1000 labels (e.g., “cheetah”, “garbage truck”, etc.). Guarding against this behavior by checking for non-
Such a model comprises 13.6 million learnable parame- finite floating point values allows one to detect er-
ters and 36,000 operations when expressed as a Tensor- rors in real time as opposed to identifying divergent
Flow graph. Running inference on a single image re- behavior post-hoc.
quires 2 billion multiply-add operations.
After building all necessary mathematical operations 6. Analyze pieces of a network and understand the
in TensorFlow, assembling and debugging all 36,000 op- magnitude of numerical error. Running subsec-
erations into the correct graph structure proved challeng- tions of a neural network in parallel on two machine
ing. Validating correctness is a difficult enterprise be- learning systems provides a precise method to en-
cause the system is inherently stochastic and only in- sure that a numerical algorithm is identical across
tended to behave in a certain way in expectation — po- two systems. Given that such algorithms run with
tentially after hours of computation. Given these cir- floating point precision, it is important to predict
cumstances, we found the following strategies critical for and understand the magnitude of expected numer-
porting the Inception model to TensorFlow: ical error in order to judge whether a given compo-
nent is correctly implemented (e.g., distinguishing
1. Build tools to gain insight into the exact number of between “within 1e-2, great!” and “within 1e-2:
parameters in a given model. Such tools demon- why is it so incorrect?!”).
10