CNN Text Classification
In this post we will implement a model similar to Kim Yoon’s Convolutional Neural Networks
for Sentence Classification. The model presented in the paper achieves good classification
performance across a range of text classification tasks (like Sentiment Analysis) and has
since become a standard baseline for new text classification architectures.
I’m assuming that you are already familiar with the basics of Convolutional Neural
Networks applied to NLP. If not, I recommend to first read over Understanding
Convolutional Neural Networks for NLP to get the necessary background.
I won’t go over the data pre-processing code in this post, but it isavailable on Github and
does the following:
1. Load positive and negative sentences from the raw data files.
2. Clean the text data using the same code as the original paper.
3. Pad each sentence to the maximum sentence length, which turns out to be 59. We append
special <PAD> tokens to all other sentences to make them 59 words. Padding sentences to
the same length is useful because it allows us to efficiently batch our data since each
example in a batch must be of the same length.
4. Build a vocabulary index and map each word to an integer between 0 and 18,765 (the
vocabulary size). Each sentence becomes a vector of integers.
The Model
The network we will build in this post looks roughly as follows:
The first layers embeds words into low-dimensional vectors. The next layer performs
convolutions over the embedded word vectors using multiple filter sizes. For example,
sliding over 3, 4 or 5 words at a time. Next, we max-pool the result of the convolutional layer
into a long feature vector, add dropout regularization, and classify the result using a softmax
Because this is an educational post I decided to simplify the model from the original paper a
We will not used pre-trained word2vec vectors for our word embeddings. Instead, we
learn embeddings from scratch.
We will not enforce L2 norm constraints on the weight vectors. A Sensitivity Analysis of
(and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
found that the constraints had little effect on the end result.
The original paper experiments with two input data channels – static and non-static
word vectors. We use only one channel.
It is relatively straightforward (a few dozen lines of code) to add the above extensions to the
code here. Take a look at the exercises at the end of the post.
To allow various hyperparameter configurations we put our code into a TextCNN class,
generating the model graph in the init function.
import tensorflow as tf
import numpy as np
class TextCNN(object):
A CNN for text classification.
Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
def __init__(
self, sequence_length, num_classes, vocab_size,
embedding_size, filter_sizes, num_filters):
# Implementation...
sequence_length – The length of our sentences. Remember that we padded all our
sentences to have the same length (59 for our data set).
num_classes – Number of classes in the output layer, two in our case (positive and
vocab_size – The size of our vocabulary. This is needed to define the size of our
embedding layer, which will have shape [vocabulary_size, embedding_size].
embedding_size – The dimensionality of our embeddings.
filter_sizes – The number of words we want our convolutional filters to cover. We
will have num_filters for each size specified here. For example, [3, 4, 5] means
that we will have filters that slide over 3, 4 and 5 words respectively, for a total of 3 *
num_filters filters.
num_filters – The number of filters per filter size (see above).
Input Placeholders
The probability of keeping a neuron in the dropout layer is also an input to the network
because we enable dropout only during training. We disable it when evaluating the model
(more on that later).
Embedding Layer
The first layer we define is the embedding layer, which maps vocabulary word indices into
low-dimensional vector representations. It’s essentially a lookup table that we learn from
W is our embedding matrix that we learn during training. We initialize it using a random
uniform distribution. tf.nn.embedding_lookup creates the actual embedding operation. The
result of the embedding operation is a 3-dimensional tensor of shape [None,
sequence_length, embedding_size].
pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
with tf.name_scope("conv-maxpool-%s" % filter_size):
# Convolution Layer
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(
strides=[1, 1, 1, 1],
# Apply nonlinearity
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
# Max-pooling over the outputs
pooled = tf.nn.max_pool(
ksize=[1, sequence_length - filter_size + 1, 1, 1],
strides=[1, 1, 1, 1],
Here, W is our filter matrix and h is the result of applying the nonlinearity to the convolution
output. Each filter slides over the whole embedding, but varies in how many words it covers.
"VALID" padding means that we slide the filter over our sentence without padding the
edges, performing a narrow convolution that gives us an output of shape [1,
sequence_length - filter_size + 1, 1, 1]. Performing max-pooling over the
output of a specific filter size leaves us with a tensor of shape [batch_size, 1, 1,
num_filters]. This is essentially a feature vector, where the last dimension corresponds to
our features. Once we have all the pooled output tensors from each filter size we combine
them into one long feature vector of shape [batch_size, num_filters_total]. Using -
1 in tf.reshape tells TensorFlow to flatten the dimension when possible.
Take some time and try to understand the output shapes for each of these operations. You
can also refer back to Understanding Convolutional Neural Networks for NLP to get some
intuition. Visualizing the operations in TensorBoard may help as well (for specific filter sizes
3, 4 and 5 here):
Dropout Layer
Dropout is the perhaps most popular method to regularize convolutional neural networks.
The idea behind dropout is simple. A dropout layer stochastically “disables” a fraction of its
neurons. This prevent neurons from co-adapting and forces them to learn individually useful
features. The fraction of neurons we keep enabled is defined by the dropout_keep_prob
input to our network. We set this to something like 0.5 during training, and to 1 (disable
dropout) during evaluation.
# Add dropout
with tf.name_scope("dropout"):
self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
Using the feature vector from max-pooling (with dropout applied) we can generate
predictions by doing a matrix multiplication and picking the class with the highest score. We
could also apply a softmax function to convert raw scores into normalized probabilities, but
that wouldn’t change our final predictions.
with tf.name_scope("output"):
W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W"
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
self.predictions = tf.argmax(self.scores, 1, name="predictions")
Using our scores we can define the loss function. The loss is a measurement of the error our
network makes, and our goal is to minimize it. The standard loss function for categorization
problems it the cross-entropy loss.
# Calculate Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
That’s it, we’re done with our network definition.The full code network definition code is
available here. To get the big picture we can also visualize the network inTensorBoard:
Training Procedure
Before we define the training procedure for our network we need to understand some
basics about how TensorFlow uses Sessions and Graphs. If you’re already familiar with
these concepts feel free to skip this section.
In TensorFlow, a Session is the environment you are executing graph operations in, and it
contains state about Variables and queues. Each session operates on a single graph. If you
don’t explicitly use a session when creating variables and operations you are using the
current default session created by TensorFlow. You can change the default session by
executing commands within a session.as_default() block (see below).
A Graph contains operations and tensors. You can use multiple graphs in your program, but
most programs only need a single graph. You can use the same graph in multiple sessions,
but not multiple graphs in one session. TensorFlow always creates a default graph, but you
may also create a graph manually and set it as the new default, like we do below. Explicitly
creating sessions and graphs ensures that resources are released properly when you no
longer need them.
with tf.Graph().as_default():
session_conf = tf.ConfigProto(
sess = tf.Session(config=session_conf)
with sess.as_default():
# Code that operates on the default graph and session comes here...
The allow_soft_placement setting allows TensorFlow to fall back on a device with a certain
operation implemented when the preferred device doesn’t exist. For example, if our code
places an operation on a GPU and we run the code on a machine without GPU, not using
allow_soft_placement would result in an error. If log_device_placement is set,
TensorFlow log on which devices (CPU or GPU) it places operations. That’s useful for
debugging. FLAGS are command-line arguments to our program.
When we instantiate our TextCNN models all the variables and operations defined will be
placed into the default graph and session we’ve created above.
cnn = TextCNN(
filter_sizes=map(int, FLAGS.filter_sizes.split(",")),
Next, we define how to optimize our network’s loss function. TensorFlow has several built-in
optimizers. We’re using the Adam optimizer.
Here, train_op here is a newly created operation that we can run to perform a gradient
update on our parameters. Each execution of train_op is a training step. TensorFlow
automatically figures out which variables are “trainable” and calculates their gradients. By
defining a global_step variable and passing it to the optimizer we allow TensorFlow handle
the counting of training steps for us. The global step will be automatically incremented by
one every time you execute train_op.
TensorFlow has a concept of a summaries, which allow you to keep track of and visualize
various quantities during training and evaluation. For example, you probably want to keep
track of how your loss and accuracy evolve over time. You can also keep track of more
complex quantities, such as histograms of layer activations. Summaries are serialized
objects, and they are written to disk using a SummaryWriter.
# Train Summaries
train_summary_op = tf.merge_summary([loss_summary, acc_summary])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.train.SummaryWriter(train_summary_dir, sess.graph_def)
# Dev summaries
dev_summary_op = tf.merge_summary([loss_summary, acc_summary])
dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, sess.graph_def)
Here, we are separately keeping track of summaries for training and evaluation. In our case
these are the same quantities, but you may have quantities that you want to track during
training only (like parameter update values). tf.merge_summary is a convenience function
that merges multiple summary operations into a single operation that we can execute.
Another TensorFlow feature you typically want to use is checkpointing – saving the
parameters of your model to restore them later on. Checkpoints can be used to continue
training at a later point, or to pick the best parameters setting using early stopping.
Checkpoints are created using a Saver object.
# Checkpointing
checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
# Tensorflow assumes this directory already exists so we need to create it
if not os.path.exists(checkpoint_dir):
saver = tf.train.Saver(tf.all_variables())
Before we can train our model we also need to initialize the variables in our graph.
The initialize_all_variables function is a convenience function run all of the initializers we’ve
defined for our variables. You can also call the initializer of your variables manually. That’s
useful if you want to initialize your embeddings with pre-trained values for example.
Let’s now define a function for a single training step, evaluating the model on a batch of data
and updating the model parameters.
feed_dict contains the data for the placeholder nodes we pass to our network. You must
feed values for all placeholder nodes, or TensorFlow will throw an error. Another way to
work with input data is using queues, but that’s beyond the scope of this post.
Next, we execute our train_op using, which returns values for all the
operations we ask it to evaluate. Note that train_op returns nothing, it just updates the
parameters of our network. Finally, we print the loss and accuracy of the current training
batch and save the summaries to disk. Note that the loss and accuracy for a training batch
may vary significantly across batches if your batch size is small. And because we’re using
dropout your training metrics may start out being worse than your evaluation metrics.
We write a similar function to evaluate the loss and accuracy on an arbitrary data set, such as
a validation set or the whole training set. Essentially this function does the same as the
above, but without the training operation. It also disables dropout.
def dev_step(x_batch, y_batch, writer=None):
Evaluates model on a dev set
feed_dict = {
cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_keep_prob: 1.0
step, summaries, loss, accuracy =
[global_step, dev_summary_op, cnn.loss, cnn.accuracy],
time_str =
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
if writer:
writer.add_summary(summaries, step)
Training loop
Finally, we’re ready to write our training loop. We iterate over batches of our data, call the
train_step function for each batch, and occasionally evaluate and checkpoint our model:
# Generate batches
batches = data_helpers.batch_iter(
zip(x_train, y_train), FLAGS.batch_size, FLAGS.num_epochs)
# Training loop. For each batch...
for batch in batches:
x_batch, y_batch = zip(*batch)
train_step(x_batch, y_batch)
current_step = tf.train.global_step(sess, global_step)
if current_step % FLAGS.evaluate_every == 0:
dev_step(x_dev, y_dev, writer=dev_summary_writer)
if current_step % FLAGS.checkpoint_every == 0:
path =, checkpoint_prefix, global_step=current_step)
print("Saved model checkpoint to {}\n".format(path))
Running the training procedure with default parameters (128-dimensional embeddings, filter
sizes of 3, 4 and 5, dropout of 0.5 and 128 filters per filter size) results in the following loss
and accuracy plots (blue is training data, red is 10% dev data).
There are a couple of things that stand out:
Our training metrics are not smooth because we use small batch sizes. If we used larger
batches (or evaluated on the whole training set) we would get a smoother blue line.
Because dev accuracy is significantly below training accuracy it seems like our network
is overfitting the training data, suggesting that we need more data (the MR dataset is
very small), stronger regularization, or fewer model parameters. For example, I
experimented with adding additional L2 penalties for the weights at the last layer and
was able to bump up the accuracy to 76%, close to that reported in the original paper.
The training loss and accuracy starts out significantly below the dev metrics due to
dropout applied to it.
You can play around with the code and try running the model with various parameter
configuration. Code and instructions are available on Github.
Here are a couple of useful exercises that may improve the performance of the model:
Initialize the embeddings with pre-trained word2vec vectors. To make this work you
need to use 300-dimensional embeddings and initialize them with the pre-trained
Constrain the L2 norm of the weight vectors in the last layer, just like theoriginal paper.
You can do this by defining a new operation that updates the weight values after each
training step.
Add L2 regularization to the network to combat overfitting, also experiment with
increasing the dropout rate. (The code on Github already includes L2 regularization, but
it is disabled by default)
Add histogram summaries for weight updates and layer actions and visualize them in
