Chap 6 Embedding
Chap 6 Embedding
Chap 6 Embedding
Learning
Dr. Sanjay Chatterji
CS 831
Learning Lower-Dimensional Representation
Convolutional architecture uses simple argument.
The larger our input vector, the larger our model.
It is expressive, but increasingly data hungry.
Without sufficiently large volumes of training data, it will likely overfit.
It helps us cope with the curse of dimensionality by reducing the number of parameters
without necessarily diminishing expressiveness.
still require large amounts of labeled training data.
For many problems, labeled data is scarce and expensive to generate.
unlabeled data is plentiful.
Our goal here is to develop effective learning models for such situation.
Embeddings
low-dimensional representations
Unsupervised fashion
Use generated embeddings to solve learning problems using smaller models.
Using embeddings to automate feature selection
in the case of scarce labeled data
To learn good embeddings
• We’ll explore other applications of learning lower-dimensional
representations, such as visualization and semantic hashing.
• We’ll start by considering situations where all of the important information
is already contained within the original input vector itself.
• In this case, learning embeddings is equivalent to developing an effective
compression algorithm.
• We’ll introduce principal component analysis (PCA), a classic method for
dimensionality reduction.
• Then we’ll explore more powerful neural methods for learning
compressive embeddings.
Principal Component Analysis (PCA)
• A classic method for dimensionality reduction.
• If we have d dimensional data, we’d like to find a new set of m < d dimensions
that conserves as much valuable information from the original dataset.
• Assuming that variance corresponds to information, we can perform this
transformation through an iterative process.
• First Axis: unit vector along which the dataset has maximum variance.
• Second Axis: From the set of vectors orthogonal to the first axis, we pick a new unit
vector along which the dataset has maximum variance.
• We continue this process until we have found a total of d new vectors in an axes.
• We project our data onto this new set of axes.
• Let’s choose d = 2, m = 1.
Mathematical Details
• We can view this operation as a project onto the vector space spanned by the top m
eigenvectors of the dataset’s covariance matrix.
• Let us represent the dataset as a matrix X with dimensions n × d
• We’d like to create an embedding matrix T with dimensions n × m.
• Using W where each column of W corresponds to an eigenvector of the matrix XΤX.
• It spectacularly fails to capture important relationships that are piecewise linear or
nonlinear.
• We hope that PCA will transform
• this two concentric circles
• to a single new axis
• that allows us to easily separate the red and blue dots.
• information is being encoded in a nonlinear way
• polar transformation: points as distance from the origin
• We need a theory for nonlinear dimensionality reduction
• Deep learning practitioners have closed this gap
Motivating the Autoencoder Architecture
• We have discussed how each layer (in FFD) learned progressively more
relevant representations of the input.
• We took the output of the final convolutional layer and used that as a lower-
dimensional representation of the input image.
• The final output is a lower-dimensional representation of the input image.
def corrupt_input(x):
corrupting_matrix = tf.random_uniform(shape=tf.shape(x),
minval=0,maxval=2,dtype=tf.int32)
return x * tf.cast(corrupting_matrix, tf.float32)
x = tf.placeholder("float", [None, 784]) # mnist data image of
corrupt = tf.placeholder(tf.float32)
phase_train = tf.placeholder(tf.bool)
c_x = (corrupt_input(x) * corrupt) + (x * (1 - corrupt))
Interpretability
• Interpretability is a property of a machine learning model that measures
how easy it is to inspect and explain its process and/or output.
• DNN
• Interpretability is very less
• due to nonlinearities and massive # of parameters.
• more accurate
• Less interpretability hinders their adoption in valuable but risky applications.
• predicting that a patient has or does not have cancer
• the doctor will likely want an explanation to confirm the model’s conclusion
• We can address one aspect of interpretability by exploring the
characteristics of the output of an autoencoder.
• Autoencoder’s representations are dense, and we can see how the
representation changes as we make modifications to the input.
Sparsity in Autoencoders
• The autoencoder produces a dense representation.
• The representation of the original image is highly compressed.
• There are many dimensions to work with in the representation.
• The activations of a dense representation combine and overlay
information from multiple features in ways that are difficult to interpret.
Ideal outcome
• As we add components or remove components, the output representation changes
in unknown ways.
• It’s virtually impossible to interpret how and why the representation is generated in the way
it is.
• 1-to-1 correspondence (or close to 1-to-1) between high-level features and
individual components in the code.
• When we are able to achieve this, we get very close to the system described next.
• Part A of the figure shows how the representation changes as we add and remove
components
• Part B color-codes the correspondence between strokes and the components in the
code.
• In this setup, it’s quite clear how and why the representation changes - the
representation is very clearly the sum of the individual strokes in the image.
Code Layer Capacity
• While this is the ideal outcome, we’ll have to think through what mechanisms
we can leverage to enable this interpretability in the representation.
• The issue here is clearly the bottlenecked capacity of the code layer.
• But, increasing the capacity of the code
layer alone is not sufficient.
• As, there is no mechanism to prevent
each feature from affecting a large
fraction of the components.
• If the features are more complex, the
capacity of the code layer may be larger
than the dimensionality of the input.
• The code layer has capacity that the
model can perform a “copy” operation.
Sparsity in Autoencoders
• We want the autoencoder to use very few components of the representation
vector.
• while still effectively reconstructing the input
• similar to using regularization to prevent overfitting in Neural Net
• We’ll achieve this by modifying the objective function with a sparsity penalty.
• ESparse = E + β · SparsityPenalty
• β determines how strongly we favor sparsity.
• Measure of divergence compares the distribution of random variable (each
component) and the distribution of a random variable whose mean is 0.
• A measure that is often used to this end is the Kullback-Leibler (often referred
to as KL) divergence.
• k-Sparse autoencoders were shown to be just as effective as other
mechanisms of sparsity despite being simple to implement and understand.
Concluding Autoencoders
• Further discussion on sparsity in autoencoders is covered by Ranzato et al.
• More recently, the theoretical properties and empirical effectiveness of
introducing an intermediate function before the code layer that zeroes out all
but k of the maximum activations in the representation were investigated by
Makhzani and Frey.
• We’ve explored how we can use autoencoders to find strong representations
of data points by summarizing their content.
• This mechanism of dimensionality reduction works well when the
independent data points are rich and contain all of the relevant information
pertaining to their structure in their original representation.
• In the next section, we’ll explore strategies that we can use when the main
source of information is in the context of the data point instead of the data
point itself.
When Context Is More Informative than the Input Vector
• In dimensionality reduction, we generally have rich inputs which contain lots
of noise on top of the core, structural information that we care about.
• We want to extract this underlying information.
• We have to ignore the variations and noise that are extraneous to this
fundamental understanding of the data.
• Sometimes we have input representations that say little about the content.
• Here, our goal is not to extract information, but rather, to gather information
from context to build useful representations.
• Sounds too abstract to be useful at this point.
• Example: Building models for language by finding a good way to represent individual
words.
Generating one-hot vector representations for words using a simple document
one-hot vector
• The document has a vocabulary V with |V| words.
• We have |V| -dimensional representation vectors
• We associate each unique word with an index in this vector.
• To represent unique word wi, we set the ith component of the vector to be 1,
and zero out all of the other components.
• This vectorization does not make similar words into similar vectors.
• “jump” and “leap” have similar meanings?
• when words are verbs or nouns or prepositions?
• Naive one-hot encoding of words to vectors doesn’t capture any of these info.
• We’ll need to discover these relationships and encode this information into a
vector.
• one way is by analyzing their surrounding context.
Example
• Both words generally appear when
a subject is performing the action
over a direct object.
• We can draw conclusions about
what the words “jumps” and
“leaps” mean just by looking at the
words around them.
• The words “jumps” and “leaps”
should have similar vector
representations because they are
virtually interchangeable.
• We can use the same principles we used when building the autoencoder
1. pass the target through an encoder network to create an embedding. Then
decoder attempts to construct a word from the context.
2. Do reverse: encoder takes a word from context as input, producing the target.
Word2Vec Framework
• A framework for generating word embeddings by Mikolov et al.
• Two strategies for generating embeddings similar to the two strategies for
encoding context
• Continuous Bag of Words (CBOW) model (much like strategy B for encoding context)
• encoder creates embedding from the context and predict the target word.
• Useful for smaller datasets
• Skip-Gram model (inverse of CBOW)
• It takes the target word as input and attempts to predict one of the words in the context.
• Lets take a toy example to explore what the dataset for a Skip-Gram model looks like:
• Consider the sentence: “the boy went to the bank.”
• Break the sentence into (context, target) pairs => [([the, went], boy), ([boy, to], went), ([went,
the], to), ([to, bank], the)]
• Split each (context, target) pair into (input, output) pairs where the input is the target and the
output is one of the words from the context => (boy, the) and (boy, went), (went, boy) and
(went, to), (to, went) and (to,the), etc.
• replace each word with its unique index i ∈ {0, 1, . . . , |V|−1} in the vocabulary.
Word2Vec Framework Encoder
• The structure of the encoder is surprisingly simple.
• It is a lookup table with V rows, where the ith row is the embedding
corresponding to the ith vocabulary word.
• Encoder takes the index of the input word and output the appropriate
row in the lookup table.
• This is an efficient operation.
• The operation can be represented as a product of the transpose of the
lookup table and the one-hot vector representing the input word.
tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None, validate_indices=True)
params is the embedding matrix, and ids is a tensor of indices we want to look up
Word2Vec Framework Decoder
• The decoder is slightly trickier.
• This is because we make some modifications for performance.
• The naive way to construct the decoder would be to attempt to reconstruct
the one-hot encoding vector for the output, which we could implement
with a run-of-the-mill feed-forward layer coupled with a softmax.
• The only concern is that it’s inefficient.
• That is because we have to produce a probability distribution over the
whole vocabulary space.
• To reduce the number of parameters, Mikolov et al. used a strategy for
implementing the decoder known as noise-contrastive estimation (NCE).
Noise-Contrastive Estimation (NCE)
• It uses the lookup table to find the embedding for the output, as well as for
random vocabulary word not in the context of the input.
• It then employ binary logistic regression to take the input embedding and the
embedding of the output or random selection.
• It outputs a value between 0 to 1 corresponding to the probability that the
comparison embedding represents a vocabulary word present in the input’s
context.
• It then take the sum(probabilities corresponding to the non-context comparisons)
– (probability corresponding to the context comparison).
• This value is the objective function that we want to minimize.
• In the optimal scenario where the model has perfect performance the value will
be -1
Implementing NCE in TensorFlow
tf.nn.nce_loss(weights, biases, inputs, labels, num_sampled, num_classes,
num_true=1, sampled_values=None, remove_accidental_hits=False,
partition_strategy= 'mod', name='nce_loss')
• weights should have the same dimensions as the embedding matrix.
• biases should be a tensor with size equal to the vocabulary.
• The inputs are the results from the embedding lookup.
• num_sampled is the number of negative samples we use to compute the
NCE
• num_classes is the vocabulary size
Word2Vec is not a deep machine learning model
• It thematically represents a strategy
(finding embeddings using context)
that generalizes to many deep
learning models.
• When we’ll learn sequence analysis,
we’ll see this strategy employed for
generating vectors to embed
sentences.
• Using Word2Vec embeddings instead
of one-hot vectors to represent
words will yield far superior results.
• Out of Syllabus: Implementing the
Skip-Gram Architecture
Thank You