Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

NLP DL Lecture2

This document discusses word embedding techniques for natural language processing. It introduces word2vec, a model that represents words as vectors in a low-dimensional space to capture semantic meaning. Word2vec uses either the Continuous Bag-of-Words or Skip-gram model to predict words from their context to learn word embeddings. These word embeddings can then be used to compare words and measure similarity based on their vector representations.

Uploaded by

thanh.tien.96.vn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

NLP DL Lecture2

This document discusses word embedding techniques for natural language processing. It introduces word2vec, a model that represents words as vectors in a low-dimensional space to capture semantic meaning. Word2vec uses either the Continuous Bag-of-Words or Skip-gram model to predict words from their context to learn word embeddings. These word embeddings can then be used to compare words and measure similarity based on their vector representations.

Uploaded by

thanh.tien.96.vn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Deep Learning for Natural Language Processing

Lecture 2: Deep Neural Networks and Word Embedding

Quan Thanh Tho


Faculty of Computer Science and Engineering
Ho Chi Minh City University of Technology
Acknowledgement

• Some slides are from Coursera course of Prof. Andrew Ng.


Agenda

• Intuition of Neural Network for Classification Task


• Neural Network and Deep NN Text Classification
• Specific DNN Architecture: CNN and AutoEncoder
• From AutoEncoder to Word Embedding
• Doc2vec: combining tf.idf and WE
Classical Machine Learning Techniques for NLP

• Machine Learning Task


• The tf.idf weights
• Neural-based approach
Word Embedding
To compare pieces of text

• We need effective representation of :


– Words
– Sentences
– Text
• Approach 1: Use existing thesauri or ontologies like WordNet and Snomed CT (for
medical). Drawbacks:
– Manual
– Not context specific
• Approach 2: Use co-occurrences for word similarity. Drawbacks:
– Quadratic space needed
– Relative position and order of words not considered

33
Approach 3: low dimensional vectors
• Store only “important” information in fixed, low dimensional vector.
• Single Value Decomposition (SVD) on co-occurrence matrix
– is the best rank k approximation to X , in terms of least squares
– Motel = [0.286, 0.792, -0.177, -0.107, 0.109, -0.542, 0.349, 0.271]
• m = n = size of vocabulary

34
Problems with SVD

• Computational cost scales quadratically for n x m matrix: O(mn2) flops


(when n<m)
• Hard to incorporate new words or documents
• Does not consider order of words

35
word2vec Approach to represent the
meaning of word
• Represent each word with a low-dimensional vector
• Word similarity = vector similarity
• Key idea: Predict surrounding words of every word
• Faster and can easily incorporate a new sentence/document or add a
word to the vocabulary

36
Word2Vec
Word2Vec
Word2Vec
Auto-Encoder
Stacked Auto-Encoder
Word2vec
Represent the meaning of word – word2vec

• 2 basic neural network models:


– Continuous Bag of Word (CBOW): use a window of word to predict
the middle word
– Skip-gram (SG): use a word to predict the surrounding ones in
window.

43
Word2vec – Continuous Bag of Word

• E.g. “The cat sat on floor”


– Window size = 2

the

cat

sat

on

floor

44
Input layer
0
Index of cat in vocabulary 1
0
0
cat 0 Hidden layer Output layer
0
0 0
0 0
… 0
0 0
0
one-hot 0 sat one-hot
vector 0 0 vector
0 1
0 …
1 0
0
on 0
0
0

0

45
We must learn W and W’
Input layer
0
1
0
0
cat Hidden layer Output layer
𝑊!×#
0
0
0 0
0 0
… 0
V-dim 0 0

𝑊′#×! 0
0 sat
0 0
0 1
0 …
N-dim
𝑊!×# V-dim
1 0
0
on 0
0
0

V-dim 0 N will be the size of word vector

46
$
𝑊!×# ×𝑥%&' = 𝑣%&'
0
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 2.4
1
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
0
2.6

0
1
… … … … … … … … … …
× 0
0
= …

… … … … … … … … … … …
0 0

xcat
0
0 𝑊$ 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0
0
1.8

Output layer

0
0
# ×𝑥 …
0 0
%& '
0
… =𝑣 0
0
V-dim 0
%& ' 𝑣!"# + 𝑣$%
0
0
+ 𝑣! = 0 sat
0 𝑣 () 2 0
0
0
= 1

1
× 𝑥 () 0
V-dim
xon
0
0 $ ×# Hidden layer
N-dim
0
0 𝑊!

V-dim 0

47
$
𝑊!×# ×𝑥() = 𝑣()
0
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 1.8
0
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
0
2.9

0
1
… … … … … … … … … …
× 1
0
= …

… … … … … … … … … … …
0 0

xcat
0
0 𝑊$ 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0
0
1.9

Output layer

0
0
# ×𝑥 …
0 0
%& '
0
… =𝑣 0
0
V-dim 0
%& ' 𝑣!"# + 𝑣$%
0
0
+ 𝑣! = 0 sat
0 𝑣 () 2 0
0
0
= 1

1
× 𝑥 () 0
V-dim
xon
0
0 $ ×# Hidden layer
N-dim
0
0 𝑊!

V-dim 0

48
Input layer
0
1
0
0
cat Hidden layer Output layer
𝑊!×#
0
0
0 0
0 0
… 0
V-dim 0 0
*
𝑊!×# ×𝑣' = 𝑧 0
0
𝑦' = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
0 0
0 1

0
𝑣!
𝑊!×#
1 0

on
0
N-dim
0 𝑦!&'(
0
0 V-dim

V-dim 0 N will be the size of word vector

49
Input layer
0
1 We would prefer 𝑦! close to 𝑦!)"#
0
0
cat Hidden layer Output layer
𝑊!×#
0
0
0 0 0.01
0 0
0.02
… 0
V-dim 0
*
0 0.00

𝑊!×# ×𝑣'=𝑧 0
0
0.02

0.01

𝑦' = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
0 0
0 1 0.02

0
𝑣! 0.01

𝑊!×#
1 0
0.7
on
0
N-dim
0 𝑦!&'( …
0
0 V-dim 0.00


V-dim 0 N will be the size of word vector 𝑦!

50
$
𝑊!×#
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2

Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1


Contain word’s vectors
0 … … … … … … … … … …
1 … … … … … … … … … …
0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2

xcat 0
0
Output layer
0 0
0

𝑊!×# 0
0
V-dim 0
*
0

𝑊!×# 0
0 sat
0 0
0 1
0 …
1
0
𝑊!×# Hidden layer
0
V-dim
xon 0
0 N-dim
0

V-dim 0

We can consider either W or W’ as the word’s representation. Or


even take the average. 51
Some interesting results

52
Word analogies

53
Combination of tf.idf and WE

• Recall
– Doc = {tfidf1, tfidf2,…,tfidf_t}
– W1 = {x11,x12,…, x_t}
– …
– Wt = {xt1,xt2,…, x_tt}
• Then d_1 = W1*tfidf1, d_2 = W2*tfidf2,… etc
• Finally DocVect = average of all d_i

You might also like