NLP DL Lecture2
NLP DL Lecture2
33
Approach 3: low dimensional vectors
• Store only “important” information in fixed, low dimensional vector.
• Single Value Decomposition (SVD) on co-occurrence matrix
– is the best rank k approximation to X , in terms of least squares
– Motel = [0.286, 0.792, -0.177, -0.107, 0.109, -0.542, 0.349, 0.271]
• m = n = size of vocabulary
34
Problems with SVD
35
word2vec Approach to represent the
meaning of word
• Represent each word with a low-dimensional vector
• Word similarity = vector similarity
• Key idea: Predict surrounding words of every word
• Faster and can easily incorporate a new sentence/document or add a
word to the vocabulary
36
Word2Vec
Word2Vec
Word2Vec
Auto-Encoder
Stacked Auto-Encoder
Word2vec
Represent the meaning of word – word2vec
43
Word2vec – Continuous Bag of Word
the
cat
sat
on
floor
44
Input layer
0
Index of cat in vocabulary 1
0
0
cat 0 Hidden layer Output layer
0
0 0
0 0
… 0
0 0
0
one-hot 0 sat one-hot
vector 0 0 vector
0 1
0 …
1 0
0
on 0
0
0
…
0
45
We must learn W and W’
Input layer
0
1
0
0
cat Hidden layer Output layer
𝑊!×#
0
0
0 0
0 0
… 0
V-dim 0 0
𝑊′#×! 0
0 sat
0 0
0 1
0 …
N-dim
𝑊!×# V-dim
1 0
0
on 0
0
0
…
V-dim 0 N will be the size of word vector
46
$
𝑊!×# ×𝑥%&' = 𝑣%&'
0
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 2.4
1
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
0
2.6
0
1
… … … … … … … … … …
× 0
0
= …
… … … … … … … … … … …
0 0
xcat
0
0 𝑊$ 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0
0
1.8
Output layer
!×
0
0
# ×𝑥 …
0 0
%& '
0
… =𝑣 0
0
V-dim 0
%& ' 𝑣!"# + 𝑣$%
0
0
+ 𝑣! = 0 sat
0 𝑣 () 2 0
0
0
= 1
…
1
× 𝑥 () 0
V-dim
xon
0
0 $ ×# Hidden layer
N-dim
0
0 𝑊!
…
V-dim 0
47
$
𝑊!×# ×𝑥() = 𝑣()
0
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 1.8
0
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
0
2.9
0
1
… … … … … … … … … …
× 1
0
= …
… … … … … … … … … … …
0 0
xcat
0
0 𝑊$ 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0
0
1.9
Output layer
!×
0
0
# ×𝑥 …
0 0
%& '
0
… =𝑣 0
0
V-dim 0
%& ' 𝑣!"# + 𝑣$%
0
0
+ 𝑣! = 0 sat
0 𝑣 () 2 0
0
0
= 1
…
1
× 𝑥 () 0
V-dim
xon
0
0 $ ×# Hidden layer
N-dim
0
0 𝑊!
…
V-dim 0
48
Input layer
0
1
0
0
cat Hidden layer Output layer
𝑊!×#
0
0
0 0
0 0
… 0
V-dim 0 0
*
𝑊!×# ×𝑣' = 𝑧 0
0
𝑦' = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
0 0
0 1
…
0
𝑣!
𝑊!×#
1 0
on
0
N-dim
0 𝑦!&'(
0
0 V-dim
…
V-dim 0 N will be the size of word vector
49
Input layer
0
1 We would prefer 𝑦! close to 𝑦!)"#
0
0
cat Hidden layer Output layer
𝑊!×#
0
0
0 0 0.01
0 0
0.02
… 0
V-dim 0
*
0 0.00
𝑊!×# ×𝑣'=𝑧 0
0
0.02
0.01
𝑦' = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧)
0 0
0 1 0.02
…
0
𝑣! 0.01
𝑊!×#
1 0
0.7
on
0
N-dim
0 𝑦!&'( …
0
0 V-dim 0.00
…
V-dim 0 N will be the size of word vector 𝑦!
50
$
𝑊!×#
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2
xcat 0
0
Output layer
0 0
0
…
𝑊!×# 0
0
V-dim 0
*
0
𝑊!×# 0
0 sat
0 0
0 1
0 …
1
0
𝑊!×# Hidden layer
0
V-dim
xon 0
0 N-dim
0
…
V-dim 0
52
Word analogies
53
Combination of tf.idf and WE
• Recall
– Doc = {tfidf1, tfidf2,…,tfidf_t}
– W1 = {x11,x12,…, x_t}
– …
– Wt = {xt1,xt2,…, x_tt}
• Then d_1 = W1*tfidf1, d_2 = W2*tfidf2,… etc
• Finally DocVect = average of all d_i