unit2
unit2
FastText model – Overview of Deep Learning models – RNN – Transformers – Overview of • Representations of the meaning of words- embeddings
• Computed directly from the word distributions in texts
Text summarization and Topic Models
• Used in every natural language processing application
• The roots of the model – convergence of two big ideas
COURSE OBJECTIVES: use a point in three-dimensional space to represent the connotation of a word (Osgood)
Apply classification algorithms to text documents to define the meaning of a word by its distribution in language use - neighboring words or grammatical environments (Joos, Harris,
and Firth)
COURSE OUTCOME: • idea was that two words that occur in very similar distributions have similar meanings (whose neighboring words are similar)
CO2::Apply deep learning techniques for NLP tasks, language modelling and machine Ongchoi is delicious sauteed with garlic spinach sauteed with garlic over rice...
translation
Ongchoi is superb over rice chard stems and leaves are delicious...
Ongchoi leaves with salty sauces... collard greens and other salty leafy greens
Text Book : Speech and Language Processing: An Introductionto Natural Language ongchoi - occurs with words like rice and garlic and delicious and salty, as do words like spinach, chard, and collard
Processing, Computational Linguistics, and Speech Recognition” by Daniel Jurafsky and James • spinach, chard, and collard – leafy greens
• ongchoi - leafy green similar
H. Martin - Chapter 6
computationally implemented - by counting words in the context of ongchoi
chapter 6.8 page 112 word2vec – can be compared to Neural Language Model
Word2Vec model
• Neural Language Model
Train a classifier on a binary prediction task
• Is a neural network that learned to predict the next word from prior words
“Is word w1 likely to occur near another word w2?” • use the next word in running text as its supervision signal
• used to learn an embedding representation for each word as part of doing this prediction task
• “Is word w likely to show up near apricot?” • word2vec - much simpler model than the neural network language model, in two ways.
• counting how often each word w occurs near- apricot – Not embedding 1. word2vec simplifies the task
• making it binary classification
• train a classifier on a binary prediction task • Neural Language Model - word prediction
• actually we do not care about this prediction task 2. word2vec simplifies the architecture
• training a logistic regression classifier
• take the learned classifier weights as the word embeddings • Neural Language Model - multi-layer neural network with hidden layers - demand more sophisticated
• use running text as implicitly supervised training data for such a classifier training algorithms
The intuition of skip-gram is:
• a word c that occurs near the target word apricot
1. Treat the target word and a neighboring context word as positive examples
• acts as gold ‘correct answer’ to the question “Is word c likely to show up near apricot?” 2. Randomly sample other words in the lexicon to get negative samples
• often called as self-supervision - avoids the need for hand-labeled supervision 3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the learned weights as the embeddings
The classifier
Skip gram model
• Generate embeddings Eg: (apricot, jam)
• Classification task (apricot, aardvark))
• Compute similarity – dot product
• target word - apricot
• Convert dot product to Probability – • window - ±2 context words Eg: (apricot, jam) - True
• Sigmoid function – Convert dot product to be in the range 0 -1
• Goal - train a classifier (apricot, aardvark) - False
• Given a tuple (w, c)
w-target word
c- candidate context word
• Return the probability that c is a real context word
• The probability that word c is not a real context word for w
Training
• To obtain the gradients needed to adjust the weights
Training Set
Loss Function – cross entrophy - difference between predicted and correct distribution
Back Propagation
• 3 sets of weights to update
W- weights from the input layer to the hidden layer
U - weights from the previous hidden layer to the current hidden layer
V - weights from the hidden layer to the output layer
• two considerations
compute the loss function for the output at time t use the hidden layer from time t − 1
hidden layer at time t influences
output at time t
hidden layer at time t +1
• two-pass algorithm for training the weights in RNNs
• first pass - perform forward inference
• compute ht , yt
• accumulate the loss at each step in time
• save the value of the hidden layer at each step for use at the next time step
• second phase – Back Propagation
• process the sequence in reverse
• compute the required gradients
• compute and save the error term for use in the hidden layer for each step backward in time.
Self-attention
• allows a network to directly extract and use information from arbitrarily large
contexts without the need to pass it through intermediate recurrent
connections
• application of self-attention – past context to be used
problems of language modeling
autoregressive generation
• access to all of the inputs - up to the one under consideration
• no access to information about inputs beyond the current one – access only
past info
• computation performed for each item is independent of all the other
computations – Parallel computing can be implemented
• softmax calculation – α i, j
Text summarization
• important concept in text analytics
• practical application of context-based autoregressive generation
• task - take a full-length article and produce an effective summary
• used by businesses and analytical firms
• shorten and summarize huge documents of text such that they still
• retain their key essence or theme
• present this summarized information to consumers and clients
• To train a transformer - corpus with full-length articles + corresponding summaries
• Append a summary to each full-length article in a corpus, with a unique marker
separating the two
• article-summary pair (x1,..., xm), (y1,..., yn) - converted to (x1,..., xm,δ, y1,...yn) - length of n+m +1
• treated as long sentences - used to train an autoregressive language model using teacher forcing
Frameworks and algorithms to build topic models 1.Latent Semantic Indexing (LSI)
• used for text summarization, information retrieval, search
• uses the very popular SVD technique (Singular Value Decomposition)
1. Latent semantic indexing - popular • main principle - similar terms tend to be used in the same context and hence
2. Latent Dirichlet allocation - popular tend to co-occur more
3. Non-negative matrix factorization
- very recent technique
- extremely effective and gives excellent results