Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

unit2

Uploaded by

V. Dhanush
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

unit2

Uploaded by

V. Dhanush
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

lOMoARcPSD|25748686

CCS369- TSS-Unit 2 - Summary Text and speech analysis

Information Technology (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)
lOMoARcPSD|25748686

Vector Semantics and Embeddings


UNIT II TEXT CLASSIFICATION Vector Semantics
• Is the standard way to represent word meaning in NLP
Vector Semantics and Embeddings -Word Embeddings - Word2Vec model – Glove model – • Helps to model many of the aspects of word meaning

FastText model – Overview of Deep Learning models – RNN – Transformers – Overview of • Representations of the meaning of words- embeddings
• Computed directly from the word distributions in texts
Text summarization and Topic Models
• Used in every natural language processing application
• The roots of the model – convergence of two big ideas
COURSE OBJECTIVES:  use a point in three-dimensional space to represent the connotation of a word (Osgood)
Apply classification algorithms to text documents  to define the meaning of a word by its distribution in language use - neighboring words or grammatical environments (Joos, Harris,
and Firth)
COURSE OUTCOME: • idea was that two words that occur in very similar distributions have similar meanings (whose neighboring words are similar)
CO2::Apply deep learning techniques for NLP tasks, language modelling and machine Ongchoi is delicious sauteed with garlic spinach sauteed with garlic over rice...
translation
Ongchoi is superb over rice chard stems and leaves are delicious...
Ongchoi leaves with salty sauces... collard greens and other salty leafy greens
Text Book : Speech and Language Processing: An Introductionto Natural Language ongchoi - occurs with words like rice and garlic and delicious and salty, as do words like spinach, chard, and collard
Processing, Computational Linguistics, and Speech Recognition” by Daniel Jurafsky and James • spinach, chard, and collard – leafy greens
• ongchoi - leafy green similar
H. Martin - Chapter 6
computationally implemented - by counting words in the context of ongchoi

• vector semantics - represent a word as a point in a multidimensional


semantic space that is derived from the distributions of word neighbors • Word Similarity of vector semantics - offers enormous power to NLP applications
• Vectors for representing words - embeddings • Application like sentiment classifiers - depend on the same words appearing in the
training and test sets
• “embedding” - mapping from one space or structure to another • represent words as embeddings - classifiers assign sentiment based on words with
similar meanings
• can be learned automatically from text without supervision
visualization of embeddings learned for
sentiment analysis • two most commonly used models
tf-idf model
• shows the location of selected words • an important baseline
projected down from 60-dimensional • the meaning of a word - function of the counts of nearby words
space into a two dimensional space. • results in very long vectors that are sparse, i.e. mostly zero
• distinct regions contain positive word2vec model
words, negative words, and neutral • construct short, dense vectors that have useful semantic properties
function words • Cosine
• standard way to use embeddings to compute semantic similarity, between two words, two
sentences, or two documents,
• an important tool in practical applications like question answering, summarization, or automatic
essay grading.

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

Word Embeddings Method for computing embeddings


Word2vec
• more powerful word representation Software package
Implemented using two algorithms
• Embeddings - short, dense vectors skip-gram with negative sampling - SGNS
• Short -number of dimensions - range from 50-1000 Continuous Bag of Words
• dense – less number of zeroes, values can be real-valued numbers, negative skip-gram algorithm – called as word2vec
• work better in every NLP task than sparse vectors • Fast
• Less dimension – classifier need to learn fewer weights • Efficient to train
Eg: 300-dimensional dense vectors requires the classifiers to learn fewer weights • Easily available online with code and pretrained embeddings
than the words represented as 50,000-dimensional vectors Static embeddings
• smaller parameter space - helps with generalization and avoiding overfitting • Method learns one fixed embedding for each word in the vocabulary
• capture synonyms better Dynamic embeddings
Eg: sparse vector representation • Contextual
• The vector for each word is different in different contexts - BERT or ELMO representations
• dimensions for synonyms like car and automobile - are distinct and unrelated
Word2vec embeddings - static embeddings
• fail to capture the similarity between a word with car as a neighbor and a word
with automobile as a neighbor

chapter 6.8 page 112 word2vec – can be compared to Neural Language Model
Word2Vec model
• Neural Language Model
Train a classifier on a binary prediction task
• Is a neural network that learned to predict the next word from prior words
“Is word w1 likely to occur near another word w2?” • use the next word in running text as its supervision signal
• used to learn an embedding representation for each word as part of doing this prediction task
• “Is word w likely to show up near apricot?” • word2vec - much simpler model than the neural network language model, in two ways.
• counting how often each word w occurs near- apricot – Not embedding 1. word2vec simplifies the task
• making it binary classification
• train a classifier on a binary prediction task • Neural Language Model - word prediction
• actually we do not care about this prediction task 2. word2vec simplifies the architecture
• training a logistic regression classifier
• take the learned classifier weights as the word embeddings • Neural Language Model - multi-layer neural network with hidden layers - demand more sophisticated
• use running text as implicitly supervised training data for such a classifier training algorithms
The intuition of skip-gram is:
• a word c that occurs near the target word apricot
1. Treat the target word and a neighboring context word as positive examples
• acts as gold ‘correct answer’ to the question “Is word c likely to show up near apricot?” 2. Randomly sample other words in the lexicon to get negative samples
• often called as self-supervision - avoids the need for hand-labeled supervision 3. Use logistic regression to train a classifier to distinguish those two cases
4. Use the learned weights as the embeddings

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

The classifier
Skip gram model
• Generate embeddings Eg: (apricot, jam)
• Classification task (apricot, aardvark))
• Compute similarity – dot product
• target word - apricot
• Convert dot product to Probability – • window - ±2 context words Eg: (apricot, jam) - True
• Sigmoid function – Convert dot product to be in the range 0 -1
• Goal - train a classifier (apricot, aardvark) - False
• Given a tuple (w, c)
w-target word
c- candidate context word
• Return the probability that c is a real context word
• The probability that word c is not a real context word for w

How does the classifier compute the probability P?


• Skipgram model - base - probability on embedding similarity
• TO compute the actual probability
• A word is likely to occur near the target
if its embedding is similar to the target embedding
• Compute total probability of the two possible events which will sum to 1
• c is a context word
• Compute similarity between dense embeddings • c isn’t a context word
two vectors are similar if they have a high dot product
• Estimate of the probability that word c is not a real context word for w as
cosine - is a normalized dot product Similarity(w, c) ≈ c · w
The dot product c · w
is not a probability
it’s a number ranging from −∞ to ∞ • Above equation gives the probability for one word
can be negative - but there are many context words in the window
Turn the dot product into a probability - Skip-gram - assumption - all context words are independent- multiply their probabilities
use the logistic or sigmoid function σ(x)-fundamental core of logistic regression

probability that word c is a real context word for target word w

The sigmoid function returns a number between 0 and 1

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

Intuition of the parameters : Learning skip-gram embeddings


• Skip-gram - stores 2 embeddings
for each word • Skip-gram learns embeddings
1) word as target • start with random embedding
2) word as context vectors
• iteratively shift the embedding of
• Parameters – each word w
• two matrices W and C • more like the embeddings of words
• each containing an embedding for that occur nearby
every one of the |V| words in the • less like the embeddings of words
that don’t occur nearby.
vocabulary V.
target word (w ) - (apricot)
• 4 context words - L = ±2 window

Skipgram with negative sampling (SGNS)


• uses more negative examples than positive examples - ratio between them set
by a parameter k.
• for each (w, cpos) training instance - create k negative samples - a ‘noise word’
cneg
• A noise word is a random word from the lexicon - not the target word w.
• k = 2  2 negative examples for each positive example
• The noise words are chosen according to their weighted unigram frequency
pα(w), where α is a weight.
• Unweighted frequency p(w)
• Common to set α = .75

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

• maximize the dot


product of the word
with the actual context
words, and
• minimize the dot
products of the word
with the k negative
sampled nonneighbor
words.
• minimize this loss
function using stochastic
gradient descent

Glove model FastText model


• GloVe short for Global Vectors • uses subword models
• Widely used static embedding model • deals with unknown words and sparsity in languages with rich
• Based on capturing global corpus morphology
statistics • Each word - represented as itself plus a bag of constituent n-grams, with
• ratios of probabilities from the word- special boundary symbols < and > added to each word.
word co-occurrence matrix Eg: n = 3 -where - <where> <wh, whe, her, ere, re>
• combining the intuitions of count- • skipgram embedding is learned for each constituent n-gram
based models and word2vec
• sum of all of the embeddings of its constituent n-grams
• fasttext - open-source library- pretrained embeddings for 157 languages
• https://fasttext.cc.

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

Overview of Deep Learning models Chapter : 9


Deep learning
• uses artificial neural networks
• perform sophisticated computations on large amounts of data
• type of machine learning - works based on the structure and function of the human brain
Deep learning algorithms
• train machines by learning from examples
• health care, eCommerce, entertainment, and advertising - use deep learning
Neural Networks
• is structured like the human brain - consists of artificial neurons-nodes Types of Deep Learning Algorithms
• nodes are stacked next to each other in three layers: 1. Convolutional Neural Networks (CNNs) 6. Multilayer Perceptrons (MLPs)
• The input layer 7. Self Organizing Maps (SOMs)
2. Long Short Term Memory Networks
• The hidden layer(s) (LSTMs) 8. Deep Belief Networks (DBNs)
• The output layer
3. Recurrent Neural Networks (RNNs) 9. Restricted Boltzmann Machines( RBMs)
Deep learning models
• are trained using a neural network architecture or a set of labeled data that contains multiple layers 4. Generative Adversarial Networks (GANs)
10. Autoencoders
• sometimes exceed human-level performance 5. Radial Basis Function Networks (RBFNs)
• learn features directly from the data without hindrance to manual feature extraction

RNN - Recurrent Neural Network


Structure of an RNN
 input vector represent the current input - Xt
• network that contains a cycle within its network connections  multiply by a weight matrix
 pass through a non-linear activation function - compute
• value of a unit is directly, or indirectly, dependent on its own earlier outputs the values for a layer of hidden units.
 Use hidden layer - to calculate a corresponding output Yt
• difficult to reason about and to train • key difference from a feedforward network - recurrent link -
• proven to be effective when applied to spoken and written language •
dashed line
link augments the input to the computation at the hidden layer
• class of recurrent networks -Elman Networks - simple recurrent networks with the value of the hidden layer from the preceding point in
time.
• serve as the basis for more complex approaches like the Long Short-Term • The hidden layer from the previous time step provides a form
of memory, or context, that encodes earlier processing and
Memory (LSTM) networks informs the decisions to be made at later points in time.
• does not impose a fixed-length limit on this prior context;
context embodied in the previous hidden layer - includes
information extending back to the beginning of the sequence.

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

• Temporal dimension - makes RNNs appear to be more complex Inference in RNN


• Perform the standard feedforward calculation
• Forward inference (mapping a sequence of inputs to a sequence of outputs) in an
• Change - new set of weights, U, that connect the hidden layer from the
previous time step to the current hidden layer • RNN is nearly identical to feedforward networks.
• Weights - determine how the network makes use of past context in • To compute an output yt for an input xt , we need the activation value for the
calculating the output for the current input hidden
• layer ht
• Like other weights in the network, these connections are trained via
backpropagation. • To calculate this, we multiply the input xt with the weight matrix W, and the
hidden layer from the previous time step ht−1 with the weight matrix U.
• We add these values together and pass them through a suitable activation
function, g, to arrive at the activation value for the current hidden layer, ht
• Once we have the values for the hidden layer, we proceed with the usual
computation to generate the output vector.

Training
• To obtain the gradients needed to adjust the weights
 Training Set
 Loss Function – cross entrophy - difference between predicted and correct distribution
 Back Propagation
• 3 sets of weights to update
 W- weights from the input layer to the hidden layer
 U - weights from the previous hidden layer to the current hidden layer
 V - weights from the hidden layer to the output layer
• two considerations
 compute the loss function for the output at time t  use the hidden layer from time t − 1
 hidden layer at time t influences
 output at time t
 hidden layer at time t +1
• two-pass algorithm for training the weights in RNNs
• first pass - perform forward inference
• compute ht , yt
• accumulate the loss at each step in time
• save the value of the hidden layer at each step for use at the next time step
• second phase – Back Propagation
• process the sequence in reverse
• compute the required gradients
• compute and save the error term for use in the hidden layer for each step backward in time.

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

RNNs as Language Models


• Process sequences - a word at a time - predict the next word in a sequence – using
current word and the previous hidden state as inputs
• Limited context constraint inherent in N-gram models is avoided
• Hidden state - has information about all preceding words – from the beginning
• Forward inference in a recurrent language model – same process
• The input sequence x -consists of word embeddings represented as one-hot vectors of size |V| × 1
• output predictions, y - represented as vectors representing a probability distribution over the
vocabulary.
• At each step
• uses the word embedding matrix E - to retrieve the embedding for the current word
• combine it with the hidden layer from the previous step to compute a new hidden layer
• hidden layer is then used to generate an output layer which is passed through a softmax layer to
generate a probability distribution over the entire vocabulary

chapter 9.4 Page No:190


Transformers
RNN
• leads to a loss of relevant information - difficulties in training
• inhibits the use of parallel computational resources - sequential nature of recurrent
networks.
Transformers
• an approach to sequential processing - eliminates recurrent connections
• map sequences of input vectors (x1,..., xn) to sequences of output vectors (y1,..., yn)
of the same length
• made up of stacks of network layers
• simple linear layers
• feedforward networks
• custom connections
• use of self-attention layers - key innovation

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

Self-attention
• allows a network to directly extract and use information from arbitrarily large
contexts without the need to pass it through intermediate recurrent
connections
• application of self-attention – past context to be used
problems of language modeling
autoregressive generation
• access to all of the inputs - up to the one under consideration
• no access to information about inputs beyond the current one – access only
past info
• computation performed for each item is independent of all the other
computations – Parallel computing can be implemented

Attention-based approach • simple mechanism


 provides no opportunity for learning
 everything is directly based on the original input values x
• Compare an item to a collection of other items - reveals relevance in the  no opportunity to learn - contribution of words to represent longer inputs
current context – dot product – Scores (other possible comparisons) • Transformers – handles above issues
computation of y3  include additional parameters - set of weight matrices that operate over the input embeddings
set of comparisons between the input x3 and its preceding elements x1, x2 and x3 • roles that each input embedding plays during the attention process
 query - current focus of attention
compute three scores: x3 · x1, x3 · x2 and x3 · x3  key - preceding input
• Normalize - softmax - create a vector of weights, αi j - indicates the  Value - output for the current focus of attention
proportional relevance of each input to the input element i- probability • introduce three sets of weights - WQ , WK, and WV .
• Generate an output value yi by taking the sum of the inputs weighted by
their respective α value. • Given input embedding of size dm, the dimensionality of these matrices are dq×dm, dk ×dm
and dv×dm
• score between xi and xj - dot product between its query vector qi and the preceding
elements key vectors kj

• softmax calculation – α i, j

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

• Result of dot product - arbitrarily large (positive or negative) value


• Exponentiating large values
can lead to numerical issues
effective loss of gradients during training
• To avoid this, the dot product are scaled - scaled dot-product approach
• divides the result of the dot product by a factor related to the size of the embeddings

 Output yi , is computed independently - process can be parallelized- matrix multiplication


 Matrix includes informations that follow the query
 Elements in the upper-triangular portion of the comparisons matrix are zeroed out (set to
−∞) - eliminate any knowledge of words that follow in the sequence

Transformer Blocks Multihead Attention


• different words in a sentence - can
relate to each other in many different
ways
• difficult for a single transformer block -
• self-attention layer to capture all of the different relations
• Transformers address this issue -
• feedforward layers multihead self-attention layers
• residual connections • sets of self-attention layers - heads
o parallel layers at the same depth in a
• normalizing layers model
o each has its own set of parameters
WK i , W Q i and WV i
o each head can learn different aspects
blocks can then be of the relationships that exist among
inputs at the same level of
stacked abstraction
o combined –
 concatenate the outputs from each head
 reduce to the original output dimension -
another linear projection
o rest of the Transformer block
remain the same - feedforward
layer, residual connections, and
layer norms

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

Positional Embeddings Transformers as Autoregressive Language Models


• train a model to predict
• order of the inputs – does not affect the output of Transformer the next word in a
sequence - teacher
• Transformer – combine word embedding with positional embeddings forcing
• positional embeddings - are learned along with other parameters • calculate the cross-
during training entropy loss for each
item in the sequence
• each training item can
• Eg: embedding for the word ‘class’ + embedding for the position 3 be processed in parallel

Contextual Generation and Summarization


• simple variation on autoregressive generation

Overview of Text summarization and Topic


Models

Text Analytics with Python – chapter 5 - Page 216

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

Text summarization
• important concept in text analytics
• practical application of context-based autoregressive generation
• task - take a full-length article and produce an effective summary
• used by businesses and analytical firms
• shorten and summarize huge documents of text such that they still
• retain their key essence or theme
• present this summarized information to consumers and clients
• To train a transformer - corpus with full-length articles + corresponding summaries
• Append a summary to each full-length article in a corpus, with a unique marker
separating the two
• article-summary pair (x1,..., xm), (y1,..., yn) - converted to (x1,..., xm,δ, y1,...yn) - length of n+m +1
• treated as long sentences - used to train an autoregressive language model using teacher forcing

Text summarization Topic modeling


• Extract main topics, themes, or concepts from a corpus of documents
• Involves statistical and mathematical modeling techniques
• Extract the key influential phrases from the documents • Requires more diverse set of documents
• Extract various diverse concepts or topics present in the documents - More topics or concepts are generated
- Single document –Single concept – may not have too many topics
• Summarize the documents to provide a gist that retains the important
• Also known as probabilistic statistical models
parts of the whole corpus
• Used extensively in text analytics and even bioinformatics
• popular techniques are • Use mathematical and statistical techniques - to discover hidden and latent semantic
keyphrase extraction structures in a corpus
topic modeling • Extract features from document terms
automated document summarization  using mathematical structures and frameworks like matrix factorization and SVD (Singular Value Decomposition)
 generate clusters or groups of terms that are distinguishable from each other
 these cluster of words form topics or concepts
• Used to
 interpret the main themes of a corpus
 make semantic connections among words that co-occur together frequently in various documents

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

Frameworks and algorithms to build topic models 1.Latent Semantic Indexing (LSI)
• used for text summarization, information retrieval, search
• uses the very popular SVD technique (Singular Value Decomposition)
1. Latent semantic indexing - popular • main principle - similar terms tend to be used in the same context and hence
2. Latent Dirichlet allocation - popular tend to co-occur more
3. Non-negative matrix factorization
- very recent technique
- extremely effective and gives excellent results

2.Latent Dirichlet Allocation (LDA) Steps Involved


• is a generative probabilistic model
• each document is assumed to have a 1. Initialize the necessary parameters.
combination of topics similar to a 2. For each document, randomly initialize each word to one of the K topics.
probabilistic latent semantic indexing
model 3. Start an iterative process as follows and repeat it several times.
• latent topics contain a Dirichlet prior 4. For each document D:
over them.
a. For each word W in document:
• For each topic T:
i. Compute P( T|D) - proportion of words in D assigned to topic T.
ii. Compute P (W|T) - proportion of assignments to topic T over all documents having
the word W.
• Reassign word W with topic T with probability P(T|D) * P(W|T) considering
all other words and their topic assignments
End-to-end LDA framework
LDA plate notation

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)


lOMoARcPSD|25748686

3.Non-negative Matrix Factorization(NNMF)


Potential Harms from Language Models
• matrix decomposition technique similar to SVD
• NNMF operates on nonnegative matrices and works well for multivariate data • can generate toxic language - hate speech and abuse, negative
• NNMF can be formally defined like so: attitudes toward minority identities such as being Black or gay.
• Given a non-negative matrix V • can amplify demographic and other biases in training data
• objective - find two non-negative matrix factors W and H such that when they are
multiplied, they can approximately reconstruct V. • can also be a tool for generating text for misinformation, phishing,
• Mathematically this is represented by radicalization, and other socially harmful activities
• To get to this approximation • privacy issues- can leak information about their training data.
• use a cost function - Euclidean distance or L2 norm between two matrices, or the
Frobenius norm - is a slight modification of the L2 norm. • Extra pre-training on non-toxic subcorpora seems to reduce tendency
to generate toxic language
• represented as simplified as
• analyze the data used to pretrain - understand toxicity and bias in
generation, as well as privacy
• works the best even with small corpora with few documents
• depends on the type of data that are dealt

Downloaded by V. Dhanush (dhanushprasana2003.07.13@gmail.com)

You might also like