Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

11 Text Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Mining Text and NLP

Data Mining

Edgar Roman-Rangel.
edgar.roman@itam.mx

Department of Computer Science.


Instituto Tecnológico Autónomo de México, ITAM.
Statistical NLP Word Embedding

Outline

Statistical NLP

Word Embedding

Text ITAM
Statistical NLP Word Embedding

NLP

Natural Language Processing (NLP).

Discipline dealing with the task of making computers process


human language.

Understanding:
▶ Contents of text.
▶ Intention/sentiments.
▶ Languages (translation).

NLP requires mathematical text representations and language


models.

Text ITAM
Statistical NLP Word Embedding

Statistical models for text


Several (old) model for text representation were developed using
statistical analysis.

Some of the most popular models are,


▶ Bag-of-Words (BoW).
▶ Term frequency - Inverse document frequency (tf-idf).
▶ n-grams.

All of these models generate global descriptors of text. We can


adapt them for the creation of local descriptors.

Text ITAM
Statistical NLP Word Embedding

Bag-of-Words

For a given sentence (a.k.a., document), the BoW model creates a


histogram with its distribution of words, where such distribution is
computed over a dictionary D that is common to the whole corpus
(dataset).

Steps:
1. Find the dictionary that is common to the corpus.
2. For each document, count how many time each word appears.
3. Normalize the histogram into a distribution.

Optional: before the computation of the histogram, remove


stop-words and perform stemming or lemmatization.

Text ITAM
Statistical NLP Word Embedding

tf-idf
It might happen that some words are common to several
documents, providing no discriminative power. Therefore, we can
opt to weight the importance of each word in the dictionary.
This weighting is proportionally inverse to the frequency of each
word in the whole corpus, i.e., inverse to the number of document
each word appears in.
The tf − idf representation of a sentence will be,
dtf −idf = dBoW ⊙ w,
where,
▶ dtf −idf and dBoW are the vector descriptors for the tf − idf
and BoW models, respectively.
▶ w is the vector of weights, one per word.
▶ ⊙ is the Hadamard operator.
Text ITAM
Statistical NLP Word Embedding

n-grams
Instead of counting single words, the n-gram model counts
co-occurrences of n consecutive words.
For instance, the phrase “The big brown dog runs fast in the yard”
has the following 2-grams,
▶ The big,
▶ big brown,
▶ brown dog,
▶ etc.

and the following 3-grams,


▶ The big brown,
▶ big brown dog,
▶ brown dog runs,
▶ etc.
Text ITAM
Statistical NLP Word Embedding

Outline

Statistical NLP

Word Embedding

Text ITAM
Statistical NLP Word Embedding

Word embedding
Word embedding: Numeric representation (vector) of words, in
which words with similar meaning result in similar representation.
They allow for vector space models (VSM).

vking − vman + vwoman = vqueen .

Features for a given word must be extracted from the context of


the word itself.
Text ITAM
Statistical NLP Word Embedding

Time-series processing

Once each word is presented as a numeric vector,

▶ A phrase will be represented as a collection of vectors.


▶ Phrases can be processed as a multi-variate time-series.
▶ Estimate global descriptors for the whole phrase.
▶ We can use these time-series for classification, regression, or
auto-regression.

Text ITAM
Statistical NLP Word Embedding

Initial word representation


For the estimation of robust embeddings, we start from simpler
numeric representations.
One-hot encoding

[0, 0, 0, 1, 0, 0]

It requires:
1. Knowing the total number of unique words in the corpus.
2. Ordering all unique words somehow, e.g., alphabetically.
3. Assigning an integer number to each unique word.
4. Representing integer indices as one-hot encoding vectors.

Text ITAM
Statistical NLP Word Embedding

Word2Vec
Mikolov et al., 2013. “Efficient Estimation of Word
Representations in Vector Space”.

Given a set of phrases, let us analyze their words, and for each
word learn to predict a probability of the following word. E.g.,
The quick brown fox ? over the lazy dog.

Text ITAM
Statistical NLP Word Embedding

Word2Vec network

Up to us to define the length of the word embedding.


▶ Word embedding: h = xT we ,
▶ Next word probability: y = a(hT wo ), where a(·) is the
softmax function.

There are two main variants: CBOW and Skip-gram.


Text ITAM
Statistical NLP Word Embedding

Word2Vec variants

CBOW: continuous bag-of-words


Predict target word from context, e.g.,
The quick brown fox ? over the lazy dog.

Skip-gram
Predict context words from a target, e.g.,
? ? ? ? jumps ? ? ? ? .

Text ITAM
Statistical NLP Word Embedding

Other models

▶ GloVe.
▶ fastText.
▶ Sentence2Vec.
▶ Doc2Vec.

Text ITAM
Statistical NLP Word Embedding

Q&A

Thank you!

edgar.roman@itam.mx

Text ITAM

You might also like