11 Text Mining

Mining Text and NLP
Data Mining
Edgar Roman-Rangel.
edgar.roman@itam.mx
Department of Computer Science.

Instituto Tecnológico Autónomo de México, ITAM.
Statistical NLP Word Embedding
Outline
Statistical NLP
Word Embedding
Text ITAM
NLP
Natural Language Processing (NLP).
Discipline dealing with the task of making computers process

human language.
Understanding:
▶ Contents of text.
▶ Intention/sentiments.
▶ Languages (translation).
NLP requires mathematical text representations and language

models.
Text ITAM
Statistical models for text

Several (old) model for text representation were developed using
statistical analysis.
Some of the most popular models are,

▶ Bag-of-Words (BoW).
▶ Term frequency - Inverse document frequency (tf-idf).
▶ n-grams.
All of these models generate global descriptors of text. We can

adapt them for the creation of local descriptors.
Text ITAM
Bag-of-Words
For a given sentence (a.k.a., document), the BoW model creates a

histogram with its distribution of words, where such distribution is
computed over a dictionary D that is common to the whole corpus
(dataset).
Steps:
1. Find the dictionary that is common to the corpus.
2. For each document, count how many time each word appears.
3. Normalize the histogram into a distribution.
Optional: before the computation of the histogram, remove

stop-words and perform stemming or lemmatization.
Text ITAM
tf-idf
It might happen that some words are common to several
documents, providing no discriminative power. Therefore, we can
opt to weight the importance of each word in the dictionary.
This weighting is proportionally inverse to the frequency of each
word in the whole corpus, i.e., inverse to the number of document
each word appears in.
The tf − idf representation of a sentence will be,
dtf −idf = dBoW ⊙ w,
where,
▶ dtf −idf and dBoW are the vector descriptors for the tf − idf
and BoW models, respectively.
▶ w is the vector of weights, one per word.
▶ ⊙ is the Hadamard operator.
Text ITAM
n-grams
Instead of counting single words, the n-gram model counts
co-occurrences of n consecutive words.
For instance, the phrase “The big brown dog runs fast in the yard”
has the following 2-grams,
▶ The big,
▶ big brown,
▶ brown dog,
▶ etc.
and the following 3-grams,

▶ The big brown,
▶ big brown dog,
▶ brown dog runs,
▶ etc.
Text ITAM
Outline
Statistical NLP
Word Embedding
Text ITAM
Word embedding
Word embedding: Numeric representation (vector) of words, in
which words with similar meaning result in similar representation.
They allow for vector space models (VSM).
vking − vman + vwoman = vqueen .
Features for a given word must be extracted from the context of

the word itself.
Text ITAM
Time-series processing
Once each word is presented as a numeric vector,
▶ A phrase will be represented as a collection of vectors.

▶ Phrases can be processed as a multi-variate time-series.
▶ Estimate global descriptors for the whole phrase.
▶ We can use these time-series for classification, regression, or
auto-regression.
Text ITAM
Initial word representation

For the estimation of robust embeddings, we start from simpler
numeric representations.
One-hot encoding
[0, 0, 0, 1, 0, 0]
It requires:
1. Knowing the total number of unique words in the corpus.
2. Ordering all unique words somehow, e.g., alphabetically.
3. Assigning an integer number to each unique word.
4. Representing integer indices as one-hot encoding vectors.
Text ITAM
Word2Vec
Mikolov et al., 2013. “Efficient Estimation of Word
Representations in Vector Space”.
Given a set of phrases, let us analyze their words, and for each
word learn to predict a probability of the following word. E.g.,
The quick brown fox ? over the lazy dog.
Text ITAM
Word2Vec network
Up to us to define the length of the word embedding.

▶ Word embedding: h = xT we ,
▶ Next word probability: y = a(hT wo ), where a(·) is the
softmax function.
There are two main variants: CBOW and Skip-gram.

Text ITAM
Word2Vec variants
CBOW: continuous bag-of-words

Predict target word from context, e.g.,
The quick brown fox ? over the lazy dog.
Skip-gram
Predict context words from a target, e.g.,
? ? ? ? jumps ? ? ? ? .
Text ITAM
Other models
▶ GloVe.
▶ fastText.
▶ Sentence2Vec.
▶ Doc2Vec.
Text ITAM
Q&A
Thank you!
edgar.roman@itam.mx
Text ITAM

11 Text Mining

Uploaded by

Copyright:

Available Formats

11 Text Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11 Text Mining

Uploaded by

Copyright:

Available Formats

Mining Text and NLP

Department of Computer Science.

Natural Language Processing (NLP).

Discipline dealing with the task of making computers process

NLP requires mathematical text representations and language

Statistical models for text

Some of the most popular models are,

All of these models generate global descriptors of text. We can

For a given sentence (a.k.a., document), the BoW model creates a

Optional: before the computation of the histogram, remove

and the following 3-grams,

vking − vman + vwoman = vqueen .

Features for a given word must be extracted from the context of

Once each word is presented as a numeric vector,

▶ A phrase will be represented as a collection of vectors.

Initial word representation

Up to us to define the length of the word embedding.

There are two main variants: CBOW and Skip-gram.

CBOW: continuous bag-of-words

You might also like