11 Text Mining
11 Text Mining
11 Text Mining
Data Mining
Edgar Roman-Rangel.
edgar.roman@itam.mx
Outline
Statistical NLP
Word Embedding
Text ITAM
Statistical NLP Word Embedding
NLP
Understanding:
▶ Contents of text.
▶ Intention/sentiments.
▶ Languages (translation).
Text ITAM
Statistical NLP Word Embedding
Text ITAM
Statistical NLP Word Embedding
Bag-of-Words
Steps:
1. Find the dictionary that is common to the corpus.
2. For each document, count how many time each word appears.
3. Normalize the histogram into a distribution.
Text ITAM
Statistical NLP Word Embedding
tf-idf
It might happen that some words are common to several
documents, providing no discriminative power. Therefore, we can
opt to weight the importance of each word in the dictionary.
This weighting is proportionally inverse to the frequency of each
word in the whole corpus, i.e., inverse to the number of document
each word appears in.
The tf − idf representation of a sentence will be,
dtf −idf = dBoW ⊙ w,
where,
▶ dtf −idf and dBoW are the vector descriptors for the tf − idf
and BoW models, respectively.
▶ w is the vector of weights, one per word.
▶ ⊙ is the Hadamard operator.
Text ITAM
Statistical NLP Word Embedding
n-grams
Instead of counting single words, the n-gram model counts
co-occurrences of n consecutive words.
For instance, the phrase “The big brown dog runs fast in the yard”
has the following 2-grams,
▶ The big,
▶ big brown,
▶ brown dog,
▶ etc.
Outline
Statistical NLP
Word Embedding
Text ITAM
Statistical NLP Word Embedding
Word embedding
Word embedding: Numeric representation (vector) of words, in
which words with similar meaning result in similar representation.
They allow for vector space models (VSM).
Time-series processing
Text ITAM
Statistical NLP Word Embedding
[0, 0, 0, 1, 0, 0]
It requires:
1. Knowing the total number of unique words in the corpus.
2. Ordering all unique words somehow, e.g., alphabetically.
3. Assigning an integer number to each unique word.
4. Representing integer indices as one-hot encoding vectors.
Text ITAM
Statistical NLP Word Embedding
Word2Vec
Mikolov et al., 2013. “Efficient Estimation of Word
Representations in Vector Space”.
Given a set of phrases, let us analyze their words, and for each
word learn to predict a probability of the following word. E.g.,
The quick brown fox ? over the lazy dog.
Text ITAM
Statistical NLP Word Embedding
Word2Vec network
Word2Vec variants
Skip-gram
Predict context words from a target, e.g.,
? ? ? ? jumps ? ? ? ? .
Text ITAM
Statistical NLP Word Embedding
Other models
▶ GloVe.
▶ fastText.
▶ Sentence2Vec.
▶ Doc2Vec.
Text ITAM
Statistical NLP Word Embedding
Q&A
Thank you!
edgar.roman@itam.mx
Text ITAM