From bag of texts to bag of clusters
Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)
Мы рассмотрим современные подходы к кластеризации текстов и их визуализации. Начиная от классического K-means на TF-IDF и заканчивая Deep Learning репрезентациями текстов. В качестве практического примера, мы проанализируем набор сообщений из соц. сетей и попробуем найти основные темы обсуждения.
Все материалы: http://datascience.in.ua/report2017
1 of 49
More Related Content
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / Павел Худан (Data Scientists / NLP Engineer at YouScan)
7. Semantic representation of texts
1. Text (semi/un)supervised classification
2. Document retrieval
3. Topic insights
4. Text similarity/relatedness
8. Requirements
• Vector representation is handy
• Descriptive (not distinctive) features
• Language/style/genre independence
• Robustness to language/speech variance
(word- and phrase- level synonymy, word
order, newly emerging words and entities)
9. • Token-based methods, although char-based are more robust
• Preprocessing and unification
• Tokenization
• Lemmatization?
Prerequisites
10. BoW, Tf-idf and more
• Bag of Words: one-hot encoding over the observed dictionary
• TF-IDF: ‘term frequency’ * ‘inverse document frequency’ for
term weighting (include different normalization schemes)
• Bag of n-grams: collocations carry more specific senses
• Singular Value Decomposition (SVD) of the original term-
document matrix (compression with less relevant information
loss):
◦ resolves inter-document relations: similarity
◦ resolves inter-term relations: synonymy and polysemy
◦ reduces dimensionality
11. BoW, Tf-idf and more
- easily interpretable
- easy to implement
- parameters are straightforward
- not robust to language variance
- scales badly
- vulnerable to overfitting
Pros Cons
12. ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
TF-IDF + SVD + TSNE
15. K-means
• Separate all observations in K groups of equal variance
• Iteratively reassign cluster members for cluster members mean to minimize the inertia: within-
cluster sum of squared criterion
16. Hierarchical clustering
• Build a hierarchy of clusters
• Bottom-up or top-down approach (agglomerative or divisive clustering)
• Various metrics for cluster dissimilarity
• Cluster count and contents depends on chosen
dissimilarity threshold
Clusters:
a, bc, def
17. Density Based Scan
• Find areas of high density separated by areas of low density of samples
• Involves two parameters: epsilon and minimum points
• Epsilon sets the minimum distance for two points to be considered close enough
Minimum points stand for the amount of mutually close points to be considered a new cluster
22. Word embeddings
- semantics is included
- moderately robust to language variance
- scales better, including OOV
- embeddings source and quality?
- vector relations (distance measures, separating
planes) is what really means, not vector values
- meaning degrades quickly on moderate-to-large
texts
- interpretation is a tedious work
Pros Cons
23. ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
Word2Vec mean
27. Sense clusters
0 0.9 0 0 0.95 0 0.1
3000
еда времяовощи
картошка
• Find K cluster centers over
target vocabulary embeddings
• Calculate distances (cosine measure) to cluster
centers for each vocabulary word, ignore
relatively small ones
• Use distances as new K-dimensional feature
vector (word embedding)
• Aggregate embeddings
• Normalize?
28. Sense clusters
- semantics is now valuable(expressed by
concrete values in vectors)
- meaning now accumulates in text vectors better
- it is possible to retrofit clusters on sense
interpretations for readability
- inherited from word embeddings
- chained complexity
- additional parameters to fiddle with
- vector length is higher (around 3k
dimensions) -> bigger, cumbersome,
heavier
Pros Cons
29. ODS курс на хабре
Google купила kaggle
распознавание раковых
опухолей
яндекс крипта,
запросы женщин
Data Science Lab
Word2Sense mean
43. Sequential Denoising Autoencoder (SDAE)
купил для исследователейGoogle
Google
Google купил для
исследователей
сервис
сервис
купил сервис для
Delete word Swap bigram
Corrupt sentence by
p0 Є [0, 1] px Є [0, 1]
and predict original sentence
44. ODS курс на хабре
Google купила kaggle
яндекс крипта,
запросы женщин
Data Science Lab
SDAE