Topic Modelling Using NLP
Topic Modelling Using NLP
BY
1| P a g e
ABSTRACT
Topic modeling is a powerful technique for unsupervised analysis of large
document collections. Topic models conceive latent topics in text using hidden
random variables, and discover that structure with posterior inference. Topic
categorization, keyword extraction and similarity search in the broad fields of text
dataset with 200 abstracts fall under four topics are collected from two different
domain journals for tagging journal abstracts. The document models are built
using LDA (Latent Dirichlet Allocation) with Collapsed Variational Bayes and
Gibbs sampling. Then the built model is used to extract appropriate tags for
abstracts. The performance of the built models are analyzed by the evaluation
2| P a g e
measure perplexity and observed that Gibbs sampling outperforms CV B0
1.Introduction
One of the primary applications of natural language processing is to automatically
extract what topics people are discussing from large volumes of text. Some examples of
large text could be feeds from social media, customer reviews of hotels, movies, etc,
user feedbacks, news stories, e-mails of customer complaints etc.
Knowing what people are talking about and understanding their problems and opinions
is highly valuable to businesses, administrators, political campaigns. And it’s really hard
to manually read through such large volumes and compile the topics.
Thus is required an automated algorithm that can read through the text documents and
automatically output the topics discussed.
In this tutorial, we will take a real example of the ’20 Newsgroups’ dataset and use LDA
to extract the naturally discussed topics.
I will be using the Latent Dirichlet Allocation (LDA) from Gensim package along with the
Mallet’s implementation (via Gensim). Mallet has an efficient implementation of the LDA.
It is known to run faster and gives better topics segregation.
We will also extract the volume and percentage contribution of each topic to get an idea
of how important a topic is.
3| P a g e
What is Topic Modeling?
Topic modelling, in the context of Natural Language Processing, is described as a
method of uncovering hidden structure in a collection of texts. Although that is indeed
true it is also a pretty useless definition. Let’s define topic modeling in more practical
terms.
Definitions:
C: collection of documents containing N texts.
V: vocabulary (the set of unique words in the collection)
Dimensionality Reduction
Topic modeling is a form of dimensionality reduction. Rather than representing a text T
in its feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the
text in its topic space as {Topic_i: weight(Topic_i, T) for Topic_i in Topics}. Notice that
we’re using Topics to represent the set of all topics.
Unsupervised Learning
Topic modeling can be easily compared to clustering. As in the case of clustering, the
number of topics, like the number of clusters, is a hyperparameter. By doing topic
modeling we build clusters of words rather than clusters of texts. A text is thus a mixture
of all the topics, each having a certain weight.
A Form of Tagging
If document classification is assigning a single category to a text, topic modeling is
assigning multiple tags to a text. A human expert can label the resulting topics with
human-readable labels and use different heuristics to convert the weighted topics to a
set of tags.
4| P a g e
Why is Topic Modeling useful?
There are several scenarios when topic modeling can prove useful. Here are some of
them:
2.Literature Survey
All topic models are based on the same basic assumption:
a collection of words.
In other words, topic models are built around the idea that the semantics of our
document are actually being governed by some hidden, or “latent,” variables that we are
not observing. As a result, the goal of topic modeling is to uncover these latent variables
— topics — that shape the meaning of our document and corpus. The rest of this blog
post will build up an understanding of how different topic models uncover these latent
topics.
There are several algorithms for doing topic modeling. The most popular ones include
• LDA – Latent Dirichlet Allocation – The one we’ll be focusing in this tutorial. Its
foundations are Probabilistic Graphical Models
• LSA or LSI – Latent Semantic Analysis or Latent Semantic Indexing – Uses
Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on
Linear Algebra
• NMF – Non-Negative Matrix Factorization – Based on Linear Algebra Here are
5| P a g e
The number of topics (n_topics) as a parameter. None of the algorithms can infer the
number of topics in the document collection.
All of the algorithms have as input the Document-Word Matrix (or Document-Term
Matrix). DWM[i][j] = The number of occurrences of word_j in document_i.
All of them output 2 matrices: WTM (Word Topic Matrix) and TDM (Topic Document
Matrix). The matrices are significantly smaller and the result of their multiplication
should be as close as possible to the original DWM matrix.
The purpose of this guide is not to describe in great detail each algorithm, but rather a
practical overview and concrete implementations in Python using Scikit-Learn and
Gensim. We’ll go over every algorithm to understand them better later in this tutorial.
6| P a g e
7| P a g e
Next, we’re going to use Scikit-Learn and Gensim to perform topic modeling on a
corpus.
8| P a g e
The process of learning, recognizing, and extracting these topics across a collection of
documents is called topic modeling.
3.Methodology
The first step is generating our document-term matrix. Given m documents and n words
in our vocabulary, we can construct an m × n matrix A in which each row represents a
document and each column represents a word. In the simplest version of LSA, each
entry can simply be a raw count of the number of times the j-th word appeared in the i-th
document. In practice, however, raw counts do not work particularly well because they
do not account for the significance of each word in the document. For example, the
word “nuclear” probably informs us more about the topic(s) of a given document than
the word
“test.”
Consequently, LSA models typically replace raw counts in the document-term matrix
with a tf-idf score. Tf-idf, or term frequency-inverse document frequency, assigns a
weight for term j in document i as follows:
Intuitively, a term has a large weight when it occurs frequently across the document but
infrequently across the corpus. The word “build” might appear often in a document, but
because it’s likely fairly common in the rest of the corpus, it will not have a high tf-idf
score. However, if the word “gentrification” appears often in a document, because it is
rarer in the rest of the corpus, it will have a higher tf-idf score.
Once we have our document-term matrix A, we can start thinking about our latent
topics. Here’s the thing: in all likelihood, A is very sparse, very noisy, and very
redundant across its many dimensions. As a result, to find the few latent topics that
capture the relationships among the words and documents, we want to perform
dimensionality reduction on A.
This dimensionality reduction can be performed using truncated SVD. SVD, or singular
value decomposition, is a technique in linear algebra that factorizes any matrix M into
the product of 3 separate matrices: M=U*S*V, where S is a diagonal matrix of the
9| P a g e
singular values of M. Critically, truncated SVD reduces dimensionality by selecting only
the t largest singular values, and only keeping the first t columns of U and V. In this
case, t is a hyperparameter we can select and adjust to reflect the number of topics we
want to find.
Intuitively, think of this as only keeping the t most significant dimensions in our
transformed space.
Pros.:With these document vectors and term vectors, we can now easily apply
measures such as cosine similarity to evaluate:
• the similarity of terms (or “queries”) and documents (which becomes useful in
information retrieval, when we want to retrieve passages most relevant to our
search query).
Code :
vectorizer = TfidfVectorizer(stop_words='english',
use_idf=True,
smooth_idf=True)
10| P a g e
n_iter=10)
svd_matrix = svd_transformer.fit_transform(documents)
Cons. :LSA is quick and efficient to use, but it does have a few primary drawbacks:
• lack of interpretable embeddings (we don’t know what the topics are, and the
components may be arbitrarily positive/negative)
• need for really large set of documents and vocabulary to get accurate results
LDA stands for Latent Dirichlet Allocation. LDA is a Bayesian version of pLSA. In
particular, it uses dirichlet priors for the document-topic and word-topic distributions,
lending itself to better generalization.
I am not going to into an in-depth treatment of dirichlet distributions, since there are very
good intuitive explanations here and here. As a brief overview, however, we can think of
dirichlet as a “distribution over distributions.” In essence, it answers the question: “given
this type of distribution, what are some actual probability distributions I am likely to
see?”
11| P a g e
• Mixture Y: 5% topic A, 90% topic B, 5% topic C
• Mixture Z: 5% topic A, 5% topic B, 90% topic C
Figure 3
In pLSA, we sample a document, then a topic based on that document, then a word
based on that topic. Here is the model for LDA:
Figure 4
12| P a g e
From a dirichlet distribution Dir(α), we draw a random sample representing the topic
distribution, or topic mixture, of a particular document. This topic distribution is θ. From
θ, we select a particular topic Z based on the distribution.
Next, from another dirichlet distribution Dir(𝛽), we select a random sample representing
the word distribution of the topic Z. This word distribution is φ. From φ, we choose the
word w.
Formally, the process for generating each word from a document is as follows (beware
this algorithm uses c instead of z to represent the topic):
Pros. :LDA typically works better than pLSA because it can generalize to new
documents easily. In pLSA, the document probability is a fixed point in the dataset. If we
haven’t seen a document, we don’t have that data point. In LDA, the dataset serves as
training data for the dirichlet distribution of document-topic distributions. If we haven’t
seen a document, we can easily sample from the dirichlet distribution and move forward
from there.
Code:
LDA is easily the most popular (and typically most effective) topic modeling technique
out there. It’s available in gensim for easy use:
13| P a g e
# load id->word mapping (the dictionary)
id2word = load_from_text('wiki_en_wordids.txt')
mm = MmCorpus('wiki_en_tfidf.mm')
Recall the basic assumption of topic models: each document consists of a mixture of
topics, and each topic consists of a collection of words. pLSA adds a probabilistic spin
to these assumptions:
14| P a g e
Intuitively, the right-hand side of this equation is telling us how likely it is see some
document, and then based upon the distribution of topics of that document, how likely it
is to find a certain word within that document.
Figure 5
Formally, the joint probability of seeing a given document and word together is:
In this case, P(D), P(Z|D), and P(W|Z) are the parameters of our model. P(D) can be
determined directly from our corpus. P(Z|D) and P(W|Z) are modeled as multinomial
distributions, and can be trained using the expectation-maximization algorithm (EM).
Without going into a full mathematical treatment of the algorithm, EM is a method of
finding the likeliest parameter estimates for a model which depends on unobserved,
latent variables (in our case, the topics).
15| P a g e
Figure 6
The reason this new parameterization is so interesting is because we can see a direct
parallel between our pLSA model our LSA model:
where the probability of our topic P(Z) corresponds to the diagonal matrix of our singular
topic probabilities, the probability of our document given the topic P(D|Z) corresponds to
our document-topic matrix U, and the probability of our word given the topic P(W|Z)
corresponds to our term-topic matrix V.
Pros. :So what does that tell us? Although it looks quite different and approaches the
problem in a very different way, pLSA really just adds a probabilistic treatment of topics
and words on top of LSA. It is a far more flexible model, but still has a few problems. In
particular:
16| P a g e
• Because we have no parameters to model P(D), we don’t know how to assign
probabilities to new documents
• The number of parameters for pLSA grows linearly with the number of documents
we have, so it is prone to overfitting
Cons. :We will not look at any code for pLSA because it is rarely used on its own. In
general, when people are looking for a topic model beyond the baseline performance
LSA gives, they turn to LDA. LDA, the most common type of topic model, extends PLSA
to address these issues.
17| P a g e
REFERENCES
[1] Blei, D.M., and Lafferty, J. D. ―Dynamic Topic Models‖, Proceedings of the 23rd
International Conference on Machine Learning, Pittsburgh, PA, 2006.
[2] Steyvers, M., and Griffiths, T. (2007). ―Probabilistic topic models‖. In T. Landauer, D
McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning.
Laurence Erlbaum
[4] Kakkonen, T., Myller, N., Sutinen, E., and Timonen, J., ―Comparison of Dimension
Reduction Methods for Automated Essay Grading‖, Educational Technology &Society, 11 (3),
2008, 275-288.
[5] Liu, S., Xia, C., and Jiang, X., ―Efficient Probabilistic Latent Semantic Analysis with
Sparsity Control‖, IEEE International Conference on Data Mining, 2010, 905-910.
[6] Bassiou, N., and Kotropoulos C. ―RPLSA: A novel updating scheme for Probabilistic
Latent Semantic Analysis‖, Department of Informatics, Aristotle University of Thessaloniki, Box
451 Thessaloniki 541 24, Greece Received 14 April 2010.
[7] Romberg, S., Hörster, E., and Lienhart, R., ―Multimodal pLSA on visual features and
tags‖, The Institute of Electrical and Electronics Engineers Inc., 2009, 414-417.
[8] Wu, H., Wang, Y., and Cheng, X., ―Incremental probabilistic latent semantic analysis
for automatic question recommendation‖, ACM New York, NY, USA, 2008, 99-106.
[9] Blei, D.M., Ng, A.Y., and Jordan, M.I., ―Latent Dirichlet Allocation‖, Journal of Machine
Learning Research, 3, 2003, 993-1022.
[10] Ahmed,A., Xing,E.P., and William W. ―Joint Latent Topic Models for Text and
Citations‖, ACM New York, NY, USA, 2008.
[11] Zhi-Yong Shen,Z.Y., Sun,J., and Yi-Dong Shen,Y.D., ―Collective Latent Dirichlet
Allocation‖, Eighth IEEE International Conference on Data Mining, pages 1019–1025, 2008.
[12] Porteous, L.,Newman,D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M., ―Fast
Collapsed Gibbs Sampling For Latent Dirichlet Allocation‖, ACM New York, NY, USA, 2008.
18| P a g e