Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
116 views

Topic Modelling Using NLP

The document discusses topic modeling using natural language processing. It provides an overview of topic modeling, including definitions and explanations of latent Dirichlet allocation (LDA) and other algorithms like latent semantic analysis (LSA) and non-negative matrix factorization (NMF). The document then describes the methodology used, including how LSA uses singular value decomposition to decompose a document-term matrix into separate document-topic and topic-term matrices to extract latent topics.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

Topic Modelling Using NLP

The document discusses topic modeling using natural language processing. It provides an overview of topic modeling, including definitions and explanations of latent Dirichlet allocation (LDA) and other algorithms like latent semantic analysis (LSA) and non-negative matrix factorization (NMF). The document then describes the methodology used, including how LSA uses singular value decomposition to decompose a document-term matrix into separate document-topic and topic-term matrices to extract latent topics.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Topic Modeling Using NLP

BY

BHAKTA BALLAV GARAI

1| P a g e
ABSTRACT
Topic modeling is a powerful technique for unsupervised analysis of large

document collections. Topic models conceive latent topics in text using hidden

random variables, and discover that structure with posterior inference. Topic

models have a wide range of applications like tag recommendation, text

categorization, keyword extraction and similarity search in the broad fields of text

mining, information retrieval, statistical language modeling. In this work, a

dataset with 200 abstracts fall under four topics are collected from two different

domain journals for tagging journal abstracts. The document models are built

using LDA (Latent Dirichlet Allocation) with Collapsed Variational Bayes and

Gibbs sampling. Then the built model is used to extract appropriate tags for

abstracts. The performance of the built models are analyzed by the evaluation

2| P a g e
measure perplexity and observed that Gibbs sampling outperforms CV B0

sampling. Tags extracted by two algorithms remains almost the same.

1.Introduction
One of the primary applications of natural language processing is to automatically
extract what topics people are discussing from large volumes of text. Some examples of
large text could be feeds from social media, customer reviews of hotels, movies, etc,
user feedbacks, news stories, e-mails of customer complaints etc.

Knowing what people are talking about and understanding their problems and opinions
is highly valuable to businesses, administrators, political campaigns. And it’s really hard
to manually read through such large volumes and compile the topics.

Thus is required an automated algorithm that can read through the text documents and
automatically output the topics discussed.

In this tutorial, we will take a real example of the ’20 Newsgroups’ dataset and use LDA
to extract the naturally discussed topics.

I will be using the Latent Dirichlet Allocation (LDA) from Gensim package along with the
Mallet’s implementation (via Gensim). Mallet has an efficient implementation of the LDA.
It is known to run faster and gives better topics segregation.

We will also extract the volume and percentage contribution of each topic to get an idea
of how important a topic is.

3| P a g e
What is Topic Modeling?
Topic modelling, in the context of Natural Language Processing, is described as a
method of uncovering hidden structure in a collection of texts. Although that is indeed
true it is also a pretty useless definition. Let’s define topic modeling in more practical
terms.

Definitions:
 C: collection of documents containing N texts.
 V: vocabulary (the set of unique words in the collection)

Dimensionality Reduction
Topic modeling is a form of dimensionality reduction. Rather than representing a text T
in its feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the
text in its topic space as {Topic_i: weight(Topic_i, T) for Topic_i in Topics}. Notice that
we’re using Topics to represent the set of all topics.
Unsupervised Learning
Topic modeling can be easily compared to clustering. As in the case of clustering, the
number of topics, like the number of clusters, is a hyperparameter. By doing topic
modeling we build clusters of words rather than clusters of texts. A text is thus a mixture
of all the topics, each having a certain weight.

A Form of Tagging
If document classification is assigning a single category to a text, topic modeling is
assigning multiple tags to a text. A human expert can label the resulting topics with
human-readable labels and use different heuristics to convert the weighted topics to a
set of tags.

Figure 1:Topic Modelling Mechanism

4| P a g e
Why is Topic Modeling useful?
There are several scenarios when topic modeling can prove useful. Here are some of
them:

• Text classification – Topic modeling can improve classification by grouping similar


words together in topics rather than using each word as a feature
• Recommender Systems – Using a similarity measure we can build recommender
systems. If our system would recommend articles for readers, it will recommend
articles with a topic structure similar to the articles the user has already read.
• Uncovering Themes in Texts – Useful for detecting trends in online publications for
example.

2.Literature Survey
All topic models are based on the same basic assumption:

• each document consists of a mixture of topics, and  each topic consists of

a collection of words.

In other words, topic models are built around the idea that the semantics of our
document are actually being governed by some hidden, or “latent,” variables that we are
not observing. As a result, the goal of topic modeling is to uncover these latent variables
— topics — that shape the meaning of our document and corpus. The rest of this blog
post will build up an understanding of how different topic models uncover these latent
topics.

Topic Modeling Algorithms-

There are several algorithms for doing topic modeling. The most popular ones include

• LDA – Latent Dirichlet Allocation – The one we’ll be focusing in this tutorial. Its
foundations are Probabilistic Graphical Models
• LSA or LSI – Latent Semantic Analysis or Latent Semantic Indexing – Uses
Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on
Linear Algebra
• NMF – Non-Negative Matrix Factorization – Based on Linear Algebra Here are

some things all these algorithms have in common:

5| P a g e
 The number of topics (n_topics) as a parameter. None of the algorithms can infer the
number of topics in the document collection.
 All of the algorithms have as input the Document-Word Matrix (or Document-Term
Matrix). DWM[i][j] = The number of occurrences of word_j in document_i.
 All of them output 2 matrices: WTM (Word Topic Matrix) and TDM (Topic Document
Matrix). The matrices are significantly smaller and the result of their multiplication
should be as close as possible to the original DWM matrix.

The purpose of this guide is not to describe in great detail each algorithm, but rather a
practical overview and concrete implementations in Python using Scikit-Learn and
Gensim. We’ll go over every algorithm to understand them better later in this tutorial.

6| P a g e
7| P a g e
Next, we’re going to use Scikit-Learn and Gensim to perform topic modeling on a
corpus.

Figure 2:Topic Modelling with LSA,LDA and NMF

In natural language understanding (NLU) tasks, there is a hierarchy of lenses through


which we can extract meaning — from words to sentences to paragraphs to documents.

8| P a g e
The process of learning, recognizing, and extracting these topics across a collection of
documents is called topic modeling.

3.Methodology

3.1.Latent Semantic Analysis(LSA) -

Latent Semantic Analysis, or LSA, is one of the foundational techniques in topic


modeling. The core idea is to take a matrix of what we have — documents and terms —
and decompose it into a separate document-topic matrix and a topic-term matrix.

The first step is generating our document-term matrix. Given m documents and n words
in our vocabulary, we can construct an m × n matrix A in which each row represents a
document and each column represents a word. In the simplest version of LSA, each
entry can simply be a raw count of the number of times the j-th word appeared in the i-th
document. In practice, however, raw counts do not work particularly well because they
do not account for the significance of each word in the document. For example, the
word “nuclear” probably informs us more about the topic(s) of a given document than
the word
“test.”
Consequently, LSA models typically replace raw counts in the document-term matrix
with a tf-idf score. Tf-idf, or term frequency-inverse document frequency, assigns a
weight for term j in document i as follows:

Intuitively, a term has a large weight when it occurs frequently across the document but
infrequently across the corpus. The word “build” might appear often in a document, but
because it’s likely fairly common in the rest of the corpus, it will not have a high tf-idf
score. However, if the word “gentrification” appears often in a document, because it is
rarer in the rest of the corpus, it will have a higher tf-idf score.

Once we have our document-term matrix A, we can start thinking about our latent
topics. Here’s the thing: in all likelihood, A is very sparse, very noisy, and very
redundant across its many dimensions. As a result, to find the few latent topics that
capture the relationships among the words and documents, we want to perform
dimensionality reduction on A.

This dimensionality reduction can be performed using truncated SVD. SVD, or singular
value decomposition, is a technique in linear algebra that factorizes any matrix M into
the product of 3 separate matrices: M=U*S*V, where S is a diagonal matrix of the

9| P a g e
singular values of M. Critically, truncated SVD reduces dimensionality by selecting only
the t largest singular values, and only keeping the first t columns of U and V. In this
case, t is a hyperparameter we can select and adjust to reflect the number of topics we
want to find.
Intuitively, think of this as only keeping the t most significant dimensions in our
transformed space.

In this case, U ∈ℝ^(m ⨉ t) emerges as our document-topic matrix, and V ∈ℝ^(n ⨉ t)


becomes our term-topic matrix. In both U and V, the columns correspond to one of our t
topics. In U, rows represent document vectors expressed in terms of topics; in V, rows
represent term vectors expressed in terms of topics.

Pros.:With these document vectors and term vectors, we can now easily apply
measures such as cosine similarity to evaluate:

• the similarity of different documents

• the similarity of different words

• the similarity of terms (or “queries”) and documents (which becomes useful in
information retrieval, when we want to retrieve passages most relevant to our
search query).

Code :

In sklearn, a simple implementation of LSA might look something like this:

from sklearn.feature_extraction.text import TfidfVectorizer


from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

documents = ["doc1.txt", "doc2.txt", "doc3.txt"]

# raw documents to tf-idf matrix:

vectorizer = TfidfVectorizer(stop_words='english',
use_idf=True,
smooth_idf=True)

# SVD to reduce dimensionality:

svd_model = TruncatedSVD(n_components=100, // num dimensions


algorithm='randomized',

10| P a g e
n_iter=10)

# pipeline of tf-idf + SVD, fit to and applied to documents:

svd_transformer = Pipeline([('tfidf', vectorizer),


('svd',
svd_model)])

svd_matrix = svd_transformer.fit_transform(documents)

# svd_matrix can later be used to compare documents, compare words, or compare


queries with documents

Cons. :LSA is quick and efficient to use, but it does have a few primary drawbacks:

• lack of interpretable embeddings (we don’t know what the topics are, and the
components may be arbitrarily positive/negative)

• need for really large set of documents and vocabulary to get accurate results

• less efficient representation

3.2:Latent Dirichlet Allocation(LDA) -

LDA stands for Latent Dirichlet Allocation. LDA is a Bayesian version of pLSA. In
particular, it uses dirichlet priors for the document-topic and word-topic distributions,
lending itself to better generalization.

I am not going to into an in-depth treatment of dirichlet distributions, since there are very
good intuitive explanations here and here. As a brief overview, however, we can think of
dirichlet as a “distribution over distributions.” In essence, it answers the question: “given
this type of distribution, what are some actual probability distributions I am likely to
see?”

Consider the very relevant example of comparing probability distributions of topic


mixtures. Let’s say the corpus we are looking at has documents from 3 very different
subject areas. If we want to model this, the type of distribution we want will be one that
very heavily weights one specific topic, and doesn’t give much weight to the rest at all. If
we have 3 topics, then some specific probability distributions we’d likely see are:

• Mixture X: 90% topic A, 5% topic B, 5% topic C

11| P a g e
• Mixture Y: 5% topic A, 90% topic B, 5% topic C
• Mixture Z: 5% topic A, 5% topic B, 90% topic C

If we draw a random probability distribution from this dirichlet distribution, parameterized


by large weights on a single topic, we would likely get a distribution that strongly
resembles either mixture X, mixture Y, or mixture Z. It would be very unlikely for us to
sample a distribution that is 33% topic A, 33% topic B, and 33% topic C.

That’s essentially what a dirichlet distribution provides: a way of sampling probability


distributions of a specific type. Recall the model for PLSA:

Figure 3

In pLSA, we sample a document, then a topic based on that document, then a word
based on that topic. Here is the model for LDA:

Figure 4

12| P a g e
From a dirichlet distribution Dir(α), we draw a random sample representing the topic
distribution, or topic mixture, of a particular document. This topic distribution is θ. From
θ, we select a particular topic Z based on the distribution.

Next, from another dirichlet distribution Dir(𝛽), we select a random sample representing
the word distribution of the topic Z. This word distribution is φ. From φ, we choose the
word w.

Formally, the process for generating each word from a document is as follows (beware
this algorithm uses c instead of z to represent the topic):

Pros. :LDA typically works better than pLSA because it can generalize to new
documents easily. In pLSA, the document probability is a fixed point in the dataset. If we
haven’t seen a document, we don’t have that data point. In LDA, the dataset serves as
training data for the dirichlet distribution of document-topic distributions. If we haven’t
seen a document, we can easily sample from the dirichlet distribution and move forward
from there.

Code:

LDA is easily the most popular (and typically most effective) topic modeling technique
out there. It’s available in gensim for easy use:

from gensim.corpora.Dictionary import load_from_text, doc2bow


from gensim.corpora import MmCorpus
from gensim.models.ldamodel import LdaModel

document = "This is some document..."

13| P a g e
# load id->word mapping (the dictionary)

id2word = load_from_text('wiki_en_wordids.txt')

# load corpus iterator

mm = MmCorpus('wiki_en_tfidf.mm')

# extract 100 LDA topics, updating once every 10,000


lda = LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1,
chunksize=10000, passes=1)

# use LDA model: transform new doc to bag-of-words, then apply


lda doc_bow = doc2bow(document.split()) doc_lda =
lda[doc_bow]

# doc_lda is vector of length num_topics representing weighted presence of each topic


in the doc

Conclusion:With LDA, we can extract human-interpretable topics from a document


corpus, where each topic is characterized by the words they are most strongly
associated with. For example, topic 2 could be characterized by terms such as “oil, gas,
drilling, pipes, Keystone, energy,” etc. Furthermore, given a new document, we can
obtain a vector representing its topic mixture, e.g. 5% topic 1, 70% topic 2, 10% topic 3,
etc. These vectors are often very useful for downstream applications.

3.3. Probabilistic Latent Semantic AnalysisPLSA-

pLSA, or Probabilistic Latent Semantic Analysis, uses a probabilistic method instead of


SVD to tackle the problem. The core idea is to find a probabilistic model with latent
topics that can generate the data we observe in our document-term matrix. In particular,
we want a model P(D,W) such that for any document d and word w, P(d,w) corresponds
to that entry in the document-term matrix.

Recall the basic assumption of topic models: each document consists of a mixture of
topics, and each topic consists of a collection of words. pLSA adds a probabilistic spin
to these assumptions:

• given a document d, topic z is present in that document with probability P(z|d)

• given a topic z, word w is drawn from z with probability P(w|z)

14| P a g e
Intuitively, the right-hand side of this equation is telling us how likely it is see some
document, and then based upon the distribution of topics of that document, how likely it
is to find a certain word within that document.

Figure 5

Formally, the joint probability of seeing a given document and word together is:

In this case, P(D), P(Z|D), and P(W|Z) are the parameters of our model. P(D) can be
determined directly from our corpus. P(Z|D) and P(W|Z) are modeled as multinomial
distributions, and can be trained using the expectation-maximization algorithm (EM).
Without going into a full mathematical treatment of the algorithm, EM is a method of
finding the likeliest parameter estimates for a model which depends on unobserved,
latent variables (in our case, the topics).

Interestingly, P(D,W) can be equivalently parameterized using a different set of 3


parameters:

We can understand this equivalency by looking at the model as a generative process. In


our first parameterization, we were starting with the document with P(d), and then
generating the topic with P(z|d), and then generating the word with P(w|z). In this
parameterization, we are starting with the topic with P(z), and then independently
generating the document with P(d|z) and the word with P(w|z).

15| P a g e
Figure 6

The reason this new parameterization is so interesting is because we can see a direct
parallel between our pLSA model our LSA model:

where the probability of our topic P(Z) corresponds to the diagonal matrix of our singular
topic probabilities, the probability of our document given the topic P(D|Z) corresponds to
our document-topic matrix U, and the probability of our word given the topic P(W|Z)
corresponds to our term-topic matrix V.

Pros. :So what does that tell us? Although it looks quite different and approaches the
problem in a very different way, pLSA really just adds a probabilistic treatment of topics
and words on top of LSA. It is a far more flexible model, but still has a few problems. In
particular:

16| P a g e
• Because we have no parameters to model P(D), we don’t know how to assign
probabilities to new documents

• The number of parameters for pLSA grows linearly with the number of documents
we have, so it is prone to overfitting

Cons. :We will not look at any code for pLSA because it is rarely used on its own. In
general, when people are looking for a topic model beyond the baseline performance
LSA gives, they turn to LDA. LDA, the most common type of topic model, extends PLSA
to address these issues.

17| P a g e
REFERENCES

[1] Blei, D.M., and Lafferty, J. D. ―Dynamic Topic Models‖, Proceedings of the 23rd
International Conference on Machine Learning, Pittsburgh, PA, 2006.

[2] Steyvers, M., and Griffiths, T. (2007). ―Probabilistic topic models‖. In T. Landauer, D
McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning.
Laurence Erlbaum

[3] Hofmann, T., ―Unsupervised learning by probabilistic latent semantic analysis‖,


Machine Learning, 42 (1), 2001, 177- 196.

[4] Kakkonen, T., Myller, N., Sutinen, E., and Timonen, J., ―Comparison of Dimension
Reduction Methods for Automated Essay Grading‖, Educational Technology &Society, 11 (3),
2008, 275-288.

[5] Liu, S., Xia, C., and Jiang, X., ―Efficient Probabilistic Latent Semantic Analysis with
Sparsity Control‖, IEEE International Conference on Data Mining, 2010, 905-910.

[6] Bassiou, N., and Kotropoulos C. ―RPLSA: A novel updating scheme for Probabilistic
Latent Semantic Analysis‖, Department of Informatics, Aristotle University of Thessaloniki, Box
451 Thessaloniki 541 24, Greece Received 14 April 2010.

[7] Romberg, S., Hörster, E., and Lienhart, R., ―Multimodal pLSA on visual features and
tags‖, The Institute of Electrical and Electronics Engineers Inc., 2009, 414-417.

[8] Wu, H., Wang, Y., and Cheng, X., ―Incremental probabilistic latent semantic analysis
for automatic question recommendation‖, ACM New York, NY, USA, 2008, 99-106.

[9] Blei, D.M., Ng, A.Y., and Jordan, M.I., ―Latent Dirichlet Allocation‖, Journal of Machine
Learning Research, 3, 2003, 993-1022.

[10] Ahmed,A., Xing,E.P., and William W. ―Joint Latent Topic Models for Text and
Citations‖, ACM New York, NY, USA, 2008.

[11] Zhi-Yong Shen,Z.Y., Sun,J., and Yi-Dong Shen,Y.D., ―Collective Latent Dirichlet
Allocation‖, Eighth IEEE International Conference on Data Mining, pages 1019–1025, 2008.

[12] Porteous, L.,Newman,D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M., ―Fast
Collapsed Gibbs Sampling For Latent Dirichlet Allocation‖, ACM New York, NY, USA, 2008.

18| P a g e

You might also like