Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Gropp, Chris; Herzog, Alexander; Safro, Ilya; Wilson, Paul W.; Apon, Amy W.

Computer Science > Information Retrieval

arXiv:1610.07703 (cs)

[Submitted on 25 Oct 2016 (v1), last revised 4 Oct 2019 (this version, v3)]

Title:Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Authors:Chris Gropp, Alexander Herzog, Ilya Safro, Paul W. Wilson, Amy W. Apon

View PDF

Abstract:Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time step to another. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. Our approach is based on data decomposition in which the data is partitioned into segments, followed by topic modeling on the individual segments. The resulting local models are then combined into a global solution using clustering. The decomposition and resulting parallelization leads to very fast runtime even on very large datasets. Our approach furthermore provides insight into how the composition of topics changes over time and can also be applied using other data partitioning strategies over any discrete features of the data, such as geographic features or classes of users. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers (2,484 documents and 3,280,697 words), seventeen years of computer science journal abstracts (533,560 documents and 32,551,540 words), and to forty years of the PubMed corpus (4,025,978 documents and 273,853,980 words).

Subjects:	Information Retrieval (cs.IR); Machine Learning (stat.ML)
Cite as:	arXiv:1610.07703 [cs.IR]
	(or arXiv:1610.07703v3 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1610.07703

Submission history

From: Ilya Safro [view email]
[v1] Tue, 25 Oct 2016 01:50:24 UTC (720 KB)
[v2] Sun, 15 Oct 2017 04:06:39 UTC (887 KB)
[v3] Fri, 4 Oct 2019 14:37:41 UTC (377 KB)

Computer Science > Information Retrieval

Title:Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators