My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

Risch, Julian; Krestel, Ralf

doi:10.1145/3197026.3197038

Computer Science > Information Retrieval

arXiv:1911.11240 (cs)

[Submitted on 25 Nov 2019]

Title:My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

Authors:Julian Risch, Ralf Krestel

View PDF

Abstract:Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations.

Subjects:	Information Retrieval (cs.IR); Digital Libraries (cs.DL)
Cite as:	arXiv:1911.11240 [cs.IR]
	(or arXiv:1911.11240v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1911.11240
Journal reference:	Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL). bll. 283-292 (2018)
Related DOI:	https://doi.org/10.1145/3197026.3197038

Submission history

From: Julian Risch [view email]
[v1] Mon, 25 Nov 2019 21:29:59 UTC (1,780 KB)

Computer Science > Information Retrieval

Title:My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators