ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

Fornaciari, Tommaso; Hovy, Dirk; Bianchi, Federico

Computer Science > Computation and Language

arXiv:2210.14763 (cs)

[Submitted on 26 Oct 2022]

Title:ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

Authors:Tommaso Fornaciari, Dirk Hovy, Federico Bianchi

View PDF

Abstract:The most common ways to explore latent document dimensions are topic models and clustering methods. However, topic models have several drawbacks: e.g., they require us to choose the number of latent dimensions a priori, and the results are stochastic. Most clustering methods have the same issues and lack flexibility in various ways, such as not accounting for the influence of different topics on single documents, forcing word-descriptors to belong to a single topic (hard-clustering) or necessarily relying on word representations. We propose PROgressive SImilarity Thresholds - ProSiT, a deterministic and interpretable method, agnostic to the input format, that finds the optimal number of latent dimensions and only has two hyper-parameters, which can be set efficiently via grid search. We compare this method with a wide range of topic models and clustering methods on four benchmark data sets. In most setting, ProSiT matches or outperforms the other methods in terms six metrics of topic coherence and distinctiveness, producing replicable, deterministic results.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2210.14763 [cs.CL]
	(or arXiv:2210.14763v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.14763

Submission history

From: Tommaso Fornaciari [view email]
[v1] Wed, 26 Oct 2022 14:52:44 UTC (2,861 KB)

Computer Science > Computation and Language

Title:ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ProSiT! Latent Variable Discovery with PROgressive SImilarity Thresholds

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators