research-article

Objective Functions to Determine the Number of Topics for Topic Modeling

Authors:

Silvio Peikert,

Clemens Kubach,

Jamal Al Qundus,

Le Duyen Sandra Vu,

Adrian PaschkeAuthors Info & Claims

iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence

Pages 328 - 332

https://doi.org/10.1145/3487664.3487710

Published: 30 December 2021 Publication History

Abstract

Topic modeling is a well-known task in unsupervised machine learning, where clustering algorithms are used to find latent topics. Several algorithms are presented in the literature, but the best known of them suffer from the drawback of requiring a lot of hyperparameter tuning to achieve good results. Especially, the number of latent topics or clusters (k) needs to be known in advance. In view of this situation, this paper analyses objective functions that help to evaluate the models in order to determine optimal hyperparameters. An empirical qualitative study was conducted using the NMF algorithm on different datasets to experimentally determine numerical properties of topic models which indicate an optimal k. Based on this study, we propose objective functions to select optimal topic models and discuss their results on different datasets.

References

[1]

Jamal Al Qundus and Adrian Paschke. 2018. Investigating the effect of attributes on user trust in social media. In International conference on database and expert systems applications. Springer, 278–288.

[2]

Jamal Al Qundus, Adrian Paschke, Shivam Gupta, Ahmad M Alzouby, and Malik Yousef. 2020. Exploring the impact of short-text complexity and structure on its quality in social media. Journal of Enterprise Information Management (2020).

[3]

Jamal Al Qundus, Adrian Paschke, Sameer Kumar, and Shivam Gupta. 2019. Calculating trust in domain analysis: Theoretical trust model. International Journal of Information Management 48 (2019), 1–11.

[4]

Nikolaos Aletras and Mark Stevenson. 2013. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers. 13–22.

[5]

Rubayyi Alghamdi and Khalid Alfalqi. 2015. A survey of topic modeling in text mining. Int. J. Adv. Comput. Sci. Appl.(IJACSA) 6, 1 (2015).

[6]

Rajkumar Arun, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining. Springer, 391–402.

Digital Library

[7]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003), 993–1022.

Digital Library

[8]

Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density-based method for adaptive LDA model selection. Neurocomputing 72, 7-9 (2009), 1775–1781.

Digital Library

[9]

Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems. 288–296.

Digital Library

[10]

Salim Chehida, Abdelhakim Baouya, Saddek Bensalem, and Marius Bozga. 2020. Applied statistical model checking for a sensor behavior analysis. In International Conference on the Quality of Information and Communications Technology. Springer, 399–411.

[11]

Yong Chen, Hui Zhang, Rui Liu, Zhiwen Ye, and Jianying Lin. 2019. Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowledge-Based Systems 163 (2019), 1–13.

[12]

Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17, 1 (2014), 61–84.

[13]

Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1(2004), 5228–5235.

[14]

Mahedi Hasan, Anichur Rahman, Md Razaul Karim, Md Saikat Islam Khan, and Md Jahidul Islam. 2021. Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA). In Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Springer, 341–354.

[15]

Thomas Jacobs and Robin Tschötschel. 2019. Topic models meet discourse analysis: a quantitative tool for a qualitative approach. International Journal of Social Research Methodology 22, 5(2019), 469–485.

[16]

Matheus Marinho, Danilo Arruda, Fernando Wanderley, and Anthony Lins. 2018. A systematic approach of dataset definition for a supervised machine learning using NFR framework. In 2018 11th International Conference on the Quality of Information and Communications Technology (QUATIC). IEEE, 110–118.

[17]

Isao Namba, Rieko Yamamoto, and Mikio Aoyama. 2020. Towards Guidelines for Assessing Qualities of Machine Learning Systems. In Quality of Information and Communications Technology: 13th International Conference, QUATIC 2020, Faro, Portugal, September 9-11, 2020, Proceedings, Vol. 1266. Springer Nature, 17.

[18]

David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics. 100–108.

Digital Library

[19]

Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 2 (1994), 111–126.

[20]

Jamal Al Qundus, Silvio Peikert, and Adrian Paschke. 2021. AI supported topic modeling using KNIME-workflows. arXiv preprint arXiv:2104.09428(2021).

[21]

Jamal Al Qundus, Ralph Schäfermeier, Naouel Karam, Silvio Peikert, and Adrian Paschke. 2021. ROC: An Ontology for Country Responses towards COVID-19. arXiv preprint arXiv:2104.07345(2021).

[22]

Georg Rehm, Peter Bourgonje, Stefanie Hegele, Florian Kintzel, Julián Moreno Schneider, Malte Ostendorff, Karolina Zaczynska, Armin Berger, Stefan Grill, Sören Räuchle, 2020. QURATOR: innovative technologies for content and data curation. arXiv preprint arXiv:2004.12195(2020).

[23]

Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining. 399–408.

Digital Library

[24]

Shaheen Syed and Marco Spruit. 2017. Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In 2017 IEEE International conference on data science and advanced analytics (DSAA). IEEE, 165–174.

[25]

Malik Yousef, Jamal Al Qundus, Silvio Peikert, and Adrian Paschke. 2020. TopicsRanksDC: Distance-Based Topic Ranking Applied on Two-Class Data. In International Conference on Database and Expert Systems Applications. Springer, 11–21.

[26]

Weizhong Zhao, James J Chen, Roger Perkins, Zhichao Liu, Weigong Ge, Yijun Ding, and Wen Zou. 2015. A heuristic approach to determine an appropriate number of topics in topic modeling. In BMC bioinformatics, Vol. 16. Springer, 1–10.

[27]

Chen Zou. 2018. Analyzing research trends on drug safety using topic modeling. Expert opinion on drug safety 17, 6 (2018), 629–636.

Cited By

Ford JLeeds JKubosumi S(2024)Mission Statement Topic Models: Coherence, Diversity, and UtilityDigital Humanities Looking at the World10.1007/978-3-031-48941-9_12(153-166)Online publication date: 20-Apr-2024
https://doi.org/10.1007/978-3-031-48941-9_12

Index Terms

Objective Functions to Determine the Number of Topics for Topic Modeling
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Topic Modeling Using Latent Dirichlet allocation: A Survey

We are not able to deal with a mammoth text corpus without summarizing them into a relatively small subset. A computational tool is extremely needed to understand such a gigantic pool of text. Probabilistic Topic Modeling discovers and explains the ...
Group topic model: organizing topics into groups
Abstract
Latent Dirichlet allocation defines hidden topics to capture latent semantics in text documents. However, it assumes that all the documents are represented by the same topics, resulting in the “forced topic” problem. To solve this problem, we ...
Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence

November 2021

658 pages

ISBN:9781450395564

DOI:10.1145/3487664

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 December 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Bundesministerium für Bildung und Forschung

Conference

iiWAS2021

iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence

November 29 - December 1, 2021

Linz, Austria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
82
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ford JLeeds JKubosumi S(2024)Mission Statement Topic Models: Coherence, Diversity, and UtilityDigital Humanities Looking at the World10.1007/978-3-031-48941-9_12(153-166)Online publication date: 20-Apr-2024
https://doi.org/10.1007/978-3-031-48941-9_12

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents