Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3487664.3487710acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Objective Functions to Determine the Number of Topics for Topic Modeling

Published: 30 December 2021 Publication History

Abstract

Topic modeling is a well-known task in unsupervised machine learning, where clustering algorithms are used to find latent topics. Several algorithms are presented in the literature, but the best known of them suffer from the drawback of requiring a lot of hyperparameter tuning to achieve good results. Especially, the number of latent topics or clusters (k) needs to be known in advance. In view of this situation, this paper analyses objective functions that help to evaluate the models in order to determine optimal hyperparameters. An empirical qualitative study was conducted using the NMF algorithm on different datasets to experimentally determine numerical properties of topic models which indicate an optimal k. Based on this study, we propose objective functions to select optimal topic models and discuss their results on different datasets.

References

[1]
Jamal Al Qundus and Adrian Paschke. 2018. Investigating the effect of attributes on user trust in social media. In International conference on database and expert systems applications. Springer, 278–288.
[2]
Jamal Al Qundus, Adrian Paschke, Shivam Gupta, Ahmad M Alzouby, and Malik Yousef. 2020. Exploring the impact of short-text complexity and structure on its quality in social media. Journal of Enterprise Information Management (2020).
[3]
Jamal Al Qundus, Adrian Paschke, Sameer Kumar, and Shivam Gupta. 2019. Calculating trust in domain analysis: Theoretical trust model. International Journal of Information Management 48 (2019), 1–11.
[4]
Nikolaos Aletras and Mark Stevenson. 2013. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers. 13–22.
[5]
Rubayyi Alghamdi and Khalid Alfalqi. 2015. A survey of topic modeling in text mining. Int. J. Adv. Comput. Sci. Appl.(IJACSA) 6, 1 (2015).
[6]
Rajkumar Arun, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining. Springer, 391–402.
[7]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003), 993–1022.
[8]
Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. 2009. A density-based method for adaptive LDA model selection. Neurocomputing 72, 7-9 (2009), 1775–1781.
[9]
Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems. 288–296.
[10]
Salim Chehida, Abdelhakim Baouya, Saddek Bensalem, and Marius Bozga. 2020. Applied statistical model checking for a sensor behavior analysis. In International Conference on the Quality of Information and Communications Technology. Springer, 399–411.
[11]
Yong Chen, Hui Zhang, Rui Liu, Zhiwen Ye, and Jianying Lin. 2019. Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowledge-Based Systems 163 (2019), 1–13.
[12]
Romain Deveaud, Eric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17, 1 (2014), 61–84.
[13]
Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101, suppl 1(2004), 5228–5235.
[14]
Mahedi Hasan, Anichur Rahman, Md Razaul Karim, Md Saikat Islam Khan, and Md Jahidul Islam. 2021. Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA). In Proceedings of International Conference on Trends in Computational and Cognitive Engineering. Springer, 341–354.
[15]
Thomas Jacobs and Robin Tschötschel. 2019. Topic models meet discourse analysis: a quantitative tool for a qualitative approach. International Journal of Social Research Methodology 22, 5(2019), 469–485.
[16]
Matheus Marinho, Danilo Arruda, Fernando Wanderley, and Anthony Lins. 2018. A systematic approach of dataset definition for a supervised machine learning using NFR framework. In 2018 11th International Conference on the Quality of Information and Communications Technology (QUATIC). IEEE, 110–118.
[17]
Isao Namba, Rieko Yamamoto, and Mikio Aoyama. 2020. Towards Guidelines for Assessing Qualities of Machine Learning Systems. In Quality of Information and Communications Technology: 13th International Conference, QUATIC 2020, Faro, Portugal, September 9-11, 2020, Proceedings, Vol. 1266. Springer Nature, 17.
[18]
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics. 100–108.
[19]
Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 2 (1994), 111–126.
[20]
Jamal Al Qundus, Silvio Peikert, and Adrian Paschke. 2021. AI supported topic modeling using KNIME-workflows. arXiv preprint arXiv:2104.09428(2021).
[21]
Jamal Al Qundus, Ralph Schäfermeier, Naouel Karam, Silvio Peikert, and Adrian Paschke. 2021. ROC: An Ontology for Country Responses towards COVID-19. arXiv preprint arXiv:2104.07345(2021).
[22]
Georg Rehm, Peter Bourgonje, Stefanie Hegele, Florian Kintzel, Julián Moreno Schneider, Malte Ostendorff, Karolina Zaczynska, Armin Berger, Stefan Grill, Sören Räuchle, 2020. QURATOR: innovative technologies for content and data curation. arXiv preprint arXiv:2004.12195(2020).
[23]
Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining. 399–408.
[24]
Shaheen Syed and Marco Spruit. 2017. Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In 2017 IEEE International conference on data science and advanced analytics (DSAA). IEEE, 165–174.
[25]
Malik Yousef, Jamal Al Qundus, Silvio Peikert, and Adrian Paschke. 2020. TopicsRanksDC: Distance-Based Topic Ranking Applied on Two-Class Data. In International Conference on Database and Expert Systems Applications. Springer, 11–21.
[26]
Weizhong Zhao, James J Chen, Roger Perkins, Zhichao Liu, Weigong Ge, Yijun Ding, and Wen Zou. 2015. A heuristic approach to determine an appropriate number of topics in topic modeling. In BMC bioinformatics, Vol. 16. Springer, 1–10.
[27]
Chen Zou. 2018. Analyzing research trends on drug safety using topic modeling. Expert opinion on drug safety 17, 6 (2018), 629–636.

Cited By

View all
  • (2024)Mission Statement Topic Models: Coherence, Diversity, and UtilityDigital Humanities Looking at the World10.1007/978-3-031-48941-9_12(153-166)Online publication date: 20-Apr-2024

Index Terms

  1. Objective Functions to Determine the Number of Topics for Topic Modeling
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      iiWAS2021: The 23rd International Conference on Information Integration and Web Intelligence
      November 2021
      658 pages
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 December 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. hyperparameter tuning
      2. latent dirichlet allocation
      3. non-negative matrix factorization
      4. topic model coherence
      5. topic model evaluation
      6. topic modeling

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Bundesministerium für Bildung und Forschung

      Conference

      iiWAS2021

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)21
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Mission Statement Topic Models: Coherence, Diversity, and UtilityDigital Humanities Looking at the World10.1007/978-3-031-48941-9_12(153-166)Online publication date: 20-Apr-2024

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media