Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Topic Modeling Using Latent Dirichlet allocation: A Survey

Published: 17 September 2021 Publication History
  • Get Citation Alerts
  • Abstract

    We are not able to deal with a mammoth text corpus without summarizing them into a relatively small subset. A computational tool is extremely needed to understand such a gigantic pool of text. Probabilistic Topic Modeling discovers and explains the enormous collection of documents by reducing them in a topical subspace. In this work, we study the background and advancement of topic modeling techniques. We first introduce the preliminaries of the topic modeling techniques and review its extensions and variations, such as topic modeling over various domains, hierarchical topic modeling, word embedded topic models, and topic models in multilingual perspectives. Besides, the research work for topic modeling in a distributed environment, topic visualization approaches also have been explored. We also covered the implementation and evaluation techniques for topic models in brief. Comparison matrices have been shown over the experimental results of the various categories of topic modeling. Diverse technical challenges and future directions have been discussed.

    Supplementary Material

    a145-chauhan-apndx.pdf (chauhan.zip)
    Supplemental movie, appendix, image and software files for, Topic Modeling Using Latent Dirichlet allocation: A Survey

    References

    [1]
    Nikolaos Aletras and Mark Stevenson. 2013. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics. 13–22.
    [2]
    Rubayyi Alghamdi and Khalid Alfalqi. 2015. A survey of topic modeling in text mining. Int. J. Adv. Comput. Sci. Appl. 6, 1 (2015).
    [3]
    Loulwah AlSumait, Daniel Barbará, and Carlotta Domeniconi. 2008. On-line LDA: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 3–12.
    [4]
    Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. 2009. On smoothing and inference for topic models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. 27–34.
    [5]
    Hazeline U. Asuncion, Arthur U. Asuncion, and Richard N. Taylor. 2010. Software traceability with topic modeling. In Proceedings of the ACM/IEEE 32nd International Conference on Software Engineering, Vol. 1. IEEE, 95–104.
    [6]
    D. K. JinYeong Bak and A. Oh. 2012. Distributed online learning for latent Dirichlet allocation. In Proceedings of the NIPS Workshop on Big Learning. 1–8.
    [7]
    Parantapa Bhattacharya, Muhammad Bilal Zafar, Niloy Ganguly, Saptarshi Ghosh, and Krishna P. Gummadi. 2014. Inferring user interests in the Twitter social network. In Proceedings of the 8th ACM Conference on Recommender Systems. ACM, 357–360.
    [8]
    David M. Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (2012), 77–84.
    [9]
    David M. Blei, Thomas L. Griffiths, and Michael I. Jordan. 2010. The nested chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM 57, 2 (2010), 7.
    [10]
    David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 113–120.
    [11]
    David M. Blei and John D. Lafferty. 2007. A correlated topic model of science. Ann. Appl. Statist. (2007), 17–35.
    [12]
    David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, Jan. (2003), 993–1022.
    [13]
    Jordan Boyd-Graber, David Mimno, and David Newman. 2014. Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. Vol. 225255. CRC Press, Boca Raton, FL.
    [14]
    Samuel Brody and Mirella Lapata. 2009. Bayesian word sense induction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 103–111.
    [15]
    Stefan Bunk and Ralf Krestel. 2018. WELDA: Enhancing topic models by incorporating local word context. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. 293–302.
    [16]
    George Casella and Edward I. George. 1992. Explaining the Gibbs sampler. Amer. Statist. 46, 3 (1992), 167–174.
    [17]
    Jonathan Chang. 2012. Collapsed Gibbs sampling methods for topic models. R package: lda (version 1.3.2). http://cran.r-project.org/web/packages/lda/index.html.
    [18]
    Jonathan Chang and David Blei. 2009. Relational topic models for document networks. In Artificial Intelligence and Statistics. PMLR, 81–88.
    [19]
    Ying-Lang Chang and Jen-Tzung Chien. 2009. Latent Dirichlet learning for document summarization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1689–1692.
    [20]
    Tse-Hsun Chen, Weiyi Shang, Meiyappan Nagappan, Ahmed E. Hassan, and Stephen W. Thomas. 2017. Topic-based software defect explanation. J. Syst. Softw. 129 (2017), 79–106.
    [21]
    Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. BTM: Topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26, 12 (2014), 2928–2941.
    [22]
    Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, 74–77.
    [23]
    Raphael Cohen, Iddo Aviram, Michael Elhadad, and Noémie Elhadad. 2014. Redundancy-aware topic modeling for patient record notes. PloS One 9, 2 (2014), e87555.
    [24]
    Mário Cordeiro. 2012. Twitter event detection: Combining wavelet analysis and topic inference summarization. In Doctoral Symposium on Informatics Engineering. 11–16.
    [25]
    Christopher S. Corley, Kostadin Damevski, and Nicholas A. Kraft. 2020. Changeset-based topic modeling of software repositories. IEEE Trans. Softw. Eng. 46, 10 (2020), 1068–1080.
    [26]
    Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In Proceedings of the Meeting of the Association for Computational Linguistics. 795–804.
    [27]
    Ali Daud, Juanzi Li, Lizhu Zhou, and Faqir Muhammad. 2010. Knowledge discovery through directed probabilistic topic models: A survey. Front. Comput. Sci. China 4, 2 (2010), 280–301.
    [28]
    Wim De Smet and Marie-Francine Moens. 2009. Cross-language linking of news stories on the web using interlingual topic modelling. In Proceedings of the 2nd ACM Workshop on Social Web Search and Mining. ACM, 57–64.
    [29]
    Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
    [30]
    Stefan Debortoli, Oliver Müller, Iris Junglas, and Jan vom Brocke. 2016. Text mining for information systems researchers: An annotated topic modeling tutorial. Commun. Assoc. Inf. Syst. 39, 1 (2016), 7.
    [31]
    Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6 (1990), 391.
    [32]
    Mohamed Dermouche, Julien Velcin, Leila Khouas, and Sabine Loudcher. 2014. A joint model for topic-sentiment evolution over time. In Proceedings of the IEEE International Conference on Data Mining (ICDM’14). IEEE, 773–778.
    [33]
    Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2019. The dynamic embedded topic model. arXiv preprint arXiv:1907.05545 (2019).
    [34]
    Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2020. Topic modeling in embedding spaces. Trans. Assoc. Comput. Ling. 8 (2020), 439–453.
    [35]
    Tarek Elguebaly and Nizar Bouguila. 2013. Simultaneous Bayesian clustering and feature selection using RJMCMC-based learning of finite generalized Dirichlet mixture models. Sig. Process. 93, 6 (2013), 1531–1546.
    [36]
    Katayoun Farrahi and Daniel Gatica-Perez. 2011. Discovering routines from large-scale human locations using probabilistic topic models. ACM Trans. Intell. Syst. Technol. 2, 1 (2011), 3.
    [37]
    Xianghua Fu, Kun Yang, Joshua Zhexue Huang, and Laizhong Cui. 2015. Dynamic non-parametric joint sentiment topic mixture model. Knowl.-based Syst. 82 (2015), 102–114.
    [38]
    Debasis Ganguly, Manisha Ganguly, Johannes Leveling, and Gareth J. F. Jones. 2013. TopicVis: A GUI for topic-based feedback and navigation.
    [39]
    Debasis Ganguly, Johannes Leveling, and Gareth J. F. Jones. 2012. Cross-lingual topical relevance models.
    [40]
    Brynjar Gretarsson, John O’Donovan, Svetlin Bostandjiev, Tobias Höllerer, Arthur Asuncion, David Newman, and Padhraic Smyth. 2012. Topicnets: Visual analysis of large text corpora with topic modeling. ACM Trans. Intell. Syst. Technol. 3, 2 (2012), 23.
    [41]
    Tom Griffiths. 2002. Gibbs sampling in the generative model of latent Dirichlet allocation.
    [42]
    Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. Nat. Acad. Sci. 101, suppl 1 (2004), 5228–5235.
    [43]
    Loni Hagen. 2018. Content analysis of e-petitions with topic modeling: How to train and evaluate LDA models?Inf. Proc. Manag. 54, 6 (2018), 1292–1307.
    [44]
    Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 362–370.
    [45]
    Xingwei He, Hua Xu, Jia Li, Liu He, and Linlin Yu. 2017. FastBTM: Reducing the sampling time for biterm topic model. Knowl.-Based Syst 132 (2017), 11–20.
    [46]
    Gregor Heinrich. 2008. Parameter Estimation for Text Analysis. Technical Report. University of Leipzig. 1–32.
    [47]
    Go Eun Heo, Keun Young Kang, Min Song, and Jeong-Hoon Lee. 2017. Analyzing the field of bioinformatics with the multi-faceted topic modeling technique. BMC Bioinf 18, 7 (2017), 251.
    [48]
    Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online learning for latent Dirichlet allocation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 856–864.
    [49]
    Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 289–296.
    [50]
    Thomas Hofmann. 2001. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42, 1 (2001), 177–196.
    [51]
    Liangjie Hong, Ovidiu Dan, and Brian D. Davison. 2011. Predicting popular messages in Twitter. In Proceedings of the 20th International Conference Companion on World Wide Web. ACM, 57–58.
    [52]
    Pengfei Hu, Wenju Liu, Wei Jiang, and Zhanlei Yang. 2014. Latent topic model for audio retrieval. Pattern Recog. 47, 3 (2014), 1138–1143.
    [53]
    Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith. 2014. Interactive topic modeling. Mach. Learn. 95, 3 (2014), 423–469.
    [54]
    Dongping Huang, Shuyu Hu, Yi Cai, and Huaqing Min. 2014. Discovering event evolution graphs based on news articles relationships. In Proceedings of the IEEE 11th International Conference on e-Business Engineering (ICEBE’14). IEEE, 246–251.
    [55]
    Hamed Jelodar, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao. 2019. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimedia Tools. Applic. 78, 11 (2019), 15169–15211.
    [56]
    Do-Heon Jeong and Min Song. 2014. Time gap analysis by the topic model-based temporal technique. J. Informet. 8, 3 (2014), 776–790.
    [57]
    Di Jiang, Yongxin Tong, and Yuanfeng Song. 2016. Cross-lingual topic discovery from multilingual search engine query log. ACM Trans. Inf. Syst. 35, 2 (2016), 9.
    [58]
    Efsun Sarioglu Kayi, Kabir Yadav, James M. Chamberlain, and Hyeong-Ah Choi. 2017. Topic modeling for classification of clinical reports. arXiv preprint arXiv:1706.06177 (2017).
    [59]
    Muhammad Taimoor Khan, Mehr Durrani, Shehzad Khalid, and Furqan Aziz. 2016. Online knowledge-based model for big data topic extraction. Comput. Intell. Neurosci.
    [60]
    Milad Kharratzadeh, Benjamin Renard, and Mark J. Coates. 2015. Bayesian topic model approaches to online and time-dependent clustering. Dig. Sig. Process. 47 (2015), 25–35.
    [61]
    Dongwoo Kim and Alice Oh. 2011. Accounting for data dependencies within a hierarchical Dirichlet process mixture model. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 873–878.
    [62]
    Dongwoo Kim and Alice Oh. 2011. Topic chains for understanding a news corpus. Comput. Ling. Intell. Text Process.
    [63]
    Dongwoo Kim and Alice Oh. 2014. Hierarchical Dirichlet scaling process. In Proceedings of the International Conference on Machine Learning. 973–981.
    [64]
    Joon Hee Kim, Dongwoo Kim, Suin Kim, and Alice Oh. 2012. Modeling topic hierarchies with the recursive Chinese restaurant process. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 783–792.
    [65]
    Younghoon Kim and Kyuseok Shim. 2014. TWILITE: A recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation. Inf. Syst. 42 (2014), 59–77.
    [66]
    Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
    [67]
    Julian F. P. Kooij, Gwenn Englebienne, and Dariu M. Gavrila. 2015. Identifying multiple objects from their appearance in inaccurate detections. Comput. Vis. Image Underst. 136 (2015), 103–116.
    [68]
    Guy Lansley and Paul A. Longley. 2016. The geography of Twitter topics in London. Comput. Environ. Urb. Syst. 58 (2016), 85–96.
    [69]
    Jey Han Lau and Timothy Baldwin. 2016. The sensitivity of topic coherence evaluation to topic cardinality. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.483–487.
    [70]
    Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality.Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics.530–539.
    [71]
    Jure Leskovec, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 497–506.
    [72]
    Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36, 2 (2017), 11.
    [73]
    Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174.
    [74]
    Weifeng Li, Junming Yin, and Hsinchsun Chen. 2017. Supervised topic modeling using hierarchical Dirichlet process-based inverse regression: Experiments on e-commerce applications. IEEE Trans. Knowl. Data Eng. 30, 6 (2017), 1192–1205.
    [75]
    Tianyi Lin, Wentao Tian, Qiaozhu Mei, and Hong Cheng. 2014. The dual-sparse topic model: Mining focused topics and focused terms in short text. In Proceedings of the 23rd International Conference on World Wide Web. 539–550.
    [76]
    Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, and Pierre Baldi. 2007. Mining concepts from code with probabilistic topic models. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering. ACM, 461–464.
    [77]
    Jun S. Liu. 1994. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89, 427 (1994), 958–966.
    [78]
    Shuhua Liu and Patrick Jansson. 2017. Topic Modelling Analysis of Instagram Data for the Greater Helsinki Region.
    [79]
    Xiaodong Liu, Kevin Duh, and Yuji Matsumoto. 2015. Multilingual topic models for bilingual dictionary extraction. ACM Trans. Asian Low-resour. Lang. Inf. Process. 14, 3 (2015), 11.
    [80]
    Xiao Liu, Mingli Song, Qi Zhao, Dacheng Tao, Chun Chen, and Jiajun Bu. 2012. Attribute-restricted latent topic model for person re-identification. Pattern Recog. 45, 12 (2012), 4204–4213.
    [81]
    Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. 2011. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2, 3 (2011), 26.
    [82]
    Kun Lu and Dietmar Wolfram. 2012. Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches. J. Amer. Soc. Inf. Sci. Technol. 63, 10 (2012), 1973–1986.
    [83]
    Zhiwu Lu and Yuxin Peng. 2013. Latent semantic learning with structured sparse representation for human action recognition. Pattern Recog. 46, 7 (2013), 1799–1809.
    [84]
    Stacy K. Lukins, Nicholas A. Kraft, and Letha H. Etzkorn. 2010. Bug localization using latent Dirichlet allocation. Inf. Softw. Technol. 52, 9 (2010), 972–990.
    [85]
    Minnan Luo, Feiping Nie, Xiaojun Chang, Yi Yang, Alexander Hauptmann, and Qinghua Zheng. 2017. Probabilistic non-negative matrix factorization and its robust extensions for topic modeling. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
    [86]
    Baizhang Ma, Dongsong Zhang, Zhijun Yan, and Taeha Kim. 2013. An LDA and synonym lexicon based approach to product feature extraction from online consumer product reviews. J. Electron. Commer. Res. 14, 4 (2013), 304.
    [87]
    Hui-Fang Ma. 2011. Hot topic extraction using time window. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC’11). IEEE, 56–60.
    [88]
    Masoud Makrehchi. 2011. Social link recommendation by learning hidden topics. In Proceedings of the 5th ACM Conference on Recommender Systems. ACM, 189–196.
    [89]
    James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela H. Byers. 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.
    [90]
    Jon D. Mcauliffe and David M. Blei. 2008. Supervised topic models. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 121–128.
    [91]
    Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. (2002). Retrieved from http://mallet.cs.umass.edu.
    [92]
    Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and ChengXiang Zhai. 2007. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proceedings of the 16th International Conference on World Wide Web. ACM, 171–180.
    [93]
    David Mimno and Andrew McCallum. 2007. Expertise modeling for matching papers with reviewers. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 500–509.
    [94]
    David Mimno and Andrew McCallum. 2007. Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 376–385.
    [95]
    David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 880–889.
    [96]
    Christopher E. Moody. 2016. Mixing Dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019 (2016).
    [97]
    Gordon E. Moon, Israt Nisa, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Parallel latent Dirichlet allocation on GPUs. In Proceedings of the International Conference on Computational Science. Springer, 259–272.
    [98]
    N. K. Nagwani. 2015. Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J. Big Data 2, 1 (2015), 6.
    [99]
    Ramesh Nallapati, William Cohen, and John Lafferty. 2007. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In Proceedings of the International Conference on Data Mining Workshops (ICDMW’07). IEEE, 349–354.
    [100]
    David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2009. Distributed algorithms for topic models. J. Mach. Learn. Res. 10, Aug. (2009), 1801–1828.
    [101]
    David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 100–108.
    [102]
    David Newman, Padhraic Smyth, and Mark Steyvers. 2006. Scalable parallel topic models. J. Intell. Commun. Res. Devel. 5 (2006).
    [103]
    David Newman, Padhraic Smyth, Max Welling, and Arthur U. Asuncion. 2008. Distributed inference for latent Dirichlet allocation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 1081–1088.
    [104]
    Zhenxing Niu, Gang Hua, Le Wang, and Xinbo Gao. 2017. Knowledge-based topic model for unsupervised object discovery and localization. IEEE Trans. Image Process. 27, 1 (2017), 50–63.
    [105]
    Michael J. Paul and Mark Dredze. 2014. Discovering health topics in social media using topic models. PloS One 9, 8 (2014), e103408.
    [106]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
    [107]
    Nanyun Peng, Yiming Wang, and Mark Dredze. 2014. Learning polylingual topic models from code-switched social media documents. In Proceedings of the 52nd Meeting of the Association for Computational Linguistics. 674–679.
    [108]
    James Petterson, Wray Buntine, Shravan M. Narayanamurthy, Tibério S. Caetano, and Alex J. Smola. 2010. Word features for latent Dirichlet allocation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 1921–1929.
    [109]
    Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 569–577.
    [110]
    Jipeng Qiang, Zhenyu Qian, Yun Li, Yunhao Yuan, and Xindong Wu. 2020. Short text topic modeling techniques, applications, and performance: A survey. IEEE Transactions on Knowledge and Data Engineering.
    [111]
    Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In Proceedings of the 24th International Joint Conference on Artificial Intelligence.
    [112]
    Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media.
    [113]
    Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 248–256.
    [114]
    Daniel Ramage, Christopher D. Manning, and Susan Dumais. 2011. Partially labeled topic models for interpretable text mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 457–465.
    [115]
    Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC Workshop on New Challenges for NLP Frameworks. ELRA, 45–50.
    [116]
    Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney. 2010. Spherical topic models. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 903–910.
    [117]
    Yafeng Ren, Ruimin Wang, and Donghong Ji. 2016. A topic-enhanced word embedding for Twitter sentiment classification. Inf. Sci. 369 (2016), 188–198.
    [118]
    Philip Resnik and Eric Hardisty. 2010. Gibbs sampling for the uninitiated. Maryland Univ College Park Inst for Advanced Computer Studies.
    [119]
    Kirk Roberts, Michael A. Roach, Joseph Johnson, Josh Guthrie, and Sanda M. Harabagiu. 2012. EmpaTweet: Annotating and detecting emotions on Twitter. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). Citeseer, 3806–3813.
    [120]
    Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487–494.
    [121]
    Karim Sayadi, Quang Vu Bui, and Marc Bui. 2016. Distributed implementation of the latent Dirichlet allocation on Spark. In Proceedings of the 7th Symposium on Information and Communication Technology. ACM, 92–98.
    [122]
    Alexandra Schofield, Måns Magnusson, and David Mimno. 2017. Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 432–436.
    [123]
    Karthick Seshadri, S. Mercy Shalinie, and Chidambaram Kollengode. 2015. Design and evaluation of a parallel algorithm for inferring topic hierarchies. Inf. Proc. Manag. 51, 5 (2015), 662–676.
    [124]
    Carson Sievert and Kenneth Shirley. 2014. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. 63–70.
    [125]
    Bradley Skaggs and Lise Getoor. 2014. Topic modeling for Wikipedia link disambiguation. ACM Trans. Inf. Syst. 32, 3 (2014), 10.
    [126]
    Alison Smith, Jason Chuang, Yuening Hu, Jordan Boyd-Graber, and Leah Findlater. 2014. Concurrent visualization of relationships between words and topics in topic models. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. 79–82.
    [127]
    Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proc. VLDB Endow. 3, 1-2 (2010), 703–710.
    [128]
    Padhraic Smyth, Max Welling, and Arthur U. Asuncion. 2009. Asynchronous distributed learning of topic models. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 81–88.
    [129]
    Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. Handb. Latent Semant. Anal. 427, 7 (2007), 424–440.
    [130]
    Xiaobing Sun, Bixin Li, Hareton Leung, Bin Li, and Yun Li. 2015. MSR4SM: Using topic models to effectively mining software repositories for software maintenance tasks. Inf. Softw. Technol. 66 (2015), 1–12.
    [131]
    Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 1385–1392.
    [132]
    Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. Int. J. High Perf. Comput. Applic. 19, 1 (2005), 49–66.
    [133]
    Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein. 2014. Studying software evolution using topic models. Sci. Comput. Prog. 80 (2014), 457–479.
    [134]
    Kai Tian, Meghan Revelle, and Denys Poshyvanyk. 2009. Using latent Dirichlet allocation for automatic categorization of software. In Proceedings of the 6th IEEE International Working Conference on Mining Software Repositories. IEEE, 163–166.
    [135]
    Zhongyuan Tian, Harumichi Yokoyama, and Takuya Araki. 2019. Parallel latent Dirichlet allocation using vector processors. In Proceedings of the IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 1548–1555.
    [136]
    Calin Rares Turliuc, Luke Dickens, Alessandra Russo, and Krysia Broda. 2016. Probabilistic abductive logic programming using Dirichlet priors. Int. J. Approx. Reas. 78 (2016), 223–240.
    [137]
    Duc-Thuan Vo and Cheol-Young Ock. 2015. Learning to classify short text from scientific documents using topic models with various types of knowledge. Exp. Syst. Applic. 42, 3 (2015), 1684–1698.
    [138]
    Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, and Marina Dudarenko. 2015. BigARTM: Open source library for regularized multimodal topic modeling of large collections. In Proceedings of the International Conference on Analysis of Images, Social Networks and Texts. Springer, 370–381.
    [139]
    Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization of topic models. Mach. Learn. 101, 1–3 (2015), 303–323.
    [140]
    Nicholas Vretos, Nikos Nikolaidis, and Ioannis Pitas. 2012. Video fingerprinting using latent Dirichlet allocation and facial images. Pattern Recog. 45, 7 (2012), 2489–2498.
    [141]
    Ivan Vulić, Wim De Smet, and Marie-Francine Moens. 2013. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retr. 16, 3 (2013), 331–368.
    [142]
    Ivan Vulić, Wim De Smet, Jie Tang, and Marie-Francine Moens. 2015. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Inf. Proc. Manag. 51, 1 (2015), 111–147.
    [143]
    Martin J. Wainwright, Michael I. Jordan et al. 2008. Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1, 1–2 (2008), 1–305.
    [144]
    Hanna M Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 977–984.
    [145]
    Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In Proceedings of the 26th International Conference on Machine Learning.1105–1112.
    [146]
    Chong Wang, David Blei, and David Heckerman. 2012. Continuous time dynamic topic models. arXiv preprint arXiv:1206.3298 (2012).
    [147]
    Di Wang and Ahmad Al-Rubaie. 2015. Incremental learning with partial-supervision based on hierarchical Dirichlet process and the application for document classification. Appl. Soft Comput. 33 (2015), 250–262.
    [148]
    Jin Wang, Xiangping Sun, Mary F. H. She, Abbas Kouzani, and Saeid Nahavandi. 2013. Unsupervised mining of long time series based on latent topic model. Neurocomputing 103 (2013), 93–103.
    [149]
    Xuerui Wang and Andrew McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 424–433.
    [150]
    Xuerui Wang, Andrew McCallum, and Xing Wei. 2007. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 697–702.
    [151]
    Xiang Wang, Kai Zhang, Xiaoming Jin, and Dou Shen. 2009. Mining common topics from multiple asynchronous text streams. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 192–201.
    [152]
    Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. 2009. PLDA: Parallel latent Dirichlet allocation for large-scale applications. In Proceedings of the International Conference on Algorithmic Applications in Management. 301–314.
    [153]
    Yu Wang, Jiebo Luo, Richard Niemi, Yuncheng Li, and Tianran Hu. 2016. Catching fire via “likes”: Inferring topic preferences of Trump followers on Twitter. In Proceedings of the 10th International AAAI Conference on Web and Social Media.
    [154]
    Yi Wang, Xuemin Zhao, Zhenlong Sun, Hao Yan, Lifeng Wang, Zhihui Jin, Liubin Wang, Yang Gao, Jia Zeng, Qiang Yang et al. 2014. Towards topic modeling for big data. arXiv preprint arXiv:1405.4402 (2014).
    [155]
    Lino Wehrheim. 2019. Economic history goes digital: Topic modeling the journal of economic history. Cliometrica 13, 1 (2019), 83–125.
    [156]
    Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. Twitterrank: Finding topic-sensitive influential Twitterers. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. ACM, 261–270.
    [157]
    Erik Wiener, Jan O. Pedersen, Andreas S. Weigend, et al. 1995. A neural network approach to topic spotting. In Proceedings of the 4th Symposium on Document Analysis and Information Retrieval.
    [158]
    Andrew T. Wilson and Peter A. Chew. 2010. Term weighting schemes for latent Dirichlet allocation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 465–473.
    [159]
    Yueshen Xu, Jianwei Yin, Jianbin Huang, and Yuyu Yin. 2018. Hierarchical topic modeling with automatic knowledge mining. Exp. Syst. Applic. 103 (2018), 106–117.
    [160]
    Yueshen Xu, Yuyu Yin, and Jianwei Yin. 2017. Tackling topic general words in topic modeling. Eng. Applic. Artif. Intell. 62 (2017), 124–133.
    [161]
    Guangxu Xun, Yaliang Li, Wayne Xin Zhao, Jing Gao, and Aidong Zhang. 2017. A correlated topic model using word embeddings. In Proceedings of the International Joint Conference on Artificial Intelligence. 4207–4213.
    [162]
    Feng Yan, Ningyi Xu, and Yuan Qi. 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 2134–2142.
    [163]
    Shuang Yang, Chunfeng Yuan, Weiming Hu, and Xinmiao Ding. 2014. A hierarchical model based on latent Dirichlet allocation for action recognition. In Proceedings of the 22nd International Conference on Pattern Recognition. IEEE, 2613–2618.
    [164]
    Weiwei Yang, Jordan Boyd-Graber, and Philip Resnik. 2019. A multilingual topic model for learning weighted topic links across corpora with low comparability. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 1243–1248.
    [165]
    Yi Yang, Doug Downey, and Jordan Boyd-Graber. 2015. Efficient methods for incorporating knowledge into topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 308–317.
    [166]
    Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 937–946.
    [167]
    Liang Yao, Yin Zhang, Baogang Wei, Lei Li, Fei Wu, Peng Zhang, and Yali Bian. 2016. Concept over time: the combination of probabilistic topic model with wikipedia knowledge. Exp. Syst. Applic. 60 (2016), 27–38.
    [168]
    Chyi-Kwei Yau, Alan Porter, Nils Newman, and Arho Suominen. 2014. Clustering scientific documents with topic modeling. Scientometrics 100, 3 (2014), 767–786.
    [169]
    Hsiang-Fu Yu, Cho-Jui Hsieh, Hyokun Yun, S. V. N. Vishwanathan, and Inderjit S. Dhillon. 2015. A scalable asynchronous distributed algorithm for topic modeling. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1340–1350.
    [170]
    Bo Yuan, Xinbo Gao, Zhenxing Niu, and Qi Tian. 2019. Discovering latent topics by Gaussian latent Dirichlet allocation and spectral clustering. ACM Trans. Multimedia Comput. Commun. Applic. 15, 1 (2019), 25.
    [171]
    Lele Yut, Ce Zhang, Yingxia Shao, and Bin Cui. 2017. LDA* a robust and large-scale topic modeling system. Proc. VLDB Endow. 10, 11 (2017), 1406–1417.
    [172]
    Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. 2017. Latent LSTM allocation joint clustering and non-linear dynamic modeling of sequential data. In Proceedings of the 34th International Conference on Machine Learning. JMLR.org, 3967–3976.
    [173]
    Jianping Zeng, Jiangjiao Duan, Wenjun Cao, and Chengrong Wu. 2012. Topics modeling based on selective Zipf distribution. Exp. Syst. Applic. 39, 7 (2012), 6541–6546.
    [174]
    Ke Zhai and Jordan Boyd-Graber. 2013. Online latent Dirichlet allocation with infinite vocabulary. In Proceedings of the International Conference on Machine Learning. 561–569.
    [175]
    Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st International Conference on World Wide Web. ACM, 879–888.
    [176]
    Jianwen Zhang, Yangqiu Song, Changshui Zhang, and Shixia Liu. 2010. Evolutionary hierarchical Dirichlet processes for multiple correlated time-varying corpora. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1079–1088.
    [177]
    Tao Zhang, Kang Liu, Jun Zhao, et al. 2013. Cross lingual entity linking with bilingual topic model.Proceedings of the International Joint Conference on Artificial Intelligence. 2218–2224.
    [178]
    Bing Zhao and Eric P. Xing. 2006. BiTAM: Bilingual topic admixture models for word alignment. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, 969–976.
    [179]
    Bing Zhao and Eric P. Xing. 2007. HM-BiTAM: Bilingual topic exploration, word alignment, and translation. Advances in Neural Information Processing Systems 20 (2007), 1689–1696.
    [180]
    Feng Zhao, Yajun Zhu, Hai Jin, and Laurence T. Yang. 2016. A personalized hashtag recommendation approach using LDA-based topic model in microblog environment. Fut. Gen. Comput. Syst. 65 (2016), 196–206.
    [181]
    Huasha Zhao, Biye Jiang, John F. Canny, and Bobby Jaros. 2015. Same but different: Fast and high quality Gibbs parameter estimation. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1495–1502.
    [182]
    Wenjun Zhu, Liqing Zhang, and Qianwei Bian. 2012. A hierarchical latent topic model based on sparse coding. Neurocomputing 76, 1 (2012), 28–35.
    [183]
    Elaine Zosa and Mark Granroth-Wilding. 2019. Multilingual dynamic topic model. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’19). 1388–1396.
    [184]
    Jialing Zou, Qixiang Ye, Yanting Cui, Fang Wan, Kun Fu, and Jianbin Jiao. 2016. Collective motion pattern inference via locally consistent latent Dirichlet allocation. Neurocomputing 184 (2016), 221–231.
    [185]
    Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2105–2114.

    Cited By

    View all
    • (2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
    • (2024)What Do Flutter Developers Ask About? An Empirical Study on Stack Overflow PostsJournal of Software Engineering Research and Development10.5753/jserd.2024.362012:1Online publication date: 6-Jun-2024
    • (2024)The Role of Artificial Intelligence in the Study of the Psychology of ReligionReligions10.3390/rel1503029015:3(290)Online publication date: 26-Feb-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 54, Issue 7
    September 2022
    778 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3476825
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 September 2021
    Accepted: 01 April 2021
    Revised: 01 March 2021
    Received: 01 April 2020
    Published in CSUR Volume 54, Issue 7

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Topic modeling
    2. gibbs sampling
    3. latent dirichlet allocation
    4. probabilistic model
    5. statistical inference

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,266
    • Downloads (Last 6 weeks)128
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
    • (2024)What Do Flutter Developers Ask About? An Empirical Study on Stack Overflow PostsJournal of Software Engineering Research and Development10.5753/jserd.2024.362012:1Online publication date: 6-Jun-2024
    • (2024)The Role of Artificial Intelligence in the Study of the Psychology of ReligionReligions10.3390/rel1503029015:3(290)Online publication date: 26-Feb-2024
    • (2024)Enhancing Systematic Literature Reviews using LDA and ChatGPT: Case of Framework for Smart City Planning2024 IST-Africa Conference (IST-Africa)10.23919/IST-Africa63983.2024.10569979(1-13)Online publication date: 20-May-2024
    • (2024)The public attitude towards ChatGPT on reddit: A study based on unsupervised learning from sentiment analysis and topic modelingPLOS ONE10.1371/journal.pone.030250219:5(e0302502)Online publication date: 14-May-2024
    • (2024)Topic prediction for tobacco control based on COP9 tweets using machine learning techniquesPLOS ONE10.1371/journal.pone.029829819:2(e0298298)Online publication date: 15-Feb-2024
    • (2024)A Systematic Review of Stemmers of Indian and Non-Indian Vernacular LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360461223:1(1-51)Online publication date: 15-Jan-2024
    • (2024)ChatGPT’s applications in marketing: a topic modeling approachMarketing Intelligence & Planning10.1108/MIP-10-2023-052642:4(666-683)Online publication date: 26-Mar-2024
    • (2024)MD-LDA: a supervised LDA topic model for identifying mechanism of disease in TCMData Technologies and Applications10.1108/DTA-12-2023-0868Online publication date: 22-Jul-2024
    • (2024)GLDADec: marker-gene guided LDA modeling for bulk gene expression deconvolutionBriefings in Bioinformatics10.1093/bib/bbae31525:4Online publication date: 10-Jul-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media