research-article

Public Access

Probabilistic Topic Modeling for Comparative Analysis of Document Collections

Authors:

Chandan K. ReddyAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 14, Issue 2

Article No.: 24, Pages 1 - 27

https://doi.org/10.1145/3369873

Published: 04 March 2020 Publication History

All formats PDF

Abstract

Probabilistic topic models, which can discover hidden patterns in documents, have been extensively studied. However, rather than learning from a single document collection, numerous real-world applications demand a comprehensive understanding of the relationships among various document sets. To address such needs, this article proposes a new model that can identify the common and discriminative aspects of multiple datasets. Specifically, our proposed method is a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular dataset). Through extensive experiments, we demonstrate the effectiveness of our method compared with state-of-the-art models. The proposed model can be useful for “comparative thinking” analysis in real-world document collections.

References

[1]

David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the International Conference on Machine Learning (ICML’06). ACM, 113--120.

[2]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 993--1022.

Digital Library

[3]

R. Darrell Bock and Murray Aitkin. 1981. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. In Psychometrika, Vol. 46. Springer, 443--459.

[4]

Jordan Boyd-Graber, Yuening Hu, David Mimno, et al. 2017. Applications of topic models. Foundations and Trends® in Information Retrieval 11, 2--3, 60--62.

[5]

Deng Cai, Xiaofei He, Xiaoyun Wu, and Jiawei Han. 2008. Non-negative matrix factorization on manifold. In Proceedings of IEEE International Conference on Data Mining (ICDM’08). IEEE, 63--72.

Digital Library

[6]

George Casella and Edward I. George. 1992. Explaining the Gibbs sampler. In The American Statistician, Vol. 46. Taylor 8 Francis, 167--174.

[7]

Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2006. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of Neural Information Processing Systems (NIPS’06), Vol. 19. 241--248.

[8]

Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. In IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 1992--2001.

Digital Library

[9]

Bishop Christopher. 2007. Pattern recognition and machine learning. Springer, 93--94.

[10]

David A. Cohn and Thomas Hofmann. 2001. The missing link-a probabilistic model of document content and hypertext connectivity. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’01). 430--436.

[11]

Tao Ge, Wenzhe Pei, Heng Ji, Sujian Li, Baobao Chang, and Zhifang Sui. 2015. Bring you to the past: Automatic generation of topically relevant event chronicles. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’15). 575--585.

[12]

Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. In Proceedings of the National Academy of Sciences (PNAS’04), Vol. 101. NAS, 5228--5235.

[13]

Bin Guo, Yi Ouyang, Cheng Zhang, Jiafan Zhang, Zhiwen Yu, Di Wu, and Yu Wang. 2017. Crowdstory: Fine-grained event storyline generation by fusion of multi-modal crowdsourced data. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 3, 55.

[14]

Gregor Heinrich. 2008. Parameter estimation for text analysis. In University of Leipzig, Tech. Rep.

[15]

Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online learning for latent dirichlet allocation. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’10). 856--864.

[16]

Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. The Journal of Machine Learning Research, Vol. 14. 1303--1347.

Digital Library

[17]

Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, 50--57.

Digital Library

[18]

Lifu Huang and Lian’en Huang. 2013. Optimized event storyline generation based on mixture-event-aspect model. In Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP’13). 726--735.

[19]

Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K. Reddy, and Haesun Park. 2015. Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’15). ACM, 567--576.

Digital Library

[20]

Gang Kou, Yanqun Lu, Yi Peng, and Yong Shi. 2012. Evaluation of classification algorithms using MCDM and rank correlation. International Journal of Information Technology 8 Decision Making 11, 01, 197--225.

[21]

Gang Kou, Yi Peng, and Guoxun Wang. 2014. Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Information Sciences 275, 1--12.

[22]

Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 1 (1955), 83--97.

[23]

Simon Lacoste-Julien, Fei Sha, and Michael I. Jordan. 2009. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’09). 897--904.

[24]

Tuan Le and Leman Akoglu. 2019. ContraVis: Contrastive and visual topic modeling for comparing document collections. In Proceedings of The World Wide Web Conference. ACM, 928--938.

Digital Library

[25]

Daniel D. Lee and H. Sebastian Seung. 2001. Algorithms for non-negative matrix factorization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’01). 556--562.

[26]

Moontae Lee and David Mimno. 2017. Low-dimensional embeddings for interpretable anchor-based topic inference. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1319--1328.

[27]

Chenghua Lin, Yulan He, Richard Everson, and Stefan Ruger. 2011. Weakly supervised joint sentiment-topic detection from text. IEEE Transactions on Knowledge and Data engineering 24, 6, 1134--1145.

Digital Library

[28]

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’16). 2873--2879.

[29]

David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). ACL, 262--272.

Digital Library

[30]

Samaneh Moghaddam and Martin Ester. 2011. ILDA: Interdependent LDA model for learning latent aspects and their ratings from online product reviews. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 665--674.

Digital Library

[31]

Samaneh Moghaddam and Martin Ester. 2013. The FLDA model for aspect-based opinion mining: Addressing the cold start problem. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 909--918.

Digital Library

[32]

Elaheh Momeni, Shanika Karunasekera, Palash Goyal, and Kristina Lerman. 2018. Modeling evolution of topics in large-scale temporal text corpora. In Proceedings of the 12th International AAAI Conference on Web and Social Media.

[33]

Arjun Mukherjee and Bing Liu. 2012. Aspect extraction through semi-supervised modeling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-volume 1. Association for Computational Linguistics, 339--348.

Digital Library

[34]

John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan. 2015. Nested hierarchical Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 2 (2015), 56--270.

[35]

Michael Paul and Roxana Girju. 2010. A two-dimensional topic-aspect model for discovering multi-faceted topics. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI’10), Vol. 51. 36.

[36]

Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). ACM, 569--577.

Digital Library

[37]

A. Kai Qin, Vicky Ling Huang, and Ponnuthurai N. Suganthan. 2009. Differential evolution algorithm with strategy adaptation for global numerical optimization. IEEE Transactions on Evolutionary Computation 13, 2 (2009), 398--417.

Digital Library

[38]

Maxim Rabinovich and David M. Blei. 2014. The inverse regression topic model. In Proceedings of International Conference on Machine Learning (ICML’14). IEEE, 199--207.

[39]

Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP’09). ACL, 248--256.

[40]

Daniel Ramage, Christopher D. Manning, and Susan Dumais. 2011. Partially labeled topic models for interpretable text mining. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). ACM, 457--465.

Digital Library

[41]

Nikhil Rasiwasia and Nuno Vasconcelos. 2013. Latent dirichlet allocation models for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11, 2665--2679.

Digital Library

[42]

Xiang Ren, Yuanhua Lv, Kuansan Wang, and Jiawei Han. 2017. Comparative document analysis for large text corpora. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. ACM, 325--334.

Digital Library

[43]

Filipe Rodrigues, Mariana Lourenco, Bernardete Ribeiro, and Francisco C Pereira. 2017. Learning supervised topic models for classification and regression from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12, 2409--2422.

[44]

Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI’04). AUAI Press, 487--494.

Digital Library

[45]

Dwijen Rudrapal, Amitava Das, and Baby Bhattacharya. 2018. A survey on automatic Twitter event summarization.Journal of Information Processing Systems 14, 1 (2018), 79--100.

[46]

Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.

[47]

Vinay Setty, Abhijit Anand, Arunav Mishra, and Avishek Anand. 2017. Modeling event importance for ranking daily news events. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. ACM, 231--240.

Digital Library

[48]

Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, 1105--1114.

[49]

Harvey F. Silver. 2010. Compare 8 contrast: Teaching comparative thinking to strengthen student learning. Association for Supervision 8 Curriculum Development, 1--2.

[50]

Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler. 2012. Exploring topic coherence over many models and many topics. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP’12) and Computational Natural Language Learning (CoNLL’12). ACL, 952--961.

[51]

Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’05). 1385--1392.

[52]

Jingjing Wang, Wenzhu Tong, Hongkun Yu, Min Li, Xiuli Ma, Haoyan Cai, Tim Hanratty, and Jiawei Han. 2015. Mining multi-aspect reflection of news events in Twitter: Discovery, linking and presentation. In Proceedings of IEEE International Conference on Data Mining (ICDM’15). IEEE, 429--438.

Digital Library

[53]

Xuerui Wang and Andrew McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’06). ACM, 424--433.

Digital Library

[54]

Stephen J. Wright and Jorge Nocedal. 1999. Numerical optimization, Vol. 35. Springer Science.

[55]

Michelle Yuan, Benjamin Van Durme, and Jordan L. Ying. 2018. Multilingual anchoring: Interactive topic modeling and alignment across languages. In Proceedings of the Advances in Neural Information Processing Systems. 8667--8677.

Cited By

Jacaruso L(2024)Insights into the nutritional prevention of macular degeneration based on a comparative topic modeling approachPeerJ Computer Science10.7717/peerj-cs.194010(e1940)Online publication date: 20-Mar-2024
https://doi.org/10.7717/peerj-cs.1940
CHERRADI M(2024)Data lake management using topic modeling techniquesData and Metadata10.56294/dm20242823(282)Online publication date: 15-Apr-2024
https://doi.org/10.56294/dm2024282
Koochemeshkian PIhou Koffi EBouguila N(2024)Hidden Variable Models in Text Classification and Sentiment AnalysisElectronics10.3390/electronics1310185913:10(1859)Online publication date: 10-May-2024
https://doi.org/10.3390/electronics13101859
Show More Cited By

Index Terms

Probabilistic Topic Modeling for Comparative Analysis of Document Collections
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models
    2. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections
TM '15: Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications

Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis based on the preferential use of graphical models and Bayesian learning. Additive regularization for topic modeling (ARTM) is a recent semiprobabilistic ...
Opinion integration through semi-supervised topic modeling
WWW '08: Proceedings of the 17th international conference on World Wide Web

Web 2.0 technology has enabled more and more people to freely express their opinions on the Web, making the Web an extremely valuable source for mining user opinions about all kinds of topics. In this paper we study how to automatically integrate ...
ContraVis: Contrastive and Visual Topic Modeling for Comparing Document Collections
WWW '19: The World Wide Web Conference

Given posts on 'abortion' and posts on 'religion' from a political forum, how can we find topics that are discriminative and those in common? In general, (1) how can we compare and contrast two or more different ('labeled') document collections? ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 14, Issue 2

April 2020

322 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3382774

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
Minginglamp Academy of Sciences, China

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2020

Accepted: 01 October 2019

Revised: 01 August 2019

Received: 01 March 2018

Published in TKDD Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
952
Total Downloads

Downloads (Last 12 months)237
Downloads (Last 6 weeks)35

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jacaruso L(2024)Insights into the nutritional prevention of macular degeneration based on a comparative topic modeling approachPeerJ Computer Science10.7717/peerj-cs.194010(e1940)Online publication date: 20-Mar-2024
https://doi.org/10.7717/peerj-cs.1940
CHERRADI M(2024)Data lake management using topic modeling techniquesData and Metadata10.56294/dm20242823(282)Online publication date: 15-Apr-2024
https://doi.org/10.56294/dm2024282
Koochemeshkian PIhou Koffi EBouguila N(2024)Hidden Variable Models in Text Classification and Sentiment AnalysisElectronics10.3390/electronics1310185913:10(1859)Online publication date: 10-May-2024
https://doi.org/10.3390/electronics13101859
Koochemeshkian PBouguila N(2024)Integration of Neural Embeddings and Probabilistic Models in Topic ModelingApplied Artificial Intelligence10.1080/08839514.2024.240390438:1Online publication date: 4-Oct-2024
https://doi.org/10.1080/08839514.2024.2403904
Luo ZAmayri MFan WIhou KBouguila N(2024)Parallel inference for cross-collection latent generalized Dirichlet allocation model and applicationsExpert Systems with Applications10.1016/j.eswa.2023.121720238(121720)Online publication date: Mar-2024
https://doi.org/10.1016/j.eswa.2023.121720
Cheng GYou QShi LWang ZLuo JLi T(2023)A survey of topic modelsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23355145:6(9929-9953)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-233551
Koochemeshkian PBouguila N(2023)Advancing Multinomial Regression and Topic Modeling with Beta-Liouville Distributions2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00292(1928-1935)Online publication date: 15-Dec-2023
https://doi.org/10.1109/ICMLA58977.2023.00292
Koochemeshkian PBouguila N(2023)Generalized Dirichlet-Multinomial Regression: Leveraging Arbitrary Features for Topic Modelling2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00128(884-891)Online publication date: 17-Dec-2023
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00128
Prabha SSardana N(2023)Question Tags or Text for Topic ModelingProcedia Computer Science10.1016/j.procs.2023.01.193218:C(2172-2180)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.procs.2023.01.193
Luo ZAmayri MFan WBouguila N(2023)Cross-collection latent Beta-Liouville allocation model training with privacy protection and applicationsApplied Intelligence10.1007/s10489-022-04378-353:14(17824-17848)Online publication date: 13-Jan-2023
https://dl.acm.org/doi/10.1007/s10489-022-04378-3
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents