Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Probabilistic Topic Modeling for Comparative Analysis of Document Collections

Published: 04 March 2020 Publication History

Abstract

Probabilistic topic models, which can discover hidden patterns in documents, have been extensively studied. However, rather than learning from a single document collection, numerous real-world applications demand a comprehensive understanding of the relationships among various document sets. To address such needs, this article proposes a new model that can identify the common and discriminative aspects of multiple datasets. Specifically, our proposed method is a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular dataset). Through extensive experiments, we demonstrate the effectiveness of our method compared with state-of-the-art models. The proposed model can be useful for “comparative thinking” analysis in real-world document collections.

References

[1]
David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the International Conference on Machine Learning (ICML’06). ACM, 113--120.
[2]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 993--1022.
[3]
R. Darrell Bock and Murray Aitkin. 1981. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. In Psychometrika, Vol. 46. Springer, 443--459.
[4]
Jordan Boyd-Graber, Yuening Hu, David Mimno, et al. 2017. Applications of topic models. Foundations and Trends® in Information Retrieval 11, 2--3, 60--62.
[5]
Deng Cai, Xiaofei He, Xiaoyun Wu, and Jiawei Han. 2008. Non-negative matrix factorization on manifold. In Proceedings of IEEE International Conference on Data Mining (ICDM’08). IEEE, 63--72.
[6]
George Casella and Edward I. George. 1992. Explaining the Gibbs sampler. In The American Statistician, Vol. 46. Taylor 8 Francis, 167--174.
[7]
Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. 2006. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of Neural Information Processing Systems (NIPS’06), Vol. 19. 241--248.
[8]
Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. In IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 1992--2001.
[9]
Bishop Christopher. 2007. Pattern recognition and machine learning. Springer, 93--94.
[10]
David A. Cohn and Thomas Hofmann. 2001. The missing link-a probabilistic model of document content and hypertext connectivity. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’01). 430--436.
[11]
Tao Ge, Wenzhe Pei, Heng Ji, Sujian Li, Baobao Chang, and Zhifang Sui. 2015. Bring you to the past: Automatic generation of topically relevant event chronicles. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’15). 575--585.
[12]
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. In Proceedings of the National Academy of Sciences (PNAS’04), Vol. 101. NAS, 5228--5235.
[13]
Bin Guo, Yi Ouyang, Cheng Zhang, Jiafan Zhang, Zhiwen Yu, Di Wu, and Yu Wang. 2017. Crowdstory: Fine-grained event storyline generation by fusion of multi-modal crowdsourced data. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 3, 55.
[14]
Gregor Heinrich. 2008. Parameter estimation for text analysis. In University of Leipzig, Tech. Rep.
[15]
Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online learning for latent dirichlet allocation. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’10). 856--864.
[16]
Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. The Journal of Machine Learning Research, Vol. 14. 1303--1347.
[17]
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, 50--57.
[18]
Lifu Huang and Lian’en Huang. 2013. Optimized event storyline generation based on mixture-event-aspect model. In Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP’13). 726--735.
[19]
Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K. Reddy, and Haesun Park. 2015. Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’15). ACM, 567--576.
[20]
Gang Kou, Yanqun Lu, Yi Peng, and Yong Shi. 2012. Evaluation of classification algorithms using MCDM and rank correlation. International Journal of Information Technology 8 Decision Making 11, 01, 197--225.
[21]
Gang Kou, Yi Peng, and Guoxun Wang. 2014. Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Information Sciences 275, 1--12.
[22]
Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 1 (1955), 83--97.
[23]
Simon Lacoste-Julien, Fei Sha, and Michael I. Jordan. 2009. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’09). 897--904.
[24]
Tuan Le and Leman Akoglu. 2019. ContraVis: Contrastive and visual topic modeling for comparing document collections. In Proceedings of The World Wide Web Conference. ACM, 928--938.
[25]
Daniel D. Lee and H. Sebastian Seung. 2001. Algorithms for non-negative matrix factorization. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’01). 556--562.
[26]
Moontae Lee and David Mimno. 2017. Low-dimensional embeddings for interpretable anchor-based topic inference. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1319--1328.
[27]
Chenghua Lin, Yulan He, Richard Everson, and Stefan Ruger. 2011. Weakly supervised joint sentiment-topic detection from text. IEEE Transactions on Knowledge and Data engineering 24, 6, 1134--1145.
[28]
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’16). 2873--2879.
[29]
David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). ACL, 262--272.
[30]
Samaneh Moghaddam and Martin Ester. 2011. ILDA: Interdependent LDA model for learning latent aspects and their ratings from online product reviews. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 665--674.
[31]
Samaneh Moghaddam and Martin Ester. 2013. The FLDA model for aspect-based opinion mining: Addressing the cold start problem. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 909--918.
[32]
Elaheh Momeni, Shanika Karunasekera, Palash Goyal, and Kristina Lerman. 2018. Modeling evolution of topics in large-scale temporal text corpora. In Proceedings of the 12th International AAAI Conference on Web and Social Media.
[33]
Arjun Mukherjee and Bing Liu. 2012. Aspect extraction through semi-supervised modeling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-volume 1. Association for Computational Linguistics, 339--348.
[34]
John Paisley, Chong Wang, David M. Blei, and Michael I. Jordan. 2015. Nested hierarchical Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 2 (2015), 56--270.
[35]
Michael Paul and Roxana Girju. 2010. A two-dimensional topic-aspect model for discovering multi-faceted topics. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI’10), Vol. 51. 36.
[36]
Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). ACM, 569--577.
[37]
A. Kai Qin, Vicky Ling Huang, and Ponnuthurai N. Suganthan. 2009. Differential evolution algorithm with strategy adaptation for global numerical optimization. IEEE Transactions on Evolutionary Computation 13, 2 (2009), 398--417.
[38]
Maxim Rabinovich and David M. Blei. 2014. The inverse regression topic model. In Proceedings of International Conference on Machine Learning (ICML’14). IEEE, 199--207.
[39]
Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP’09). ACL, 248--256.
[40]
Daniel Ramage, Christopher D. Manning, and Susan Dumais. 2011. Partially labeled topic models for interpretable text mining. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). ACM, 457--465.
[41]
Nikhil Rasiwasia and Nuno Vasconcelos. 2013. Latent dirichlet allocation models for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11, 2665--2679.
[42]
Xiang Ren, Yuanhua Lv, Kuansan Wang, and Jiawei Han. 2017. Comparative document analysis for large text corpora. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. ACM, 325--334.
[43]
Filipe Rodrigues, Mariana Lourenco, Bernardete Ribeiro, and Francisco C Pereira. 2017. Learning supervised topic models for classification and regression from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12, 2409--2422.
[44]
Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI’04). AUAI Press, 487--494.
[45]
Dwijen Rudrapal, Amitava Das, and Baby Bhattacharya. 2018. A survey on automatic Twitter event summarization.Journal of Information Processing Systems 14, 1 (2018), 79--100.
[46]
Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.
[47]
Vinay Setty, Abhijit Anand, Arunav Mishra, and Avishek Anand. 2017. Modeling event importance for ranking daily news events. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining. ACM, 231--240.
[48]
Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, 1105--1114.
[49]
Harvey F. Silver. 2010. Compare 8 contrast: Teaching comparative thinking to strengthen student learning. Association for Supervision 8 Curriculum Development, 1--2.
[50]
Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler. 2012. Exploring topic coherence over many models and many topics. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP’12) and Computational Natural Language Learning (CoNLL’12). ACL, 952--961.
[51]
Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’05). 1385--1392.
[52]
Jingjing Wang, Wenzhu Tong, Hongkun Yu, Min Li, Xiuli Ma, Haoyan Cai, Tim Hanratty, and Jiawei Han. 2015. Mining multi-aspect reflection of news events in Twitter: Discovery, linking and presentation. In Proceedings of IEEE International Conference on Data Mining (ICDM’15). IEEE, 429--438.
[53]
Xuerui Wang and Andrew McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’06). ACM, 424--433.
[54]
Stephen J. Wright and Jorge Nocedal. 1999. Numerical optimization, Vol. 35. Springer Science.
[55]
Michelle Yuan, Benjamin Van Durme, and Jordan L. Ying. 2018. Multilingual anchoring: Interactive topic modeling and alignment across languages. In Proceedings of the Advances in Neural Information Processing Systems. 8667--8677.

Cited By

View all
  • (2024)Insights into the nutritional prevention of macular degeneration based on a comparative topic modeling approachPeerJ Computer Science10.7717/peerj-cs.194010(e1940)Online publication date: 20-Mar-2024
  • (2024)Data lake management using topic modeling techniquesData and Metadata10.56294/dm20242823(282)Online publication date: 15-Apr-2024
  • (2024)Hidden Variable Models in Text Classification and Sentiment AnalysisElectronics10.3390/electronics1310185913:10(1859)Online publication date: 10-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 2
April 2020
322 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3382774
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2020
Accepted: 01 October 2019
Revised: 01 August 2019
Received: 01 March 2018
Published in TKDD Volume 14, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Probabilistic topic modeling
  2. text mining

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)237
  • Downloads (Last 6 weeks)35
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Insights into the nutritional prevention of macular degeneration based on a comparative topic modeling approachPeerJ Computer Science10.7717/peerj-cs.194010(e1940)Online publication date: 20-Mar-2024
  • (2024)Data lake management using topic modeling techniquesData and Metadata10.56294/dm20242823(282)Online publication date: 15-Apr-2024
  • (2024)Hidden Variable Models in Text Classification and Sentiment AnalysisElectronics10.3390/electronics1310185913:10(1859)Online publication date: 10-May-2024
  • (2024)Integration of Neural Embeddings and Probabilistic Models in Topic ModelingApplied Artificial Intelligence10.1080/08839514.2024.240390438:1Online publication date: 4-Oct-2024
  • (2024)Parallel inference for cross-collection latent generalized Dirichlet allocation model and applicationsExpert Systems with Applications10.1016/j.eswa.2023.121720238(121720)Online publication date: Mar-2024
  • (2023)A survey of topic modelsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23355145:6(9929-9953)Online publication date: 1-Jan-2023
  • (2023)Advancing Multinomial Regression and Topic Modeling with Beta-Liouville Distributions2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00292(1928-1935)Online publication date: 15-Dec-2023
  • (2023)Generalized Dirichlet-Multinomial Regression: Leveraging Arbitrary Features for Topic Modelling2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00128(884-891)Online publication date: 17-Dec-2023
  • (2023)Question Tags or Text for Topic ModelingProcedia Computer Science10.1016/j.procs.2023.01.193218:C(2172-2180)Online publication date: 1-Jan-2023
  • (2023)Cross-collection latent Beta-Liouville allocation model training with privacy protection and applicationsApplied Intelligence10.1007/s10489-022-04378-353:14(17824-17848)Online publication date: 13-Jan-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media