Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3023638.3023709guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Integrating document clustering and topic modeling

Published: 11 August 2013 Publication History

Abstract

Document clustering and topic modeling are two closely related tasks which can mutually benefit each other. Topic modeling can project documents into a topic space which facilitates effective document clustering. Cluster labels discovered by document clustering can be incorporated into topic models to extract local topics specific to each cluster and global topics shared by all clusters. In this paper, we propose a multi-grain clustering topic model (MGCTM) which integrates document clustering and topic modeling into a unified framework and jointly performs the two tasks to achieve the overall best performance. Our model tightly couples two components: a mixture component used for discovering latent groups in document collection and a topic model component used for mining multi-grain topics including local topics specific to each cluster and global topics shared across clusters. We employ variational inference to approximate the posterior of hidden variables and learn model parameters. Experiments on two datasets demonstrate the effectiveness of our model.

References

[1]
Charu C Aggarwal and ChengXiang Zhai. A survey of text clustering algorithms. Mining Text Data, pages 77-128, 2012.
[2]
Amr Ahmed and Eric P Xing. Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1140-1150. Association for Computational Linguistics, 2010.
[3]
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of Machine Learning Research, 3:993-1022, 2003.
[4]
Jonathan Boyd-Graber, Jordan Chang, Sean Gerrish, Chong Wang, and David Blei. Reading tea leaves: how humans interpret topic models. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, 2009.
[5]
Deng Cai, Xiaofei He, and Jiawei Han. Locally consistent concept factorization for document clustering. Knowledge and Data Engineering, IEEE Transactions on, 23(6):902-913, 2011.
[6]
Chaitanya Chemudugunta and Padhraic Smyth Mark Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, volume 19, page 241. MIT Press, 2007.
[7]
Scott Deerwester, Susan T. Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for Information Science, 41(6):391-407, 1990.
[8]
Li Fei-Fei and Pietro Perona. A bayesian hierarchical model for learning natural scene categories. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 524-531. IEEE, 2005.
[9]
Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177-196, 2001.
[10]
Yue Lu, Qiaozhu Mei, and ChengXiang Zhai. Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Information Retrieval, 14(2):178-203, 2011.
[11]
Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems, 2:849-856, 2002.
[12]
Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888-905, 2000.
[13]
Ivan Titov and Ryan McDonald. Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web, pages 111-120. ACM, 2008.
[14]
Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends@ in Machine Learning, 1(1-2):1-305, 2008.
[15]
Hanna M Wallach. Structured topic models for language. Unpublished doctoral dissertation, Univ. of Cambridge, 2008.
[16]
Wei Xu and Yihong Gong. Document clustering by concept factorization. In Proceedings of the 27th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 202-209. ACM, 2004.
[17]
Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 267-273. ACM, 2003.
[18]
Jun Zhu, Li-Jia Li, Li Fei-Fei, and Eric P Xing. Large margin learning of upstream scene understanding models. Advances in Neural Information Processing Systems, 24, 2010.

Cited By

View all
  • (2019)CluWordsProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291032(753-761)Online publication date: 30-Jan-2019
  • (2018)Finding communities with hierarchical semantics by distinguishing general and specialized topicsProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304222.3304275(3648-3654)Online publication date: 13-Jul-2018
  • (2018)Seed-Guided Topic Model for Document Filtering and ClassificationACM Transactions on Information Systems10.1145/323825037:1(1-37)Online publication date: 6-Dec-2018
  • Show More Cited By
  1. Integrating document clustering and topic modeling

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    UAI'13: Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence
    August 2013
    722 pages
    • Editors:
    • Ann Nicholson,
    • Padhraic Smyth

    Publisher

    AUAI Press

    Arlington, Virginia, United States

    Publication History

    Published: 11 August 2013

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)CluWordsProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291032(753-761)Online publication date: 30-Jan-2019
    • (2018)Finding communities with hierarchical semantics by distinguishing general and specialized topicsProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304222.3304275(3648-3654)Online publication date: 13-Jul-2018
    • (2018)Seed-Guided Topic Model for Document Filtering and ClassificationACM Transactions on Information Systems10.1145/323825037:1(1-37)Online publication date: 6-Dec-2018
    • (2017)An effective and interpretable method for document classificationKnowledge and Information Systems10.1007/s10115-016-0956-650:3(763-793)Online publication date: 1-Mar-2017

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media