Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Group topic model: organizing topics into groups

Published: 01 February 2015 Publication History

Abstract

Latent Dirichlet allocation defines hidden topics to capture latent semantics in text documents. However, it assumes that all the documents are represented by the same topics, resulting in the “forced topic” problem. To solve this problem, we developed a group latent Dirichlet allocation (GLDA). GLDA uses two kinds of topics: local topics and global topics. The highly related local topics are organized into groups to describe the local semantics, whereas the global topics are shared by all the documents to describe the background semantics. GLDA uses variational inference algorithms for both offline and online data. We evaluated the proposed model for topic modeling and document clustering. Our experimental results indicated that GLDA can achieve a competitive performance when compared with state-of-the-art approaches.

References

[1]
Blei, D., & Lafferty, J. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on machine learning (pp. 113–120). ACM.
[2]
Blei, D., & McAuliffe, J. (2007). Supervised topic models. In Proceedings of the neural information processing systems.
[3]
Blei D, Ng A, and Jordan M Latent Dirichlet allocation The Journal of Machine Learning Research 2003 3 993-1022
[4]
Blei D and Lafferty J A correlated topic model fo science The Annals of Applied Statistics 2007 1 1 17-35
[5]
Blei D, Griffiths T, and Jordan M The nested chinese restaurant process and Bayesian nonparametric inference of topic hierarchies Journal of the ACM 2010 57 2 1-30
[6]
Blei D Probabilistic topic models Communications of the ACM 2012 55 4 77-84
[7]
Boyd-Graber, J., & Blei, D. (2008). Syntactic topic models. In Proceedings of neural information processing systems.
[8]
Cai, D., He, X., & Han, J. (2011). Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering, 23(6), 902–913.
[9]
Chang J and Blei D Hierarchical relational models for document networks Annals of Applied Statistics 2010 4 1 124-150
[10]
Deerwester S, Dumais ST, Furnas GW, Landauer TK, and Harshman R Indexing by latent semantic analysis Journal of the American Society for Information Science 1990 41 6 391-407
[11]
Doyle, G., & Elkan, C. (2009). Accounting for burstiness in topic models. In Proceedings of the 26th international conference on machine learning (pp. 281–288). ACM.
[12]
Hoffman, M., & Blei, D. (2010). Online learning for latent Dirichlet allocation. In Advances in neural information processing systems.
[13]
Hoffman M, Blei D, and Wang C Stochastic variational inference Journal of Machine Learning Research 2013 14 1 1303-1347
[14]
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57). ACM.
[15]
Jing L, Ng MK, and Huang JZ An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data IEEE Transactions on Knowledge and Data Engineering 2007 19 8 1026-1041
[16]
Koller D and Friedman N Probabilistic graphical models: Principles and techniques 2009 Cambridge MIT Press
[17]
Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on machine learning (pp. 577–584). ACM.
[18]
Li, F., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In Computer vision and pattern recognition (Vol. 2, pp. 524–531). IEEE.
[19]
Lovasz L and Plummer M Matching theory 1986 North Holland Akademiai Kiado
[20]
Lu Y, Mei Q, and Zhai C Investigating task performance of probabilistic topic models: An empirical study of PLSA and LDA Information Retrieval 2011 14 2 178-203
[21]
Reisinger, J., Waters, A., Silverthorn, B., & Mooney, R. (2009). Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. In Proceedings of neural information processing systems (pp. 1982–1989). (2009).
[22]
Reisinger, J., Waters, A., Silverthorn, B., & Mooney, R. (2010). Spherical topic models. In Proceedings of the 27th international conference on machine learning. ACM.
[23]
Sivic, J., Russell, B., Zisserman, A., Freeman, W., & Efros, A. (2008). Unsupervised discovery of visual object class hierarchies. In Proceedings of the computer vision and pattern recognition (pp. 1–8). IEEE.
[24]
Teh YW, Jordan MI, Beal MJ, and Blei DM Hierarchical Dirichlet processes Journal of the American Statistical Association 2006 101 476 1566-1581
[25]
Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd international conference on machine learning (pp. 977–984). ACM.
[26]
Wallach, H. M. (2008). Structured topic models for language. Ph.D. thesis. Newnham College, University of Cambridge.
[27]
Wallach, H., Mimno, D., & McCallum, A. (2009a). Rethinking LDA: Why priors matter. In Advances in neural information processing systems.
[28]
Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimn, D. (2009b). Evaluation methods for topic models. In Proceedings of the 26th conference on uncertainty in artificial intelligence (pp. 1105–111). ACM.
[29]
Wang, X., Mccallum, A., & Wei, X. (2007). Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 7th IEEE international conference on data mining. IEEE.
[30]
Wang, C., Thiesson, B., Meek, C., & Blei, D. (2009). Markov topic models. In Proceedings of the 12th international conference on artificial intelligence and statistics (pp. 583–590). Journal of Machine Learning Research.
[31]
Xie, P., & Xing, E. P. (2013). Integrating document clustering and topic modeling. In Proceedings of the 20th conference on uncertainty in artificial intelligence (pp. 694–703).
[32]
Zhang, D., Wang, J., & Si, L. (2011). Document clustering with universum. In Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval (pp. 873–882). ACM.
[33]
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In 20th international conference on machine learning. ACM.

Cited By

View all
  • (2024)A two-stage clustering ensemble algorithm applicable to risk assessment of railway signaling faultsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123500249:PAOnline publication date: 1-Sep-2024
  • (2023)Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signalsSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-022-07771-927:9(5397-5410)Online publication date: 6-Jan-2023
  • (2019)Pseudo Topic Analysis for Boosting Pseudo Relevance FeedbackWeb and Big Data10.1007/978-3-030-26072-9_26(345-361)Online publication date: 1-Aug-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Retrieval
Information Retrieval  Volume 18, Issue 1
Feb 2015
94 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2015
Accepted: 27 August 2014
Received: 12 March 2014

Author Tags

  1. Topic modeling
  2. Latent Dirichlet allocation
  3. Group
  4. Variational inference
  5. Online learning
  6. Document clustering

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A two-stage clustering ensemble algorithm applicable to risk assessment of railway signaling faultsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123500249:PAOnline publication date: 1-Sep-2024
  • (2023)Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signalsSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-022-07771-927:9(5397-5410)Online publication date: 6-Jan-2023
  • (2019)Pseudo Topic Analysis for Boosting Pseudo Relevance FeedbackWeb and Big Data10.1007/978-3-030-26072-9_26(345-361)Online publication date: 1-Aug-2019
  • (2018)Short text topic modeling by exploring original documentsKnowledge and Information Systems10.1007/s10115-017-1099-056:2(443-462)Online publication date: 1-Aug-2018
  • (2017)An effective and interpretable method for document classificationKnowledge and Information Systems10.1007/s10115-016-0956-650:3(763-793)Online publication date: 1-Mar-2017

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media