Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1458082.1458202acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Modeling hidden topics on document manifold

Published: 26 October 2008 Publication History

Abstract

Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evidence that the document space is Euclidean, or flat. Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specifically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the ambient space and directly performs the topic modeling on this document manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimental results show that LapPLSI provides better representation in the sense of semantic structure.

References

[1]
R. Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In Proc. 2000 Int. Conf. on Research and Development in Information Retrieval (SIGIR'00), Athens, Greece, July 2000.
[2]
M. Belkin. Problems of Learning on Manifolds. PhD thesis, University of Chicago, 2003.
[3]
M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14, pages 585--591. MIT Press, Cambridge, MA, 2001.
[4]
M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from examples. Journal of Machine Learning Research, 7:2399--2434, 2006.
[5]
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of machine Learning Research, 2003.
[6]
D. Cai, X. He, and J. Han. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12):1624--1637, December 2005.
[7]
F. R. K. Chung. Spectral Graph Theory, volume 92 of Regional Conference Series in Mathematics. AMS, 1997.
[8]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
[9]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.
[10]
X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality preserving indexing for document representation. In Proc. 2004 Int. Conf. on Research and Development in Information Retrieval (SIGIR'04), pages 96--103, Sheffield, UK, July 2004.
[11]
T. Hofmann. Probabilistic latent semantic indexing. In Proc. 1999 Int. Conf. on Research and Development in Information Retrieval (SIGIR'99), pages 50--57, Berkeley, CA, Aug. 1999.
[12]
T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-2):177--196, 2001.
[13]
L. Lovasz and M. Plummer. Matching Theory. Akadémiai Kiadó, North Holland, Budapest, 1986.
[14]
R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models. Kluwer, 1998.
[15]
A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856. MIT Press, Cambridge, MA, 2001.
[16]
A. Popescul, L. Ungar, D. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In 17th Conference on Uncertainty in Artificial Intelligence, pages 437--444, 2001.
[17]
W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1992.
[18]
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000.
[19]
L. Si and R. Jin. Adjusting mixture weights of gaussian mixture model via regularized probabilistic latent semantic analysis. In The Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05), 2005.
[20]
X. Wang, J.-T. Sun, Z. Chen, and C. Zhai. Latent semantic analysis for multiple-type interrelated data objects. In Proc. 2006 Int. Conf. on Research and Development in Information Retrieval (SIGIR'06), pages 236--243, 2006.
[21]
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proc. 2003 Int. Conf. on Research and Development in Information Retrieval (SIGIR'03), pages 267--273, Toronto, Canada, Aug. 2003.
[22]
H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems 14, pages 1057--1064. MIT Press, Cambridge, MA, 2001.
[23]
D. Zhang, X. Chen, and W. S. Lee. Text classification with kernels on the multinomial manifold. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 266--273, 2005.
[24]
X. Zhu and J. Lafferty. Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 1052--1059, 2005.

Cited By

View all
  • (2024)Robust multilayer bootstrap networks in ensemble for unsupervised representation learning and clusteringPattern Recognition10.1016/j.patcog.2024.110739156(110739)Online publication date: Dec-2024
  • (2023)A Comparative Study of K-means and NMF Clustering Algorithms2023 2nd International Engineering Conference on Electrical, Energy, and Artificial Intelligence (EICEEAI)10.1109/EICEEAI60672.2023.10590510(1-4)Online publication date: 27-Dec-2023
  • (2023)Error bound and exact penalty method for optimization problems with nonnegative orthogonal constraintIMA Journal of Numerical Analysis10.1093/imanum/drac08444:1(120-156)Online publication date: 2-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
October 2008
1562 pages
ISBN:9781595939913
DOI:10.1145/1458082
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document representation
  2. generative model
  3. manifold regularization
  4. probabilistic latent semantic indexing

Qualifiers

  • Research-article

Conference

CIKM08
CIKM08: Conference on Information and Knowledge Management
October 26 - 30, 2008
California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Robust multilayer bootstrap networks in ensemble for unsupervised representation learning and clusteringPattern Recognition10.1016/j.patcog.2024.110739156(110739)Online publication date: Dec-2024
  • (2023)A Comparative Study of K-means and NMF Clustering Algorithms2023 2nd International Engineering Conference on Electrical, Energy, and Artificial Intelligence (EICEEAI)10.1109/EICEEAI60672.2023.10590510(1-4)Online publication date: 27-Dec-2023
  • (2023)Error bound and exact penalty method for optimization problems with nonnegative orthogonal constraintIMA Journal of Numerical Analysis10.1093/imanum/drac08444:1(120-156)Online publication date: 2-Feb-2023
  • (2023)Deep NMF topic modelingNeurocomputing10.1016/j.neucom.2022.10.002515(157-173)Online publication date: Jan-2023
  • (2023)Graph neural topic model with commonsense knowledgeInformation Processing & Management10.1016/j.ipm.2022.10321560:2(103215)Online publication date: Mar-2023
  • (2022)Broadcast news story segmentation using sticky hierarchical dirichlet processApplied Intelligence10.1007/s10489-021-03098-452:11(12788-12800)Online publication date: 14-Feb-2022
  • (2022)An exact penalty approach for optimization with nonnegative orthogonality constraintsMathematical Programming10.1007/s10107-022-01794-8198:1(855-897)Online publication date: 25-Mar-2022
  • (2021)Generalized Separable Nonnegative Matrix FactorizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2019.295604643:5(1546-1561)Online publication date: 1-May-2021
  • (2021)Deep Topic Modeling by Multilayer Bootstrap Network and Lasso2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412751(2470-2475)Online publication date: 10-Jan-2021
  • (2021)Quartic First-Order Methods for Low-Rank MinimizationJournal of Optimization Theory and Applications10.1007/s10957-021-01820-3Online publication date: 23-Mar-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media