research-article

Modeling hidden topics on document manifold

Authors:

Chengxiang ZhaiAuthors Info & Claims

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 911 - 920

https://doi.org/10.1145/1458082.1458202

Published: 26 October 2008 Publication History

Abstract

Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evidence that the document space is Euclidean, or flat. Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specifically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the ambient space and directly performs the topic modeling on this document manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimental results show that LapPLSI provides better representation in the sense of semantic structure.

References

[1]

R. Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In Proc. 2000 Int. Conf. on Research and Development in Information Retrieval (SIGIR'00), Athens, Greece, July 2000.

Digital Library

[2]

M. Belkin. Problems of Learning on Manifolds. PhD thesis, University of Chicago, 2003.

Digital Library

[3]

M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14, pages 585--591. MIT Press, Cambridge, MA, 2001.

Digital Library

[4]

M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from examples. Journal of Machine Learning Research, 7:2399--2434, 2006.

Digital Library

[5]

D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of machine Learning Research, 2003.

Digital Library

[6]

D. Cai, X. He, and J. Han. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12):1624--1637, December 2005.

Digital Library

[7]

F. R. K. Chung. Spectral Graph Theory, volume 92 of Regional Conference Series in Mathematics. AMS, 1997.

[8]

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.

[9]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.

[10]

X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality preserving indexing for document representation. In Proc. 2004 Int. Conf. on Research and Development in Information Retrieval (SIGIR'04), pages 96--103, Sheffield, UK, July 2004.

Digital Library

[11]

T. Hofmann. Probabilistic latent semantic indexing. In Proc. 1999 Int. Conf. on Research and Development in Information Retrieval (SIGIR'99), pages 50--57, Berkeley, CA, Aug. 1999.

Digital Library

[12]

T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-2):177--196, 2001.

Digital Library

[13]

L. Lovasz and M. Plummer. Matching Theory. Akadémiai Kiadó, North Holland, Budapest, 1986.

[14]

R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models. Kluwer, 1998.

Digital Library

[15]

A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849--856. MIT Press, Cambridge, MA, 2001.

Digital Library

[16]

A. Popescul, L. Ungar, D. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In 17th Conference on Uncertainty in Artificial Intelligence, pages 437--444, 2001.

Digital Library

[17]

W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1992.

Digital Library

[18]

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000.

Digital Library

[19]

L. Si and R. Jin. Adjusting mixture weights of gaussian mixture model via regularized probabilistic latent semantic analysis. In The Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05), 2005.

Digital Library

[20]

X. Wang, J.-T. Sun, Z. Chen, and C. Zhai. Latent semantic analysis for multiple-type interrelated data objects. In Proc. 2006 Int. Conf. on Research and Development in Information Retrieval (SIGIR'06), pages 236--243, 2006.

Digital Library

[21]

W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proc. 2003 Int. Conf. on Research and Development in Information Retrieval (SIGIR'03), pages 267--273, Toronto, Canada, Aug. 2003.

Digital Library

[22]

H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems 14, pages 1057--1064. MIT Press, Cambridge, MA, 2001.

[23]

D. Zhang, X. Chen, and W. S. Lee. Text classification with kernels on the multinomial manifold. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 266--273, 2005.

Digital Library

[24]

X. Zhu and J. Lafferty. Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 1052--1059, 2005.

Digital Library

Cited By

Zhang XLi X(2024)Robust multilayer bootstrap networks in ensemble for unsupervised representation learning and clusteringPattern Recognition10.1016/j.patcog.2024.110739156(110739)Online publication date: Dec-2024
https://doi.org/10.1016/j.patcog.2024.110739
Basiri FAmer ARanjbar Naserabadi MMoghimi M(2023)A Comparative Study of K-means and NMF Clustering Algorithms2023 2nd International Engineering Conference on Electrical, Energy, and Artificial Intelligence (EICEEAI)10.1109/EICEEAI60672.2023.10590510(1-4)Online publication date: 27-Dec-2023
https://doi.org/10.1109/EICEEAI60672.2023.10590510
Qian YPan SXiao L(2023)Error bound and exact penalty method for optimization problems with nonnegative orthogonal constraintIMA Journal of Numerical Analysis10.1093/imanum/drac08444:1(120-156)Online publication date: 2-Feb-2023
https://doi.org/10.1093/imanum/drac084
Show More Cited By

Index Terms

Modeling hidden topics on document manifold
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
2. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms
      1. Statistical graphics

Recommendations

Hierarchical neural topic modeling with manifold regularization
Abstract
Topic models have been widely used for learning the latent explainable representation of documents, but most of the existing approaches discover topics in a flat structure. In this study, we propose an effective hierarchical neural topic model ...
Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference Resolution
JCDL '20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

Topic modeling is a technique used in a broad spectrum of use cases, such as data exploration, summarization, and classification. Despite being a crucial constituent of many use cases, established topic models, such as LDA, often produce statistically ...
Document Topic Extraction Based on Wikipedia Category
CSO '11: Proceedings of the 2011 Fourth International Joint Conference on Computational Sciences and Optimization

Document Topic Extraction aims at using several key phrases to describe the topics of documents. It can be applied in web document categorization and tagging, document clusters topic description and information retrieval tasks. In this paper, we propose ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

October 2008

1562 pages

ISBN:9781595939913

DOI:10.1145/1458082

General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM08

Sponsor:

CIKM08: Conference on Information and Knowledge Management

October 26 - 30, 2008

California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

108
Total Citations
View Citations
770
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XLi X(2024)Robust multilayer bootstrap networks in ensemble for unsupervised representation learning and clusteringPattern Recognition10.1016/j.patcog.2024.110739156(110739)Online publication date: Dec-2024
https://doi.org/10.1016/j.patcog.2024.110739
Basiri FAmer ARanjbar Naserabadi MMoghimi M(2023)A Comparative Study of K-means and NMF Clustering Algorithms2023 2nd International Engineering Conference on Electrical, Energy, and Artificial Intelligence (EICEEAI)10.1109/EICEEAI60672.2023.10590510(1-4)Online publication date: 27-Dec-2023
https://doi.org/10.1109/EICEEAI60672.2023.10590510
Qian YPan SXiao L(2023)Error bound and exact penalty method for optimization problems with nonnegative orthogonal constraintIMA Journal of Numerical Analysis10.1093/imanum/drac08444:1(120-156)Online publication date: 2-Feb-2023
https://doi.org/10.1093/imanum/drac084
Wang JZhang X(2023)Deep NMF topic modelingNeurocomputing10.1016/j.neucom.2022.10.002515(157-173)Online publication date: Jan-2023
https://doi.org/10.1016/j.neucom.2022.10.002
Zhu BCai YRen H(2023)Graph neural topic model with commonsense knowledgeInformation Processing & Management10.1016/j.ipm.2022.10321560:2(103215)Online publication date: Mar-2023
https://doi.org/10.1016/j.ipm.2022.103215
Yu JShao H(2022)Broadcast news story segmentation using sticky hierarchical dirichlet processApplied Intelligence10.1007/s10489-021-03098-452:11(12788-12800)Online publication date: 14-Feb-2022
https://doi.org/10.1007/s10489-021-03098-4
Jiang BMeng XWen ZChen X(2022)An exact penalty approach for optimization with nonnegative orthogonality constraintsMathematical Programming10.1007/s10107-022-01794-8198:1(855-897)Online publication date: 25-Mar-2022
https://doi.org/10.1007/s10107-022-01794-8
Pan JGillis N(2021)Generalized Separable Nonnegative Matrix FactorizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2019.295604643:5(1546-1561)Online publication date: 1-May-2021
https://doi.org/10.1109/TPAMI.2019.2956046
Wang JZhang X(2021)Deep Topic Modeling by Multilayer Bootstrap Network and Lasso2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412751(2470-2475)Online publication date: 10-Jan-2021
https://doi.org/10.1109/ICPR48806.2021.9412751
Dragomir Rd’Aspremont ABolte J(2021)Quartic First-Order Methods for Low-Rank MinimizationJournal of Optimization Theory and Applications10.1007/s10957-021-01820-3Online publication date: 23-Mar-2021
https://doi.org/10.1007/s10957-021-01820-3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents