Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

Published: 01 January 2013 Publication History

Abstract

Topic modeling provides a powerful way to analyze the content of a collection of documents. It has become a popular tool in many research areas, such as text mining, information retrieval, natural language processing, and other related fields. In real-world applications, however, the usefulness of topic modeling is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps, such as vastly reducing input vocabulary. In this article we introduce Regularized Latent Semantic Indexing (RLSI)---including a batch version and an online version, referred to as batch RLSI and online RLSI, respectively---to scale up topic modeling. Batch RLSI and online RLSI are as effective as existing topic modeling techniques and can scale to larger datasets without reducing input vocabulary. Moreover, online RLSI can be applied to stream data and can capture the dynamic evolution of topics. Both versions of RLSI formalize topic modeling as a problem of minimizing a quadratic loss function regularized by ℓ1 and/or ℓ2 norm. This formulation allows the learning process to be decomposed into multiple suboptimization problems which can be optimized in parallel, for example, via MapReduce. We particularly propose adopting ℓ1 norm on topics and ℓ2 norm on document representations to create a model with compact and readable topics and which is useful for retrieval. In learning, batch RLSI processes all the documents in the collection as a whole, while online RLSI processes the documents in the collection one by one. We also prove the convergence of the learning of online RLSI. Relevance ranking experiments on three TREC datasets show that batch RLSI and online RLSI perform better than LSI, PLSI, LDA, and NMF, and the improvements are sometimes statistically significant. Experiments on a Web dataset containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance.

References

[1]
Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.
[2]
AlSumait, L., Barbara, D., and Domeniconi, C. 2008. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proceedings of the IEEE International Conference on Data Mining.
[3]
Asuncion, A., Smyth, P., and Welling, M. 2011. Asynchronous distributed estimation of topic models for document analysis. Stat. Methodol.
[4]
Atreya, A. and Elkan, C. 2010. Latent semantic indexing (lsi) fails for trec collections. ACM SIGKDD Exp. Newslet. 12.
[5]
Bertsekas, D. P. 1999. Nonlinear Programming. Athena Scientific, Belmont, MA.
[6]
Blei, D. 2011. Introduction to probabilistic topic models. Commun. ACM. to appear.
[7]
Blei, D. and Lafferty, J. 2009. Topic models. Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC.
[8]
Blei, D., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3.
[9]
Blei, D. M. and Lafferty, J. D. 2006. Dynamic topic models. In Proceedings of the International Conference on Machine Learning.
[10]
Bonnans, J. F. and Shapiro, A. 1998. Optimization problems with perturbations: A guided tour. SIAM Rev. 40.
[11]
Bottou, L. and Bousquet, O. 2008. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[12]
Buluc, A. and Gilbert, J. R. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In Proceedings of the International Conference on Parallel Processing.
[13]
Burges, C. J., Ragno, R., and Le, Q. V. 2007. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[14]
Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. Scope: Easy and efficient parallel processing of massive data sets. Very Large Data Base Endow. 1.
[15]
Chen, S. S., Donoho, D. L., and Saunders, M. A. 1998. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20.
[16]
Chen, X., Bai, B., Qi, Y., Lin, Q., and Carbonell, J. 2010. Sparse latent semantic analysis. In Proceedings of the Workshop on Neural Information Processing Systems.
[17]
Dean, J., Ghemawat, S., and Inc, G. 2004. Mapreduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation.
[18]
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. J, Amer. Soc. Inf. Sci. 41.
[19]
Ding, C., Li, T., and Peng, W. 2008. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing semantic indexing. Comput. Stat. Data Anal. 52.
[20]
Ding, C. H. Q. 2005. A probabilistic model for latent semantic indexing. J. Amer. Soc. Inf. Sci. Technol. 56.
[21]
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. 2004. Least angle regression. Ann. Stat. 32.
[22]
Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. 2007. Pathwise coordinate optimization. Ann. Appl. Stat. 1.
[23]
Fu, W. J. 1998. Penalized regressions: The bridge versus the lasso. J. Comput. Graphi. Stat. 7.
[24]
Hoffman, M. D., Blei, D. M., and Bach, F. 2010. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[25]
Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
[26]
Kontostathis, A. 2007. Essential dimensions of latent semantic indexing (lsi). In Proceedings of the 40th Hawaii International International Conference on Systems Science.
[27]
Lee, D. D. and Seung, H. S. 1999. Learning the parts of objects with nonnegative matrix factorization. Nature 401.
[28]
Lee, D. D. and Seung, H. S. 2001. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[29]
Lee, H., Battle, A., Raina, R., and Ng, A. Y. 2007. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[30]
Liang, P. and Klein, D. 2009. Online em for unsupervised models. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics.
[31]
Liu, C., chih Yang, H., Fan, J., He, L.-W., and Wang, Y.-M. 2010. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In Proceedings of the World Wide Web Conference.
[32]
Liu, Z., Zhang, Y., and Chang, E. Y. 2011. Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2.
[33]
Lu, Y., Mei, Q., and Zhai, C. 2011. Investigating task performance of probabilistic topic models: An empirical study of plsa and lda. Inf. Retrieval 14.
[34]
Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. 2009. Supervised dictionary learning. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[35]
Mairal, J., Bach, F., Suprieure, E. N., and Sapiro, G. 2010. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11.
[36]
Mimno, D., Hoffman, M. D., and Blei, D. M. 2010. Sparse stochastic inference for latent dirichlet allocation. In Proceedings of the 29th International Conference on Machine Learning.
[37]
Mimno, D. M. and McCallum. 2007. Organizing the oca: Learning faceted subjects from a library of digital books. In Proceedings of the Joint Conference on Digital Libraries.
[38]
Neal, R. M. and Hinton, G. E. 1998. A view of the em algorithm that justifies incremental, sparse, and other variants. Learn. Graph. Models 89.
[39]
Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2008. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[40]
Olshausen, B. A. and Fieldt, D. J. 1997. Sparse coding with an overcomplete basis set: A strategy employed by v1. Vision Res. 37.
[41]
Osborne, M., Presnell, B., and Turlach, B. 2000. A new approach to variable selection in least squares problems. IMA J. Numer. Anal.
[42]
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., and Gatford, M. 1994. Okapi at trec-3. In Proceedings of the 3rd Text REtrieval Conference.
[43]
Rubinstein, R., Zibulevsky, M., and Elad, M. 2008. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Trans. Signal Process.
[44]
Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. Commun. ACM 18.
[45]
Shashanka, M., Raj, B., and Smaragdis, P. 2007. Sparse overcomplete latent variable decomposition of counts data. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[46]
Singh, A. P. and Gordon, G. J. 2008. A unified view of matrix factorization models. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases.
[47]
Smola, A. and Narayanamurthy, S. 2010. An architecture for parallel topic models. Proceed. VLDB Endow. 3.
[48]
Thakur, R. and Rabenseifner, R. 2005. Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. 19.
[49]
Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. J. Royal Stat. Soc.
[50]
Wang, C. and Blei, D. M. 2009. Decoupling sparsity and smoothness in the discrete hierachical dirichlet process. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[51]
Wang, Q., Xu, J., Li, H., and Craswell, N. 2011. Regularized latent semantic indexing. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
[52]
Wang, X. and McCallum, A. 2006. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[53]
Wang, Y., Bai, H., Stanton, M., yen Chen, W., and Chang, E. Y. 2009. Plda: Parallel latent dirichlet allocation for large-scale applications. In Proceedings of the International Conference on Algorithmic Aspects of Information and Management.
[54]
Wei, X. and Croft, B. W. 2006. Lda-based document models for ad-hoc retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
[55]
Yan, F., Xu, N., and Qi, Y. A. 2009. Parallel inference for latent dirichlet allocation on graphics processing units. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
[56]
Yi, X. and Allan, J. 2009. A comparative study of utilizing topic models for information retrieval. In Proceedings of the 31st European Conference on IR Research.
[57]
Zhu, J. and Xing, E. P. 2011. Sparse topical coding. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Cited By

View all
  • (2023)Influence of the Spatial Distribution of Jobs in Intervening Opportunities ModelsTransportation Research Record: Journal of the Transportation Research Board10.1177/036119812211433742677:5(1441-1454)Online publication date: 6-Jan-2023
  • (2023)A Summary of Unsupervised Learning MethodsMachine Learning Methods10.1007/978-981-99-3917-6_22(493-498)Online publication date: 6-Dec-2023
  • (2023)Improved Evolutionary Approach for Tuning Topic Models with Additive RegularizationHybrid Artificial Intelligent Systems10.1007/978-3-031-40725-3_35(409-420)Online publication date: 29-Aug-2023
  • Show More Cited By

Index Terms

  1. Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 31, Issue 1
    January 2013
    163 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2414782
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2013
    Accepted: 01 November 2012
    Revised: 01 October 2012
    Received: 01 September 2011
    Published in TOIS Volume 31, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Topic modeling
    2. distributed learning
    3. online learning
    4. regularization
    5. sparse methods

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Influence of the Spatial Distribution of Jobs in Intervening Opportunities ModelsTransportation Research Record: Journal of the Transportation Research Board10.1177/036119812211433742677:5(1441-1454)Online publication date: 6-Jan-2023
    • (2023)A Summary of Unsupervised Learning MethodsMachine Learning Methods10.1007/978-981-99-3917-6_22(493-498)Online publication date: 6-Dec-2023
    • (2023)Improved Evolutionary Approach for Tuning Topic Models with Additive RegularizationHybrid Artificial Intelligent Systems10.1007/978-3-031-40725-3_35(409-420)Online publication date: 29-Aug-2023
    • (2022)Topic Modelling for Research Perception: Techniques, Processes and a Case StudyRecent Innovations in Artificial Intelligence and Smart Applications10.1007/978-3-031-14748-7_13(221-237)Online publication date: 2-Oct-2022
    • (2020)Fine-Grained Privacy Detection with Graph-Regularized Hierarchical Attentive Representation LearningACM Transactions on Information Systems10.1145/340610938:4(1-26)Online publication date: 16-Sep-2020
    • (2020)Application of Deep Learning Approaches for Sentiment AnalysisDeep Learning-Based Approaches for Sentiment Analysis10.1007/978-981-15-1216-2_1(1-31)Online publication date: 25-Jan-2020
    • (2019)Reasoning about a Machine with Local CapabilitiesACM Transactions on Programming Languages and Systems10.1145/336351942:1(1-53)Online publication date: 10-Dec-2019
    • (2019)Behavioural Equivalence via Modalities for Algebraic EffectsACM Transactions on Programming Languages and Systems10.1145/336351842:1(1-45)Online publication date: 21-Nov-2019
    • (2019)Modular Product ProgramsACM Transactions on Programming Languages and Systems10.1145/332478342:1(1-37)Online publication date: 21-Nov-2019
    • (2019)Consistent Subtyping for AllACM Transactions on Programming Languages and Systems10.1145/331033942:1(1-79)Online publication date: 21-Nov-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media