Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

Published: 01 January 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Topic modeling provides a powerful way to analyze the content of a collection of documents. It has become a popular tool in many research areas, such as text mining, information retrieval, natural language processing, and other related fields. In real-world applications, however, the usefulness of topic modeling is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps, such as vastly reducing input vocabulary. In this article we introduce Regularized Latent Semantic Indexing (RLSI)---including a batch version and an online version, referred to as batch RLSI and online RLSI, respectively---to scale up topic modeling. Batch RLSI and online RLSI are as effective as existing topic modeling techniques and can scale to larger datasets without reducing input vocabulary. Moreover, online RLSI can be applied to stream data and can capture the dynamic evolution of topics. Both versions of RLSI formalize topic modeling as a problem of minimizing a quadratic loss function regularized by ℓ1 and/or ℓ2 norm. This formulation allows the learning process to be decomposed into multiple suboptimization problems which can be optimized in parallel, for example, via MapReduce. We particularly propose adopting ℓ1 norm on topics and ℓ2 norm on document representations to create a model with compact and readable topics and which is useful for retrieval. In learning, batch RLSI processes all the documents in the collection as a whole, while online RLSI processes the documents in the collection one by one. We also prove the convergence of the learning of online RLSI. Relevance ranking experiments on three TREC datasets show that batch RLSI and online RLSI perform better than LSI, PLSI, LDA, and NMF, and the improvements are sometimes statistically significant. Experiments on a Web dataset containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance.

    References

    [1]
    Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.
    [2]
    AlSumait, L., Barbara, D., and Domeniconi, C. 2008. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proceedings of the IEEE International Conference on Data Mining.
    [3]
    Asuncion, A., Smyth, P., and Welling, M. 2011. Asynchronous distributed estimation of topic models for document analysis. Stat. Methodol.
    [4]
    Atreya, A. and Elkan, C. 2010. Latent semantic indexing (lsi) fails for trec collections. ACM SIGKDD Exp. Newslet. 12.
    [5]
    Bertsekas, D. P. 1999. Nonlinear Programming. Athena Scientific, Belmont, MA.
    [6]
    Blei, D. 2011. Introduction to probabilistic topic models. Commun. ACM. to appear.
    [7]
    Blei, D. and Lafferty, J. 2009. Topic models. Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC.
    [8]
    Blei, D., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3.
    [9]
    Blei, D. M. and Lafferty, J. D. 2006. Dynamic topic models. In Proceedings of the International Conference on Machine Learning.
    [10]
    Bonnans, J. F. and Shapiro, A. 1998. Optimization problems with perturbations: A guided tour. SIAM Rev. 40.
    [11]
    Bottou, L. and Bousquet, O. 2008. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [12]
    Buluc, A. and Gilbert, J. R. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In Proceedings of the International Conference on Parallel Processing.
    [13]
    Burges, C. J., Ragno, R., and Le, Q. V. 2007. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [14]
    Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. Scope: Easy and efficient parallel processing of massive data sets. Very Large Data Base Endow. 1.
    [15]
    Chen, S. S., Donoho, D. L., and Saunders, M. A. 1998. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20.
    [16]
    Chen, X., Bai, B., Qi, Y., Lin, Q., and Carbonell, J. 2010. Sparse latent semantic analysis. In Proceedings of the Workshop on Neural Information Processing Systems.
    [17]
    Dean, J., Ghemawat, S., and Inc, G. 2004. Mapreduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation.
    [18]
    Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. J, Amer. Soc. Inf. Sci. 41.
    [19]
    Ding, C., Li, T., and Peng, W. 2008. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing semantic indexing. Comput. Stat. Data Anal. 52.
    [20]
    Ding, C. H. Q. 2005. A probabilistic model for latent semantic indexing. J. Amer. Soc. Inf. Sci. Technol. 56.
    [21]
    Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. 2004. Least angle regression. Ann. Stat. 32.
    [22]
    Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. 2007. Pathwise coordinate optimization. Ann. Appl. Stat. 1.
    [23]
    Fu, W. J. 1998. Penalized regressions: The bridge versus the lasso. J. Comput. Graphi. Stat. 7.
    [24]
    Hoffman, M. D., Blei, D. M., and Bach, F. 2010. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [25]
    Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
    [26]
    Kontostathis, A. 2007. Essential dimensions of latent semantic indexing (lsi). In Proceedings of the 40th Hawaii International International Conference on Systems Science.
    [27]
    Lee, D. D. and Seung, H. S. 1999. Learning the parts of objects with nonnegative matrix factorization. Nature 401.
    [28]
    Lee, D. D. and Seung, H. S. 2001. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [29]
    Lee, H., Battle, A., Raina, R., and Ng, A. Y. 2007. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [30]
    Liang, P. and Klein, D. 2009. Online em for unsupervised models. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics.
    [31]
    Liu, C., chih Yang, H., Fan, J., He, L.-W., and Wang, Y.-M. 2010. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In Proceedings of the World Wide Web Conference.
    [32]
    Liu, Z., Zhang, Y., and Chang, E. Y. 2011. Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol. 2.
    [33]
    Lu, Y., Mei, Q., and Zhai, C. 2011. Investigating task performance of probabilistic topic models: An empirical study of plsa and lda. Inf. Retrieval 14.
    [34]
    Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. 2009. Supervised dictionary learning. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [35]
    Mairal, J., Bach, F., Suprieure, E. N., and Sapiro, G. 2010. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11.
    [36]
    Mimno, D., Hoffman, M. D., and Blei, D. M. 2010. Sparse stochastic inference for latent dirichlet allocation. In Proceedings of the 29th International Conference on Machine Learning.
    [37]
    Mimno, D. M. and McCallum. 2007. Organizing the oca: Learning faceted subjects from a library of digital books. In Proceedings of the Joint Conference on Digital Libraries.
    [38]
    Neal, R. M. and Hinton, G. E. 1998. A view of the em algorithm that justifies incremental, sparse, and other variants. Learn. Graph. Models 89.
    [39]
    Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2008. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [40]
    Olshausen, B. A. and Fieldt, D. J. 1997. Sparse coding with an overcomplete basis set: A strategy employed by v1. Vision Res. 37.
    [41]
    Osborne, M., Presnell, B., and Turlach, B. 2000. A new approach to variable selection in least squares problems. IMA J. Numer. Anal.
    [42]
    Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., and Gatford, M. 1994. Okapi at trec-3. In Proceedings of the 3rd Text REtrieval Conference.
    [43]
    Rubinstein, R., Zibulevsky, M., and Elad, M. 2008. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Trans. Signal Process.
    [44]
    Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. Commun. ACM 18.
    [45]
    Shashanka, M., Raj, B., and Smaragdis, P. 2007. Sparse overcomplete latent variable decomposition of counts data. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [46]
    Singh, A. P. and Gordon, G. J. 2008. A unified view of matrix factorization models. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases.
    [47]
    Smola, A. and Narayanamurthy, S. 2010. An architecture for parallel topic models. Proceed. VLDB Endow. 3.
    [48]
    Thakur, R. and Rabenseifner, R. 2005. Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. 19.
    [49]
    Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. J. Royal Stat. Soc.
    [50]
    Wang, C. and Blei, D. M. 2009. Decoupling sparsity and smoothness in the discrete hierachical dirichlet process. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [51]
    Wang, Q., Xu, J., Li, H., and Craswell, N. 2011. Regularized latent semantic indexing. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
    [52]
    Wang, X. and McCallum, A. 2006. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [53]
    Wang, Y., Bai, H., Stanton, M., yen Chen, W., and Chang, E. Y. 2009. Plda: Parallel latent dirichlet allocation for large-scale applications. In Proceedings of the International Conference on Algorithmic Aspects of Information and Management.
    [54]
    Wei, X. and Croft, B. W. 2006. Lda-based document models for ad-hoc retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
    [55]
    Yan, F., Xu, N., and Qi, Y. A. 2009. Parallel inference for latent dirichlet allocation on graphics processing units. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
    [56]
    Yi, X. and Allan, J. 2009. A comparative study of utilizing topic models for information retrieval. In Proceedings of the 31st European Conference on IR Research.
    [57]
    Zhu, J. and Xing, E. P. 2011. Sparse topical coding. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

    Cited By

    View all
    • (2023)Influence of the Spatial Distribution of Jobs in Intervening Opportunities ModelsTransportation Research Record: Journal of the Transportation Research Board10.1177/036119812211433742677:5(1441-1454)Online publication date: 6-Jan-2023
    • (2023)A Summary of Unsupervised Learning MethodsMachine Learning Methods10.1007/978-981-99-3917-6_22(493-498)Online publication date: 6-Dec-2023
    • (2023)Improved Evolutionary Approach for Tuning Topic Models with Additive RegularizationHybrid Artificial Intelligent Systems10.1007/978-3-031-40725-3_35(409-420)Online publication date: 29-Aug-2023
    • Show More Cited By

    Index Terms

    1. Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 31, Issue 1
      January 2013
      163 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/2414782
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 January 2013
      Accepted: 01 November 2012
      Revised: 01 October 2012
      Received: 01 September 2011
      Published in TOIS Volume 31, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Topic modeling
      2. distributed learning
      3. online learning
      4. regularization
      5. sparse methods

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Influence of the Spatial Distribution of Jobs in Intervening Opportunities ModelsTransportation Research Record: Journal of the Transportation Research Board10.1177/036119812211433742677:5(1441-1454)Online publication date: 6-Jan-2023
      • (2023)A Summary of Unsupervised Learning MethodsMachine Learning Methods10.1007/978-981-99-3917-6_22(493-498)Online publication date: 6-Dec-2023
      • (2023)Improved Evolutionary Approach for Tuning Topic Models with Additive RegularizationHybrid Artificial Intelligent Systems10.1007/978-3-031-40725-3_35(409-420)Online publication date: 29-Aug-2023
      • (2022)Topic Modelling for Research Perception: Techniques, Processes and a Case StudyRecent Innovations in Artificial Intelligence and Smart Applications10.1007/978-3-031-14748-7_13(221-237)Online publication date: 2-Oct-2022
      • (2020)Fine-Grained Privacy Detection with Graph-Regularized Hierarchical Attentive Representation LearningACM Transactions on Information Systems10.1145/340610938:4(1-26)Online publication date: 16-Sep-2020
      • (2020)Application of Deep Learning Approaches for Sentiment AnalysisDeep Learning-Based Approaches for Sentiment Analysis10.1007/978-981-15-1216-2_1(1-31)Online publication date: 25-Jan-2020
      • (2019)Reasoning about a Machine with Local CapabilitiesACM Transactions on Programming Languages and Systems10.1145/336351942:1(1-53)Online publication date: 10-Dec-2019
      • (2019)Behavioural Equivalence via Modalities for Algebraic EffectsACM Transactions on Programming Languages and Systems10.1145/336351842:1(1-45)Online publication date: 21-Nov-2019
      • (2019)Modular Product ProgramsACM Transactions on Programming Languages and Systems10.1145/332478342:1(1-37)Online publication date: 21-Nov-2019
      • (2019)Consistent Subtyping for AllACM Transactions on Programming Languages and Systems10.1145/331033942:1(1-79)Online publication date: 21-Nov-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media