research-article

Regularized latent semantic indexing

Authors:

Nick CraswellAuthors Info & Claims

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Pages 685 - 694

https://doi.org/10.1145/2009916.2010008

Published: 24 July 2011 Publication History

Abstract

Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l₂ and/or l₁ norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l₂ norm on topics and l₁ norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.

References

[1]

L. AlSumait, D. Barbara, and C. Domeniconi. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In ICDM, 2008.

Digital Library

[2]

A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed estimation of topic models for document analysis. Statistical Methodology, 2011.

[3]

D. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.

Digital Library

[4]

A. Buluc and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP, pages 503--510, 2008.

Digital Library

[5]

C. J. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In NIPS 19, 2007.

[6]

R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. VLDB Endow., 1:1265--1276, 2008.

Digital Library

[7]

S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SISC, 20:33--61, 1998.

Digital Library

[8]

X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. In NIPS Workshop, 2010.

[9]

J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on large clusters. In OSDI, 2004.

Digital Library

[10]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J AM SOC INFORM SCI, 41:391--407, 1990.

[11]

C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing semantic indexing. COMPUT STAT DATA AN, 52:3913--3927, 2008.

Digital Library

[12]

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. ANN STAT, 32:407--499, 2004.

[13]

J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. ANN APPL STAT, 1:302--332, 2007.

[14]

W. J. Fu. Penalized regressions: The bridge versus the lasso. J COMPUT GRAPH STAT, 7:397--416, 1998.

[15]

M. D. Hoffman, D. M. Blei, and F. Bach. Online learning for latent dirichlet allocation. In NIPS, 2010.

Digital Library

[16]

T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.

Digital Library

[17]

D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization. Nature, 401:391--407, 1999.

[18]

D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS 13, pages 556--562. 2001.

[19]

H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In NIPS, pages 801--808. 2007.

[20]

C. Liu, H. chih Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In WWW, pages 681--690, 2010.

Digital Library

[21]

Z. Liu, Y. Zhang, and E. Y. Chang. Plda

[22]

: Parallel latent dirichlet allocation with data placement and pipeline processing. In TIST, 2010.

[23]

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NIPS 21, pages 1033--1040. 2009.

[24]

D. M. Mimno and McCallum. Organizing the oca: Learning faceted subjects from a library of digital books. In JCDL, pages 376--385, 2007.

Digital Library

[25]

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, 2008.

[26]

B. A. Olshausen and D. J. Fieldt. Sparse coding with an overcomplete basis set: a strategy employed by v1. VISION RES, 37:3311--3325, 1997.

[27]

M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA J NUMER ANAL, 2000.

[28]

S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC'3, 1994.

[29]

R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE T SIGNAL PROCES, pages 1553--1564, 2008.

Digital Library

[30]

G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613--620, 1975.

Digital Library

[31]

A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECMLPKDD, pages 358--373, 2008.

Digital Library

[32]

A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3:703--710, 2010.

Digital Library

[33]

R. Thakur and R. Rabenseifner. Optimization of collective communication operations in mpich. INT J HIGH PERFORM C, 19:49--66, 2005.

Digital Library

[34]

C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierachical dirichlet process. In NIPS, 2009.

[35]

Y. Wang, H. Bai, M. Stanton, W. yen Chen, and E. Y. Chang. Plda: Parallel latent dirichlet allocation for large-scale applications. In AAIM, pages 301--314, 2009.

Digital Library

[36]

X. Wei and B. W. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006.

Digital Library

[37]

F. Yan, N. Xu, and Y. A. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, pages 2134--2142, 2009.

Cited By

Muthusami RMani Kandan NSaritha KNarenthiran BNagaprasad NRamaswamy K(2024)Investigating topic modeling techniques through evaluation of topics discovered in short texts data across diverse domainsScientific Reports10.1038/s41598-024-61738-414:1Online publication date: 25-May-2024
https://doi.org/10.1038/s41598-024-61738-4
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
Li HLi H(2023)Latent Semantic AnalysisMachine Learning Methods10.1007/978-981-99-3917-6_17(365-385)Online publication date: 6-Dec-2023
https://doi.org/10.1007/978-981-99-3917-6_17
Show More Cited By

Index Terms

Regularized latent semantic indexing
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

Topic modeling provides a powerful way to analyze the content of a collection of documents. It has become a popular tool in many research areas, such as text mining, information retrieval, natural language processing, and other related fields. In real-...
Fast and Modular Regularized Topic Modelling
FRUCT'21: Proceedings of the 21st Conference of Open Innovations Association FRUCT

Topic modelling is an area of text mining that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over ...
Latent Community Topic Analysis: Integration of Community Discovery with Topic Modeling

This article studies the problem of latent community topic analysis in text-associated graphs. With the development of social media, a lot of user-generated content is available with user networks. Along with rich information in networks, user graphs can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

July 2011

1374 pages

ISBN:9781450307574

DOI:10.1145/2009916

General Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Jian-Yun Nie
University of Montreal, Canada
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Tat-Seng Chua
National University of Singapore
,
W. Bruce Croft
University of Massachusetts, Amherst, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '11

Sponsor:

SIGIR

SIGIR '11: The 34th International ACM SIGIR conference on research and development in Information Retrieval

July 24 - 28, 2011

Beijing, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
655
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Muthusami RMani Kandan NSaritha KNarenthiran BNagaprasad NRamaswamy K(2024)Investigating topic modeling techniques through evaluation of topics discovered in short texts data across diverse domainsScientific Reports10.1038/s41598-024-61738-414:1Online publication date: 25-May-2024
https://doi.org/10.1038/s41598-024-61738-4
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
Li HLi H(2023)Latent Semantic AnalysisMachine Learning Methods10.1007/978-981-99-3917-6_17(365-385)Online publication date: 6-Dec-2023
https://doi.org/10.1007/978-981-99-3917-6_17
Guo JCai YFan YSun FZhang RCheng X(2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3486250
Zhu QWang LChen JZeng WZhong YGuan QYang Z(2022) S 3 TRM: Spectral-Spatial Unmixing of Hyperspectral Imagery Based on Sparse Topic Relaxation-Clustering Model IEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2021.311725060(1-13)Online publication date: 2022
https://doi.org/10.1109/TGRS.2021.3117250
Zhu QWang LZeng WGuan QHu Z(2021)A Sparse Topic Relaxion and Group Clustering Model for Hyperspectral UnmixingIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2021.306957414(4014-4027)Online publication date: 2021
https://doi.org/10.1109/JSTARS.2021.3069574
Sangwan NBhatnagar V(2020)Comprehensive Contemplation of Probabilistic Aspects in Intelligent AnalyticsInternational Journal of Service Science, Management, Engineering, and Technology10.4018/IJSSMET.202001010811:1(116-141)Online publication date: 1-Jan-2020
https://doi.org/10.4018/IJSSMET.2020010108
Xu JZeng WLan YGuo JCheng X(2019)Modeling the Parameter Interactions in Ranking SVM with Low-Rank ApproximationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.285125731:6(1181-1193)Online publication date: 1-Jun-2019
https://doi.org/10.1109/TKDE.2018.2851257
Zhang RGuo JLan YXu JCheng X(2018)Aggregating Neural Word Embeddings for Document RepresentationAdvances in Information Retrieval10.1007/978-3-319-76941-7_23(303-315)Online publication date: 1-Mar-2018
https://doi.org/10.1007/978-3-319-76941-7_23
Habibian AMensink TSnoek C(2017)Video2vec Embeddings Recognize Events When Examples Are ScarceIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2016.262756339:10(2089-2103)Online publication date: 1-Oct-2017
https://doi.org/10.1109/TPAMI.2016.2627563
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents