Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2009916.2010008acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Regularized latent semantic indexing

Published: 24 July 2011 Publication History

Abstract

Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l₂ and/or l₁ norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l₂ norm on topics and l₁ norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.

References

[1]
L. AlSumait, D. Barbara, and C. Domeniconi. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In ICDM, 2008.
[2]
A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed estimation of topic models for document analysis. Statistical Methodology, 2011.
[3]
D. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.
[4]
A. Buluc and J. R. Gilbert. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP, pages 503--510, 2008.
[5]
C. J. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In NIPS 19, 2007.
[6]
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. VLDB Endow., 1:1265--1276, 2008.
[7]
S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SISC, 20:33--61, 1998.
[8]
X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. In NIPS Workshop, 2010.
[9]
J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing on large clusters. In OSDI, 2004.
[10]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J AM SOC INFORM SCI, 41:391--407, 1990.
[11]
C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing semantic indexing. COMPUT STAT DATA AN, 52:3913--3927, 2008.
[12]
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. ANN STAT, 32:407--499, 2004.
[13]
J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate optimization. ANN APPL STAT, 1:302--332, 2007.
[14]
W. J. Fu. Penalized regressions: The bridge versus the lasso. J COMPUT GRAPH STAT, 7:397--416, 1998.
[15]
M. D. Hoffman, D. M. Blei, and F. Bach. Online learning for latent dirichlet allocation. In NIPS, 2010.
[16]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.
[17]
D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization. Nature, 401:391--407, 1999.
[18]
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS 13, pages 556--562. 2001.
[19]
H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In NIPS, pages 801--808. 2007.
[20]
C. Liu, H. chih Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In WWW, pages 681--690, 2010.
[21]
Z. Liu, Y. Zhang, and E. Y. Chang. Plda
[22]
: Parallel latent dirichlet allocation with data placement and pipeline processing. In TIST, 2010.
[23]
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NIPS 21, pages 1033--1040. 2009.
[24]
D. M. Mimno and McCallum. Organizing the oca: Learning faceted subjects from a library of digital books. In JCDL, pages 376--385, 2007.
[25]
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, 2008.
[26]
B. A. Olshausen and D. J. Fieldt. Sparse coding with an overcomplete basis set: a strategy employed by v1. VISION RES, 37:3311--3325, 1997.
[27]
M. Osborne, B. Presnell, and B. Turlach. A new approach to variable selection in least squares problems. IMA J NUMER ANAL, 2000.
[28]
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In TREC'3, 1994.
[29]
R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE T SIGNAL PROCES, pages 1553--1564, 2008.
[30]
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613--620, 1975.
[31]
A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECMLPKDD, pages 358--373, 2008.
[32]
A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3:703--710, 2010.
[33]
R. Thakur and R. Rabenseifner. Optimization of collective communication operations in mpich. INT J HIGH PERFORM C, 19:49--66, 2005.
[34]
C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierachical dirichlet process. In NIPS, 2009.
[35]
Y. Wang, H. Bai, M. Stanton, W. yen Chen, and E. Y. Chang. Plda: Parallel latent dirichlet allocation for large-scale applications. In AAIM, pages 301--314, 2009.
[36]
X. Wei and B. W. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006.
[37]
F. Yan, N. Xu, and Y. A. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, pages 2134--2142, 2009.

Cited By

View all

Index Terms

  1. Regularized latent semantic indexing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
    July 2011
    1374 pages
    ISBN:9781450307574
    DOI:10.1145/2009916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. regularization
    2. sparse methods
    3. topic modeling

    Qualifiers

    • Research-article

    Conference

    SIGIR '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Investigating topic modeling techniques through evaluation of topics discovered in short texts data across diverse domainsScientific Reports10.1038/s41598-024-61738-414:1Online publication date: 25-May-2024
    • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
    • (2023)Latent Semantic AnalysisMachine Learning Methods10.1007/978-981-99-3917-6_17(365-385)Online publication date: 6-Dec-2023
    • (2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
    • (2022) S 3 TRM: Spectral-Spatial Unmixing of Hyperspectral Imagery Based on Sparse Topic Relaxation-Clustering Model IEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2021.311725060(1-13)Online publication date: 2022
    • (2021)A Sparse Topic Relaxion and Group Clustering Model for Hyperspectral UnmixingIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2021.306957414(4014-4027)Online publication date: 2021
    • (2020)Comprehensive Contemplation of Probabilistic Aspects in Intelligent AnalyticsInternational Journal of Service Science, Management, Engineering, and Technology10.4018/IJSSMET.202001010811:1(116-141)Online publication date: 1-Jan-2020
    • (2019)Modeling the Parameter Interactions in Ranking SVM with Low-Rank ApproximationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.285125731:6(1181-1193)Online publication date: 1-Jun-2019
    • (2018)Aggregating Neural Word Embeddings for Document RepresentationAdvances in Information Retrieval10.1007/978-3-319-76941-7_23(303-315)Online publication date: 1-Mar-2018
    • (2017)Video2vec Embeddings Recognize Events When Examples Are ScarceIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2016.262756339:10(2089-2103)Online publication date: 1-Oct-2017
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media