Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1148170.1148204acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

LDA-based document models for ad-hoc retrieval

Published: 06 August 2006 Publication History

Abstract

Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has recently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.

References

[1]
Azzopardi, L., Girolami, M and van Rijsbergen, C.J. Topic Based Language Models for ad hoc Information Retrieval. In Proceedings of the International Joint Conference on Neural Networks, Budapest,Hungary, 2004.
[2]
Berger, A. and Lafferty, J. Information Retrieval as Statistical Translation. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 222--229.
[3]
Blei, D. M., Ng, A. Y., and Jordan, M. J. Latent Dirichlet allocation. In Journal of Machine Learning Research, 3, 2003, 993--1022.
[4]
Blei, D., Griffiths, T., Jordan, M., Tenenbaum, J. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems 16, Cambridge, MA, MIT Press, 2004.
[5]
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 1990, 391--407.
[6]
Geman, S., and Geman, D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 1984, 721--741.
[7]
Girolami, M. and Kaban, A. Sequential activity profiling: latent Dirichlet allocation of Markov chains. Data Mining and Knowledge Discovery, 10, 2005, 175--196.
[8]
Girolami, M. and Kaban, A. On an equivalence between PLSI and LDA. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, 433--434.
[9]
Griffiths, T. L., and Steyvers, M. Finding scientific topics. In Proceeding of the National Academy of Sciences, 2004, 5228--5235.
[10]
Griffiths, T. L., Steyvers, M., Blei, D. and Tenenbaum, J. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, 2005
[11]
Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 50--57.
[12]
Lavrenko, V. and Croft, W. B. Relevance-based language models. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, 120--127.
[13]
Li, W. and McCallum, A. DAG-Structured Mixture Models of Topic Correlations. To appear in Proceedings of the 23rd International Conference on Machine Learning (ICML-06), Pittsburgh, Pennsylvania, USA, 2006.
[14]
Liu, X. and Croft, W. B. Cluster-based retrieval using language models. In Proceedings of the 27th International ACM SIGIR Conference on Research and Development Information Retrieval, 2004, 186--193.
[15]
McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 workshop on Text Learning, 1999.
[16]
Ponte, J. and Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development Information Retrieval, 1998, 275--281.
[17]
Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff, Alberta, Canada, 2004.
[18]
Sparck Jones, K. Automatic keyword classification for information retrieval. Butterworths, London, 1971.
[19]
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Technical Report, Department of Statistics, UC Berkeley, 2004.
[20]
Zhai, C. and Lafferty, J. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, 334--342.

Cited By

View all
  • (2025)Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications: A ReviewACM Computing Surveys10.1145/370174057:5(1-37)Online publication date: 9-Jan-2025
  • (2024)Optimizing DUS testing for Chimonanthus praecox using feature selection based on a genetic algorithmFrontiers in Plant Science10.3389/fpls.2023.132860314Online publication date: 18-Jan-2024
  • (2024)Ecological Modelling: Applying Computational Linguistic Analysis to the UN Secretary-General’s Speeches on Climate Change (2018–2022)Digital Studies / Le champ numérique10.16995/dscn.1023114:1Online publication date: 1-May-2024
  • Show More Cited By

Index Terms

  1. LDA-based document models for ad-hoc retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
    August 2006
    768 pages
    ISBN:1595933697
    DOI:10.1145/1148170
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 August 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. document model
    2. information retrieval
    3. language model
    4. latent dirichlet allocation (LDA)
    5. topic model

    Qualifiers

    • Article

    Conference

    SIGIR06
    Sponsor:
    SIGIR06: The 29th Annual International SIGIR Conference
    August 6 - 11, 2006
    Washington, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)224
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications: A ReviewACM Computing Surveys10.1145/370174057:5(1-37)Online publication date: 9-Jan-2025
    • (2024)Optimizing DUS testing for Chimonanthus praecox using feature selection based on a genetic algorithmFrontiers in Plant Science10.3389/fpls.2023.132860314Online publication date: 18-Jan-2024
    • (2024)Ecological Modelling: Applying Computational Linguistic Analysis to the UN Secretary-General’s Speeches on Climate Change (2018–2022)Digital Studies / Le champ numérique10.16995/dscn.1023114:1Online publication date: 1-May-2024
    • (2024)Research Trends of Artificial Intelligence in Lung Cancer: A Combined Approach of Analysis With Latent Dirichlet Allocation and HJ‐Biplot Statistical MethodsPulmonary Medicine10.1155/pm/59116462024:1Online publication date: 4-Dec-2024
    • (2024)A Knowledge-Driven Approach to Enhance Topic Modeling with Multi-Modal Representation LearningProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658069(1347-1355)Online publication date: 30-May-2024
    • (2024)A Semantics-enhanced Topic Modelling Technique: Semantic-LDAACM Transactions on Knowledge Discovery from Data10.1145/363940918:4(1-27)Online publication date: 12-Feb-2024
    • (2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
    • (2024)Exploring Knowledge Landscapes: Clustering and Topic Modeling of Sebha University Scientific Publications2024 IEEE 4th International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering (MI-STA)10.1109/MI-STA61267.2024.10599683(712-717)Online publication date: 19-May-2024
    • (2024)Ranking Case Law by Context Dimensions Using Fuzzy Fingerprints2024 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ-IEEE60900.2024.10612086(1-8)Online publication date: 30-Jun-2024
    • (2024)A Review: Comprehensive study on societal Analysis for health care system Using topic modeling Paradigms2024 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC)10.1109/ASSIC60049.2024.10507910(1-5)Online publication date: 27-Jan-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media