Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1871437.1871519acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Using the past to score the present: extending term weighting models through revision history analysis

Published: 26 October 2010 Publication History
  • Get Citation Alerts
  • Abstract

    The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks.

    References

    [1]
    E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR, 2006.
    [2]
    J. Allan. Introduction to topic detection and tracking. In Topic detection and tracking: event-based information organization, pages 1--16. Kluwer Academic Publishers, 2002.
    [3]
    M. Bendersky and W. B. Croft. Discovering key concepts in verbose queries. In SIGIR, pages 491--498, 2008.
    [4]
    J. Bian, Y. Liu, E. Agichtein, and H. Zha. Finding the right facts in the crowd: factoid question answering over social media. In Proc. of WWW, 2008.
    [5]
    M. Bilenko and R. W. White. Mining the search trails of surfing crowds: identifying relevant websites from user activity. In Proc. of WWW, 2008.
    [6]
    C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR'04, 2004.
    [7]
    G. Cao, J. Nie, and J. Bai. Integrating word relationships into language models. In Proc. of SIGIR, 2005.
    [8]
    M. Efron. Linear time series models for term weighting in information retrieval. Journal of the American Society for Information Science and Technology (JASIST), 2010.
    [9]
    J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog feed search. In Proc. of SIGIR, 2008.
    [10]
    J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM, 2010.
    [11]
    N. Fuhr and C. Buckley. A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS), 9(3):223--248, 1991.
    [12]
    N. Fuhr, J. Kamps, M. Lalmas, S. Malik, and A. Trotman. Overview of the INEX 2007 ad hoc track. Focused Access to XML Documents, pages 1--23, 2008.
    [13]
    E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie: Providing personalized newsfeeds via analysis of information novelty. In WWW, pages 482--490, 2004.
    [14]
    D. Gruhl, R. V. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In WWW, 2004.
    [15]
    D. Hawking, N. Craswell, and P. Thistlewaite. Overview of TREC-7 very large collection track. In TREC-7, 1998.
    [16]
    J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, pages 415--424, New York, NY, USA, 2009. ACM.
    [17]
    K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002.
    [18]
    J. Kamps, S. Geva, and A. Trotman. Analysis of the inex 2009 ad hoc track results. In INEX, 2009.
    [19]
    R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In Proc. of WWW, 2003.
    [20]
    O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In SIGIR, pages 194--201, 2004.
    [21]
    J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001.
    [22]
    M. Lease. An improved markov random field model for supporting verbose queries. In SIGIR, pages 476--483, 2009.
    [23]
    X. Liu and W. B. Croft. Cluster-based retrieval using language models. SIGIR, pages 186--193, 2004.
    [24]
    P. Ogilvie and J. P. Callan. Combining document representations for known-item search. In SIGIR, pages 143--150, 2003.
    [25]
    C. Olston and S. Pandey. Recrawl scheduling based on information longevity. In WWW, pages 437--446, 2008.
    [26]
    S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. 3rd TREC, pages 109--126, 1994.
    [27]
    S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM, pages 42--49, 2004.
    [28]
    M. Theobald, H. Bast, D. Majumdar, R. Schenkel, and G. Weikum. Topx: efficient and versatile top-query processing for semistructured data. VLDB J., 17(1), 2008.
    [29]
    A. Trotman. Choosing document structure weights. Information Processing & Management, 41(2), 2005.
    [30]
    M. Wang and L. Si. Discriminative probabilistic models for passage based retrieval. In Proc. of SIGIR, 2008.
    [31]
    X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. SIGIR, pages 178--185, 2006.
    [32]
    J. Zhang and T. Suel. Efficient search in large textual collections with redundancy. In Proc. of WWW, 2007.

    Cited By

    View all
    • (2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
    • (2022)Dynamic and Static Features-Aware Recommendation with Graph Neural NetworksComputational Intelligence and Neuroscience10.1155/2022/54841192022Online publication date: 21-Apr-2022
    • (2022)Ranking Models for the Temporal Dimension of TextACM Transactions on Information Systems10.1145/356548141:2(1-34)Online publication date: 21-Dec-2022
    • Show More Cited By

    Index Terms

    1. Using the past to score the present: extending term weighting models through revision history analysis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
      October 2010
      2036 pages
      ISBN:9781450300995
      DOI:10.1145/1871437
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 October 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. collaboratively generated content
      2. retrieval models
      3. term weighting

      Qualifiers

      • Research-article

      Conference

      CIKM '10

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)1
      Reflects downloads up to

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
      • (2022)Dynamic and Static Features-Aware Recommendation with Graph Neural NetworksComputational Intelligence and Neuroscience10.1155/2022/54841192022Online publication date: 21-Apr-2022
      • (2022)Ranking Models for the Temporal Dimension of TextACM Transactions on Information Systems10.1145/356548141:2(1-34)Online publication date: 21-Dec-2022
      • (2021)Characterizing Stage-aware Writing Assistance for Collaborative Document AuthoringProceedings of the ACM on Human-Computer Interaction10.1145/34341804:CSCW3(1-29)Online publication date: 5-Jan-2021
      • (2020)Studying Ranking-Incentivized Web DynamicsProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401300(2093-2096)Online publication date: 25-Jul-2020
      • (2019)Combined techniques based query expansion approach for document retrieval system2019 International Conference on contemporary Computing and Informatics (IC3I)10.1109/IC3I46837.2019.9055709(101-105)Online publication date: Dec-2019
      • (2018)Ranking Archived Documents for Structured Queries on Semantic LayersProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197049(155-164)Online publication date: 23-May-2018
      • (2018)Fast Identification of Topic Burst Patterns Based on Temporal Clustering2018 7th International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI.2018.00117(548-553)Online publication date: Jul-2018
      • (2017)A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filteringKnowledge-Based Systems10.1016/j.knosys.2017.09.004136:C(97-120)Online publication date: 15-Nov-2017
      • (2017)Identifying Evolutionary Topic Temporal Patterns Based on Bursty Phrase ClusteringWeb and Big Data10.1007/978-3-319-63564-4_22(276-284)Online publication date: 3-Aug-2017
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media