research-article

Using the past to score the present: extending term weighting models through revision history analysis

Authors:

Eugene Agichtein,

Evgeniy GabrilovichAuthors Info & Claims

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 629 - 638

https://doi.org/10.1145/1871437.1871519

Published: 26 October 2010 Publication History

Abstract

The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks.

References

[1]

E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR, 2006.

Digital Library

[2]

J. Allan. Introduction to topic detection and tracking. In Topic detection and tracking: event-based information organization, pages 1--16. Kluwer Academic Publishers, 2002.

[3]

M. Bendersky and W. B. Croft. Discovering key concepts in verbose queries. In SIGIR, pages 491--498, 2008.

Digital Library

[4]

J. Bian, Y. Liu, E. Agichtein, and H. Zha. Finding the right facts in the crowd: factoid question answering over social media. In Proc. of WWW, 2008.

Digital Library

[5]

M. Bilenko and R. W. White. Mining the search trails of surfing crowds: identifying relevant websites from user activity. In Proc. of WWW, 2008.

Digital Library

[6]

C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR'04, 2004.

Digital Library

[7]

G. Cao, J. Nie, and J. Bai. Integrating word relationships into language models. In Proc. of SIGIR, 2005.

Digital Library

[8]

M. Efron. Linear time series models for term weighting in information retrieval. Journal of the American Society for Information Science and Technology (JASIST), 2010.

Digital Library

[9]

J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog feed search. In Proc. of SIGIR, 2008.

Digital Library

[10]

J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM, 2010.

Digital Library

[11]

N. Fuhr and C. Buckley. A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS), 9(3):223--248, 1991.

Digital Library

[12]

N. Fuhr, J. Kamps, M. Lalmas, S. Malik, and A. Trotman. Overview of the INEX 2007 ad hoc track. Focused Access to XML Documents, pages 1--23, 2008.

Digital Library

[13]

E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie: Providing personalized newsfeeds via analysis of information novelty. In WWW, pages 482--490, 2004.

Digital Library

[14]

D. Gruhl, R. V. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In WWW, 2004.

Digital Library

[15]

D. Hawking, N. Craswell, and P. Thistlewaite. Overview of TREC-7 very large collection track. In TREC-7, 1998.

[16]

J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, pages 415--424, New York, NY, USA, 2009. ACM.

Digital Library

[17]

K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002.

Digital Library

[18]

J. Kamps, S. Geva, and A. Trotman. Analysis of the inex 2009 ad hoc track results. In INEX, 2009.

Digital Library

[19]

R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In Proc. of WWW, 2003.

Digital Library

[20]

O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In SIGIR, pages 194--201, 2004.

Digital Library

[21]

J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001.

Digital Library

[22]

M. Lease. An improved markov random field model for supporting verbose queries. In SIGIR, pages 476--483, 2009.

Digital Library

[23]

X. Liu and W. B. Croft. Cluster-based retrieval using language models. SIGIR, pages 186--193, 2004.

Digital Library

[24]

P. Ogilvie and J. P. Callan. Combining document representations for known-item search. In SIGIR, pages 143--150, 2003.

Digital Library

[25]

C. Olston and S. Pandey. Recrawl scheduling based on information longevity. In WWW, pages 437--446, 2008.

Digital Library

[26]

S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. 3rd TREC, pages 109--126, 1994.

[27]

S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM, pages 42--49, 2004.

Digital Library

[28]

M. Theobald, H. Bast, D. Majumdar, R. Schenkel, and G. Weikum. Topx: efficient and versatile top-query processing for semistructured data. VLDB J., 17(1), 2008.

Digital Library

[29]

A. Trotman. Choosing document structure weights. Information Processing & Management, 41(2), 2005.

Digital Library

[30]

M. Wang and L. Si. Discriminative probabilistic models for passage based retrieval. In Proc. of SIGIR, 2008.

Digital Library

[31]

X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. SIGIR, pages 178--185, 2006.

Digital Library

[32]

J. Zhang and T. Suel. Efficient search in large textual collections with redundancy. In Proc. of WWW, 2007.

Digital Library

Cited By

Vasilisky ZKurland OTennenholtz MRaiber FYoshioka MKiseleva JAliannejadi M(2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605124
Sun NChen TRan LGuo W(2022)Dynamic and Static Features-Aware Recommendation with Graph Neural NetworksComputational Intelligence and Neuroscience10.1155/2022/54841192022Online publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1155/2022/5484119
Rizzo SBrucato MMontesi D(2022)Ranking Models for the Temporal Dimension of TextACM Transactions on Information Systems10.1145/356548141:2(1-34)Online publication date: 21-Dec-2022
https://dl.acm.org/doi/10.1145/3565481
Show More Cited By

Index Terms

Using the past to score the present: extending term weighting models through revision history analysis
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

On the analysis and evaluation of information retrieval models for social book search
Abstract
Social Book Search (SBS) studies how the Social Web impacts book retrieval. This impact is studied in two steps. In this first step, called the baseline run, the search index having bibliographic descriptions or professional metadata and user-...
Axiomatic Analysis and Optimization of Information Retrieval Models
ICTIR '13: Proceedings of the 2013 Conference on the Theory of Information Retrieval

The accuracy of a search engine is mostly determined by the optimality of the retrieval model used in the search engine. Develoing optimal retrieval models has always been a very important fundamental research problem in information retrieval because an ...
Concept Based Search Using LSI and Automatic Keyphrase Extraction
ICETET '10: Proceedings of the 2010 3rd International Conference on Emerging Trends in Engineering and Technology

Classic information retrieval model might lead to poor retrieval due to unrelated documents that might be included in the answer set or missed relevant documents that do not contain at least one index term. Retrieval based on index terms is vague and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

October 2010

2036 pages

ISBN:9781450300995

DOI:10.1145/1871437

General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '10

Sponsor:

CIKM '10: International Conference on Information and Knowledge Management

October 26 - 30, 2010

ON, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
370
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Vasilisky ZKurland OTennenholtz MRaiber FYoshioka MKiseleva JAliannejadi M(2023)Content-Based Relevance Estimation in Retrieval Settings with Ranking-Incentivized Document ManipulationsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605124(205-214)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605124
Sun NChen TRan LGuo W(2022)Dynamic and Static Features-Aware Recommendation with Graph Neural NetworksComputational Intelligence and Neuroscience10.1155/2022/54841192022Online publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1155/2022/5484119
Rizzo SBrucato MMontesi D(2022)Ranking Models for the Temporal Dimension of TextACM Transactions on Information Systems10.1145/356548141:2(1-34)Online publication date: 21-Dec-2022
https://dl.acm.org/doi/10.1145/3565481
Sarrafzadeh BJauhar SGamon MLank EWhite R(2021)Characterizing Stage-aware Writing Assistance for Collaborative Document AuthoringProceedings of the ACM on Human-Computer Interaction10.1145/34341804:CSCW3(1-29)Online publication date: 5-Jan-2021
https://dl.acm.org/doi/10.1145/3434180
Vasilisky ZTennenholtz MKurland OHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Studying Ranking-Incentivized Web DynamicsProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401300(2093-2096)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401300
Sharma DPamula RChauhan D(2019)Combined techniques based query expansion approach for document retrieval system2019 International Conference on contemporary Computing and Informatics (IC3I)10.1109/IC3I46837.2019.9055709(101-105)Online publication date: Dec-2019
https://doi.org/10.1109/IC3I46837.2019.9055709
Fafalios PKasturia VNejdl WChen JGonçalves MAllen JFox EKan MPetras V(2018)Ranking Archived Documents for Structured Queries on Semantic LayersProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197049(155-164)Online publication date: 23-May-2018
https://dl.acm.org/doi/10.1145/3197026.3197049
Xu ZIwaihara M(2018)Fast Identification of Topic Burst Patterns Based on Temporal Clustering2018 7th International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI.2018.00117(548-553)Online publication date: Jul-2018
https://doi.org/10.1109/IIAI-AAI.2018.00117
Gupta YSaini A(2017)A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filteringKnowledge-Based Systems10.1016/j.knosys.2017.09.004136:C(97-120)Online publication date: 15-Nov-2017
https://dl.acm.org/doi/10.1016/j.knosys.2017.09.004
Liu YGao ZIwaihara M(2017)Identifying Evolutionary Topic Temporal Patterns Based on Bursty Phrase ClusteringWeb and Big Data10.1007/978-3-319-63564-4_22(276-284)Online publication date: 3-Aug-2017
https://doi.org/10.1007/978-3-319-63564-4_22
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents