Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Individual Judgments Versus Consensus: Estimating Query-URL Relevance

Published: 09 January 2016 Publication History

Abstract

Query-URL relevance, measuring the relevance of each retrieved URL with respect to a given query, is one of the fundamental criteria to evaluate the performance of commercial search engines. The traditional way to collect reliable and accurate query-URL relevance requires multiple annotators to provide their individual judgments based on their subjective expertise (e.g., understanding of user intents). In this case, the annotators’ subjectivity reflected in each annotator individual judgment (AIJ) inevitably affects the quality of the ground truth relevance (GTR). But to the best of our knowledge, the potential impact of AIJs on estimating GTRs has not been studied and exploited quantitatively by existing work. This article first studies how multiple AIJs and GTRs are correlated. Our empirical studies find that the multiple AIJs possibly provide more cues to improve the accuracy of estimating GTRs. Inspired by this finding, we then propose a novel approach to integrating the multiple AIJs with the features characterizing query-URL pairs for estimating GTRs more accurately. Furthermore, we conduct experiments in a commercial search engine—Baidu.com—and report significant gains in terms of the normalized discounted cumulative gains.

References

[1]
E. Agichtein, E. Brill, and S. Dumais. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’06). 19--26.
[2]
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, and E. Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’08). 667--674.
[3]
R. Blanco, H. Halpin, D. Herzig, P. Mika, J. Pound, and H. S. Thompson. 2011. Repeatable and reliable search system evaluation using crowdsourcing. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 923--932.
[4]
C. Buckley, M. Lease, and M. D. Smucker. 2010. Overview of the TREC 2010 Relevance Feedback Track (Notebook). Retrieved December 2, 2015, from https://www.ischool.utexas.edu/∼ml/papers/trec-notebook-2010.pdf.
[5]
C. J. Burges, Q. V. Le, and R. Ragno. 2007. Learning to rank with nonsmooth cost functions. In Proceedings of the Neural Information Processing Systems Conference (NIPS’07). 193--200.
[6]
B. Carterette, J. Allan, and R. Sitaraman. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’06). 268--275.
[7]
O. Chapelle and Y. Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the JMLR Workshop (JMLR’11). 14:1--14:24.
[8]
O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). 620--631.
[9]
W. Chen, Z. Ji, S. Shen, and Q. Yang. 2011. A whole page click model to better interpret search engine click data. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11).
[10]
C. Cleverdon. 1997. The Cranfield tests on index language devices. In Readings in Information Retrieval. Morgan Kaufman, San Francisco, CA, 47--59.
[11]
N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). 87--94.
[12]
H. Deng, I. King, and M. R. Lyu. 2009. Entropy-biased models for query representation on the click graph. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 339--346.
[13]
H. Duan, K. Emre, and C. Zhai. 2012. Click patterns: An empirical representation of complex query intents. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’12). 1035--1044.
[14]
G. Dupret and C. A. Liao. 2010. Model to estimate intrinsic document relevance from the click-through logs of a Web search engine. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 181--190.
[15]
G. Dupret and B. Piwowarski. 2008. A user browsing model to predict search engine click data from past observations. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’08). 331--338.
[16]
Q. Guo and E. Agichtein. 2012. Beyond dwell time: Estimating document relevance from cursor movements and other post-click searcher behavior. In Proceedings of the World Wide Web Conference (WWW’12). 569--578.
[17]
A. Gao, Y. Bachrach, P. Key, and T. Graepel. 2012. Quality expectation-variance tradeoffs in crowdsourcing contests. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12).
[18]
S. Goel, A. Broder, E. Gabrilovich, and B. Pang. 2010. Anatomy of the long tail: Ordinary people with extraordinary tastes. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 201--210.
[19]
D. Harman. 2010. Is the Cranfield paradigm outdated? In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’10). 1.
[20]
J. He, W. X. Zhao, B. Shu, X. M. Li, and H. F. Yan. 2011. Efficiently collecting relevance information from clickthroughs for Web retrieval system evaluation. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 275--284.
[21]
B. T. Hu, Y. C. Zhang, W. Z. Chen, G. Wang, and Q. Yang. 2011. Characterize search intent diversity into click models. In Proceedings of the World Wide Web Conference (WWW’11). 17--26.
[22]
J. Huang, R. W. White, G. Buscher, and K. Wang. 2012. Improving searcher models using mouse cursor activity. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). 195--204.
[23]
P. Ipeirotis, F. Provost, and J. Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOM’10). 64--67.
[24]
K. Jarvelin and J. Kekalainen. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’00). 41--48.
[25]
T. Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). 133--142.
[26]
H. J. Jung and M. Lease. 2012. Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’12). 1095--1096.
[27]
R. Jurca and B. Faltings. 2009. Mechanisms for making crowds truthful. Journal of Artificial Intelligence Research 34, 209--253.
[28]
T. Kanungo and D. Orr. 2009. Predicting the readability of short Web summaries. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). 202--211.
[29]
G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. 2011. Crowdsourcing for book search evaluation: Impact of HIT design on comparative system ranking. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 205--214.
[30]
J. Le, A. Edmonds, V. Hester, and L. Biewald. 2010. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In Proceedings of the Workshop on Crowdsourcing for Search Evaluation (SIGIR’10). 21--26.
[31]
S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10, 1345--1359.
[32]
V. C. Raykar and S. Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research 13, 491--518.
[33]
V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. 2009. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th International Conference on Machine Learning (ICML’09). 889--896.
[34]
V. C. Raykar, S. Yu, L. H. Zhao, and G. H. Valadez. 2010. Learning from crowds. Journal of Machine Learning Research 11, 1297--1322.
[35]
C. Saunders, A. Gammerman, and V. Vovk. 1998. Ridge regression learning algorithm in dual variables. In Proceedings of the 15th International Conference on Machine Learning (ICML’98). 515--521.
[36]
V. S. Sheng, F. Provost, and P. G. Lpeirotis. 2008. Get another label? Improving data quality and data mining using multiple noisy labelers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). 614--622.
[37]
H. J. Song, R. X. Liao, X. L. Zhang, C. Y. Miao, and Q. Yang. 2012. A mouse-trajectory based model for predicting query-URL relevance. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12). 143--149.
[38]
H. J. Song, C. Y. Miao, and Z. Q. Shen. 2011. Generating true relevance labels in Chinese search engine using clickthrough data. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11). 1230--1236.
[39]
R. Srikant, S. Basu, N. Wang, and D. Pregibon. 2010. User browsing models: Relevance versus examination. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’10). 223--232.
[40]
F. Xia, T. Y. Liu, J. Wang, W. Zhang, and H. Li. 2008. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 1192--1199.
[41]
L. Xiao, G. R. Xue, W. Y. Dai, Y. Jiang, Q. Yang, and Y. Yu. 2008. Can Chinese Web pages be classified with English data source? In Proceedings of the World Wide Web Conference (WWW’08). 969--978.
[42]
J. Xu, C. Chen, G. Xu, H. Li, and E. Abib. 2010. Improving quality of training data for learning to rank using click-through data. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 171--180.
[43]
H. Yang, A. Mityagin, and K. M. Svore. 2010. Collecting high quality overlapping labels at low cost. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’10). 459--466.
[44]
Y. Zhang, W. Chen, D. Wang, and Q. Yang. 2011. User-click modeling for understanding and predicting search-behavior. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). 1388--1396.

Index Terms

  1. Individual Judgments Versus Consensus: Estimating Query-URL Relevance

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on the Web
    ACM Transactions on the Web  Volume 10, Issue 1
    February 2016
    198 pages
    ISSN:1559-1131
    EISSN:1559-114X
    DOI:10.1145/2870642
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 January 2016
    Accepted: 01 October 2015
    Revised: 01 September 2015
    Received: 01 May 2013
    Published in TWEB Volume 10, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Web search
    2. performance evaluation
    3. relevance feedback

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Baidu research
    • JSPS Fellow Program

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 334
      Total Downloads
    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media