research-article

Individual Judgments Versus Consensus: Estimating Query-URL Relevance

Authors:

Nishida ToyoakiAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 10, Issue 1

Article No.: 3, Pages 1 - 21

https://doi.org/10.1145/2834122

Published: 09 January 2016 Publication History

Abstract

Query-URL relevance, measuring the relevance of each retrieved URL with respect to a given query, is one of the fundamental criteria to evaluate the performance of commercial search engines. The traditional way to collect reliable and accurate query-URL relevance requires multiple annotators to provide their individual judgments based on their subjective expertise (e.g., understanding of user intents). In this case, the annotators’ subjectivity reflected in each annotator individual judgment (AIJ) inevitably affects the quality of the ground truth relevance (GTR). But to the best of our knowledge, the potential impact of AIJs on estimating GTRs has not been studied and exploited quantitatively by existing work. This article first studies how multiple AIJs and GTRs are correlated. Our empirical studies find that the multiple AIJs possibly provide more cues to improve the accuracy of estimating GTRs. Inspired by this finding, we then propose a novel approach to integrating the multiple AIJs with the features characterizing query-URL pairs for estimating GTRs more accurately. Furthermore, we conduct experiments in a commercial search engine—Baidu.com—and report significant gains in terms of the normalized discounted cumulative gains.

References

[1]

E. Agichtein, E. Brill, and S. Dumais. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’06). 19--26.

Digital Library

[2]

P. Bailey, N. Craswell, I. Soboroff, P. Thomas, and E. Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’08). 667--674.

Digital Library

[3]

R. Blanco, H. Halpin, D. Herzig, P. Mika, J. Pound, and H. S. Thompson. 2011. Repeatable and reliable search system evaluation using crowdsourcing. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 923--932.

Digital Library

[4]

C. Buckley, M. Lease, and M. D. Smucker. 2010. Overview of the TREC 2010 Relevance Feedback Track (Notebook). Retrieved December 2, 2015, from https://www.ischool.utexas.edu/&sim;ml/papers/trec-notebook-2010.pdf.

[5]

C. J. Burges, Q. V. Le, and R. Ragno. 2007. Learning to rank with nonsmooth cost functions. In Proceedings of the Neural Information Processing Systems Conference (NIPS’07). 193--200.

[6]

B. Carterette, J. Allan, and R. Sitaraman. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’06). 268--275.

Digital Library

[7]

O. Chapelle and Y. Chang. 2011. Yahoo&excl; learning to rank challenge overview. In Proceedings of the JMLR Workshop (JMLR’11). 14:1--14:24.

[8]

O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). 620--631.

Digital Library

[9]

W. Chen, Z. Ji, S. Shen, and Q. Yang. 2011. A whole page click model to better interpret search engine click data. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11).

[10]

C. Cleverdon. 1997. The Cranfield tests on index language devices. In Readings in Information Retrieval. Morgan Kaufman, San Francisco, CA, 47--59.

Digital Library

[11]

N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). 87--94.

Digital Library

[12]

H. Deng, I. King, and M. R. Lyu. 2009. Entropy-biased models for query representation on the click graph. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 339--346.

Digital Library

[13]

H. Duan, K. Emre, and C. Zhai. 2012. Click patterns: An empirical representation of complex query intents. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’12). 1035--1044.

Digital Library

[14]

G. Dupret and C. A. Liao. 2010. Model to estimate intrinsic document relevance from the click-through logs of a Web search engine. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 181--190.

Digital Library

[15]

G. Dupret and B. Piwowarski. 2008. A user browsing model to predict search engine click data from past observations. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’08). 331--338.

Digital Library

[16]

Q. Guo and E. Agichtein. 2012. Beyond dwell time: Estimating document relevance from cursor movements and other post-click searcher behavior. In Proceedings of the World Wide Web Conference (WWW’12). 569--578.

Digital Library

[17]

A. Gao, Y. Bachrach, P. Key, and T. Graepel. 2012. Quality expectation-variance tradeoffs in crowdsourcing contests. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12).

[18]

S. Goel, A. Broder, E. Gabrilovich, and B. Pang. 2010. Anatomy of the long tail: Ordinary people with extraordinary tastes. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 201--210.

Digital Library

[19]

D. Harman. 2010. Is the Cranfield paradigm outdated? In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’10). 1.

Digital Library

[20]

J. He, W. X. Zhao, B. Shu, X. M. Li, and H. F. Yan. 2011. Efficiently collecting relevance information from clickthroughs for Web retrieval system evaluation. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 275--284.

Digital Library

[21]

B. T. Hu, Y. C. Zhang, W. Z. Chen, G. Wang, and Q. Yang. 2011. Characterize search intent diversity into click models. In Proceedings of the World Wide Web Conference (WWW’11). 17--26.

Digital Library

[22]

J. Huang, R. W. White, G. Buscher, and K. Wang. 2012. Improving searcher models using mouse cursor activity. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). 195--204.

Digital Library

[23]

P. Ipeirotis, F. Provost, and J. Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOM’10). 64--67.

Digital Library

[24]

K. Jarvelin and J. Kekalainen. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’00). 41--48.

Digital Library

[25]

T. Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). 133--142.

Digital Library

[26]

H. J. Jung and M. Lease. 2012. Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’12). 1095--1096.

Digital Library

[27]

R. Jurca and B. Faltings. 2009. Mechanisms for making crowds truthful. Journal of Artificial Intelligence Research 34, 209--253.

Digital Library

[28]

T. Kanungo and D. Orr. 2009. Predicting the readability of short Web summaries. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). 202--211.

Digital Library

[29]

G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. 2011. Crowdsourcing for book search evaluation: Impact of HIT design on comparative system ranking. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 205--214.

Digital Library

[30]

J. Le, A. Edmonds, V. Hester, and L. Biewald. 2010. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In Proceedings of the Workshop on Crowdsourcing for Search Evaluation (SIGIR’10). 21--26.

[31]

S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10, 1345--1359.

Digital Library

[32]

V. C. Raykar and S. Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research 13, 491--518.

Digital Library

[33]

V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. 2009. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th International Conference on Machine Learning (ICML’09). 889--896.

Digital Library

[34]

V. C. Raykar, S. Yu, L. H. Zhao, and G. H. Valadez. 2010. Learning from crowds. Journal of Machine Learning Research 11, 1297--1322.

Digital Library

[35]

C. Saunders, A. Gammerman, and V. Vovk. 1998. Ridge regression learning algorithm in dual variables. In Proceedings of the 15th International Conference on Machine Learning (ICML’98). 515--521.

Digital Library

[36]

V. S. Sheng, F. Provost, and P. G. Lpeirotis. 2008. Get another label? Improving data quality and data mining using multiple noisy labelers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). 614--622.

Digital Library

[37]

H. J. Song, R. X. Liao, X. L. Zhang, C. Y. Miao, and Q. Yang. 2012. A mouse-trajectory based model for predicting query-URL relevance. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12). 143--149.

[38]

H. J. Song, C. Y. Miao, and Z. Q. Shen. 2011. Generating true relevance labels in Chinese search engine using clickthrough data. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11). 1230--1236.

[39]

R. Srikant, S. Basu, N. Wang, and D. Pregibon. 2010. User browsing models: Relevance versus examination. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’10). 223--232.

Digital Library

[40]

F. Xia, T. Y. Liu, J. Wang, W. Zhang, and H. Li. 2008. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 1192--1199.

Digital Library

[41]

L. Xiao, G. R. Xue, W. Y. Dai, Y. Jiang, Q. Yang, and Y. Yu. 2008. Can Chinese Web pages be classified with English data source? In Proceedings of the World Wide Web Conference (WWW’08). 969--978.

Digital Library

[42]

J. Xu, C. Chen, G. Xu, H. Li, and E. Abib. 2010. Improving quality of training data for learning to rank using click-through data. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 171--180.

Digital Library

[43]

H. Yang, A. Mityagin, and K. M. Svore. 2010. Collecting high quality overlapping labels at low cost. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’10). 459--466.

Digital Library

[44]

Y. Zhang, W. Chen, D. Wang, and Q. Yang. 2011. User-click modeling for understanding and predicting search-behavior. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). 1388--1396.

Digital Library

Index Terms

Individual Judgments Versus Consensus: Estimating Query-URL Relevance
1. Information systems
  1. Information retrieval

Recommendations

Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Using web-graph distance for relevance feedback in web search
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

We study the effect of user supplied relevance feedback in improving web search results. Rather than using query refinement or document similarity measures to rerank results, we show that the web-graph distance between two documents is a robust measure ...
Are search engine users equally reliable?
WWW '10: Proceedings of the 19th international conference on World wide web

In this paper, we study on the reliability of search engine users using click-through data. We proposed a graph-based approach to evaluate user reliability according to how users click on search result lists. We tried to incorporate this measure of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 10, Issue 1

February 2016

198 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/2870642

Editors:
Brian D. Davison
Lehigh University, USA
,
Marianne Winslett
University of Illinois at Urbana-Champaign

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2016

Accepted: 01 October 2015

Revised: 01 September 2015

Received: 01 May 2013

Published in TWEB Volume 10, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Baidu research
JSPS Fellow Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
334
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents