Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063576.2063601acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Query sampling for learning data fusion

Published: 24 October 2011 Publication History

Abstract

Data fusion is to merge the results of multiple independent retrieval models into a single ranked list. Several earlier studies have shown that the combination of different models can improve the retrieval performance better than using any of the individual models. Although many promising results have been given by supervised fusion methods, training data sampling has attracted little attention in previous work of data fusion. By observing some evaluations on TREC and NTCIR datasets, we found that the performance of one model varied largely from one training example to another, so that not all training examples were equivalently effective. In this paper, we propose two novel approaches: greedy and boosting approaches, which select effective training data by query sampling to improve the performance of supervised data fusion algorithms such as BayesFuse, probFuse and MAPFuse. Extensive experiments were conducted on five data sets including TREC-3,4,5 and NTCIR-3,4. The results show that our sampling approaches can significantly improve the retrieval performance of those data fusion methods.

References

[1]
W. I. Ai and P. Langley. Induction of one-level decision trees. In ICML. Morgan Kaufmann, 1992.
[2]
J. A. Aslam, E. Kanoulas, V. Pavlu, S. Savev, and E. Yilmaz. Document selection methodologies for efficient and effective learning-to-rank. In SIGIR. ACM, 2009.
[3]
J. A. Aslam and M. Montague. Models for metasearch. In SIGIR. ACM, 2001.
[4]
B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination of multiple ranked retrieval systems. In SIGIR, 1994.
[5]
N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. Combining the evidence of multiple query representations for information retrieval. IP&M, 1995.
[6]
A. L. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 1997.
[7]
J. A. S. Edward A. Fox. Combination of multiple searches. In TREC-2. NIST, 1994.
[8]
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT. Springer, 1995.
[9]
J. Guiver, S. Mizzaro, and S. Robertson. A few good topics: Experiments in topic set reduction for retrieval evaluation. TOIS, 2009.
[10]
C. Hauff, D. Hiemstra, F. de Jong, and L. Azzopardi. Relying on topic subsets for system ranking estimation. In CIKM. ACM, 2009.
[11]
L. S. Kennedy, A. P. Natsev, and S.-F. Chang. Automatic discovery of query-class-dependent models for multimodal search. In MULTIMEDIA. ACM, 2005.
[12]
L. Li, A. Pratap, H. tien Lin, and Y. S. Abu-mostafa. Improving generalization by data categorization. In PKDD 2005. Springer, 2005.
[13]
D. Lillis, F. Toolan, R. Collier, and J. Dunnion. Probfuse: a probabilistic approach to data fusion. In SIGIR. ACM, 2006.
[14]
D. Lillis, F. Toolan, R. Collier, and J. Dunnion. Extending probabilistic data fusion using sliding windows. In ECIR. Springer, 2008.
[15]
D. Lillis, L. Zhang, F. Toolan, R. W. Collier, D. Leonard, and J. Dunnion. Estimating probabilities for effective data fusion. In SIGIR. ACM, 2010.
[16]
D. J. C. MacKay. Information-based objective functions for active data selection. Neural Computation, 1992.
[17]
M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval. In CIKM. ACM, 2002.
[18]
S. Robertson. On the contributions of topics to system evaluation. In ECIR. Springer, 2011.
[19]
R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 1999.
[20]
M. Shokouhi. Segmentation of search engine results for effective data-fusion. In ECIR. Springer, 2007.
[21]
E. M. Voorhees, N. K. Gupta, and B. Johnson-laird. The collection fusion problem. In TREC-3. NIST.
[22]
J. Wang, P. Neskovic, and L. N. Cooper. Training data selection for support vector machines. In ICNC. LNCS. Springer, 2005.
[23]
R. Yan and A. G. Hauptmann. Probabilistic latent query analysis for combining multiple retrieval sources. In SIGIR. ACM, 2006.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adafuse
  2. data fusion
  3. query sampling

Qualifiers

  • Research-article

Conference

CIKM '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 275
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media