Abstract
The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to contain relevant documents, so it is necessary to first obtain information about the content of the target collections. In an uncooperative environment, query probing — where randomly-chosen queries are used to retrieve a sample of the documents and thus of the lexicon — has been proposed as a technique for estimating statistical term distributions. In this paper we rebut the claim that a sample of 300 documents is sufficient to provide good coverage of collection terms. We propose a novel sampling strategy and experimentally demonstrate that sample size needs to vary from collection to collection, that our methods achieve good coverage based on variable-sized samples, and that we can use the results of a probe to determine when to stop sampling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Bailey, P., Craswell, N., Hawking, D.: Engineering a multi-purpose test collection for web retrieval experiments. Inf. Process. Manage. 39(6), 853–871 (2003)
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inf. Syst. 19(2), 97–130 (2001)
Callan, J., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, pp. 21–28. ACM Press, New York (1995)
Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, pp. 479–490. ACM Press, New York (1999)
Craswell, N., Bailey, P., Hawking, D.: Server selection on the world wide web. In: Proceedings of the fifth ACM Conference on Digital Libraries, San Antonio, Texas, pp. 37–46. ACM Press, New York (2000)
D’Souza, D., Thom, J., Zobel, J.: Collection selection for managed distributed document databases. Inf. Process. Manage. 40(3), 527–546 (2004a)
D’Souza, D., Zobel, J., Thom, J.: Is CORI effective for collection selection? an exploration of parameters, queries, and data. In: Bruza, P., Moffat, A., Turpin, A. (eds.) Proceedings of the Australian Document Computing Symposium, Melbourne, Australia, pp. 41–46 (2004b)
French, J., Powell, A.L., Callan, J., Viles, C.L., Emmitt, T., Prey, K.J., Mou, Y.: Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, pp. 238–245. ACM Press, New York (1999)
Gravano, L., Garcia-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)
Gravano, L., Ipeirotis, P.G., Sahami, M.: Qprober: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Syst. 21(1), 1–41 (2003)
Ipeirotis, P.: Classifying and Searching Hidden-Web Text Databases. PhD thesis, Columbia University, USA (2004)
Ipeirotis, P.G., Gravano, L.: When one sample is not enough: improving text database selection using shrinkage. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, pp. 767–778. ACM Press, New York (2004)
Jansen, B.J., Spink, A., Saracevic, T.: Real life, real users, and real needs: a study and analysis of user queries on the web. Inf. Process. Manage. 36(2), 207–227 (2000)
Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research Development 2(2), 159–165 (1958)
Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)
Powell, A.L., French, J.: Comparing the performance of collection selection algorithms. ACM Trans. Inf. Syst. 21(4), 412–456 (2003)
Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, pp. 162–169. ACM Press, New York (2005)
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 298–305. ACM Press, New York (2003)
Williams, H.E., Zobel, J.: Searchable words on the web. International Journal of Digital Libraries 5(2), 99–105 (2005)
Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on the internet. In: Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA), Melbourne, Australia, pp. 41–50. World Scientific Press, Singapore (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shokouhi, M., Scholer, F., Zobel, J. (2006). Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_7
Download citation
DOI: https://doi.org/10.1007/11610113_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)