Article

When one sample is not enough: improving text database selection using shrinkage

Authors:

Panagiotis G. Ipeirotis,

Luis GravanoAuthors Info & Claims

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Pages 767 - 778

https://doi.org/10.1145/1007568.1007655

Published: 13 June 2004 Publication History

Abstract

Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.

References

[1]

J. P. Callan and M. Connell. Query-based sampling of text databases. ACM TOIS, 19(2), 2001.

Digital Library

[2]

J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD'99, 1999.

Digital Library

[3]

J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR'95, 1995.

Digital Library

[4]

Y. S. Choi and S. I. Yoo. Text database discovery on the Web: Neural net based approach. JIIS, 16(1), Jan. 2001.

Digital Library

[5]

J. G. Conrad, X. S. Guo, P. Jackson, and M Meziou. Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment. In VLDB 2002, 2002.

[6]

N. Craswell, P. Bailey, and D. Hawking. Server selection on the World Wide Web. In ICDL 2000, 2000.

Digital Library

[7]

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B(39), 1977.

[8]

R. Dolin, D. Agrawal, and A. El Abbadi. Scalable collection summarization and selection. In ICDL'99, 1999.

Digital Library

[9]

J. C. French, A. L. Powell, J. P. Callan, C. L. Viles, T. Emmitt, K. J. Prey, and Y. Mou. Comparing the performance of database selection algorithms. In SIGIR'99, 1999.

Digital Library

[10]

N. Fuhr. A decision-theoretic approach to database selection in networked IR. ACM TOIS, 17(3), May 1999.

Digital Library

[11]

L. Gravano, C.-C. K. Chang, H. García-Molina, and A. Paepcke. STARTS: Stanford proposal for Internet meta-searching. In SIGMOD'97, 1997.

Digital Library

[12]

L. Gravano, H. García-Molina, and A. Tomasic. GlOSS: Text-source discovery over the Internet. ACM TODS, 24(2), June 1999.

Digital Library

[13]

L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classification of hidden-web databases. ACM TOIS, 21(1), Jan. 2003.

Digital Library

[14]

D. Harman. Overview of the Fourth Text REtrieval Conference (TREC-4). In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), 1996.

[15]

T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer Verlag, Aug. 2001.

[16]

P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB 2002, 2002.

Digital Library

[17]

P. G. Ipeirotis and L. Gravano. When one sample is not enough: Improving text database selection using shrinkage. Technical Report CUCS-013-04, Columbia University, Computer Science Department, Mar. 2004.

Digital Library

[18]

L. S. Larkey, M. E. Connell, and J. P. Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In CIKM 2000, 2000.

Digital Library

[19]

Z. Liu, C. Luo, J. Cho, and W. Chu. A probabilistic approach to metasearching with adaptive probing. In ICDE 2004, 2004.

Digital Library

[20]

B. B. Mandelbrot. Fractal Geometry of Nature. W. H. Freeman & Co., 1988.

[21]

J. P. Marques De Sá. Applied Statistics. Springer Verlag, 2003.

[22]

A. McCallum, R. Rosenfeld, T. M. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML'98, 1998.

Digital Library

[23]

W. Meng, K.-L. Liu, C. T. Yu, X. Wang, Y. Chang, and N. Rishe. Determining text databases to search in the Internet. In VLDB'98, 1998.

Digital Library

[24]

G. A. Monroe, J. C. French, and A. L. Powell. Obtaining language models of web collections using query-based sampling techniques. In HICSS'02, 2002.

Digital Library

[25]

G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.

Digital Library

[26]

M. A. Sheldon. Content Routing: A Scalable Architecture for Network-Based Information Discovery. PhD thesis, M.I.T., 1995.

Digital Library

[27]

L. Si and J. P. Callan. Relevant document distribution estimation method for resource selection. In SIGIR 2003, 2003.

Digital Library

[28]

L. Si, R. Jin, J. P. Callan, and P. Ogilvie. A language modeling framework for resource selection and results merging. In CIKM 2002, 2002.

Digital Library

[29]

E. Voorhees and D. Harman. Overview of the Sixth Text REtrieval Conference (TREC-6). In NIST Special Publication 500-240: The Sixth Text REtrieval Conference (TREC-6), 1998.

[30]

J. Xu and J. P. Callan. Effective retrieval with distributed collections. In SIGIR'98, 1998.

Digital Library

[31]

J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In SIGIR'99, 1999.

Digital Library

[32]

C. T. Yu, W. Meng, K.-L. Liu, W. Wu, and N. Rishe. Efficient and effective metasearch for a large number of text databases. In CIKM'99, 1999.

Digital Library

[33]

B. Yuwono and D. L. Lee. Server ranking for distributed text retrieval systems on the Internet. In DASFAA'97, 1997.

Digital Library

Cited By

Ipeirotis P(2018)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_327(3333-3337)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_327
Saoud ZKechid SSaoud MDoucet A(2017)Exploiting Social Annotations to Generate Resource Descriptions in a Distributed Environment: Cooperative Multi-Agent Simulation on Query-Based SamplingThe Review of Socionetwork Strategies10.1007/s12626-017-0001-611:1(83-93)Online publication date: 1-Jun-2017
https://doi.org/10.1007/s12626-017-0001-6
Loyola PBrunetti EMartinez GVelásquez JMaldonado P(2016)Leveraging Neurodata to Support Web User Behavior AnalysisWisdom Web of Things10.1007/978-3-319-44198-6_8(181-207)Online publication date: 8-Nov-2016
https://doi.org/10.1007/978-3-319-44198-6_8
Show More Cited By

When one sample is not enough: improving text database selection using shrinkage
1. Information systems

Recommendations

SQL All-in-One For Dummies
Expert One on One: Oracle
One more bit is enough

Achieving efficient and fair bandwidth allocation while minimizing packet loss and bottleneck queue in high bandwidth-delay product networks has long been a daunting challenge. Existing end-to-end congestion control (e.g., TCP) and traditional ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

June 2004

988 pages

ISBN:1581138598

DOI:10.1145/1007568

Conference Chairs:
Arnd Christian König
Microsoft Research
,
Stefan Dessloch
University of Kaiserslautern, Germany
,
General Chair:
Patrick Valduriez
INRIA, France
,
Program Chair:
Gerhard Weikum
University of the Saarland

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS04

Sponsor:

SIGMOD

SIGMOD/PODS04: International Conference on Management of Data and Symposium on Principles Database and Systems

June 13 - 18, 2004

Paris, France

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
658
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ipeirotis P(2018)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_327(3333-3337)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_327
Saoud ZKechid SSaoud MDoucet A(2017)Exploiting Social Annotations to Generate Resource Descriptions in a Distributed Environment: Cooperative Multi-Agent Simulation on Query-Based SamplingThe Review of Socionetwork Strategies10.1007/s12626-017-0001-611:1(83-93)Online publication date: 1-Jun-2017
https://doi.org/10.1007/s12626-017-0001-6
Loyola PBrunetti EMartinez GVelásquez JMaldonado P(2016)Leveraging Neurodata to Support Web User Behavior AnalysisWisdom Web of Things10.1007/978-3-319-44198-6_8(181-207)Online publication date: 8-Nov-2016
https://doi.org/10.1007/978-3-319-44198-6_8
Ipeirotis P(2016)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_327-2(1-4)Online publication date: 9-Dec-2016
https://doi.org/10.1007/978-1-4899-7993-3_327-2
Ghansah BWu S(2015)Distributed Information Retrieval: Developments and StrategiesInternational Journal of Engineering Research in Africa10.4028/www.scientific.net/JERA.16.11016(110-144)Online publication date: Jun-2015
https://doi.org/10.4028/www.scientific.net/JERA.16.110
Saoud ZKechid S(2015)PERSONALIZED Source Selection Process: A Social Profile Adaptation TechniqueIntelligent Data Analysis and Applications10.1007/978-3-319-21206-7_18(203-213)Online publication date: 26-Jun-2015
https://doi.org/10.1007/978-3-319-21206-7_18
Balakrishnan RKambhampati SJha M(2013)Assessing relevance and trust of the deep web sources and results based on inter-source agreementACM Transactions on the Web10.1145/2460383.24603907:2(1-32)Online publication date: 29-May-2013
https://dl.acm.org/doi/10.1145/2460383.2460390
Moraes MHeuser CMoreira VBarbosa D(2013)Prequery Discovery of Domain-Specific Query FormsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.11125:8(1830-1848)Online publication date: 1-Aug-2013
https://dl.acm.org/doi/10.1109/TKDE.2012.111
Jha MBalakrishnan RKambhampati SSadaphal VHaritsa JDayal UDeshpande P(2011)Agreement based source selection for the multi-topic deep web integrationProceedings of the 17th International Conference on Management of Data10.5555/2591338.2591353(1-12)Online publication date: 19-Dec-2011
https://dl.acm.org/doi/10.5555/2591338.2591353
Berberich KKönig ALymberopoulos DZhao PMa WNie JBaeza-Yates RChua TCroft W(2011)Improving local search ranking through external logsProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2010021(785-794)Online publication date: 24-Jul-2011
https://dl.acm.org/doi/10.1145/2009916.2010021
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents