Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1007568.1007655acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

When one sample is not enough: improving text database selection using shrinkage

Published: 13 June 2004 Publication History

Abstract

Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.

References

[1]
J. P. Callan and M. Connell. Query-based sampling of text databases. ACM TOIS, 19(2), 2001.
[2]
J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD'99, 1999.
[3]
J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR'95, 1995.
[4]
Y. S. Choi and S. I. Yoo. Text database discovery on the Web: Neural net based approach. JIIS, 16(1), Jan. 2001.
[5]
J. G. Conrad, X. S. Guo, P. Jackson, and M Meziou. Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment. In VLDB 2002, 2002.
[6]
N. Craswell, P. Bailey, and D. Hawking. Server selection on the World Wide Web. In ICDL 2000, 2000.
[7]
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B(39), 1977.
[8]
R. Dolin, D. Agrawal, and A. El Abbadi. Scalable collection summarization and selection. In ICDL'99, 1999.
[9]
J. C. French, A. L. Powell, J. P. Callan, C. L. Viles, T. Emmitt, K. J. Prey, and Y. Mou. Comparing the performance of database selection algorithms. In SIGIR'99, 1999.
[10]
N. Fuhr. A decision-theoretic approach to database selection in networked IR. ACM TOIS, 17(3), May 1999.
[11]
L. Gravano, C.-C. K. Chang, H. García-Molina, and A. Paepcke. STARTS: Stanford proposal for Internet meta-searching. In SIGMOD'97, 1997.
[12]
L. Gravano, H. García-Molina, and A. Tomasic. GlOSS: Text-source discovery over the Internet. ACM TODS, 24(2), June 1999.
[13]
L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classification of hidden-web databases. ACM TOIS, 21(1), Jan. 2003.
[14]
D. Harman. Overview of the Fourth Text REtrieval Conference (TREC-4). In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), 1996.
[15]
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer Verlag, Aug. 2001.
[16]
P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB 2002, 2002.
[17]
P. G. Ipeirotis and L. Gravano. When one sample is not enough: Improving text database selection using shrinkage. Technical Report CUCS-013-04, Columbia University, Computer Science Department, Mar. 2004.
[18]
L. S. Larkey, M. E. Connell, and J. P. Callan. Collection selection and results merging with topically organized U.S. patents and TREC data. In CIKM 2000, 2000.
[19]
Z. Liu, C. Luo, J. Cho, and W. Chu. A probabilistic approach to metasearching with adaptive probing. In ICDE 2004, 2004.
[20]
B. B. Mandelbrot. Fractal Geometry of Nature. W. H. Freeman & Co., 1988.
[21]
J. P. Marques De Sá. Applied Statistics. Springer Verlag, 2003.
[22]
A. McCallum, R. Rosenfeld, T. M. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML'98, 1998.
[23]
W. Meng, K.-L. Liu, C. T. Yu, X. Wang, Y. Chang, and N. Rishe. Determining text databases to search in the Internet. In VLDB'98, 1998.
[24]
G. A. Monroe, J. C. French, and A. L. Powell. Obtaining language models of web collections using query-based sampling techniques. In HICSS'02, 2002.
[25]
G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.
[26]
M. A. Sheldon. Content Routing: A Scalable Architecture for Network-Based Information Discovery. PhD thesis, M.I.T., 1995.
[27]
L. Si and J. P. Callan. Relevant document distribution estimation method for resource selection. In SIGIR 2003, 2003.
[28]
L. Si, R. Jin, J. P. Callan, and P. Ogilvie. A language modeling framework for resource selection and results merging. In CIKM 2002, 2002.
[29]
E. Voorhees and D. Harman. Overview of the Sixth Text REtrieval Conference (TREC-6). In NIST Special Publication 500-240: The Sixth Text REtrieval Conference (TREC-6), 1998.
[30]
J. Xu and J. P. Callan. Effective retrieval with distributed collections. In SIGIR'98, 1998.
[31]
J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In SIGIR'99, 1999.
[32]
C. T. Yu, W. Meng, K.-L. Liu, W. Wu, and N. Rishe. Efficient and effective metasearch for a large number of text databases. In CIKM'99, 1999.
[33]
B. Yuwono and D. L. Lee. Server ranking for distributed text retrieval systems on the Internet. In DASFAA'97, 1997.

Cited By

View all
  • (2018)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_327(3333-3337)Online publication date: 7-Dec-2018
  • (2017)Exploiting Social Annotations to Generate Resource Descriptions in a Distributed Environment: Cooperative Multi-Agent Simulation on Query-Based SamplingThe Review of Socionetwork Strategies10.1007/s12626-017-0001-611:1(83-93)Online publication date: 1-Jun-2017
  • (2016)Leveraging Neurodata to Support Web User Behavior AnalysisWisdom Web of Things10.1007/978-3-319-44198-6_8(181-207)Online publication date: 8-Nov-2016
  • Show More Cited By
  1. When one sample is not enough: improving text database selection using shrinkage

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data
    June 2004
    988 pages
    ISBN:1581138598
    DOI:10.1145/1007568
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS04
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_327(3333-3337)Online publication date: 7-Dec-2018
    • (2017)Exploiting Social Annotations to Generate Resource Descriptions in a Distributed Environment: Cooperative Multi-Agent Simulation on Query-Based SamplingThe Review of Socionetwork Strategies10.1007/s12626-017-0001-611:1(83-93)Online publication date: 1-Jun-2017
    • (2016)Leveraging Neurodata to Support Web User Behavior AnalysisWisdom Web of Things10.1007/978-3-319-44198-6_8(181-207)Online publication date: 8-Nov-2016
    • (2016)Searching Digital LibrariesEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_327-2(1-4)Online publication date: 9-Dec-2016
    • (2015)Distributed Information Retrieval: Developments and StrategiesInternational Journal of Engineering Research in Africa10.4028/www.scientific.net/JERA.16.11016(110-144)Online publication date: Jun-2015
    • (2015)PERSONALIZED Source Selection Process: A Social Profile Adaptation TechniqueIntelligent Data Analysis and Applications10.1007/978-3-319-21206-7_18(203-213)Online publication date: 26-Jun-2015
    • (2013)Assessing relevance and trust of the deep web sources and results based on inter-source agreementACM Transactions on the Web10.1145/2460383.24603907:2(1-32)Online publication date: 29-May-2013
    • (2013)Prequery Discovery of Domain-Specific Query FormsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.11125:8(1830-1848)Online publication date: 1-Aug-2013
    • (2011)Agreement based source selection for the multi-topic deep web integrationProceedings of the 17th International Conference on Management of Data10.5555/2591338.2591353(1-12)Online publication date: 19-Dec-2011
    • (2011)Improving local search ranking through external logsProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2010021(785-794)Online publication date: 24-Jul-2011
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media