Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Google's Deep Web crawl

Published: 01 August 2008 Publication History

Abstract

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content.
Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.

References

[1]
L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
[2]
M. K. Bergman. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 2001.
[3]
S. Byers, J. Freire, and C. T. Silva. Efficient acquisition of web data through restricted query interfaces. In WWW Posters, 2001.
[4]
J. P. Callan and M. E. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001.
[5]
Cars.com FAQ. http://siy.cars.com/siy/qsg/faqGeneralInfo.jsp#howmanyads.
[6]
A. Doan, P. Domingos, and A. Y. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In SIGMOD, 2001.
[7]
L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems, 21(1):1--41, 2003.
[8]
B. He and K. Chang. Automatic Complex Schema Matching across Web Query Interfaces: A Correlation Mining Approach. TODS, 31(1), 2006.
[9]
B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the Deep Web: A survey. Communications of the ACM, 50(5):95--101, 2007.
[10]
P. G. Ipeirotis and L. Gravano. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In VLDB, pages 394--405, 2002.
[11]
J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy. Corpus-based Schema Matching. In ICDE, 2005.
[12]
J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR, 2007.
[13]
A. Ntoulas, P. Zerfos, and J. Cho. Downloading Textual Hidden Web Content through Keyword Queries. In JCDL, pages 100--109, 2005.
[14]
S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In VLDB, pages 129--138, 2001.
[15]
A. Rajaraman, Y. Sagiv, and J. D. Ullman. Answering Queries Using Templates with Binding Patterns. In PODS, 1995.
[16]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. 1983.
[17]
J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In VLDB, 2004.
[18]
P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query Selection Techniques for Efficient Crawling of Structured Web Sources. In ICDE, 2006.
[19]
W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. In SIGMOD, 2004.

Cited By

View all
  • (2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
  • (2023)Automated Selection of Web Form Text Field Values Based on Bayesian InferencesInternational Journal of Information Retrieval Research10.4018/IJIRR.31839913:1(1-13)Online publication date: 16-Feb-2023
  • (2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 1, Issue 2
August 2008
461 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008
Published in PVLDB Volume 1, Issue 2

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)4
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
  • (2023)Automated Selection of Web Form Text Field Values Based on Bayesian InferencesInternational Journal of Information Retrieval Research10.4018/IJIRR.31839913:1(1-13)Online publication date: 16-Feb-2023
  • (2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
  • (2023)Synthesis of multilevel knowledge graphsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106244123:PAOnline publication date: 1-Aug-2023
  • (2022)Characterizing "permanently dead" links on WikipediaProceedings of the 22nd ACM Internet Measurement Conference10.1145/3517745.3561451(388-394)Online publication date: 25-Oct-2022
  • (2022)DuMapper: Towards Automatic Verification of Large-Scale POIs with Street Views at Baidu MapsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557097(3063-3071)Online publication date: 17-Oct-2022
  • (2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
  • (2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-x15:4(323-338)Online publication date: 1-Dec-2021
  • (2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
  • (2019)Best practices for publishing, retrieving, and using spatial data on the webSemantic Web10.3233/SW-18030510:1(95-114)Online publication date: 1-Jan-2019
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media