research-article

Google's Deep Web crawl

Authors:

Jayant Madhavan,

Vignesh Ganapathy,

Alex Rasmussen,

Alon HalevyAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 1, Issue 2

Pages 1241 - 1252

https://doi.org/10.14778/1454159.1454163

Published: 01 August 2008 Publication History

Abstract

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content.

Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.

References

[1]

L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.

[2]

M. K. Bergman. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 2001.

[3]

S. Byers, J. Freire, and C. T. Silva. Efficient acquisition of web data through restricted query interfaces. In WWW Posters, 2001.

[4]

J. P. Callan and M. E. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001.

Digital Library

[5]

Cars.com FAQ. http://siy.cars.com/siy/qsg/faqGeneralInfo.jsp#howmanyads.

[6]

A. Doan, P. Domingos, and A. Y. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In SIGMOD, 2001.

Digital Library

[7]

L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems, 21(1):1--41, 2003.

Digital Library

[8]

B. He and K. Chang. Automatic Complex Schema Matching across Web Query Interfaces: A Correlation Mining Approach. TODS, 31(1), 2006.

Digital Library

[9]

B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the Deep Web: A survey. Communications of the ACM, 50(5):95--101, 2007.

Digital Library

[10]

P. G. Ipeirotis and L. Gravano. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In VLDB, pages 394--405, 2002.

Digital Library

[11]

J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy. Corpus-based Schema Matching. In ICDE, 2005.

Digital Library

[12]

J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR, 2007.

[13]

A. Ntoulas, P. Zerfos, and J. Cho. Downloading Textual Hidden Web Content through Keyword Queries. In JCDL, pages 100--109, 2005.

Digital Library

[14]

S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In VLDB, pages 129--138, 2001.

Digital Library

[15]

A. Rajaraman, Y. Sagiv, and J. D. Ullman. Answering Queries Using Templates with Binding Patterns. In PODS, 1995.

Digital Library

[16]

G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. 1983.

Digital Library

[17]

J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In VLDB, 2004.

Digital Library

[18]

P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query Selection Techniques for Efficient Crawling of Structured Web Sources. In ICDE, 2006.

Digital Library

[19]

W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. In SIGMOD, 2004.

Digital Library

Cited By

Chang JCui BNargesian FAsudeh AJagadish H(2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00849-w
Malhotra DBhatia RKumar M(2023)Automated Selection of Web Form Text Field Values Based on Bayesian InferencesInternational Journal of Information Retrieval Research10.4018/IJIRR.31839913:1(1-13)Online publication date: 16-Feb-2023
https://dl.acm.org/doi/10.4018/IJIRR.318399
Buss CMousavi JTokarev MTermehchy AMaier DLee S(2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611535
Show More Cited By

Index Terms

Google's Deep Web crawl
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
  2. Information systems applications
    1. Data mining

Recommendations

Learning to crawl deep web

Deep web or hidden web refers to the hidden part of the Web (usually residing in structured databases) that remains unavailable for standard Web crawlers. Obtaining content of the deep web is challenging and has been acknowledged as a significant gap in ...
Deep Web: As a search tool
AJAX Crawl: Making AJAX Applications Searchable
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering

Current search engines such as Google and Yahoo! are prevalent for searching the Web. Search on dynamic client-side Web pages is, however, either inexistent or far from perfect, and not addressed by existing work, for example on Deep Web. This is a real ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 1, Issue 2

August 2008

461 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008

Published in PVLDB Volume 1, Issue 2

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

104
Total Citations
View Citations
3,040
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)4

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chang JCui BNargesian FAsudeh AJagadish H(2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00849-w
Malhotra DBhatia RKumar M(2023)Automated Selection of Web Form Text Field Values Based on Bayesian InferencesInternational Journal of Information Retrieval Research10.4018/IJIRR.31839913:1(1-13)Online publication date: 16-Feb-2023
https://dl.acm.org/doi/10.4018/IJIRR.318399
Buss CMousavi JTokarev MTermehchy AMaier DLee S(2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611535
Man TVodyaho AIgnatov DKulikov IZhukova N(2023)Synthesis of multilevel knowledge graphsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106244123:PAOnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.engappai.2023.106244
Nyayachavadi AZhu JMadhyastha HBarakat CPelsser CBenson TChoffnes D(2022)Characterizing "permanently dead" links on WikipediaProceedings of the 22nd ACM Internet Measurement Conference10.1145/3517745.3561451(388-394)Online publication date: 25-Oct-2022
https://dl.acm.org/doi/10.1145/3517745.3561451
Fan MHuang JWang HAl Hasan MXiong L(2022)DuMapper: Towards Automatic Verification of Large-Scale POIs with Street Views at Baidu MapsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557097(3063-3071)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557097
Nargesian FAsudeh AJagadish H(2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476299
Hintzen SLiesy YZirpins C(2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-x15:4(323-338)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s11761-020-00313-x
Bernstein PMadhavan JRahm E(2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402707.3402710
van den Brink LBarnaghi PTandy JAtemezing GAtkinson RCochrane BFathy YGarcía Castro RHaller AHarth AJanowicz KKolozali Şvan Leeuwen BLefrançois MLieberman JPerego ALe-Phuoc DRoberts BTaylor KTroncy R(2019)Best practices for publishing, retrieving, and using spatial data on the webSemantic Web10.3233/SW-18030510:1(95-114)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.3233/SW-180305
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents