Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Google's Deep Web crawl

Published: 01 August 2008 Publication History
  • Get Citation Alerts
  • Abstract

    The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content.
    Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.

    References

    [1]
    L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces. In SBBD, 2004.
    [2]
    M. K. Bergman. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 2001.
    [3]
    S. Byers, J. Freire, and C. T. Silva. Efficient acquisition of web data through restricted query interfaces. In WWW Posters, 2001.
    [4]
    J. P. Callan and M. E. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001.
    [5]
    Cars.com FAQ. http://siy.cars.com/siy/qsg/faqGeneralInfo.jsp#howmanyads.
    [6]
    A. Doan, P. Domingos, and A. Y. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In SIGMOD, 2001.
    [7]
    L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems, 21(1):1--41, 2003.
    [8]
    B. He and K. Chang. Automatic Complex Schema Matching across Web Query Interfaces: A Correlation Mining Approach. TODS, 31(1), 2006.
    [9]
    B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the Deep Web: A survey. Communications of the ACM, 50(5):95--101, 2007.
    [10]
    P. G. Ipeirotis and L. Gravano. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In VLDB, pages 394--405, 2002.
    [11]
    J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy. Corpus-based Schema Matching. In ICDE, 2005.
    [12]
    J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. In CIDR, 2007.
    [13]
    A. Ntoulas, P. Zerfos, and J. Cho. Downloading Textual Hidden Web Content through Keyword Queries. In JCDL, pages 100--109, 2005.
    [14]
    S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In VLDB, pages 129--138, 2001.
    [15]
    A. Rajaraman, Y. Sagiv, and J. D. Ullman. Answering Queries Using Templates with Binding Patterns. In PODS, 1995.
    [16]
    G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. 1983.
    [17]
    J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In VLDB, 2004.
    [18]
    P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query Selection Techniques for Efficient Crawling of Structured Web Sources. In ICDE, 2006.
    [19]
    W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. In SIGMOD, 2004.

    Cited By

    View all
    • (2023)Automated Selection of Web Form Text Field Values Based on Bayesian InferencesInternational Journal of Information Retrieval Research10.4018/IJIRR.31839913:1(1-13)Online publication date: 16-Feb-2023
    • (2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
    • (2023)Synthesis of multilevel knowledge graphsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106244123:PAOnline publication date: 1-Aug-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 1, Issue 2
    August 2008
    461 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2008
    Published in PVLDB Volume 1, Issue 2

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)9

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Automated Selection of Web Form Text Field Values Based on Bayesian InferencesInternational Journal of Information Retrieval Research10.4018/IJIRR.31839913:1(1-13)Online publication date: 16-Feb-2023
    • (2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
    • (2023)Synthesis of multilevel knowledge graphsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106244123:PAOnline publication date: 1-Aug-2023
    • (2022)Characterizing "permanently dead" links on WikipediaProceedings of the 22nd ACM Internet Measurement Conference10.1145/3517745.3561451(388-394)Online publication date: 25-Oct-2022
    • (2022)DuMapper: Towards Automatic Verification of Large-Scale POIs with Street Views at Baidu MapsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557097(3063-3071)Online publication date: 17-Oct-2022
    • (2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
    • (2021)A third-party replication service for dynamic hidden databasesService Oriented Computing and Applications10.1007/s11761-020-00313-x15:4(323-338)Online publication date: 1-Dec-2021
    • (2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
    • (2019)Best practices for publishing, retrieving, and using spatial data on the webSemantic Web10.3233/SW-18030510:1(95-114)Online publication date: 1-Jan-2019
    • (2019)Google Dataset Search: Building a search engine for datasets in an open Web ecosystemThe World Wide Web Conference10.1145/3308558.3313685(1365-1375)Online publication date: 13-May-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media