Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

WebTables: exploring the power of tables on the web

Published: 01 August 2008 Publication History
  • Get Citation Alerts
  • Abstract

    The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude.
    We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus?
    First, we develop new techniques for keyword search over a corpus of tables, and show that they can achieve substantially higher relevance than solutions based on a traditional search engine. Second, we introduce a new object derived from the database corpus: the attribute correlation statistics database (AcsDB) that records corpus-wide statistics on co-occurrences of schema elements. In addition to improving search relevance, the AcsDB makes possible several novel applications: schema auto-complete, which helps a database designer to choose schema elements; attribute synonym finding, which automatically computes attribute synonym pairs for schema matching; and join-graph traversal, which allows a user to navigate between extracted schemas using automatically-generated join links.

    References

    [1]
    E. Agichtein, L. Gravano, V. Sokolovna, and A. Voskoboynik. Snowball: A prototype system for extracting relations from large text collections. In SIGMOD Conference, 2001.
    [2]
    S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relational databases. In ICDE, 2002.
    [3]
    S. Bell and P. Brockhausen. Discovery of data dependencies in relational databases. In European Conference on Machine Learning, 1995.
    [4]
    T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, pages 858--867, 2007.
    [5]
    M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In under review, 2008.
    [6]
    M. J. Cafarella, D. Suciu, and O. Etzioni. Navigating extracted data with schema discovery. In Web DB, 2007.
    [7]
    H. Chen, S. Tsai, and J. Tsai. Mining tables from large scale html texts. In 18th International Conference on Computational Linguistics (COLING), pages 166--172, 2000.
    [8]
    K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Association for Computational Linguistics, 1989.
    [9]
    R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. imap: Discovering complex mappings between database schemas. In SIGMOD Conference, 2004.
    [10]
    A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD Conference, 2001.
    [11]
    O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in knowitall (preliminary results). In Thirteenth International World Wide Web Conference, 2004.
    [12]
    W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International World Wide Web Conference (WWW 2007), pages 71--80, 2007.
    [13]
    B. He, Z. Zhang, and K. C.-C. Chang. Knocking the door to the deep web: Integration of web query interfaces. In SIGMOD Conference, pages 913--914, 2004.
    [14]
    V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 2002.
    [15]
    D. Lin and P. Pantel. Dirt: Discovery of inference rules from text. In KDD, 2001.
    [16]
    J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy. Corpus-based schema matching. In ICDE, 2005.
    [17]
    J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB, 2001.
    [18]
    J. Madhavan, A. Y. Halevy, S. Cohen, X. L. Dong, S. R. Jeffery, D. Ko, and C. Yu. Structured data meets the web: A few observations. IEEE Data Eng. Bull., 29(4): 19--26, 2006.
    [19]
    C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
    [20]
    I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE, 2006.
    [21]
    R. Miller and P. Andritsos. Schema discovery. IEEE Data Eng. Bull., 26(3):40--45, 2003.
    [22]
    A. Nandi and H. V. Jagadish. Assisted querying using instant-response interfaces. In SIGMOD Conference, pages 1156--1158, 2007.
    [23]
    G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices. In International Conference on Document Analysis and Recognition (ICDAR01), pages 1074--1078, 2001.
    [24]
    E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001.
    [25]
    P. D. Turney. Mining the web for synonyms: Pmi-ir versus Isa on toefl. In Proceedings of the Twelfth European Conference on Machine Learning, 2001.
    [26]
    Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In Eleventh International World Wide Web Conference, 2002.
    [27]
    I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufman, San Francisco, 2nd edition edition, 2005.
    [28]
    S. Wong, C. Butz, and Y. Xiang. Automated database schema design using mined data dependencies. Journal of the American Society of Information Science, 49(5):455--470, 1998.
    [29]
    M. Yoshida and K. Torisawa. A method to integrate tables of the world wide web. In Proceedings of the 1st International Workshop on Web Document Analysis, pages 31--34, 2001.
    [30]
    R. Zanibbi, D. Blostein, and J. Cordy. A survey of table recognition: Models, observations, transformations, and inferences, 2003.

    Cited By

    View all
    • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
    • (2024)Rethinking Table Retrieval from Data LakesProceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3663742.3663972(1-5)Online publication date: 14-Jun-2024
    • (2024)SchemaPile: A Large Collection of Relational Database SchemasProceedings of the ACM on Management of Data10.1145/36549752:3(1-25)Online publication date: 30-May-2024
    • Show More Cited By

    Index Terms

    1. WebTables: exploring the power of tables on the web

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 1, Issue 1
          August 2008
          1216 pages

          Publisher

          VLDB Endowment

          Publication History

          Published: 01 August 2008
          Published in PVLDB Volume 1, Issue 1

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)114
          • Downloads (Last 6 weeks)11
          Reflects downloads up to 27 Jul 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
          • (2024)Rethinking Table Retrieval from Data LakesProceedings of the Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3663742.3663972(1-5)Online publication date: 14-Jun-2024
          • (2024)SchemaPile: A Large Collection of Relational Database SchemasProceedings of the ACM on Management of Data10.1145/36549752:3(1-25)Online publication date: 30-May-2024
          • (2024)NPEL: Neural Paired Entity Linking in Web TablesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3652511Online publication date: 19-Mar-2024
          • (2024)Determining the Largest Overlap between TablesProceedings of the ACM on Management of Data10.1145/36393032:1(1-26)Online publication date: 26-Mar-2024
          • (2024)Towards Cross-Table Masked Pretraining for Web Data MiningProceedings of the ACM on Web Conference 202410.1145/3589334.3645707(4449-4459)Online publication date: 13-May-2024
          • (2024)Word embeddings for retrieving tabular data from research publicationsMachine Language10.1007/s10994-023-06472-0113:4(2227-2248)Online publication date: 1-Apr-2024
          • (2023)Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation LearningProceedings of the VLDB Endowment10.14778/3587136.358714616:7(1726-1739)Online publication date: 8-May-2023
          • (2023)Watchog: A Light-weight Contrastive Learning based Framework for Column AnnotationProceedings of the ACM on Management of Data10.1145/36267661:4(1-24)Online publication date: 12-Dec-2023
          • (2023)Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised ApproachProceedings of the ACM on Management of Data10.1145/36267561:4(1-27)Online publication date: 12-Dec-2023
          • Show More Cited By

          View Options

          Get Access

          Login options

          Full Access

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media