Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Schema extraction for tabular data on the web

Published: 01 April 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Tabular data is an abundant source of information on the Web, but remains mostly isolated from the latter's interconnections since tables lack links and computer-accessible descriptions of their structure. In other words, the schemas of these tables -- attribute names, values, data types, etc. -- are not explicitly stored as table metadata. Consequently, the structure that these tables contain is not accessible to the crawlers that power search engines and thus not accessible to user search queries. We address this lack of structure with a new method for leveraging the principles of table construction in order to extract table schemas. Discovering the schema by which a table is constructed is achieved by harnessing the similarities and differences of nearby table rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables are determined using a classification technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which is specifically designed for the data table extraction task. Our method provides considerable improvement over the well-known WebTables schema extraction method. In contrast with previous work that focuses on extracting individual relations, our method excels at correctly interpreting full tables, thereby being capable of handling general tables such as those found in spreadsheets, instead of being restricted to HTML tables as is the case with the WebTables method. We also extract additional schema characteristics, such as row groupings, which are important for supporting information retrieval tasks on tabular data.

    References

    [1]
    P. A. Bernstein, J. Madhavan, and E. Rahm. Generic schema matching, ten years later. PVLDB, 4(11):695-701, 2011.
    [2]
    M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In WebDB, Vancouver, Canada, June 2008.
    [3]
    M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the power of tables on the web. In VLDB, pages 538-549, Auckland, New Zealand, Aug. 2008.
    [4]
    M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, pages 1090-1101, Lyon, France, Aug. 2009.
    [5]
    H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale HTML texts. In COLING, pages 166-172, Saarbrücken, Germany, July 2000.
    [6]
    E. F. Codd. A relational model of data for large shared data banks. CACM, 13(6):377-387, June 1970.
    [7]
    A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In SIGMOD, pages 817-828, Scottsdale, Arizona, USA, May 2012.
    [8]
    D. W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. Table-processing paradigms: a research survey. IJDAR, 8(2):66-86, 2006.
    [9]
    W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71-80, Banff, Canada, May 2007.
    [10]
    G. S. Iwerks and H. Samet. The spatial spreadsheet. In VISUAL, pages 317-324, Amsterdam, The Netherlands, June 1999.
    [11]
    E. Jacox and H. Samet. Spatial join techniques. Computer Science Technical Report TR-4730, University of Maryland, College Park, MD, June 2005.
    [12]
    D. Jannach, K. Shchekotykhin, and G. Friedrich. Automated ontology instantiation from tabular web sources--the AllRight system. Web Semantics, 7(3):136-153, Sept. 2009.
    [13]
    J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282-289, Williamstown, Massachussetts, USA, 2001.
    [14]
    O. Lassila. The resource description framework. IEEE Intelligent Systems, 15(6):67-69, 2000.
    [15]
    M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling. Spatio-textual spreadsheets: Geotagging via spatial coherence. In SIGSPATIAL, pages 524-527, Seattle, WA, Nov. 2009.
    [16]
    G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1):1338-1347, 2010.
    [17]
    Y. Liu, K. Bai, P. Mitra, and C. L. Giles. TableSeer: Automatic table metadata extraction and searching in digital libraries. In JCDL, pages 91-100, Vancouver, Canada, June 2007.
    [18]
    R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. In VLDB, pages 908-919, Istanbul, Turkey, Aug. 2012.
    [19]
    D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In SIGIR, pages 235-242, 2003.
    [20]
    H. Samet, A. Rosenfeld, C. A. Shaffer, and R. E. Webber. A geographic information system using quadtrees. Pattern Recognition, 17(6):647-656, November/December 1984.
    [21]
    H. Samet, H. Alborzi, F. Brabec, C. Esperança, G. R. Hjaltason, F. Morgan, and E. Tanin. Use of the SAND spatial browser for digital government applications. CACM, 46(1):63-66, Jan. 2003.
    [22]
    F. Sha and F. C. N. Pereira. Shallow parsing with conditional random fields. In HLT-NAACL, pages 213-220, 2003.
    [23]
    P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 4(9):528-538, June 2011.
    [24]
    Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW, pages 242-250, Honolulu, HI, May 2002.
    [25]
    M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97-108, Scottsdale, Arizona, USA, May 2012.
    [26]
    R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition: Models, observations, transformations, and inferences. IJDAR, 7(1):1-16, Mar. 2004.

    Cited By

    View all
    • (2023)Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation LearningProceedings of the VLDB Endowment10.14778/3587136.358714616:7(1726-1739)Online publication date: 8-May-2023
    • (2023)SANTOS: Relationship-based Semantic Table Union SearchProceedings of the ACM on Management of Data10.1145/35886891:1(1-25)Online publication date: 30-May-2023
    • (2023)Olio: A Semantic Search Interface for Data RepositoriesProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology10.1145/3586183.3606806(1-16)Online publication date: 29-Oct-2023
    • Show More Cited By

    Index Terms

    1. Schema extraction for tabular data on the web
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 6, Issue 6
        April 2013
        144 pages

        Publisher

        VLDB Endowment

        Publication History

        Published: 01 April 2013
        Published in PVLDB Volume 6, Issue 6

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)48
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 09 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation LearningProceedings of the VLDB Endowment10.14778/3587136.358714616:7(1726-1739)Online publication date: 8-May-2023
        • (2023)SANTOS: Relationship-based Semantic Table Union SearchProceedings of the ACM on Management of Data10.1145/35886891:1(1-25)Online publication date: 30-May-2023
        • (2023)Olio: A Semantic Search Interface for Data RepositoriesProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology10.1145/3586183.3606806(1-16)Online publication date: 29-Oct-2023
        • (2023)MORPHER: Structural Transformation of Ill-formed RowsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614747(5051-5055)Online publication date: 21-Oct-2023
        • (2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
        • (2022)Extracting data models from background knowledge graphsKnowledge-Based Systems10.1016/j.knosys.2021.107818237:COnline publication date: 15-Feb-2022
        • (2022)Rule-based spreadsheet data transformation from arbitrary to relational tablesInformation Systems10.1016/j.is.2017.08.00471:C(123-136)Online publication date: 13-Apr-2022
        • (2022)Automatic Machine Learning-Based OLAP Measure Detection for Tabular DataBig Data Analytics and Knowledge Discovery10.1007/978-3-031-12670-3_15(173-188)Online publication date: 22-Aug-2022
        • (2021)Semantic table structure identification in spreadsheetsProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3460319.3464812(283-295)Online publication date: 11-Jul-2021
        • (2021)From Tables to KnowledgeProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3470809(4060-4061)Online publication date: 14-Aug-2021
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media