Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2723725acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

TEGRA: Table Extraction by Global Record Alignment

Published: 27 May 2015 Publication History

Abstract

It is well known today that pages on the Web contain a large number of content-rich relational tables. Such tables have been systematically extracted in a number of efforts to empower important applications such as table search and schema discovery. However, a significant fraction of relational tables are not embedded in the standard HTML table tags, and are thus difficult to extract. In particular, a large number of relational tables are known to be in a ``list'' form, which contains a list of clearly separated rows that are not separated into columns.
In this work, we address the important problem of automatically extracting multi-column relational tables from such lists. Our key intuition lies in the simple observation that in correctly-extracted tables, values in the same column are coherent, both at a syntactic and at a semantic level. Using a background corpus of over 100 million tables crawled from the Web, we quantify semantic coherence based on a statistical measure of value co-occurrence in the same column from the corpus. We then model table extraction as a principled optimization problem -- we allocate tokens in each row sequentially to a fixed number of columns, such that the sum of coherence across all pairs of values in the same column is maximized. Borrowing ideas from $A^\star$ search and metric distance, we develop an efficient 2-approximation algorithm. We conduct large-scale table extraction experiments using both real Web data and proprietary enterprise spreadsheet data. Our approach considerably outperforms the state-of-the-art approaches in terms of quality, achieving over 90% F-measure across many cases.

References

[1]
Google Web Tables. http://research.google.com/tables.
[2]
Microsoft Excel Power Query. http://office.microsoft.com/powerbi.
[3]
E. Agichtein and V. Ganti. mining reference tables for automatic text segmentation. In Proceedings of KDD, 2004.
[4]
E. Agichtein and L. Gravano. Snowball: extracting relations from large plain-text collections. In DL, 2000.
[5]
Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum clustering in metric spaces. In STOC, 2001.
[6]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008.
[7]
P. Bonizzoni and G. D. Vedova. The complexity of multiple sequence alignment with sp-score that is a metric. In Theoretical Computer Science, 2001.
[8]
V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In SIGMOD Conference, pages 175--186, 2001.
[9]
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, 2009.
[10]
M. J. Cafarella, E. Wu, A. Halevy, Y. Zhang, and D. Z. Wang. Webtables: Exploring the power of tables on the web. In Proceedings of VLDB, 2008.
[11]
K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. In Computational Linguistics, 1990.
[12]
W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW, 2002.
[13]
T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. 2001.
[14]
E. Cortez, D. Oliveira, A. S. da Silva, E. S. de Moura, and A. H. F. Laender. Joint unsupervised structure discovery and information extraction. In SIGMOD Conference, pages 541--552, 2011.
[15]
H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009.
[16]
D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In SIGMOD, 2009.
[17]
J. Euzenat. Semantic precision and recall for ontology alignment evaluation. IJCAI'07, pages 348--353, 2007.
[18]
R. L. Francis, T. J. Lowe, and H. D. Ratliff. Distance constraints for tree network multifacility location problems. In Operations Research, 1978.
[19]
S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. In CACM, 2012.
[20]
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In PVLDB, 2009.
[21]
D. Gusfield. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55(1):141--154, 1993.
[22]
H. hai Do, S. Melnik, and E. Rahm. Comparison of schema matching evaluations. In In Proceedings of the 2nd Int. Workshop on Web Databases (German Informatics Society, pages 221--237, 2002.
[23]
P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. SIGART Bull., (37):28--29, Dec. 1972.
[24]
W. Just. Computational complexity of multiple sequence alignment with sp-score. In Journal of Computational Biology, 2001.
[25]
A. Machanavajjhala, A. S. Iyer, P. Bohannon, and S. Merugu. Collective extraction from heterogeneous web lists. In WSDM, 2011.
[26]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. pages 97--108, 2012.
[27]
M. Zhang and K. Chakrabarti. Infogather: Semantic matching and annotation of numeric and time-varying attributes in web tables. In ACM SIGMOD, 2013.
[28]
C. Zhao, J. Mahmud, and I. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proceedings of SDM, 2008.

Cited By

View all

Index Terms

  1. TEGRA: Table Extraction by Global Record Alignment

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
    May 2015
    2110 pages
    ISBN:9781450327589
    DOI:10.1145/2723372
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. html lists
    2. information extraction
    3. table extraction
    4. web tables

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'15
    Sponsor:
    SIGMOD/PODS'15: International Conference on Management of Data
    May 31 - June 4, 2015
    Victoria, Melbourne, Australia

    Acceptance Rates

    SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 16 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
    • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
    • (2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
    • (2022)TOMATEInformation Sciences: an International Journal10.1016/j.ins.2021.04.087577:C(49-68)Online publication date: 22-Apr-2022
    • (2022)A coral-reef approach to extract information from HTML tablesApplied Soft Computing10.1016/j.asoc.2021.107980115:COnline publication date: 6-May-2022
    • (2021)A Survey on Data Collection for Machine Learning: A Big Data - AI Integration PerspectiveIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.294616233:4(1328-1347)Online publication date: 1-Apr-2021
    • (2020)TULIP: A Five-Star Table and List - From Machine-Readable to Machine-Understandable SystemsLinked Open Data - Applications, Trends and Future Developments10.5772/intechopen.91406Online publication date: 19-Nov-2020
    • (2020)PytheasProceedings of the VLDB Endowment10.14778/3407790.340781013:12(2075-2089)Online publication date: 14-Sep-2020
    • (2020)Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science NotebooksProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389738(1539-1554)Online publication date: 11-Jun-2020
    • (2020)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-050:2(271-295)Online publication date: 1-Feb-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media