research-article

TEGRA: Table Extraction by Global Record Alignment

Authors:

Kaushik Chakrabarti,

Kris GanjamAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1713 - 1728

https://doi.org/10.1145/2723372.2723725

Published: 27 May 2015 Publication History

Abstract

It is well known today that pages on the Web contain a large number of content-rich relational tables. Such tables have been systematically extracted in a number of efforts to empower important applications such as table search and schema discovery. However, a significant fraction of relational tables are not embedded in the standard HTML table tags, and are thus difficult to extract. In particular, a large number of relational tables are known to be in a ``list'' form, which contains a list of clearly separated rows that are not separated into columns.

In this work, we address the important problem of automatically extracting multi-column relational tables from such lists. Our key intuition lies in the simple observation that in correctly-extracted tables, values in the same column are coherent, both at a syntactic and at a semantic level. Using a background corpus of over 100 million tables crawled from the Web, we quantify semantic coherence based on a statistical measure of value co-occurrence in the same column from the corpus. We then model table extraction as a principled optimization problem -- we allocate tokens in each row sequentially to a fixed number of columns, such that the sum of coherence across all pairs of values in the same column is maximized. Borrowing ideas from $A^\star$ search and metric distance, we develop an efficient 2-approximation algorithm. We conduct large-scale table extraction experiments using both real Web data and proprietary enterprise spreadsheet data. Our approach considerably outperforms the state-of-the-art approaches in terms of quality, achieving over 90% F-measure across many cases.

References

[1]

Google Web Tables. http://research.google.com/tables.

[2]

Microsoft Excel Power Query. http://office.microsoft.com/powerbi.

[3]

E. Agichtein and V. Ganti. mining reference tables for automatic text segmentation. In Proceedings of KDD, 2004.

Digital Library

[4]

E. Agichtein and L. Gravano. Snowball: extracting relations from large plain-text collections. In DL, 2000.

Digital Library

[5]

Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum clustering in metric spaces. In STOC, 2001.

Digital Library

[6]

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD, 2008.

Digital Library

[7]

P. Bonizzoni and G. D. Vedova. The complexity of multiple sequence alignment with sp-score that is a metric. In Theoretical Computer Science, 2001.

Digital Library

[8]

V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In SIGMOD Conference, pages 175--186, 2001.

Digital Library

[9]

M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. In VLDB, 2009.

Digital Library

[10]

M. J. Cafarella, E. Wu, A. Halevy, Y. Zhang, and D. Z. Wang. Webtables: Exploring the power of tables on the web. In Proceedings of VLDB, 2008.

Digital Library

[11]

K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. In Computational Linguistics, 1990.

Digital Library

[12]

W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in html documents. In WWW, 2002.

Digital Library

[13]

T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. 2001.

Digital Library

[14]

E. Cortez, D. Oliveira, A. S. da Silva, E. S. de Moura, and A. H. F. Laender. Joint unsupervised structure discovery and information extraction. In SIGMOD Conference, pages 541--552, 2011.

Digital Library

[15]

H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009.

Digital Library

[16]

D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In SIGMOD, 2009.

Digital Library

[17]

J. Euzenat. Semantic precision and recall for ontology alignment evaluation. IJCAI'07, pages 348--353, 2007.

Digital Library

[18]

R. L. Francis, T. J. Lowe, and H. D. Ratliff. Distance constraints for tree network multifacility location problems. In Operations Research, 1978.

[19]

S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. In CACM, 2012.

Digital Library

[20]

R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In PVLDB, 2009.

Digital Library

[21]

D. Gusfield. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology, 55(1):141--154, 1993.

[22]

H. hai Do, S. Melnik, and E. Rahm. Comparison of schema matching evaluations. In In Proceedings of the 2nd Int. Workshop on Web Databases (German Informatics Society, pages 221--237, 2002.

Digital Library

[23]

P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. SIGART Bull., (37):28--29, Dec. 1972.

Digital Library

[24]

W. Just. Computational complexity of multiple sequence alignment with sp-score. In Journal of Computational Biology, 2001.

[25]

A. Machanavajjhala, A. S. Iyer, P. Bohannon, and S. Merugu. Collective extraction from heterogeneous web lists. In WSDM, 2011.

Digital Library

[26]

M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. pages 97--108, 2012.

Digital Library

[27]

M. Zhang and K. Chakrabarti. Infogather: Semantic matching and annotation of numeric and time-varying attributes in web tables. In ACM SIGMOD, 2013.

Digital Library

[28]

C. Zhao, J. Mahmud, and I. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proceedings of SDM, 2008.

Cited By

Li PHe YYashar DCui WGe SZhang HRifinski Fainman DZhang DChaudhuri S(2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654979
Chen ZMeng WDragut E(2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574276
Roldán JJiménez PCorchuelo R(2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.knosys.2019.105157
Show More Cited By

Index Terms

TEGRA: Table Extraction by Global Record Alignment
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Web-scale knowledge extraction from semi-structured tables
WWW '10: Proceedings of the 19th international conference on World wide web

A wealth of knowledge is encoded in the form of tables on the World Wide Web. We propose a classification algorithm and a rich feature set for automatically recognizing layout tables and attribute/value tables. We report the frequencies of these table ...
Web-scale table census and classification
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic ...
Harvesting relational tables from lists on the web

A large number of web pages contain data structured in the form of "lists". Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
448
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)5

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li PHe YYashar DCui WGe SZhang HRifinski Fainman DZhang DChaudhuri S(2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654979
Chen ZMeng WDragut E(2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574276
Roldán JJiménez PCorchuelo R(2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.knosys.2019.105157
Roldán JJiménez PSzekely PCorchuelo R(2022)TOMATEInformation Sciences: an International Journal10.1016/j.ins.2021.04.087577:C(49-68)Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.ins.2021.04.087
Jiménez PRoldán JCorchuelo R(2022)A coral-reef approach to extract information from HTML tablesApplied Soft Computing10.1016/j.asoc.2021.107980115:COnline publication date: 6-May-2022
https://dl.acm.org/doi/10.1016/j.asoc.2021.107980
Roh YHeo GWhang S(2021)A Survey on Data Collection for Machine Learning: A Big Data - AI Integration PerspectiveIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.294616233:4(1328-1347)Online publication date: 1-Apr-2021
https://doi.org/10.1109/TKDE.2019.2946162
Nandakwang JChongstitvatana P(2020)TULIP: A Five-Star Table and List - From Machine-Readable to Machine-Understandable SystemsLinked Open Data - Applications, Trends and Future Developments10.5772/intechopen.91406Online publication date: 19-Nov-2020
https://doi.org/10.5772/intechopen.91406
Christodoulakis CMunson EGabel MBrown AMiller R(2020)PytheasProceedings of the VLDB Endowment10.14778/3407790.340781013:12(2075-2089)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3407790.3407810
Yan CHe YMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science NotebooksProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389738(1539-1554)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389738
Yuliana OChang C(2020)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-050:2(271-295)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10489-019-01499-0
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents