research-article

TOMATE: : A heuristic-based approach to extract data from HTML tables

Authors:

Juan C. Roldán,

Patricia Jiménez,

Rafael CorchueloAuthors Info & Claims

Volume 577, Issue C

Pages 49 - 68

https://doi.org/10.1016/j.ins.2021.04.087

Published: 01 October 2021 Publication History

Abstract

Extracting data from user-friendly HTML tables is difficult because of their different layouts, formats, and encoding problems. In this article, we present a new proposal that first applies several pre-processing heuristics to clean the tables, then performs functional analysis, and finally applies some post-processing heuristics to produce the output. Our most important contribution is regarding functional analysis, which we address by projecting the cells onto a high-dimensional feature space in which a standard clustering technique is used to make the meta-data cells apart from the data cells. We experimented with two large repositories of real-world HTML tables and our results confirm that our proposal can extract data from them with an F 1 score of 89.50 % in just 0.09 CPU seconds per table. We confronted our proposal with several competitors and the statistical analysis confirmed its superiority in terms of effectiveness, while it keeps very competitive in terms of efficiency.

References

[1]

M.J. Cafarella, A.Y. Halevy, D.Z. Wang, E. Wu, Y. Zhang, WebTables: exploring the power of tables on the Web, PVLDB 1 (1) (2008) 538–549,.

Digital Library

[2]

M.J. Cafarella, A.Y. Halevy, Y. Zhang, D.Z. Wang, and E. Wu, Uncovering the relational Web, in: WebDB, 2008, pp. 1–6. http://webdb2008.como.polimi.it/images/stories/WebDB2008/paper30.pdf.

[3]

H.-H. Chen, S.-C. Tsai, J.-H. Tsai, Mining tables from large scale HTML texts, in: COLING, 2000, pp. 166–172.

[4]

X. Chu, Y. He, K. Chakrabarti, K. Ganjam, TEGRA: table extraction by global record alignment, in: SIGMOD, 2015, pp. 1713–1728.

[5]

J. Eberius, M. Thiele, K. Braunschweig, W. Lehner, Top-k entity augmentation using consistent set covering, in: SSDBM, 2015, pp. 8:1–8:12.

[6]

H. Elmeleegy, J. Madhavan, A.Y. Halevy, Harvesting relational tables from lists on the Web, VLDB 20 (2) (2011) 209–226,.

Digital Library

[7]

D.W. Embley, S.C. Seth, G. Nagy, Transforming web tables to a relational database, in: ICPR, 2014, pp. 2781–2786.

[8]

E. Ferrara, P. de Meo, G. Fiumara, R. Baumgartner, Web data extraction, applications, and techniques: a survey, Knowl.-Based Syst. 70 (2014) 301–323,.

Digital Library

[9]

P. Jiménez, R. Corchuelo, H.A. Sleiman, ARIEX: automated ranking of information extractors, Knowl.-Based Syst. 93 (2016) 84–108,.

Digital Library

[10]

S.-W. Jung, H.-C. Kwon, A scalable hybrid approach for extracting head components from web tables, IEEE Trans. Knowl. Data Eng. 18 (2) (2006) 174–187,.

Digital Library

[11]

Y.-S. Kim, K.-H. Lee, Detecting tables in web documents, Eng. Appl. AI 18 (6) (2005) 745–757,.

Digital Library

[12]

N. Milošević, C. Gregson, R. Hernandez, G. Nenadic, Disentangling the structure of tables in scientific literature, in: NLDB, 2016, pp. 162–174.

[13]

F. Morstatter, A. Galstyan, G. Satyukov, D.M. Benjamin, A. Abeliuk, M. Mirtaheri, P. Szekely, E. Ferrara, A. Matsui, M. Steyvers, S. Bennet, D. Budescu, M. Himmelstein, M.D. Ward, A. Beger, M. Catasta, R. Sosic, J. Leskovec, P. Atanasov, R. Joseph, R. Sethi, A. Abbas, SAGE: a hybrid geopolitical event forecasting system, IJCAI 1 (2019) 6557–6559,.

[14]

K. Nishida, K. Sadamitsu, R. Higashinaka, Y. Matsuo, Understanding the semantic structures of tables with a hybrid deep neural network architecture, in: AAAI, 2017, pp. 168–174.

[15]

Y. Oulabi, C. Bizer, Extending cross-domain knowledge bases with long tail entities using web table data, in: EDBT, 2019, pp. 385–396.

[16]

R. Rastan, H.-Y. Paik, J. Shepherd, A. Haller, Automated table understanding using stub patterns, in: DSAA, 2016, pp. 533–548.

[17]

J.C. Roldán. Kizomba: An unsupervised heuristic-based web information extractor, in: PAAMS, 2016, pp. 383–385.

[18]

J.C. Roldán, P. Jiménez, R. Corchuelo, On extracting data from tables that are encoded using HTML, Knowl.-Based Syst. (2019) 1–19,.

Digital Library

[19]

H.A. Sleiman, R. Corchuelo, TEX: an efficient and effective unsupervised web information extractor, Knowl.-Based Syst. 39 (2013) 109–123,.

Digital Library

[20]

H.A. Sleiman, R. Corchuelo, A survey on region extractors from web documents, IEEE Trans. Knowl. Data Eng. 25 (9) (2013) 1960–1981,.

Digital Library

[21]

A. Thawani, M. Hu, E. Hu, H. Zafar, N.T. Divvala, A. Singh, E. Qasemi, P.A. Szekely, J. Pujara, Entity linking to knowledge graphs to infer column types and properties, in: SemTab@ISWC, 2019, pp. 25–32.

[22]

Wikipedia. Database download, 2020. URL https://en.wikipedia.org/wiki/Wikipedia:Database_download.

[23]

Y. Yang, W.-S. Luk, A framework for web table mining, in: WIDM, 2002, pp. 36–42.

[24]

M. Yoshida, K. Torisawa, J. Tsujii, A method to integrate tables of the World Wide Web, in: WDA, 2001, pp. 31–34.

[25]

S. Zhang, K. Balog, Web table extraction, retrieval, and augmentation: A survey, ACM Trans. Intell. Syst. Technol. 11 (2) (2020) 13:1–13:35,.

Digital Library

Cited By

Jiménez PCorchuelo R(2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 23-May-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.116700

Index Terms

TOMATE: A heuristic-based approach to extract data from HTML tables

Index terms have been assigned to the content through auto-classification.

Recommendations

A clustering approach to extract data from HTML tables
Abstract
HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In ...
Highlights
- User-friendly HTML tables are a popular means to publish data.
- It is difficult ...
Ducky: a data extraction system for various structured web documents
IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium

The World Wide Web has become a primary source of information. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky: including a Web Wrapper which extracts data from Web ...
DIADEM: thousands of websites to a single database

The web is overflowing with implicitly structured data, spread over hundreds of thousands of sites, hidden deep behind search forms, or siloed in marketplaces, only accessible as HTML. Automatic extraction of structured data at the scale of thousands of ...

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 577, Issue C

Oct 2021

902 pages

ISSN:0020-0255

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 October 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiménez PCorchuelo R(2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 23-May-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.116700

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents