Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

TOMATE: : A heuristic-based approach to extract data from HTML tables

Published: 01 October 2021 Publication History

Abstract

Extracting data from user-friendly HTML tables is difficult because of their different layouts, formats, and encoding problems. In this article, we present a new proposal that first applies several pre-processing heuristics to clean the tables, then performs functional analysis, and finally applies some post-processing heuristics to produce the output. Our most important contribution is regarding functional analysis, which we address by projecting the cells onto a high-dimensional feature space in which a standard clustering technique is used to make the meta-data cells apart from the data cells. We experimented with two large repositories of real-world HTML tables and our results confirm that our proposal can extract data from them with an F 1 score of 89.50 % in just 0.09 CPU seconds per table. We confronted our proposal with several competitors and the statistical analysis confirmed its superiority in terms of effectiveness, while it keeps very competitive in terms of efficiency.

References

[1]
M.J. Cafarella, A.Y. Halevy, D.Z. Wang, E. Wu, Y. Zhang, WebTables: exploring the power of tables on the Web, PVLDB 1 (1) (2008) 538–549,.
[2]
M.J. Cafarella, A.Y. Halevy, Y. Zhang, D.Z. Wang, and E. Wu, Uncovering the relational Web, in: WebDB, 2008, pp. 1–6. http://webdb2008.como.polimi.it/images/stories/WebDB2008/paper30.pdf.
[3]
H.-H. Chen, S.-C. Tsai, J.-H. Tsai, Mining tables from large scale HTML texts, in: COLING, 2000, pp. 166–172.
[4]
X. Chu, Y. He, K. Chakrabarti, K. Ganjam, TEGRA: table extraction by global record alignment, in: SIGMOD, 2015, pp. 1713–1728.
[5]
J. Eberius, M. Thiele, K. Braunschweig, W. Lehner, Top-k entity augmentation using consistent set covering, in: SSDBM, 2015, pp. 8:1–8:12.
[6]
H. Elmeleegy, J. Madhavan, A.Y. Halevy, Harvesting relational tables from lists on the Web, VLDB 20 (2) (2011) 209–226,.
[7]
D.W. Embley, S.C. Seth, G. Nagy, Transforming web tables to a relational database, in: ICPR, 2014, pp. 2781–2786.
[8]
E. Ferrara, P. de Meo, G. Fiumara, R. Baumgartner, Web data extraction, applications, and techniques: a survey, Knowl.-Based Syst. 70 (2014) 301–323,.
[9]
P. Jiménez, R. Corchuelo, H.A. Sleiman, ARIEX: automated ranking of information extractors, Knowl.-Based Syst. 93 (2016) 84–108,.
[10]
S.-W. Jung, H.-C. Kwon, A scalable hybrid approach for extracting head components from web tables, IEEE Trans. Knowl. Data Eng. 18 (2) (2006) 174–187,.
[11]
Y.-S. Kim, K.-H. Lee, Detecting tables in web documents, Eng. Appl. AI 18 (6) (2005) 745–757,.
[12]
N. Milošević, C. Gregson, R. Hernandez, G. Nenadic, Disentangling the structure of tables in scientific literature, in: NLDB, 2016, pp. 162–174.
[13]
F. Morstatter, A. Galstyan, G. Satyukov, D.M. Benjamin, A. Abeliuk, M. Mirtaheri, P. Szekely, E. Ferrara, A. Matsui, M. Steyvers, S. Bennet, D. Budescu, M. Himmelstein, M.D. Ward, A. Beger, M. Catasta, R. Sosic, J. Leskovec, P. Atanasov, R. Joseph, R. Sethi, A. Abbas, SAGE: a hybrid geopolitical event forecasting system, IJCAI 1 (2019) 6557–6559,.
[14]
K. Nishida, K. Sadamitsu, R. Higashinaka, Y. Matsuo, Understanding the semantic structures of tables with a hybrid deep neural network architecture, in: AAAI, 2017, pp. 168–174.
[15]
Y. Oulabi, C. Bizer, Extending cross-domain knowledge bases with long tail entities using web table data, in: EDBT, 2019, pp. 385–396.
[16]
R. Rastan, H.-Y. Paik, J. Shepherd, A. Haller, Automated table understanding using stub patterns, in: DSAA, 2016, pp. 533–548.
[17]
J.C. Roldán. Kizomba: An unsupervised heuristic-based web information extractor, in: PAAMS, 2016, pp. 383–385.
[18]
J.C. Roldán, P. Jiménez, R. Corchuelo, On extracting data from tables that are encoded using HTML, Knowl.-Based Syst. (2019) 1–19,.
[19]
H.A. Sleiman, R. Corchuelo, TEX: an efficient and effective unsupervised web information extractor, Knowl.-Based Syst. 39 (2013) 109–123,.
[20]
H.A. Sleiman, R. Corchuelo, A survey on region extractors from web documents, IEEE Trans. Knowl. Data Eng. 25 (9) (2013) 1960–1981,.
[21]
A. Thawani, M. Hu, E. Hu, H. Zafar, N.T. Divvala, A. Singh, E. Qasemi, P.A. Szekely, J. Pujara, Entity linking to knowledge graphs to infer column types and properties, in: SemTab@ISWC, 2019, pp. 25–32.
[22]
Wikipedia. Database download, 2020. URL https://en.wikipedia.org/wiki/Wikipedia:Database_download.
[23]
Y. Yang, W.-S. Luk, A framework for web table mining, in: WIDM, 2002, pp. 36–42.
[24]
M. Yoshida, K. Torisawa, J. Tsujii, A method to integrate tables of the World Wide Web, in: WDA, 2001, pp. 31–34.
[25]
S. Zhang, K. Balog, Web table extraction, retrieval, and augmentation: A survey, ACM Trans. Intell. Syst. Technol. 11 (2) (2020) 13:1–13:35,.

Cited By

View all
  • (2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 23-May-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal
Information Sciences: an International Journal  Volume 577, Issue C
Oct 2021
902 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 October 2021

Author Tags

  1. HTML tables
  2. Data extraction

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 23-May-2022

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media