Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3464509.3464892acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Pre-Trained Web Table Embeddings for Table Discovery

Published: 20 June 2021 Publication History

Abstract

Pre-trained word embedding models have become the de-facto standard to model text in state-of-the-art analysis tools and frameworks. However, while there are massive amounts of textual data stored in tables, word embedding models are usually pre-trained on large documents. This mismatch can lead to narrowed performance on tasks where text values in tables are analyzed. To improve analysis and retrieval tasks working with tabular data, we propose a novel embedding technique to be pre-trained directly on a large Web table corpus. In an experimental evaluation, we employ our models for various data analysis tasks on different data sources. Our evaluation shows that models using pre-trained Web table embeddings outperform the same models when applied to embeddings pre-trained on text. Moreover, we show that by using Web table embeddings state-of-the-art models for the investigated tasks can be outperformed.

References

[1]
S. O. Arik and T. Pfister. 2019. Tabnet: Attentive Interpretable Tabular Learning. arXiv preprint arXiv:1908.07442(2019).
[2]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching Word Vectors with Subword Information. TACL 5(2017), 135–146.
[3]
R. Bordawekar and O. Shmueli. 2017. Using Word Embedding to Enable Semantic Queries in Relational Databases. In DEEM. 1–4.
[4]
U. Brunner and K. Stockinger. 2020. Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In EDBT. OpenProceedings, 463–473.
[5]
M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. 2008. Uncovering the Relational Web. In WebDB.
[6]
R. Cappuzzo, P. Papotti, and S. Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD. 1335–1349.
[7]
H. H. Chen, S. C. Tsai, and J. H. Tsai. 2000. Mining Tables from Large Scale HTML Texts. In COLING.
[8]
F. Chirigati, J. Liu, F. Korn, Y. Wu, C. Yu, and H. Zhang. 2016. Knowledge Exploration Using Tables on the Web. VLDB 10, 3 (2016), 193–204.
[9]
X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu. 2020. TURL: Table Understanding through Representation Learning. VLDB 14, 3 (2020), 307–319.
[10]
J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. 2015. Building the Dresden Web Table Corpus: A Classification Approach. In BDC. IEEE, 41–50.
[11]
J. Eberius, M. Thiele, K. Braunschweig, and W. Lehner. 2015. Top-K Entity Augmentation Using Consistent Set Covering. In SSDBM. 1–12.
[12]
M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. 2018. Distributed Representations of Tuples for Entity Resolution. VLDB 11, 11 (2018), 1454–1467.
[13]
A. L. Gentile, P. Ristoski, S. Eckel, D. Ritze, and H. Paulheim. 2017. Entity Matching on Web Tables: A Table Embeddings Approach for Blocking. In EDBT. 510–513.
[14]
M. Ghasemi-Gol, J. Pujara, and P. Szekely. 2020. Learning Cell Embeddings for Understanding Table Layouts. Knowledge and Information Systems(2020), 1–26.
[15]
M. Ghasemi-Gol and P. Szekely. 2018. TabVec: Table Vectors for Classification of Web Tables. arXiv preprint arXiv:1802.06290(2018).
[16]
R. V. Guha, D. Brickley, and S. Macbeth. 2016. Schema. org: Evolution of Structured Data on the Web. Commun. ACM 59, 2 (2016), 44–51.
[17]
M. Günther. 2018. Freddy: Fast Word Embeddings in Database Systems. In SIGMOD. 1817–1819.
[18]
M. Günther, P. Oehme, M. Thiele, and W. Lehner. 2020. Learning from Textual Data in Database Systems. In CIKM. 375–384.
[19]
W. L. Hamilton, R. Ying, and J. Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS. 1025–1035.
[20]
T. Kilias, A. Löser, F. Gers, Y. Zhang, R. Koopmanschap, and M. Kersten. 2019. IDEL: In-Database Neural Entity Linking. In BigComp. IEEE, 1–8.
[21]
D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster).
[22]
E. Koci. 2020. Layout Inference and Table Detection in Spreadsheet Documents. Ph.D. Dissertation. Technische Universität Dresden, Dresden; Polytechnic University of Catalonia, Barcelona.
[23]
E. Koci, M. Thiele, J. Rehak, O. Romero, and W. Lehner. 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition. In ICDAR. IEEE, 1280–1285.
[24]
E. Koci, M. Thiele, O. Romero, and W. Lehner. 2016. A Machine Learning Approach for Layout Inference in Spreadsheets. In IC3K. SciTePress, 77–88.
[25]
E. Koci, M. Thiele, O. Romero, and W. Lehner. 2019. A Genetic-Based Search for Adaptive Table Recognition in Spreadsheets. In ICDAR. IEEE, 1274–1279.
[26]
O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. 2016. A Large Public Corpus of Web Tables Containing Time and Context Metadata. In WWW. 75–76.
[27]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 3111–3119.
[28]
R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis, K. Q. Pu, and P. Andritsos. 2018. Making Open Data Transparent: Data Discovery on Open Data.IEEE Data Eng. Bull. 41, 2 (2018), 59–70.
[29]
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD. 19–34.
[30]
F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. 2018. Table Union Search on Open Data. VLDB 11, 7 (2018), 813–825.
[31]
K. Nishida, K. Sadamitsu, R. Higashinaka, and Y. Matsuo. 2017. Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture. In AAAI. 168–174.
[32]
B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deepwalk: Online Learning of Social Representations. In SIGKDD. 701–710.
[33]
R. Rifkin and A. Klautau. 2004. In Defense of One-Vs-All Classification. JMLR 5, Jan (2004), 101–141.
[34]
H. Sun, H. Ma, X. He, W. T. Yih, Y. Su, and X. Yan. 2016. Table Cell Search for Question Answering. In WWW. 771–782.
[35]
T. P. Tanon, G. Weikum, and F. Suchanek. 2020. YAGO 4: A Reason-able Knowledge Base. In ESWC. Springer, 583–596.
[36]
A. Trask, P. Michalak, and J. Liu. 2015. sense2vec - A Fast and Accurate Method for Word Sense Disambiguation in Neural Word Embeddings. arXiv preprint arXiv:1511.06388(2015).
[37]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is All You Need. In NIPS. 6000–6010.
[38]
Y. Wang and J. Hu. 2002. A Machine Learning Based Approach for Table Detection on the Web. In WWW. 242–250.

Cited By

View all
  • (2024)Leveraging Large Language Models for Sensor Data RetrievalApplied Sciences10.3390/app1406250614:6(2506)Online publication date: 15-Mar-2024
  • (2024)ENTRANT: A Large Financial Dataset for Table UnderstandingScientific Data10.1038/s41597-024-03605-511:1Online publication date: 13-Aug-2024
  • (2022)Towards practical approximate lineageProceedings of the 14th International Workshop on the Theory and Practice of Provenance10.1145/3530800.3534530(1-8)Online publication date: 17-Jun-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
aiDM '21: Proceedings of the Fourth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management
June 2021
44 pages
ISBN:9781450385350
DOI:10.1145/3464509
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Web tables
  2. learned representations
  3. table discovery

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 19 of 26 submissions, 73%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)63
  • Downloads (Last 6 weeks)6
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Leveraging Large Language Models for Sensor Data RetrievalApplied Sciences10.3390/app1406250614:6(2506)Online publication date: 15-Mar-2024
  • (2024)ENTRANT: A Large Financial Dataset for Table UnderstandingScientific Data10.1038/s41597-024-03605-511:1Online publication date: 13-Aug-2024
  • (2022)Towards practical approximate lineageProceedings of the 14th International Workshop on the Theory and Practice of Provenance10.1145/3530800.3534530(1-8)Online publication date: 17-Jun-2022
  • (2022)Qualitative measures for ad hoc table retrievalInformation Sciences: an International Journal10.1016/j.ins.2022.05.080607:C(1-26)Online publication date: 1-Aug-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media