research-article

Public Access

SANTOS: Relationship-based Semantic Table Union Search

Authors:

Aamod Khatiwada,

Wolfgang Gatterbauer,

Renée J. Miller, and

Mirek RiedewaldAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 1

Article No.: 9, Pages 1 - 25

https://doi.org/10.1145/3588689

Published: 30 May 2023 Publication History

Abstract

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search.

Supplemental Material

MP4 File

Presentation video

Download
142.98 MB

References

[1]

Marco D. Adelfio and Hanan Samet. 2013. Schema Extraction for Tabular Data on the Web. Proc. VLDB Endow., Vol. 6, 6 (apr 2013), 421--432. https://doi.org/10.14778/2536336.2536343

Digital Library

[2]

Akiko Aizawa. 2003. An information-theoretic perspective of tf--idf measures. Information Processing and Management, Vol. 39, 1 (2003), 45--65. https://doi.org/10.1016/S0306--4573(02)00021--3

Digital Library

[3]

Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE 2020. IEEE, 709--720. https://doi.org/10.1109/ICDE48307.2020.00067

[4]

Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW 2019. ACM, 1365--1375. https://doi.org/10.1145/3308558.3313685

Digital Library

[5]

Michael J. Cafarella, Alon Y. Halevy, and Nodira Khoussainova. 2009. Data Integration for the Relational Web. Proc. VLDB Endow., Vol. 2, 1 (aug 2009), 1090--1101. https://doi.org/10.14778/1687627.1687750

Digital Library

[6]

Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD 2020. Association for Computing Machinery, 1335--1349. https://doi.org/10.1145/3318464.3389742

Digital Library

[7]

Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding Related Tables. In SIGMOD 2012. Association for Computing Machinery, 817--828. https://doi.org/10.1145/2213836.2213962

Digital Library

[8]

Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table Understanding through Representation Learning. Proc. VLDB Endow., Vol. 14, 3 (nov 2020), 307--319. https://doi.org/10.14778/3430915.3430921

Digital Library

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. ACL, 4171--4186. https://doi.org/10.18653/v1/N19--1423

[10]

Mina Farid, Alexandra Roatis, Ihab F. Ilyas, Hella-Franziska Hoffmann, and Xu Chu. 2016. CLAMS: Bringing Quality to Data Lakes. In SIGMOD 2016. Association for Computing Machinery, 2089--2092. https://doi.org/10.1145/2882903.2899391

Digital Library

[11]

Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. ICDE (2018), 989--1000. https://doi.org/10.1109/ICDE.2018.00093

[12]

Peter A. Flach and Iztok Savnik. 1999. Database Dependency Discovery: A Machine Learning Approach. AI Commun., Vol. 12, 3 (1999), 139--160. http://content.iospress.com/articles/ai-communications/aic182

Digital Library

[13]

D Frank Hsu and Isak Taksa. 2005. Comparing rank and score combination methods for data fusion in information retrieval. Information retrieval, Vol. 8, 3 (2005), 449--480. https://doi.org/10.1007/s10791-005--6994--4

[14]

Sainyam Galhotra and Udayan Khurana. 2020. Semantic Search over Structured Data. In CIKM 2020. Association for Computing Machinery, 3381--3384. https://doi.org/10.1145/3340531.3417426

Digital Library

[15]

Parantapa Goswami, Eric Gaussier, and Massih-Reza Amini. 2017. Exploring the space of information retrieval term scoring functions. Information Processing and Management, Vol. 53, 2 (2017), 454--472. https://doi.org/10.1016/j.ipm.2016.11.003

Digital Library

[16]

Vinh Thinh Ho, Yusra Ibrahim, Koninika Pal, Klaus Berberich, and Gerhard Weikum. 2019. Qsearch: Answering Quantity Queries from Text. In ISWC. Springer-Verlag, 237--257. https://doi.org/10.1007/978--3-030--30793--6_14

[17]

Vinh Thinh Ho, Koninika Pal, and Gerhard Weikum. 2021. QuTE: Answering Quantity Queries from Web Tables. In SIGMOD (Virtual Event, China). Association for Computing Machinery, New York, NY, USA, 2740--2744. https://doi.org/10.1145/3448016.3452763

Digital Library

[18]

Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia D'amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, and Antoine Zimmermann. 2021. Knowledge Graphs. ACM Comput. Surv., Vol. 54, 4, Article 71 (July 2021), 37 pages. https://doi.org/10.1145/3447772

Digital Library

[19]

Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, cCagatay Demiralp, and César Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In KDD 2019. ACM, 1500--1508. https://doi.org/10.1145/3292500.3330993

Digital Library

[20]

Oliver Lehmberg and Christian Bizer. 2017. Stitching Web Tables for Improving Matching Quality. Proc. VLDB Endow., Vol. 10, 11 (Aug. 2017), 1502--1513. https://doi.org/10.14778/3137628.3137657

Digital Library

[21]

Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A Large Public Corpus of Web Tables Containing Time and Context Metadata. In WWW 2016 Companion. International World Wide Web Conferences Steering Committee, 75--76. https://doi.org/10.1145/2872518.2889386

Digital Library

[22]

Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, René e J. Miller, and Mirek Riedewald. 2021. DomainNet: Homograph Detection for Data Lake Disambiguation. In EDBT 2021. OpenProceedings.org, 13--24. https://doi.org/10.5441/002/edbt.2021.03

[23]

Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow., Vol. 3, 1--2 (Sept. 2010), 1338--1347. https://doi.org/10.14778/1920841.1921005

Digital Library

[24]

Xiao Ling, Alon Y. Halevy, Fei Wu, and Cong Yu. 2013. Synthesizing Union Tables from the Web. In IJCAI 2013. IJCAI/AAAI, 2677--2683. http://www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/view/6758

Digital Library

[25]

H. P. Luhn. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development, Vol. 1, 4 (1957), 309--317. https://doi.org/10.1147/rd.14.0309

Digital Library

[26]

David JC MacKay. 2003. Information theory, inference and learning algorithms. Cambridge university press. https://books.google.com/books?id=AKuMj4PN_EMC

Digital Library

[27]

C.D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. https://books.google.com/books?id=t1PoSh4uwVcC

[28]

Suvodeep Mazumdar and Ziqi Zhang. 2016. Visualizing Semantic Table Annotations with TableMiner. In ISWC 2016, Vol. 1690. CEUR-WS.org. http://ceur-ws.org/Vol-1690/paper88.pdf

[29]

René e J. Miller. 2018. Open Data Integration. Proc. VLDB Endow., Vol. 11, 12 (aug 2018), 2130--2139. https://doi.org/10.14778/3229863.3240491

Digital Library

[30]

René e J. Miller, Fatemeh Nargesian, Erkang Zhu, Christina Christodoulakis, Ken Q. Pu, and Periklis Andritsos. 2018. Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Eng. Bull., Vol. 41, 2 (2018), 59--70. http://sites.computer.org/debull/A18june/p59.pdf

[31]

Varish Mulwad, Tim Finin, Zareen Syed, and Joshi Anupam. 2010. Using linked data to interpret tables. In Proceedings of the the First International Workshop on Consuming Linked Data, Vol. 665. CEUR-WS.org. http://ceur-ws.org/Vol-665/MulwadEtAl_COLD2010.pdf

[32]

Fatemeh Nargesian, Erkang Zhu, René e J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow., Vol. 12, 12 (aug 2019), 1986--1989. https://doi.org/10.14778/3352063.3352116

Digital Library

[33]

Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data. Proc. VLDB Endow., Vol. 11, 7 (March 2018), 813--825. https://doi.org/10.14778/3192965.3192973

Digital Library

[34]

Masayo Ota, Heiko Müller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow., Vol. 13, 7 (March 2020), 953--967. https://doi.org/10.14778/3384345.3384346

Digital Library

[35]

Christos H Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 2000. Latent semantic indexing: A probabilistic analysis. J. Comput. System Sci., Vol. 61, 2 (2000), 217--235. https://doi.org/10.1006/jcss.2000.1711

Digital Library

[36]

Thomas Pellissier Tanon, Gerhard Weikum, and Fabian Suchanek. 2020. YAGO 4: A Reason-able Knowledge Base. In ESWC. Springer International Publishing, 583--596. https://doi.org/10.1007/978--3-030--49461--2_34

[37]

Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Citeseer, 29--48.

[38]

Dominique Ritze, Oliver Lehmberg, and Christian Bizer. 2015. Matching HTML Tables to DBpedia. In WIMS 2015. Association for Computing Machinery, Article 10, 6 pages. https://doi.org/10.1145/2797115.2797118

Digital Library

[39]

Karen Sparck Jones. 1988. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Taylor Graham Publishing, GBR, 132--142.

[40]

Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Caugatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-Trained Language Models. In SIGMOD 2022. Association for Computing Machinery, 1493--1503. https://doi.org/10.1145/3514221.3517906

Digital Library

[41]

Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi. 2010. Exploiting a Web of Semantic Data for Interpreting Tables. In Proceedings of the Second Web Science Conference. ACM. https://ebiquity.umbc.edu/paper/html/id/474

[42]

Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, and Takeshi Okadome. 2019. Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables. In AAAI 2019. AAAI Press, 281--288. https://doi.org/10.1609/aaai.v33i01.3301281

Digital Library

[43]

Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pacsca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow., Vol. 4, 9 (June 2011), 528--538. https://doi.org/10.14778/2002938.2002939

Digital Library

[44]

Muhammad Yahya, John G. Breslin, and Muhammad Intizar Ali. 2021. Semantic Web and Knowledge Graphs for Industry 4.0. Applied Sciences, Vol. 11, 11 (2021). https://doi.org/10.3390/app11115110

[45]

Nasser Zalmout, Chenwei Zhang, Xian Li, Yan Liang, and Xin Luna Dong. 2021. All You Need to Know to Build a Product Knowledge Graph. In KDD 2021. Association for Computing Machinery, 4090--4091. https://doi.org/10.1145/3447548.3470825

Digital Library

[46]

Dan Zhang, Madelon Hulsebos, Yoshihiko Suhara, Caugatay Demiralp, Jinfeng Li, and Wang-Chiew Tan. 2020. Sato: Contextual Semantic Type Detection in Tables. Proc. VLDB Endow., Vol. 13, 12 (2020), 1835--1848. https://doi.org/10.14778/3407790.3407793

Digital Library

[47]

Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD 2020. Association for Computing Machinery, 1951--1966. https://doi.org/10.1145/3318464.3389726

Digital Library

[48]

Ziqi Zhang. 2017. Effective and efficient Semantic Table Interpretation using TableMiner(^ ). Semantic Web, Vol. 8, 6 (2017), 921--957. https://doi.org/10.3233/SW-160242

Digital Library

[49]

Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow., Vol. 9, 12 (Aug 2016), 1185--1196. https://doi.org/10.14778/2994509.2994534

Digital Library

Cited By

Kayali MLykov AFountalis IVasiloglou NOlteanu DSuciu D(2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 31-May-2024
https://dl.acm.org/doi/10.14778/3659437.3659461
Zeng ACafarella M(2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665601.3669850
Feng YRahman SFeng AChen VKandogan E(2024)CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI SystemsProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669846(16-25)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665601.3669846
Show More Cited By

Index Terms

SANTOS: Relationship-based Semantic Table Union Search
1. Information systems
  1. Data management systems
    1. Information integration

Recommendations

Table union search on open data

We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. Two tables are unionable if they share attributes from the same domain. Our solution ...
Read More
From Semantic to Cognitive Information Search: The Fundamental Principles and Models of Deep Semantic Search
Abstract—
The features of human-machine documentary search focused on information support of cognitive processes are considered. The concepts of meaning and semantic information search are analyzed. The concept of deep semantic search is introduced, ...
Read More
Ontology Based Semantic Search: An Introduction and a Survey of Current Approaches
ICICA '14: Proceedings of the 2014 International Conference on Intelligent Computing Applications

Ontology based semantic search will lead to new generation of search based on the meaning of keyword rather than keyword and helps in finding correct information on the web. Here, ontology provides an explicit specification of conceptualization which ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 1

PACMMOD

May 2023

2807 pages

EISSN:2836-6573

DOI:10.1145/3603164

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023

Published in PACMMOD Volume 1, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Data Availability

Presentation video https://dl.acm.org/doi/10.1145/3588689#20230520-SANTOS_SIGMOD.mp4

Funding Sources

NSF (National Science Foundation)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
496
Total Downloads

Downloads (Last 12 months)471
Downloads (Last 6 weeks)82

Other Metrics

View Author Metrics

Citations

Cited By

Kayali MLykov AFountalis IVasiloglou NOlteanu DSuciu D(2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 31-May-2024
https://dl.acm.org/doi/10.14778/3659437.3659461
Zeng ACafarella M(2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665601.3669850
Feng YRahman SFeng AChen VKandogan E(2024)CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI SystemsProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669846(16-25)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665601.3669846
Chen KKoudas N(2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654984
Leventidis AChristensen MLissandrini MDi Rocco LHose KMiller RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657877
Taha ILissandrini MSimitsis AIoannidis Y(2024)A Study on Efficient Indexing for Table Search in Data Lakes2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00046(245-252)Online publication date: 5-Feb-2024
https://doi.org/10.1109/ICSC59802.2024.00046
Jiang SLan YWang WGuo Z(2024)Pyramid: A Heterogeneous Data Integration Algorithm Based on Hierarchical GraphICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447879(6220-6224)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447879
Shah RMukherjee KTyagi AKarnam SJoshi DBhosale SMitra S(2023)R2D2: Reducing Redundancy and Duplication in Data LakesProceedings of the ACM on Management of Data10.1145/36267621:4(1-25)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626762

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents