Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Public Access

SANTOS: Relationship-based Semantic Table Union Search

Published: 30 May 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search.

    Supplemental Material

    MP4 File
    Presentation video


    Marco D. Adelfio and Hanan Samet. 2013. Schema Extraction for Tabular Data on the Web. Proc. VLDB Endow., Vol. 6, 6 (apr 2013), 421--432. https://doi.org/10.14778/2536336.2536343
    Akiko Aizawa. 2003. An information-theoretic perspective of tf--idf measures. Information Processing and Management, Vol. 39, 1 (2003), 45--65. https://doi.org/10.1016/S0306--4573(02)00021--3
    Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE 2020. IEEE, 709--720. https://doi.org/10.1109/ICDE48307.2020.00067
    Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW 2019. ACM, 1365--1375. https://doi.org/10.1145/3308558.3313685
    Michael J. Cafarella, Alon Y. Halevy, and Nodira Khoussainova. 2009. Data Integration for the Relational Web. Proc. VLDB Endow., Vol. 2, 1 (aug 2009), 1090--1101. https://doi.org/10.14778/1687627.1687750
    Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD 2020. Association for Computing Machinery, 1335--1349. https://doi.org/10.1145/3318464.3389742
    Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding Related Tables. In SIGMOD 2012. Association for Computing Machinery, 817--828. https://doi.org/10.1145/2213836.2213962
    Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table Understanding through Representation Learning. Proc. VLDB Endow., Vol. 14, 3 (nov 2020), 307--319. https://doi.org/10.14778/3430915.3430921
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. ACL, 4171--4186. https://doi.org/10.18653/v1/N19--1423
    Mina Farid, Alexandra Roatis, Ihab F. Ilyas, Hella-Franziska Hoffmann, and Xu Chu. 2016. CLAMS: Bringing Quality to Data Lakes. In SIGMOD 2016. Association for Computing Machinery, 2089--2092. https://doi.org/10.1145/2882903.2899391
    Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. ICDE (2018), 989--1000. https://doi.org/10.1109/ICDE.2018.00093
    Peter A. Flach and Iztok Savnik. 1999. Database Dependency Discovery: A Machine Learning Approach. AI Commun., Vol. 12, 3 (1999), 139--160. http://content.iospress.com/articles/ai-communications/aic182
    D Frank Hsu and Isak Taksa. 2005. Comparing rank and score combination methods for data fusion in information retrieval. Information retrieval, Vol. 8, 3 (2005), 449--480. https://doi.org/10.1007/s10791-005--6994--4
    Sainyam Galhotra and Udayan Khurana. 2020. Semantic Search over Structured Data. In CIKM 2020. Association for Computing Machinery, 3381--3384. https://doi.org/10.1145/3340531.3417426
    Parantapa Goswami, Eric Gaussier, and Massih-Reza Amini. 2017. Exploring the space of information retrieval term scoring functions. Information Processing and Management, Vol. 53, 2 (2017), 454--472. https://doi.org/10.1016/j.ipm.2016.11.003
    Vinh Thinh Ho, Yusra Ibrahim, Koninika Pal, Klaus Berberich, and Gerhard Weikum. 2019. Qsearch: Answering Quantity Queries from Text. In ISWC. Springer-Verlag, 237--257. https://doi.org/10.1007/978--3-030--30793--6_14
    Vinh Thinh Ho, Koninika Pal, and Gerhard Weikum. 2021. QuTE: Answering Quantity Queries from Web Tables. In SIGMOD (Virtual Event, China). Association for Computing Machinery, New York, NY, USA, 2740--2744. https://doi.org/10.1145/3448016.3452763
    Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia D'amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, and Antoine Zimmermann. 2021. Knowledge Graphs. ACM Comput. Surv., Vol. 54, 4, Article 71 (July 2021), 37 pages. https://doi.org/10.1145/3447772
    Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, cCagatay Demiralp, and César Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In KDD 2019. ACM, 1500--1508. https://doi.org/10.1145/3292500.3330993
    Oliver Lehmberg and Christian Bizer. 2017. Stitching Web Tables for Improving Matching Quality. Proc. VLDB Endow., Vol. 10, 11 (Aug. 2017), 1502--1513. https://doi.org/10.14778/3137628.3137657
    Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A Large Public Corpus of Web Tables Containing Time and Context Metadata. In WWW 2016 Companion. International World Wide Web Conferences Steering Committee, 75--76. https://doi.org/10.1145/2872518.2889386
    Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, René e J. Miller, and Mirek Riedewald. 2021. DomainNet: Homograph Detection for Data Lake Disambiguation. In EDBT 2021. OpenProceedings.org, 13--24. https://doi.org/10.5441/002/edbt.2021.03
    Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow., Vol. 3, 1--2 (Sept. 2010), 1338--1347. https://doi.org/10.14778/1920841.1921005
    Xiao Ling, Alon Y. Halevy, Fei Wu, and Cong Yu. 2013. Synthesizing Union Tables from the Web. In IJCAI 2013. IJCAI/AAAI, 2677--2683. http://www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/view/6758
    H. P. Luhn. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development, Vol. 1, 4 (1957), 309--317. https://doi.org/10.1147/rd.14.0309
    David JC MacKay. 2003. Information theory, inference and learning algorithms. Cambridge university press. https://books.google.com/books?id=AKuMj4PN_EMC
    C.D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. https://books.google.com/books?id=t1PoSh4uwVcC
    Suvodeep Mazumdar and Ziqi Zhang. 2016. Visualizing Semantic Table Annotations with TableMiner. In ISWC 2016, Vol. 1690. CEUR-WS.org. http://ceur-ws.org/Vol-1690/paper88.pdf
    René e J. Miller. 2018. Open Data Integration. Proc. VLDB Endow., Vol. 11, 12 (aug 2018), 2130--2139. https://doi.org/10.14778/3229863.3240491
    René e J. Miller, Fatemeh Nargesian, Erkang Zhu, Christina Christodoulakis, Ken Q. Pu, and Periklis Andritsos. 2018. Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Eng. Bull., Vol. 41, 2 (2018), 59--70. http://sites.computer.org/debull/A18june/p59.pdf
    Varish Mulwad, Tim Finin, Zareen Syed, and Joshi Anupam. 2010. Using linked data to interpret tables. In Proceedings of the the First International Workshop on Consuming Linked Data, Vol. 665. CEUR-WS.org. http://ceur-ws.org/Vol-665/MulwadEtAl_COLD2010.pdf
    Fatemeh Nargesian, Erkang Zhu, René e J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow., Vol. 12, 12 (aug 2019), 1986--1989. https://doi.org/10.14778/3352063.3352116
    Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data. Proc. VLDB Endow., Vol. 11, 7 (March 2018), 813--825. https://doi.org/10.14778/3192965.3192973
    Masayo Ota, Heiko Müller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow., Vol. 13, 7 (March 2020), 953--967. https://doi.org/10.14778/3384345.3384346
    Christos H Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 2000. Latent semantic indexing: A probabilistic analysis. J. Comput. System Sci., Vol. 61, 2 (2000), 217--235. https://doi.org/10.1006/jcss.2000.1711
    Thomas Pellissier Tanon, Gerhard Weikum, and Fabian Suchanek. 2020. YAGO 4: A Reason-able Knowledge Base. In ESWC. Springer International Publishing, 583--596. https://doi.org/10.1007/978--3-030--49461--2_34
    Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Citeseer, 29--48.
    Dominique Ritze, Oliver Lehmberg, and Christian Bizer. 2015. Matching HTML Tables to DBpedia. In WIMS 2015. Association for Computing Machinery, Article 10, 6 pages. https://doi.org/10.1145/2797115.2797118
    Karen Sparck Jones. 1988. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Taylor Graham Publishing, GBR, 132--142.
    Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Caugatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-Trained Language Models. In SIGMOD 2022. Association for Computing Machinery, 1493--1503. https://doi.org/10.1145/3514221.3517906
    Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi. 2010. Exploiting a Web of Semantic Data for Interpreting Tables. In Proceedings of the Second Web Science Conference. ACM. https://ebiquity.umbc.edu/paper/html/id/474
    Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, and Takeshi Okadome. 2019. Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables. In AAAI 2019. AAAI Press, 281--288. https://doi.org/10.1609/aaai.v33i01.3301281
    Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pacsca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow., Vol. 4, 9 (June 2011), 528--538. https://doi.org/10.14778/2002938.2002939
    Muhammad Yahya, John G. Breslin, and Muhammad Intizar Ali. 2021. Semantic Web and Knowledge Graphs for Industry 4.0. Applied Sciences, Vol. 11, 11 (2021). https://doi.org/10.3390/app11115110
    Nasser Zalmout, Chenwei Zhang, Xian Li, Yan Liang, and Xin Luna Dong. 2021. All You Need to Know to Build a Product Knowledge Graph. In KDD 2021. Association for Computing Machinery, 4090--4091. https://doi.org/10.1145/3447548.3470825
    Dan Zhang, Madelon Hulsebos, Yoshihiko Suhara, Caugatay Demiralp, Jinfeng Li, and Wang-Chiew Tan. 2020. Sato: Contextual Semantic Type Detection in Tables. Proc. VLDB Endow., Vol. 13, 12 (2020), 1835--1848. https://doi.org/10.14778/3407790.3407793
    Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD 2020. Association for Computing Machinery, 1951--1966. https://doi.org/10.1145/3318464.3389726
    Ziqi Zhang. 2017. Effective and efficient Semantic Table Interpretation using TableMiner(^ ). Semantic Web, Vol. 8, 6 (2017), 921--957. https://doi.org/10.3233/SW-160242
    Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow., Vol. 9, 12 (Aug 2016), 1185--1196. https://doi.org/10.14778/2994509.2994534

    Cited By

    View all
    • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 31-May-2024
    • (2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024
    • (2024)CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI SystemsProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669846(16-25)Online publication date: 9-Jun-2024
    • Show More Cited By

    Index Terms

    1. SANTOS: Relationship-based Semantic Table Union Search



      Information & Contributors


      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 1
      May 2023
      2807 pages
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].


      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 May 2023
      Published in PACMMOD Volume 1, Issue 1


      Request permissions for this article.

      Author Tags

      1. data lakes
      2. knowledge graphs
      3. table discovery


      • Research-article

      Data Availability

      Funding Sources


      Other Metrics

      Bibliometrics & Citations


      Article Metrics

      • Downloads (Last 12 months)471
      • Downloads (Last 6 weeks)82

      Other Metrics


      Cited By

      View all
      • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 31-May-2024
      • (2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024
      • (2024)CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI SystemsProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669846(16-25)Online publication date: 9-Jun-2024
      • (2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
      • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
      • (2024)A Study on Efficient Indexing for Table Search in Data Lakes2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00046(245-252)Online publication date: 5-Feb-2024
      • (2024)Pyramid: A Heterogeneous Data Integration Algorithm Based on Hierarchical GraphICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447879(6220-6224)Online publication date: 14-Apr-2024
      • (2023)R2D2: Reducing Redundancy and Duplication in Data LakesProceedings of the ACM on Management of Data10.1145/36267621:4(1-25)Online publication date: 12-Dec-2023

      View Options

      View options


      View or Download as a PDF file.



      View online with eReader.


      Get Access

      Login options

      Full Access







      Share this Publication link

      Share on social media