Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

SANTOS: Relationship-based Semantic Table Union Search

Published: 30 May 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search.

    Supplemental Material

    MP4 File
    Presentation video

    References

    [1]
    Marco D. Adelfio and Hanan Samet. 2013. Schema Extraction for Tabular Data on the Web. Proc. VLDB Endow., Vol. 6, 6 (apr 2013), 421--432. https://doi.org/10.14778/2536336.2536343
    [2]
    Akiko Aizawa. 2003. An information-theoretic perspective of tf--idf measures. Information Processing and Management, Vol. 39, 1 (2003), 45--65. https://doi.org/10.1016/S0306--4573(02)00021--3
    [3]
    Alex Bogatu, Alvaro A. A. Fernandes, Norman W. Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In ICDE 2020. IEEE, 709--720. https://doi.org/10.1109/ICDE48307.2020.00067
    [4]
    Dan Brickley, Matthew Burgess, and Natasha F. Noy. 2019. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In WWW 2019. ACM, 1365--1375. https://doi.org/10.1145/3308558.3313685
    [5]
    Michael J. Cafarella, Alon Y. Halevy, and Nodira Khoussainova. 2009. Data Integration for the Relational Web. Proc. VLDB Endow., Vol. 2, 1 (aug 2009), 1090--1101. https://doi.org/10.14778/1687627.1687750
    [6]
    Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In SIGMOD 2020. Association for Computing Machinery, 1335--1349. https://doi.org/10.1145/3318464.3389742
    [7]
    Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding Related Tables. In SIGMOD 2012. Association for Computing Machinery, 817--828. https://doi.org/10.1145/2213836.2213962
    [8]
    Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table Understanding through Representation Learning. Proc. VLDB Endow., Vol. 14, 3 (nov 2020), 307--319. https://doi.org/10.14778/3430915.3430921
    [9]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. ACL, 4171--4186. https://doi.org/10.18653/v1/N19--1423
    [10]
    Mina Farid, Alexandra Roatis, Ihab F. Ilyas, Hella-Franziska Hoffmann, and Xu Chu. 2016. CLAMS: Bringing Quality to Data Lakes. In SIGMOD 2016. Association for Computing Machinery, 2089--2092. https://doi.org/10.1145/2882903.2899391
    [11]
    Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. ICDE (2018), 989--1000. https://doi.org/10.1109/ICDE.2018.00093
    [12]
    Peter A. Flach and Iztok Savnik. 1999. Database Dependency Discovery: A Machine Learning Approach. AI Commun., Vol. 12, 3 (1999), 139--160. http://content.iospress.com/articles/ai-communications/aic182
    [13]
    D Frank Hsu and Isak Taksa. 2005. Comparing rank and score combination methods for data fusion in information retrieval. Information retrieval, Vol. 8, 3 (2005), 449--480. https://doi.org/10.1007/s10791-005--6994--4
    [14]
    Sainyam Galhotra and Udayan Khurana. 2020. Semantic Search over Structured Data. In CIKM 2020. Association for Computing Machinery, 3381--3384. https://doi.org/10.1145/3340531.3417426
    [15]
    Parantapa Goswami, Eric Gaussier, and Massih-Reza Amini. 2017. Exploring the space of information retrieval term scoring functions. Information Processing and Management, Vol. 53, 2 (2017), 454--472. https://doi.org/10.1016/j.ipm.2016.11.003
    [16]
    Vinh Thinh Ho, Yusra Ibrahim, Koninika Pal, Klaus Berberich, and Gerhard Weikum. 2019. Qsearch: Answering Quantity Queries from Text. In ISWC. Springer-Verlag, 237--257. https://doi.org/10.1007/978--3-030--30793--6_14
    [17]
    Vinh Thinh Ho, Koninika Pal, and Gerhard Weikum. 2021. QuTE: Answering Quantity Queries from Web Tables. In SIGMOD (Virtual Event, China). Association for Computing Machinery, New York, NY, USA, 2740--2744. https://doi.org/10.1145/3448016.3452763
    [18]
    Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia D'amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, and Antoine Zimmermann. 2021. Knowledge Graphs. ACM Comput. Surv., Vol. 54, 4, Article 71 (July 2021), 37 pages. https://doi.org/10.1145/3447772
    [19]
    Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, cCagatay Demiralp, and César Hidalgo. 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. In KDD 2019. ACM, 1500--1508. https://doi.org/10.1145/3292500.3330993
    [20]
    Oliver Lehmberg and Christian Bizer. 2017. Stitching Web Tables for Improving Matching Quality. Proc. VLDB Endow., Vol. 10, 11 (Aug. 2017), 1502--1513. https://doi.org/10.14778/3137628.3137657
    [21]
    Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A Large Public Corpus of Web Tables Containing Time and Context Metadata. In WWW 2016 Companion. International World Wide Web Conferences Steering Committee, 75--76. https://doi.org/10.1145/2872518.2889386
    [22]
    Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, René e J. Miller, and Mirek Riedewald. 2021. DomainNet: Homograph Detection for Data Lake Disambiguation. In EDBT 2021. OpenProceedings.org, 13--24. https://doi.org/10.5441/002/edbt.2021.03
    [23]
    Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow., Vol. 3, 1--2 (Sept. 2010), 1338--1347. https://doi.org/10.14778/1920841.1921005
    [24]
    Xiao Ling, Alon Y. Halevy, Fei Wu, and Cong Yu. 2013. Synthesizing Union Tables from the Web. In IJCAI 2013. IJCAI/AAAI, 2677--2683. http://www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/view/6758
    [25]
    H. P. Luhn. 1957. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development, Vol. 1, 4 (1957), 309--317. https://doi.org/10.1147/rd.14.0309
    [26]
    David JC MacKay. 2003. Information theory, inference and learning algorithms. Cambridge university press. https://books.google.com/books?id=AKuMj4PN_EMC
    [27]
    C.D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. https://books.google.com/books?id=t1PoSh4uwVcC
    [28]
    Suvodeep Mazumdar and Ziqi Zhang. 2016. Visualizing Semantic Table Annotations with TableMiner. In ISWC 2016, Vol. 1690. CEUR-WS.org. http://ceur-ws.org/Vol-1690/paper88.pdf
    [29]
    René e J. Miller. 2018. Open Data Integration. Proc. VLDB Endow., Vol. 11, 12 (aug 2018), 2130--2139. https://doi.org/10.14778/3229863.3240491
    [30]
    René e J. Miller, Fatemeh Nargesian, Erkang Zhu, Christina Christodoulakis, Ken Q. Pu, and Periklis Andritsos. 2018. Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Eng. Bull., Vol. 41, 2 (2018), 59--70. http://sites.computer.org/debull/A18june/p59.pdf
    [31]
    Varish Mulwad, Tim Finin, Zareen Syed, and Joshi Anupam. 2010. Using linked data to interpret tables. In Proceedings of the the First International Workshop on Consuming Linked Data, Vol. 665. CEUR-WS.org. http://ceur-ws.org/Vol-665/MulwadEtAl_COLD2010.pdf
    [32]
    Fatemeh Nargesian, Erkang Zhu, René e J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow., Vol. 12, 12 (aug 2019), 1986--1989. https://doi.org/10.14778/3352063.3352116
    [33]
    Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data. Proc. VLDB Endow., Vol. 11, 7 (March 2018), 813--825. https://doi.org/10.14778/3192965.3192973
    [34]
    Masayo Ota, Heiko Müller, Juliana Freire, and Divesh Srivastava. 2020. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow., Vol. 13, 7 (March 2020), 953--967. https://doi.org/10.14778/3384345.3384346
    [35]
    Christos H Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. 2000. Latent semantic indexing: A probabilistic analysis. J. Comput. System Sci., Vol. 61, 2 (2000), 217--235. https://doi.org/10.1006/jcss.2000.1711
    [36]
    Thomas Pellissier Tanon, Gerhard Weikum, and Fabian Suchanek. 2020. YAGO 4: A Reason-able Knowledge Base. In ESWC. Springer International Publishing, 583--596. https://doi.org/10.1007/978--3-030--49461--2_34
    [37]
    Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Citeseer, 29--48.
    [38]
    Dominique Ritze, Oliver Lehmberg, and Christian Bizer. 2015. Matching HTML Tables to DBpedia. In WIMS 2015. Association for Computing Machinery, Article 10, 6 pages. https://doi.org/10.1145/2797115.2797118
    [39]
    Karen Sparck Jones. 1988. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Taylor Graham Publishing, GBR, 132--142.
    [40]
    Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Caugatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-Trained Language Models. In SIGMOD 2022. Association for Computing Machinery, 1493--1503. https://doi.org/10.1145/3514221.3517906
    [41]
    Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi. 2010. Exploiting a Web of Semantic Data for Interpreting Tables. In Proceedings of the Second Web Science Conference. ACM. https://ebiquity.umbc.edu/paper/html/id/474
    [42]
    Kunihiro Takeoka, Masafumi Oyamada, Shinji Nakadai, and Takeshi Okadome. 2019. Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables. In AAAI 2019. AAAI Press, 281--288. https://doi.org/10.1609/aaai.v33i01.3301281
    [43]
    Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pacsca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow., Vol. 4, 9 (June 2011), 528--538. https://doi.org/10.14778/2002938.2002939
    [44]
    Muhammad Yahya, John G. Breslin, and Muhammad Intizar Ali. 2021. Semantic Web and Knowledge Graphs for Industry 4.0. Applied Sciences, Vol. 11, 11 (2021). https://doi.org/10.3390/app11115110
    [45]
    Nasser Zalmout, Chenwei Zhang, Xian Li, Yan Liang, and Xin Luna Dong. 2021. All You Need to Know to Build a Product Knowledge Graph. In KDD 2021. Association for Computing Machinery, 4090--4091. https://doi.org/10.1145/3447548.3470825
    [46]
    Dan Zhang, Madelon Hulsebos, Yoshihiko Suhara, Caugatay Demiralp, Jinfeng Li, and Wang-Chiew Tan. 2020. Sato: Contextual Semantic Type Detection in Tables. Proc. VLDB Endow., Vol. 13, 12 (2020), 1835--1848. https://doi.org/10.14778/3407790.3407793
    [47]
    Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In SIGMOD 2020. Association for Computing Machinery, 1951--1966. https://doi.org/10.1145/3318464.3389726
    [48]
    Ziqi Zhang. 2017. Effective and efficient Semantic Table Interpretation using TableMiner(^ ). Semantic Web, Vol. 8, 6 (2017), 921--957. https://doi.org/10.3233/SW-160242
    [49]
    Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow., Vol. 9, 12 (Aug 2016), 1185--1196. https://doi.org/10.14778/2994509.2994534

    Cited By

    View all
    • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 31-May-2024
    • (2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024
    • (2024)CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI SystemsProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669846(16-25)Online publication date: 9-Jun-2024
    • Show More Cited By

    Index Terms

    1. SANTOS: Relationship-based Semantic Table Union Search

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 1
      PACMMOD
      May 2023
      2807 pages
      EISSN:2836-6573
      DOI:10.1145/3603164
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 May 2023
      Published in PACMMOD Volume 1, Issue 1

      Permissions

      Request permissions for this article.

      Author Tags

      1. data lakes
      2. knowledge graphs
      3. table discovery

      Qualifiers

      • Research-article

      Data Availability

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)471
      • Downloads (Last 6 weeks)82

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 31-May-2024
      • (2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024
      • (2024)CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI SystemsProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669846(16-25)Online publication date: 9-Jun-2024
      • (2024)Unstructured Data Fusion for Schema and Data ExtractionProceedings of the ACM on Management of Data10.1145/36549842:3(1-26)Online publication date: 30-May-2024
      • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
      • (2024)A Study on Efficient Indexing for Table Search in Data Lakes2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00046(245-252)Online publication date: 5-Feb-2024
      • (2024)Pyramid: A Heterogeneous Data Integration Algorithm Based on Hierarchical GraphICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447879(6220-6224)Online publication date: 14-Apr-2024
      • (2023)R2D2: Reducing Redundancy and Duplication in Data LakesProceedings of the ACM on Management of Data10.1145/36267621:4(1-25)Online publication date: 12-Dec-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media