research-article

DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models

Authors:

Arash Dargahi Nobari,

Davood RafieiAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 1

Article No.: 24, Pages 1 - 24

https://doi.org/10.1145/3639279

Published: 26 March 2024 Publication History

Abstract

Many organizations rely on data from government and third-party sources, and those sources rarely follow the same data formatting. This introduces challenges in integrating data from multiple sources or aligning external sources with internal databases. Commercial database systems do not offer adequate support for integrating data from heterogeneous sources, and manual integration is both time-consuming and inefficient. State-of-the-art data integration approaches that rely on similarity functions and textual transformations often fail to handle challenging cases where multiple mappings are required, or the mappings go beyond simple textual transformations.

In this paper, we study the potentials of deep neural models for transforming tables for joinability. In particular, we cast the problem as a prediction task and develop a framework that leverages large deep-learning language models to transform tabular data from a source formatting to a desired target representation. Our framework can efficiently learn the patterns for mapping a source formatting into an expected target using just a few examples, which can then be used for tasks such as table joining, filling in missing values, and error detection. Compared to state-of-the-art mapping and joining approaches, our framework delivers noticeably more accurate and scalable performance on both real-world and synthetic datasets. Our experimental evaluation also shows that the performance of the proposed framework using our fine-tuned model is at par or better than large language models such as GPT-3, despite the significant difference in size, and that using large language models within our framework improves their performance.

References

[1]

Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2016. DataXFormer: A robust transformation discovery system. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE). 1134--1145.

[2]

Mehdi Akbarian Rastaghi, Ehsan Kamalloo, and Davood Rafiei. 2022. Probing the Robustness of Pre-trained Language Models for Entity Matching. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3786--3790.

[3]

Rajeev Alur, Dana Fisman, Rishabh Singh, and Armando Solar-Lezama. 2016. SyGuS-Comp 2016: Results and Analysis. Electronic Proceedings in Theoretical Computer Science, Vol. 229 (Nov 2016), 178--202.

[4]

Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Transformers for Tabular Data Representation: A Survey of Models and Applications. Transactions of the Association for Computational Linguistics, Vol. 11 (03 2023), 227--249.

[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901.

[6]

Wenhu Chen. 2022. Large language models are few (1)-shot table reasoners. arXiv preprint arXiv:2210.06710 (2022).

[7]

Wenhu Chen, Ming-Wei Chang, Eva Schlinger, William Wang, and William W Cohen. 2020. Open question answering over tables and text. arXiv preprint arXiv:2010.10439 (2020).

[8]

Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, and Jianhua Feng. 2014. MassJoin: A mapreduce-based method for scalable string similarity joins. In 2014 IEEE 30th International Conference on Data Engineering. 340--351. https://doi.org/10.1109/ICDE.2014.6816663

[9]

Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table Understanding through Representation Learning. Proc. VLDB Endow., Vol. 14, 3 (2020), 307--319.

Digital Library

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. Association for Computational Linguistics, 4171--4186.

[11]

Haoyu Dong, Zhoujun Cheng, Xinyi He, Mengyu Zhou, Anda Zhou, Fan Zhou, Ao Liu, Shi Han, and Dongmei Zhang. 2022. Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks. arXiv preprint arXiv:2201.09745 (2022).

[12]

Sumit Gulwani. 2011. Automating String Processing in Spreadsheets Using Input-Output Examples. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL '11). Association for Computing Machinery, New York, NY, USA, 317--330.

Digital Library

[13]

Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet Data Manipulation Using Examples. Commun. ACM, Vol. 55, 8 (Aug. 2012), 97--105.

Digital Library

[14]

Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, and Nan Tang. 2016. Interactive and deterministic data cleaning. In Proceedings of the 2016 International Conference on Management of Data. 893--907.

Digital Library

[15]

Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek Narasayya, and Surajit Chaudhuri. 2018a. Transform-data-by-example (TDE): an extensible search engine for data transformations. Proceedings of the VLDB Endowment, Vol. 11, 10 (2018), 1165--1177.

Digital Library

[16]

Yeye He, Kris Ganjam, and Xu Chu. 2015. SEMA-JOIN: Joining Semantically-Related Tables Using Big Table Corpora. Proc. VLDB Endow., Vol. 8, 12 (Aug. 2015), 1358--1369.

Digital Library

[17]

Yeye He, Kris Ganjam, Kukjin Lee, Yue Wang, Vivek Narasayya, Surajit Chaudhuri, Xu Chu, and Yudian Zheng. 2018b. Transform-Data-by-Example (TDE): Extensible Data Transformation in Excel. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1785--1788.

Digital Library

[18]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[19]

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. 2023. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. arxiv: 2207.01848 [cs.LG]

[20]

Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. TABBIE: Pretrained Representations of Tabular Data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3446--3456.

[21]

Zhongjun Jin, Michael R. Anderson, Michael Cafarella, and H. V. Jagadish. 2017. Foofah: Transforming Data By Example. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 683--698.

[22]

Zhongjun Jin, Yeye He, and Surajit Chauduri. 2020. Auto-Transform: Learning-to-Transform by Patterns. Proc. VLDB Endow., Vol. 13, 12 (July 2020), 2368--2381.

Digital Library

[23]

Mihir Kale and Abhinav Rastogi. 2020. Text-to-Text Pre-Training for Data-to-Text Tasks. In Proceedings of the 13th International Conference on Natural Language Generation. Association for Computational Linguistics, Dublin, Ireland, 97--102. https://aclanthology.org/2020.inlg-1.14

[24]

George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal (2023), 1--32.

[25]

Adam Kortylewski, Andreas Schneider, Thomas Gerig, Bernhard Egger, Andreas Morel-Forster, and Thomas Vetter. 2018. Training deep face recognition systems with synthetic data. arXiv preprint arXiv:1802.05891 (2018).

[26]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871--7880.

[27]

Peng Li, Xiang Cheng, Xu Chu, Yeye He, and Surajit Chaudhuri. 2021. Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD/PODS '21). Association for Computing Machinery, New York, NY, USA, 1064--1076.

Digital Library

[28]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020a. Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Endow., Vol. 14, 1 (sep 2020), 50--60.

Digital Library

[29]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020b. Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Endow., Vol. 14, 1 (oct 2020), 50--60. https://doi.org/10.14778/3421424.3421431

Digital Library

[30]

Yunyao Li, Dragomir Radev, and Davood Rafiei. 2023. Natural language interfaces to databases. Springer Nature.

[31]

P. McBrien and A. Poulovassilis. 2003. Data integration by bi-directional schema transformation rules. In Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405). 227--238.

[32]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc.

[33]

Arash Dargahi Nobari and Davood Rafiei. 2022. Efficiently Transforming Tables for Joinability. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1649--1661.

[34]

Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. ToTTo: A Controlled Table-To-Text Generation Dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 1173--1186. https://doi.org/10.18653/v1/2020.emnlp-main.89

[35]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[36]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227--2237.

[37]

Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa Patwary, and Bryan Catanzaro. 2020. Training Question Answering Models From Synthetic Data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5811--5826.

[38]

Abdul Quamar, Vasilis Efthymiou, Chuan Lei, Fatma Özcan, et al. 2022. Natural language interfaces to data. Foundations and Trends® in Databases, Vol. 11, 4 (2022), 319--414.

[39]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).

[40]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.

[41]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21 (2020), 140:1--140:67.

[42]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).

[43]

Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5149--5152.

[44]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1715--1725.

[45]

A. Simitsis, P. Vassiliadis, and T. Sellis. 2005. Optimizing ETL processes in data warehouses. In 21st International Conference on Data Engineering (ICDE'05). 564--575.

[46]

Rishabh Singh. 2016. BlinkFill: Semi-Supervised Programming by Example for Syntactic String Transformations. Proc. VLDB Endow., Vol. 9, 10 (June 2016), 816--827.

Digital Library

[47]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems, Vol. 27. Curran Associates, Inc.

Digital Library

[48]

Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam Madden, and Mourad Ouzzani. 2021. RPT: Relational Pre-Trained Transformer is Almost All You Need towards Democratizing Data Preparation. Proc. VLDB Endow., Vol. 14, 8 (2021), 1254--1261.

Digital Library

[49]

James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, and Alon Halevy. 2021. From natural language processing to neural databases. In Proceedings of the VLDB Endowment, Vol. 14. VLDB Endowment, 1033--1039.

Digital Library

[50]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.

[51]

Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7567--7578. https://doi.org/10.18653/v1/2020.acl-main.677

[52]

Jin Wang, Chunbin Lin, and Carlo Zaniolo. 2019. MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 386--397.

[53]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).

[54]

Yue Wang and Yeye He. 2017. Synthesizing Mapping Relationships Using Table Corpus. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 1117--1132.

Digital Library

[55]

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, Vol. 10 (03 2022), 291--306.

[56]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 483--498.

[57]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).

[58]

Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8413--8426.

[59]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems, Vol. 33 (2020), 17283--17297.

[60]

Chen Zhao and Yeye He. 2019. Auto-EM: End-to-End Fuzzy Entity-Matching Using Pre-Trained Deep Models and Transfer Learning. In The World Wide Web Conference (San Francisco, CA, USA) (WWW '19). Association for Computing Machinery, New York, NY, USA, 2413--2424.

Digital Library

[61]

Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-join: Joining tables by leveraging transformations. Proceedings of the VLDB Endowment, Vol. 10, 10 (2017), 1034--1045.

Digital Library

Cited By

Li PHe YYashar DCui WGe SZhang HRifinski Fainman DZhang DChaudhuri S(2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654979

Index Terms

DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Data management systems
    1. Information integration
  2. Information retrieval
    1. Retrieval models and ranking
      1. Language models

Recommendations

Embedding-based Query Language Models
ICTIR '16: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of ...
Data-Efficient Information Extraction from Documents with Pre-trained Language Models
Document Analysis and Recognition – ICDAR 2021 Workshops
Abstract
Like for many text understanding and generation tasks, pre-trained languages models have emerged as a powerful approach for extracting information from business documents. However, their performance has not been properly studied in data-...
The use of categorization information in language models for question retrieval
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Community Question Answering (CQA) has emerged as a popular type of service meeting a wide range of information needs. Such services enable users to ask and answer questions and to access existing question-answer pairs. CQA archives contain very large ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 1

SIGMOD

February 2024

1874 pages

EISSN:2836-6573

DOI:10.1145/3654807

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024

Published in PACMMOD Volume 2, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Sciences and Engineering Research Council
Servus Credit Union

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
135
Total Downloads

Downloads (Last 12 months)135
Downloads (Last 6 weeks)42

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li PHe YYashar DCui WGe SZhang HRifinski Fainman DZhang DChaudhuri S(2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654979

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents