Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Generation of Training Examples for Tabular Natural Language Inference

Published: 12 December 2023 Publication History

Abstract

Tabular data is becoming increasingly important in Natural Language Processing (NLP) tasks, such as Tabular Natural Language Inference (TNLI). Given a table and a hypothesis expressed in NL text, the goal is to assess if the former structured data supports or refutes the latter. In this work, we focus on the role played by the annotated data in training the inference model. We introduce a system, Tenet, for the automatic augmentation and generation of training examples for TNLI. Given the tables, existing approaches are either based on human annotators, and thus expensive, or on methods that produce simple examples that lack data variety and complex reasoning. Instead, our approach is built around the intuition that SQL queries are the right tool to achieve variety in the generated examples, both in terms of data variety and reasoning complexity. The first is achieved by evidence-queries that identify cell values over tables according to different data patterns. Once the data for the example is identified, semantic-queries describe the different ways such data can be identified with standard SQL clauses. These rich descriptions are then verbalized as text to create the annotated examples for the TNLI task. The same approach is also extended to create counterfactual examples, i.e., examples where the hypothesis is false, with a method based on injecting errors in the original (clean) table. For all steps, we introduce generic generation algorithms that take as input only the tables. For our experimental study, we use three datasets from the TNLI literature and two crafted by us on more complex tables. Tenet generates human-like examples, which lead to the effective training of several inference models with results comparable to those obtained by training the same models with manually-written examples.

References

[1]
Milam Aiken and Mina Park. 2010. The efficacy of round-trip translation for MT evaluation. Translation Journal, Vol. 14, 1 (2010), 1--10.
[2]
Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In NeurIPS (Datasets and Benchmarks).
[3]
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do not have enough data? Deep learning to the rescue!. In AAAI, Vol. 34. 7383--7390.
[4]
Gilbert Badaro, Mohammed Saeed, and Papotti Paolo. 2023. Transformers for Tabular Data Representation: A Survey of Models and Applications. Transactions of the Association for Computational Linguistics, Vol. 11 (2023), 227--249. https://doi.org/doi.org/10.1162/tacl_a_00544
[5]
Markus Bayer, Marc-André Kaufhold, Björn Buchhold, Marcel Keller, Jörg Dallmeyer, and Christian Reuter. 2022b. Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers. International Journal of Machine Learning and Cybernetics (2022), 1--16.
[6]
Markus Bayer, Marc-André Kaufhold, and Christian Reuter. 2022a. A Survey on Data Augmentation for Text Classification. Comput. Surveys (jun 2022).
[7]
Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173 (2017).
[8]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.
[9]
Jiaao Chen, Zichao Yang, and Diyi Yang. 2020c. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. arXiv preprint arXiv:2004.12239 (2020).
[10]
Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, and William Yang Wang. 2020a. Logical Natural Language Generation from Open-Domain Tables. In ACL. 7929--7942.
[11]
Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020b. TabFact: A Large-scale Dataset for Table-based Fact Verification. In ICLR.
[12]
Hyunsoo Cho, Hyuhng Joon Kim, Junyeob Kim, Sang-Woo Lee, Sang goo Lee, Kang Min Yoo, and Taeuk Kim. 2022. Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners. In AAAI.
[13]
Vincent Claveau, Antoine Chaffin, and Ewa Kijak. 2021. Generating artificial texts as substitution or complement of training data. arXiv preprint arXiv:2110.13016 (2021).
[14]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[15]
Julian Eisenschlos, Syrine Krichene, and Thomas Müller. 2020. Understanding tables with intermediate pre-training. In EMNLP. 281--296.
[16]
Steven Y Feng, Varun Gangal, Dongyeop Kang, Teruko Mitamura, and Eduard Hovy. 2020. Genaug: Data augmentation for finetuning text generators. arXiv preprint arXiv:2010.01794 (2020).
[17]
Orest Gkini, Theofilos Belmpas, Georgia Koutrika, and Yannis E. Ioannidis. 2021. An In-Depth Benchmarking of Text-to-SQL Systems. In SIGMOD. ACM, 632--644.
[18]
Vivek Gupta, Riyaz A. Bhat, Atreya Ghosal, Manish Shrivastava, Maneesh Kumar Singh, and Vivek Srikumar. 2022. Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning. Trans. Assoc. Comput. Linguistics, Vol. 10 (2022), 659--679.
[19]
Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, and Vivek Srikumar. 2020. INFOTABS: Inference on Tables as Semi-structured Data. In ACL. ACL, Online, 2309--2324.
[20]
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. In ACL. Association for Computational Linguistics, 4320--4333. https://doi.org/10.18653/v1/2020.acl-main.398
[21]
Thien Ho Huong and Vinh Truong Hoang. 2020. A data augmentation technique based on text for Vietnamese sentiment analysis. In International Conference on Advances in Information Technology. 1--5.
[22]
Georgios Karagiannis, Mohammed Saeed, Paolo Papotti, and Immanuel Trummer. 2020. Scrutinizer: A Mixed-Initiative Approach to Large-Scale, Data-Driven Claim Verification. Proc. VLDB Endow., Vol. 13, 11 (2020), 2508--2521.
[23]
George Katsogiannis-Meimarakis and Georgia Koutrika. 2021. A Deep Dive into Deep Learning Approaches for Text-to-SQL Systems. In SIGMOD. ACM, 2846--2851.
[24]
Karen Kukich. 1983. Design of a Knowledge-Based Report Generator. In 21st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Cambridge, Massachusetts, USA, 145--150. https://doi.org/10.3115/981311.981340
[25]
Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245 (2020).
[26]
Varun Kumar, Hadrien Glaude, Cyprien de Lichy, and William Campbell. 2019. A closer look at feature space data augmentation for few-shot intent classification. arXiv preprint arXiv:1910.04176 (2019).
[27]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. https://doi.org/10.48550/ARXIV.1910.13461
[28]
Hao Li, Chee-Yong Chan, and David Maier. 2015. Query from Examples: An Iterative, Data-Driven Approach to Query Construction. Proc. VLDB Endow., Vol. 8, 13 (sep 2015), 2158--2169. https://doi.org/10.14778/2831360.2831369
[29]
Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. 2017. Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision. In ACL. 23--33.
[30]
Ruibo Liu, Guangxuan Xu, Chenyan Jia, Weicheng Ma, Lili Wang, and Soroush Vosoughi. 2020. Data boost: Text data augmentation through reinforcement learning guided conditional generation. arXiv preprint arXiv:2012.02952 (2020).
[31]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[32]
Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022a. Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. CoRR, Vol. abs/2202.04538 (2022). showeprint[arXiv]2202.04538 https://arxiv.org/abs/2202.04538
[33]
Yu Meng, Martin Michalski, Jiaxin Huang, Yu Zhang, Tarek F. Abdelzaher, and Jiawei Han. 2022b. Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning. CoRR, Vol. abs/2211.03044 (2022). https://doi.org/10.48550/arXiv.2211.03044 showeprint[arXiv]2211.03044
[34]
Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-Supervised Hierarchical Text Classification. https://doi.org/10.48550/ARXIV.1812.11270
[35]
Preslav Nakov, David P. A. Corney, Maram Hasanain, Firoj Alam, Tamer Elsayed, Alberto Barró n-Cede n o, Paolo Papotti, Shaden Shaar, and Giovanni Da San Martino. 2021. Automated Fact-Checking for Assisting Human Fact-Checkers. In IJCAI. ijcai.org, 4551--4558. https://doi.org/10.24963/ijcai.2021/619
[36]
Linyong Nan, Lorenzo Jaime Yu Flores, Yilun Zhao, Yixin Liu, Luke Benson, Weijin Zou, and Dragomir Radev. 2022. R2D2: Robust Data-to-Text with Replacement Detection. arXiv preprint arXiv:2205.12467 (2022).
[37]
A. Neveol, Dalianis, and S. Velupillai. 2018. Clinical Natural Language Processing in languages other than English: opportunities and challenges. Journal Biomed Semantic, Vol. 9, 12 (2018).
[38]
Liangming Pan, Wenhu Chen, Wenhan Xiong, Min-Yen Kan, and William Yang Wang. 2021. Zero-shot Fact Verification by Claim Generation. In ACL. Association for Computational Linguistics, 476--483.
[39]
Simone Papicchio, Paolo Papotti, and Luca Cagliero. 2023. QATCH: Benchmarking Table Representation Learning Models on Your Data. In NeurIPS (Datasets and Benchmarks).
[40]
Ankur P. Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. ToTTo: A Controlled Table-To-Text Generation Dataset. In EMNLP. ACL, 1173--1186.
[41]
Siyuan Qiu, Binxia Xu, Jie Zhang, Yafang Wang, Xiaoyu Shen, Gerard De Melo, Chong Long, and Xiaolong Li. 2020. Easyaug: An automatic textual data augmentation platform for classification tasks. In Companion Proceedings of the Web Conference 2020. 249--252.
[42]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In ACL. Association for Computational Linguistics, Melbourne, Australia, 784--789. https://doi.org/10.18653/v1/P18--2124
[43]
Ehud Reiter and Robert Dale. 2002. Building Applied Natural Language Generation Systems. Natural Language Engineering, Vol. 3 (03 2002).
[44]
Anish Das Sarma, Aditya G. Parameswaran, Hector Garcia-Molina, and Jennifer Widom. 2010. Synthesizing view definitions from data. In ICDT. ACM, 89--103.
[45]
Dinghan Shen, Mingzhi Zheng, Yelong Shen, Yanru Qu, and Weizhu Chen. 2020. A simple but tough-to-beat data augmentation approach for natural language understanding and generation. arXiv preprint arXiv:2009.13818 (2020).
[46]
Rajesh Shrestha, Omeed Habibelahian, Arash Termehchy, and Paolo Papotti. 2023. Exploratory Training: When Annonators Learn About Data. Proc. ACM Manag. Data, Vol. 1, 2 (2023), 135:1--135:25. https://doi.org/10.1145/3589280
[47]
Elena Soare, Iain Mackie, and Jeffrey Dalton. 2022. DocuT5: Seq2seq SQL Generation with Table Documentation. CoRR, Vol. abs/2211.06193 (2022). https://doi.org/10.48550/arXiv.2211.06193 showeprint[arXiv]2211.06193
[48]
Wei Chit Tan, Meihui Zhang, Hazem Elmeleegy, and Divesh Srivastava. 2017. Reverse Engineering Aggregation Queries. Proc. VLDB Endow., Vol. 10, 11 (2017), 1394--1405. https://doi.org/10.14778/3137628.3137648
[49]
Immanuel Trummer. 2022. From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management. Proc. VLDB Endow., Vol. 15, 12 (2022), 3770--3773.
[50]
Enzo Veltri, Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Data Ambiguity Profiling for the Generation of Training Examples. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3--7, 2023. IEEE, 450--463. https://doi.org/10.1109/ICDE55515.2023.00041
[51]
Enzo Veltri, Donatello Santoro, Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2022. Pythia: Unsupervised Generation of Ambiguous Textual Claims from Relational Data. Proceedings of the ACM SIGMOD International Conference on Management of Data (2022), 2409 -- 2412. https://doi.org/10.1145/3514221.3520164
[52]
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In ACL. 7567--7578.
[53]
Congcong Wang and David Lillis. 2019. Classification for Crisis-Related Tweets Leveraging Word Embeddings and Data Augmentation. In TREC.
[54]
Nancy XR Wang, Diwakar Mahajan, Marina Danilevsky, and Sara Rosenthal. 2021. SemEval-2021 task 9: Fact verification and evidence finding for tabular data in scientific documents (SEM-TAB-FACTS). arXiv preprint arXiv:2105.13995 (2021).
[55]
Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin H"a ttasch, Steffen Eger, Ugur cC etintemel, and Carsten Binnig. 2020. DBPal: A Fully Pluggable NL2SQL Training Pipeline. In SIGMOD. ACM, 2347--2361.
[56]
Yaacov Y. Weiss and Sara Cohen. 2017. Reverse Engineering SPJ-Queries from Examples. In PODS. ACM, 151--166.
[57]
Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in Data-to-Document Generation. In EMNLP. ACL, 2253--2263.
[58]
You Wu, Pankaj K. Agarwal, Chengkai Li, Jun Yang, and Cong Yu. 2017. Computational Fact Checking through Query Perturbations. ACM Trans. Database Syst., Vol. 42, 1 (2017), 4:1--4:41.
[59]
Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, Vol. 33 (2020), 6256--6268.
[60]
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL. 8413--8426.
[61]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911--3921.
[62]
Meihui Zhang, Hazem Elmeleegy, Cecilia M. Procopiuc, and Divesh Srivastava. 2013. Reverse Engineering Complex Join Queries. In SIGMOD. ACM, 809--820.

Cited By

View all
  • (2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 24-Jun-2024
  • (2024)Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual researchNatural Language Processing Journal10.1016/j.nlp.2024.1000667(100066)Online publication date: Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 4
PACMMOD
December 2023
1317 pages
EISSN:2836-6573
DOI:10.1145/3637468
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023
Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Author Tags

  1. SQL-based NL generation
  2. data augmentation
  3. natural language processing (NLP) for databases
  4. query generation
  5. tabular natural language inference (TNLI)
  6. text generation

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)215
  • Downloads (Last 6 weeks)16
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)BUNNI: Learning Repair Actions in Rule-driven Data CleaningJournal of Data and Information Quality10.1145/366593016:2(1-31)Online publication date: 24-Jun-2024
  • (2024)Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual researchNatural Language Processing Journal10.1016/j.nlp.2024.1000667(100066)Online publication date: Jun-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media