mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer

José, Marcelo Archanjo; Cozman, Fabio Gagliardi

doi:10.1007/978-3-030-91699-2_35

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13074))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

1125 Accesses
3 Citations
2 Altmetric

Abstract

The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: https://github.com/C4AI/gap-text2sql.

Supported by IBM and FAPESP (São Paulo Research Foundation).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Building a Bilingual QA-system with ruGPT-3

A multilingual translator to SQL with database schema pruning to improve self-attention

Article 23 June 2023

Solving Text-to-SQL Task Through Machine Translation

Notes

1.
Spider dataset: https://yale-lily.github.io/spider.
2.
Spider test suite evaluation github:https://github.com/taoyds/test-suite-sql-eval.
3.
Spider leaderboard rank: https://yale-lily.github.io/spider.
4.
RAT-SQL+GAP gitHub: https://github.com/awslabs/gap-text2sql.
5.
Dev results are obtained locally by the developer; to get official score and Test results, it is necessary to submit the model following guidelines in “Yale Semantic Parsing and Text-to-SQL Challenge (Spider) 1.0 Submission Guideline” at https://worksheets.codalab.org/worksheets/0x82150f426cb94c17b861ef4162817399/.
6.
mRAT-SQL+GAP Github: https://github.com/C4AI/gap-text2sql.
7.
Cloud Translation API: https://googleapis.dev/python/translation/latest/index.html.
8.
Simplemma: a simple multilingual lemmatizer for Python at https://github.com/adbar/simplemma.
9.
Facebook BART-large: https://huggingface.co/facebook/bart-large.
10.
FacebookmBART-50manyfordifferentmultilingualmachinetranslations: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt.
11.
BERTimbau-base: https://huggingface.co/neuralmind/bert-base-portuguese-cased.
12.
Spider dataset translated to Portuguese and double-size (English and Portuguese together): https://github.com/C4AI/gap-text2sql.
13.
mRAT-SQL+GAP Github: https://github.com/C4AI/gap-text2sql.
14.
mRAT-SQL+GAP Github: https://github.com/C4AI/gap-text2sql.
15.
BERTimbau-large: https://huggingface.co/neuralmind/bert-large-portuguese-cased.

References

Kim, H., So, B.H., Han, W.S., Lee, H.: Natural language to SQL: where are we today? Proc. VLDB Endow. 13, 1737–1750 (2020). https://doi.org/10.14778/3401960.3401970
Affolter, K., Stockinger, K., Bernstein, A.: A comparative survey of recent natural language interfaces for databases. VLDB J. 28, 793–819 (2019). https://doi.org/10.1007/s00778-019-00567-8
Ozcan, F., Quamar, A., Sen, J., Lei, C., Efthymiou, V.: State of the art and open challenges in natural language interfaces to data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 2629–2636 (2020). https://doi.org/10.1145/3318464.3383128
Walter, S., Unger, C., Cimiano, P., Bär, D.: Evaluation of a layered approach to question answering over linked data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7650, pp. 362–374. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35173-0_25
Chapter Google Scholar
Blunschi, L., Jossen, C., Kossmann, D., Mori, M., Stockinger, K.: SODA: generating SQL for business users. Proc. VLDB Endow. 5, 932–943 (2012). https://doi.org/10.14778/2336664.2336667
Li, F., Jagadish, H. V: Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8, 73–84 (2014). https://doi.org/10.14778/2735461.2735468
Li, F., Jagadish, H. V.: NaLIR: an interactive natural language interface for querying relational databases. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, New York, pp. 709–712. ACM (2014). https://doi.org/10.1145/2588555.2594519
Li, F., Jagadish, H.V.: Understanding natural language queries over relational databases. ACM SIGMOD Rec. 45, 6–13 (2016). https://doi.org/10.1145/2949741.2949744
Article Google Scholar
Song, D., et al.: TR discover: a natural language interface for querying and analyzing interlinked datasets. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9367, pp. 21–37. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25010-6_2
Chapter Google Scholar
Saha, D., Floratou, A., Sankaranarayanan, K., Minhas, U.F., Mittal, A.R., Özcan, F.: ATHENA: an ontology-driven system for natural language querying over relational data stores. Proc. VLDB Endow. 9, 1209–1220 (2016). https://doi.org/10.14778/2994509.2994536
Lei, C., et al.: Ontology-based natural language query interfaces for data exploration. IEEE Data Eng. Bull. 41, 52–63 (2018)
Google Scholar
Sen, J., et al.: ATHENA++: natural language querying for complex nested SQL queries. Proc. VLDB Endow. 13, 2747–2759 (2020). https://doi.org/10.14778/3407790.3407858
Baik, C., Arbor, A., Arbor, A., Arbor, A., Jagadish, H.V: Constructing expressive relational queries with dual-specification synthesis. In: Proceedings of the 10th Annual Conference Innovations Data Systems Research (CIDR 2020) (2020)
Google Scholar
Baik, C., Jin, Z., Cafarella, M., Jagadish, H. V.: Duoquest: a dual-specification system for expressive SQL queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 2319–2329 (2020). https://doi.org/10.1145/3318464.3389776
Lyons, G., Tran, V., Binnig, C., Cetintemel, U., Kraska, T.: Making the case for query-by-voice with echoquery. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 26-June-20, pp. 2129–2132 (2016). https://doi.org/10.1145/2882903.2899394
Xu, X., Liu, C., Song, D.: SQLNet: generating structured queries from natural language without reinforcement learning. ${\rm arXiv}$. pp. 1–13 (2017)
Google Scholar
Gur, I., Yavuz, S., Su, Y., Yan, X.: DialSQL: dialogue based structured query generation. In: ACL 2018–56th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference (Long Paper 1), pp. 1339–1349 (2018). https://doi.org/10.18653/v1/p18-1124
Yu, T., Li, Z., Zhang, Z., Zhang, R., Radev, D.: TypeSQL: knowledge-based type-aware neural text-to-SQL generation. In: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 2, pp. 588–594 (2018). https://doi.org/10.18653/v1/n18-2093
Yu, T., Yasunaga, M., Yang, K., Zhang, R., Wang, D., Li, Z., Radev, D.R.: SyntaxSQLNet: syntax tree networks for complex and cross-domain text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1653–1663. Association for Computational Linguistics, Brussels, Belgium (2018)
Google Scholar
Yu, T., et al.: Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887v5 (2018)
Francia, M., Golfarelli, M., Rizzi, S.: Augmented business intelligence. In: CEUR Workshop Proceedings, vol. 2324 (2019)
Google Scholar
Guo, J., et al.: Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation (2019)
Google Scholar
Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. arXiv. (2019). https://doi.org/10.18653/v1/2020.acl-main.677
Shi, P., et al.: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training (2020)
Google Scholar
Yu, T., et al.: GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing (2020)
Google Scholar
Lin, X.V., Socher, R., Xiong, C.: Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing, pp. 4870–4888 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.438
Utama, P., et al.: DBPal: An End-to-end Neural Natural Language Interface for Databases (2018)
Google Scholar
Basik, F., et al.: DBPal: a learned NL-interface for databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1765–1768 (2018). https://doi.org/10.1145/3183713.3193562
Weir, N., et al.: DBPal: a fully pluggable NL2SQL training pipeline. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, New York, pp. 2347–2361. ACM (2020). https://doi.org/10.1145/3318464.3380589
Lyu, Q., Chakrabarti, K., Hathi, S., Kundu, S., Zhang, J., Chen, Z.: Hybrid ranking network for text-to-SQL. arXiv. pp. 1–12 (2020)
Google Scholar
Cao, R., Chen, L., Chen, Z., Zhao, Y., Zhu, S., Yu, K.: LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations, pp. 2541–2555 (2021). https://doi.org/10.18653/v1/2021.acl-long.198
Xu, P., et al.: Optimizing Deeper Transformers on Small Datasets, pp. 2089–2102 (2021). https://doi.org/10.18653/v1/2021.acl-long.163
Bergamaschi, S., Guerra, F., Interlandi, M., Trillo-Lado, R., Velegrakis, Y.: Combining user and database perspective for solving keyword queries over relational databases. Inf. Syst. 55, 1–19 (2016). https://doi.org/10.1016/j.is.2015.07.005
Bast, H., Haussmann, E.: More accurate question answering on freebase. In: Proceedings of the International on Conference on Information and Knowledge Management, 19–23-October 2015, pp. 1431–1440 (2015). https://doi.org/10.1145/2806416.2806472
Ben Abacha, A., Zweigenbaum, P.: MEANS: a medical question-answering system combining NLP techniques and semantic Web technologies. Inf. Process. Manag. 51, 570–594 (2015). https://doi.org/10.1016/j.ipm.2015.04.006
Article Google Scholar
Iyer, S., Konstas, I., Cheung, A., Krishnamurthy, J., Zettlemoyer, L.: Learning a neural semantic parser from user feedback. In: ACL 2017–55th Annual Meeting of the Association for Computational Linguistics Proceeding Conference (Long Paper 1), pp. 963–973 (2017). https://doi.org/10.18653/v1/P17-1089
Giordani, A., Moschitti, A.: Translating questions to SQL queries with generative parsers discriminatively reranked. In: Coling, pp. 401–410 (2012)
Google Scholar
Popescu, A.M., Etzioni, O., Kautz, H.: Towards a theory of natural language interfaces to databases. In: International Conference on Intelligent user Interfaces, Proceedings of the IUI, pp. 149–157 (2003). https://doi.org/10.1145/604050.604070
Zelle, J.M., Mooney, R.J.: Learning to parse database queries using inductive logic programming. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1050–1055 (1996)
Google Scholar
Zettlemoyer, L.S., Michael, C.: Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In: Proceedings of the 21st Conference on Uncertain Artificial Intelligence, UAI 2005, pp. 658–666 (2005)
Google Scholar
Zhong, V., Xiong, C., Socher, R.: Seq2Sql: Generating Structured Queries From Natural Language Using Reinforcement Learning. arXiv:1709.00103v7. pp. 1–12 (2017)
Zettlemoyer, L.S., Collins, M.: Online learning of relaxed CCG grammars for parsing to logical form. In: EMNLP-CoNLL 2007 - Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 678–687 (2007)
Google Scholar
Price, P.J.: Evaluation of spoken language systems. In: Proceedings of the workshop on Speech and Natural Language - HLT 1990, pp. 91–95. Association for Computational Linguistics, Morristown, NJ, USA (1990). https://doi.org/10.3115/116580.116612
Dahl, D.A., et al.: Expanding the scope of the ATIS task 43 (1994). https://doi.org/10.3115/1075812.1075823
Hemphill, C.T., Godfrey, J.J., George, R.D.: The ATIS spoken language systems pilot corpus. In: Proceedings of the DARPA Speech and Natural Language Workshop., Hidden Valley, Pennsylvania (1990)
Google Scholar
Zhong, R., Yu, T., Klein, D.: Semantic evaluation for Text-to-SQL with distilled test suites. arXiv. (2020). https://doi.org/10.18653/v1/2020.emnlp-main.29
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186 (2019)
Google Scholar
Lewis, M., et al.: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (2019)
Google Scholar
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Tang, Y., et al.: Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. arXiv:2008.00401 (2020)
da Silva, C.F.M., Jindal, R.: SQL query from portuguese language using natural language processing. In: Garg, D., Wong, K., Sarangapani, J., Gupta, S.K. (eds.) IACC 2020. CCIS, vol. 1367, pp. 323–335. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-0401-0_25
Chapter Google Scholar

Download references

Acknowledgment

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), supported by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. The second author is partially supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), grant 312180/2018-7.

Author information

Authors and Affiliations

Instituto de Estudos Avançados, Universidade de São Paulo and Center for Artificial Intelligence (C4AI), São Paulo, Brazil
Marcelo Archanjo José
Escola Politécnica, Universidade de São Paulo and Center for Artificial Intelligence (C4AI), São Paulo, Brazil
Fabio Gagliardi Cozman

Authors

Marcelo Archanjo José
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Gagliardi Cozman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcelo Archanjo José .

Editor information

Editors and Affiliations

Universidade Federal de Sergipe, São Cristóvão, Brazil
André Britto
Universidade de São Paulo, São Paulo, Brazil
Karina Valdivia Delgado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

José, M.A., Cozman, F.G. (2021). mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-91699-2_35
Published: 28 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Building a Bilingual QA-system with ruGPT-3

A multilingual translator to SQL with database schema pruning to improve self-attention

Solving Text-to-SQL Task Through Machine Translation

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Building a Bilingual QA-system with ruGPT-3

A multilingual translator to SQL with database schema pruning to improve self-attention

Solving Text-to-SQL Task Through Machine Translation

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation