Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2021)

Abstract

The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: https://github.com/C4AI/gap-text2sql.

Supported by IBM and FAPESP (São Paulo Research Foundation).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Spider dataset: https://yale-lily.github.io/spider.

  2. 2.

    Spider test suite evaluation github:https://github.com/taoyds/test-suite-sql-eval.

  3. 3.

    Spider leaderboard rank: https://yale-lily.github.io/spider.

  4. 4.

    RAT-SQL+GAP gitHub: https://github.com/awslabs/gap-text2sql.

  5. 5.

    Dev results are obtained locally by the developer; to get official score and Test results, it is necessary to submit the model following guidelines in “Yale Semantic Parsing and Text-to-SQL Challenge (Spider) 1.0 Submission Guideline” at https://worksheets.codalab.org/worksheets/0x82150f426cb94c17b861ef4162817399/.

  6. 6.

    mRAT-SQL+GAP Github: https://github.com/C4AI/gap-text2sql.

  7. 7.

    Cloud Translation API: https://googleapis.dev/python/translation/latest/index.html.

  8. 8.

    Simplemma: a simple multilingual lemmatizer for Python at https://github.com/adbar/simplemma.

  9. 9.

    Facebook BART-large: https://huggingface.co/facebook/bart-large.

  10. 10.

    FacebookmBART-50manyfordifferentmultilingualmachinetranslations: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt.

  11. 11.

    BERTimbau-base: https://huggingface.co/neuralmind/bert-base-portuguese-cased.

  12. 12.

    Spider dataset translated to Portuguese and double-size (English and Portuguese together): https://github.com/C4AI/gap-text2sql.

  13. 13.

    mRAT-SQL+GAP Github: https://github.com/C4AI/gap-text2sql.

  14. 14.

    mRAT-SQL+GAP Github: https://github.com/C4AI/gap-text2sql.

  15. 15.

    BERTimbau-large: https://huggingface.co/neuralmind/bert-large-portuguese-cased.

References

  1. Kim, H., So, B.H., Han, W.S., Lee, H.: Natural language to SQL: where are we today? Proc. VLDB Endow. 13, 1737–1750 (2020). https://doi.org/10.14778/3401960.3401970

  2. Affolter, K., Stockinger, K., Bernstein, A.: A comparative survey of recent natural language interfaces for databases. VLDB J. 28, 793–819 (2019). https://doi.org/10.1007/s00778-019-00567-8

  3. Ozcan, F., Quamar, A., Sen, J., Lei, C., Efthymiou, V.: State of the art and open challenges in natural language interfaces to data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 2629–2636 (2020). https://doi.org/10.1145/3318464.3383128

  4. Walter, S., Unger, C., Cimiano, P., Bär, D.: Evaluation of a layered approach to question answering over linked data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7650, pp. 362–374. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35173-0_25

    Chapter  Google Scholar 

  5. Blunschi, L., Jossen, C., Kossmann, D., Mori, M., Stockinger, K.: SODA: generating SQL for business users. Proc. VLDB Endow. 5, 932–943 (2012). https://doi.org/10.14778/2336664.2336667

  6. Li, F., Jagadish, H. V: Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8, 73–84 (2014). https://doi.org/10.14778/2735461.2735468

  7. Li, F., Jagadish, H. V.: NaLIR: an interactive natural language interface for querying relational databases. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, New York, pp. 709–712. ACM (2014). https://doi.org/10.1145/2588555.2594519

  8. Li, F., Jagadish, H.V.: Understanding natural language queries over relational databases. ACM SIGMOD Rec. 45, 6–13 (2016). https://doi.org/10.1145/2949741.2949744

    Article  Google Scholar 

  9. Song, D., et al.: TR discover: a natural language interface for querying and analyzing interlinked datasets. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9367, pp. 21–37. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25010-6_2

    Chapter  Google Scholar 

  10. Saha, D., Floratou, A., Sankaranarayanan, K., Minhas, U.F., Mittal, A.R., Özcan, F.: ATHENA: an ontology-driven system for natural language querying over relational data stores. Proc. VLDB Endow. 9, 1209–1220 (2016). https://doi.org/10.14778/2994509.2994536

  11. Lei, C., et al.: Ontology-based natural language query interfaces for data exploration. IEEE Data Eng. Bull. 41, 52–63 (2018)

    Google Scholar 

  12. Sen, J., et al.: ATHENA++: natural language querying for complex nested SQL queries. Proc. VLDB Endow. 13, 2747–2759 (2020). https://doi.org/10.14778/3407790.3407858

  13. Baik, C., Arbor, A., Arbor, A., Arbor, A., Jagadish, H.V: Constructing expressive relational queries with dual-specification synthesis. In: Proceedings of the 10th Annual Conference Innovations Data Systems Research (CIDR 2020) (2020)

    Google Scholar 

  14. Baik, C., Jin, Z., Cafarella, M., Jagadish, H. V.: Duoquest: a dual-specification system for expressive SQL queries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 2319–2329 (2020). https://doi.org/10.1145/3318464.3389776

  15. Lyons, G., Tran, V., Binnig, C., Cetintemel, U., Kraska, T.: Making the case for query-by-voice with echoquery. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 26-June-20, pp. 2129–2132 (2016). https://doi.org/10.1145/2882903.2899394

  16. Xu, X., Liu, C., Song, D.: SQLNet: generating structured queries from natural language without reinforcement learning. \({\rm arXiv}\). pp. 1–13 (2017)

    Google Scholar 

  17. Gur, I., Yavuz, S., Su, Y., Yan, X.: DialSQL: dialogue based structured query generation. In: ACL 2018–56th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference (Long Paper 1), pp. 1339–1349 (2018). https://doi.org/10.18653/v1/p18-1124

  18. Yu, T., Li, Z., Zhang, Z., Zhang, R., Radev, D.: TypeSQL: knowledge-based type-aware neural text-to-SQL generation. In: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 2, pp. 588–594 (2018). https://doi.org/10.18653/v1/n18-2093

  19. Yu, T., Yasunaga, M., Yang, K., Zhang, R., Wang, D., Li, Z., Radev, D.R.: SyntaxSQLNet: syntax tree networks for complex and cross-domain text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1653–1663. Association for Computational Linguistics, Brussels, Belgium (2018)

    Google Scholar 

  20. Yu, T., et al.: Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887v5 (2018)

  21. Francia, M., Golfarelli, M., Rizzi, S.: Augmented business intelligence. In: CEUR Workshop Proceedings, vol. 2324 (2019)

    Google Scholar 

  22. Guo, J., et al.: Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation (2019)

    Google Scholar 

  23. Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. arXiv. (2019). https://doi.org/10.18653/v1/2020.acl-main.677

  24. Shi, P., et al.: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training (2020)

    Google Scholar 

  25. Yu, T., et al.: GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing (2020)

    Google Scholar 

  26. Lin, X.V., Socher, R., Xiong, C.: Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing, pp. 4870–4888 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.438

  27. Utama, P., et al.: DBPal: An End-to-end Neural Natural Language Interface for Databases (2018)

    Google Scholar 

  28. Basik, F., et al.: DBPal: a learned NL-interface for databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1765–1768 (2018). https://doi.org/10.1145/3183713.3193562

  29. Weir, N., et al.: DBPal: a fully pluggable NL2SQL training pipeline. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, New York, pp. 2347–2361. ACM (2020). https://doi.org/10.1145/3318464.3380589

  30. Lyu, Q., Chakrabarti, K., Hathi, S., Kundu, S., Zhang, J., Chen, Z.: Hybrid ranking network for text-to-SQL. arXiv. pp. 1–12 (2020)

    Google Scholar 

  31. Cao, R., Chen, L., Chen, Z., Zhao, Y., Zhu, S., Yu, K.: LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations, pp. 2541–2555 (2021). https://doi.org/10.18653/v1/2021.acl-long.198

  32. Xu, P., et al.: Optimizing Deeper Transformers on Small Datasets, pp. 2089–2102 (2021). https://doi.org/10.18653/v1/2021.acl-long.163

  33. Bergamaschi, S., Guerra, F., Interlandi, M., Trillo-Lado, R., Velegrakis, Y.: Combining user and database perspective for solving keyword queries over relational databases. Inf. Syst. 55, 1–19 (2016). https://doi.org/10.1016/j.is.2015.07.005

  34. Bast, H., Haussmann, E.: More accurate question answering on freebase. In: Proceedings of the International on Conference on Information and Knowledge Management, 19–23-October 2015, pp. 1431–1440 (2015). https://doi.org/10.1145/2806416.2806472

  35. Ben Abacha, A., Zweigenbaum, P.: MEANS: a medical question-answering system combining NLP techniques and semantic Web technologies. Inf. Process. Manag. 51, 570–594 (2015). https://doi.org/10.1016/j.ipm.2015.04.006

    Article  Google Scholar 

  36. Iyer, S., Konstas, I., Cheung, A., Krishnamurthy, J., Zettlemoyer, L.: Learning a neural semantic parser from user feedback. In: ACL 2017–55th Annual Meeting of the Association for Computational Linguistics Proceeding Conference (Long Paper 1), pp. 963–973 (2017). https://doi.org/10.18653/v1/P17-1089

  37. Giordani, A., Moschitti, A.: Translating questions to SQL queries with generative parsers discriminatively reranked. In: Coling, pp. 401–410 (2012)

    Google Scholar 

  38. Popescu, A.M., Etzioni, O., Kautz, H.: Towards a theory of natural language interfaces to databases. In: International Conference on Intelligent user Interfaces, Proceedings of the IUI, pp. 149–157 (2003). https://doi.org/10.1145/604050.604070

  39. Zelle, J.M., Mooney, R.J.: Learning to parse database queries using inductive logic programming. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1050–1055 (1996)

    Google Scholar 

  40. Zettlemoyer, L.S., Michael, C.: Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In: Proceedings of the 21st Conference on Uncertain Artificial Intelligence, UAI 2005, pp. 658–666 (2005)

    Google Scholar 

  41. Zhong, V., Xiong, C., Socher, R.: Seq2Sql: Generating Structured Queries From Natural Language Using Reinforcement Learning. arXiv:1709.00103v7. pp. 1–12 (2017)

  42. Zettlemoyer, L.S., Collins, M.: Online learning of relaxed CCG grammars for parsing to logical form. In: EMNLP-CoNLL 2007 - Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 678–687 (2007)

    Google Scholar 

  43. Price, P.J.: Evaluation of spoken language systems. In: Proceedings of the workshop on Speech and Natural Language - HLT 1990, pp. 91–95. Association for Computational Linguistics, Morristown, NJ, USA (1990). https://doi.org/10.3115/116580.116612

  44. Dahl, D.A., et al.: Expanding the scope of the ATIS task 43 (1994). https://doi.org/10.3115/1075812.1075823

  45. Hemphill, C.T., Godfrey, J.J., George, R.D.: The ATIS spoken language systems pilot corpus. In: Proceedings of the DARPA Speech and Natural Language Workshop., Hidden Valley, Pennsylvania (1990)

    Google Scholar 

  46. Zhong, R., Yu, T., Klein, D.: Semantic evaluation for Text-to-SQL with distilled test suites. arXiv. (2020). https://doi.org/10.18653/v1/2020.emnlp-main.29

  47. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186 (2019)

    Google Scholar 

  48. Lewis, M., et al.: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (2019)

    Google Scholar 

  49. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28

    Chapter  Google Scholar 

  50. Tang, Y., et al.: Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. arXiv:2008.00401 (2020)

  51. da Silva, C.F.M., Jindal, R.: SQL query from portuguese language using natural language processing. In: Garg, D., Wong, K., Sarangapani, J., Gupta, S.K. (eds.) IACC 2020. CCIS, vol. 1367, pp. 323–335. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-0401-0_25

    Chapter  Google Scholar 

Download references

Acknowledgment

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), supported by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. The second author is partially supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), grant 312180/2018-7.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcelo Archanjo José .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

José, M.A., Cozman, F.G. (2021). mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91699-2_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91698-5

  • Online ISBN: 978-3-030-91699-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics