Few-Shot Text-to-SQL Translation Using Structure
Few-Shot Text-to-SQL Translation Using Structure
Authors’ addresses: Zihui Gu, Renmin University of China, China, guzh@ruc.edu.cn; Ju Fan, Renmin University of China,
China, fanj@ruc.edu.cn; Nan Tang, QCRI / HKUST (GZ), Qatar / China, ntang@hbku.edu.qa; Lei Cao, MIT CSAIL / University
of Arizona, USA, lcao@csail.mit.edu; Bowen Jia, Renmin University of China, China, bowenjia@ruc.edu.cn; Sam Madden,
MIT CSAIL, USA, madden@csail.mit.edu; Xiaoyong Du, Renmin University of China, China, duyong@ruc.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2836-6573/2023/6-ART147 $15.00
https://doi.org/10.1145/3589292
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:2 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
1 INTRODUCTION
Text-to-SQL, which translates natural language (NL) questions to SQL queries, promises to enable
sophisticated data querying to non-technical end users. Supporting NL questions over tables has
many applications. For example, data visualization companies, such as Tableau and Power BI,
use NL questions to assist the construction of dashboards, where the required structured queries
over tables are relatively simple, e.g., most queries do not need joins or nested loops. In addition,
supporting Text-to-SQL translation has high demand from the database industry, e.g., Oracle [15]
and SalesForce [21].
The “Pre-train, Fine-tune” Paradigm. Language models, e.g., GPT-3 [2] and T5 [18], are pre-
trained on very large corpora to learn general knowledge. Fine-tuning adapts these PLMs to
downstream tasks using task-specific objective functions and datasets, e.g., question answering [32]
and text summarization [41]. Recent studies, based on Text-to-SQL benchmarks such as Spider [39],
show that the state-of-the-art (SOTA) performance on Text-to-SQL is achieved by fine-tuning large
PLMs, such as Picard [24], which fine-tunes T5 [18] in a simple end-to-end fashion.
Few-shot Text-to-SQL. A common scenario for Text-to-SQL is that oftentimes, for a new dataset
or domain, sufficient high-quality training data is not available and obtaining it requires expert
knowledge, which is very expensive and labor-intensive to acquire. In this paper, we refer to the
case of limited training data as few-shot Text-to-SQL. The “pre-train, fine-tune” paradigm tends to
be ineffective in this few-shot setting for Text-to-SQL, because the PLMs have poor generalization
given limited training data on new datasets.
One possible strategy to mitigate poor generalization is through prompting [14], where textual
prompts are used to explicitly guide PLMs to better reason about the tasks, e.g., “generate an SQL
query” or “write a Java program”. Unfortunately, for the few-shot Text-to-SQL scenario when there
is not enough training data, traditional prompting techniques also cannot solve the problem, as we
illustrate in the following example.
Example 1 (Prompting for Text-to-SQL). Consider the database 𝐷 in Figure 1(a) with two
tables, highschooler and friend. Let 𝑁 be an NL question that asks for high school students who
do not have friends. As shown in the figure, the ground truth SQL query 𝑄 uses the EXCEPT operator
to return the student IDs from the table highschooler that are not in the table friend.
We fine-tuned the pre-trained T5-large [18] model with the textual prompt 𝑃 in Figure 1(b) using 5%
of the training data of the database 𝐷 from the Spider [39] Text-to-SQL benchmark, which is around
300 (NL, SQL) pairs. During testing, the fined-tuned model still outputs the incorrect SQL query 𝑄 ′ ,
where the column name student_id should be id. After careful analysis, we find that the cause of
these failures is a discrepancy between the pre-training tasks and our Text-to-SQL task, where the
limited training data is not sufficient to adapt the PLM to new datasets. □
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:3
Example 2 (SC-Prompt for Text-to-SQL). Consider the example in Figure 1 (a) again. Figure 1
(c) depicts our two-stage strategy.
(1) Structure stage. In this stage, we use a structure prompt P𝑆 to generate an SQL structure 𝑆
containing only SQL commands (e.g., SELECT, FROM, WHERE) and operators (e.g., <, >), while
leaving all contents to be placeholders (e.g.,[col], [tab], etc).
(2) Content stage. In this stage, we use a content prompt P𝐶 , the previously generated SQL
structure 𝑆, together with the NL question 𝑁 and the database 𝐷 to guide a PLM to fill the
content placeholders (e.g., the first [col] should be column “id”, the first [tab] should be table
“highshooler”, and so on).
Finally, we combine the SQL structure 𝑆 and SQL content 𝐶 to generate an SQL query 𝑄 ′′ . Compared
with the ground truth query 𝑄 in Figure 1 (a), we note that SC-Prompt may predict a different
but equivalent SQL structure (e.g., EXCEPT vs. NOT IN), and correctly predicts the SQL query, i.e.,
𝑄 (𝐷) = 𝑄 ′′ (𝐷). □
Next, based on Example 2, we discuss why SC-Prompt is better than traditional end-to-end Text-
to-SQL translation in the few-shot learning setting. The structure stage focuses only on predicting
an SQL structure from an NL question, which is much easier than predicting a complete SQL query.
For the content stage, the task is to fill the content placeholders, a fill-in-the-blank task, which is
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:4 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
well aligned with the self-supervised mask language modeling for pre-training many PLMs (e.g.,
BERT [4] and T5 [18]).
Challenges and Solutions. Effectively realizing the proposed SC-Prompt framework raises two
main technical challenges.
The first challenge is that manually designed textual prompts [2] are not flexible enough to
effectively guide PLMs to solve different prediction problems in the two stages. An intuitive way to
improve predictions is to provide more contextual information. To this end, we propose a hybrid
prompt strategy that combines learnable vectors [13] and fixed vectors (i.e., word embeddings of
textual prompts). The learnable vectors are tuned during training and are used to provide contextual
information to better guide PLMs for prediction in both stages.
The second challenge is how to improve the decoding process of the PLMs. A typical auto-
regressive decoding process may not be effective, because large PLMs, such as T5 [18] used in our
paper, usually have a large number of (sub-word) tokens, which would increase the possibilities of
generating invalid SQL queries. To this end, we propose a fine-grained constrained decoding strategy
to help prune the search space of both stages to make invalid SQL queries less likely. In particular,
we design keyword-constrained decoding to ensure the validity of generated SQL structures, and
structure-guided decoding to guarantee that the model fills the correct content.
Contributions. We summarize our contributions as follows.
(1) We study the problem of few-shot Text-to-SQL translation (Section 2), and introduce a divide-
and-conquer framework SC-Prompt that divides the Text-to-SQL task into two simpler stages
(sub-tasks), namely a structure stage and a content stage (Section 3).
(2) We propose two novel techniques to tackle the challenges of this problem: structure and content
prompt construction (Section 4) and fine-grained constrained decoding (Section 5).
(3) We conduct extensive experiments on three Text-to-SQL benchmarks with different difficulty
levels. Our experimental results show that SC-Prompt significantly outperforms the advanced
Text-to-SQL solutions in the few-shot scenario (Section 7) 1 .
2 FEW-SHOT TEXT-TO-SQL
This section first formalizes Text-to-SQL translation in Section 2.1, and then presents how to adopt
PLMs to solve the problem (Sections 2.2 and 2.3). Next, we propose a decomposition strategy, which
divides an SQL query into SQL structure and SQL content, to address the challenge of few-shot
Text-to-SQL (Section 2.4).
2.1 Text-to-SQL
SQL Queries. We consider standard SQL queries in relational databases, which consist of the
following elements2 .
• Commands: These are SQL reserved keywords such as SELECT, WHERE, COUNT, etc.
• Operators: These include SQL arithmetic operators (e.g., “+”, “−”, “∗”), bit-wise operators
(e.g., “&”, “|”), comparison operators (e.g., “>”, “<”), and logical operators (e.g., “NOT IN”).
• Identifiers: These refer to database specific objects, such as tables, columns, etc (e.g., column
student_id in Figure 1).
• Constant Values: These are fixed data values, which can be categorical, numerical and
textual values. Note that the textual values would be enclosed in single quote marks.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:5
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:6 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
SQL Content. SQL content includes SQL identifiers and constant values. More specifically, we
concatenate placeholders and corresponding identifiers or constant values as the representation of
SQL content. For example, the representation of SQL content in Figure 1(c) is: “[col] id [tab]
highschooler [col] id [col] student_id [tab] friend”.
SQL Query = SQL Structure + SQL Content. The SQL structure and the SQL content correspond-
ing to one NL query can be trivially combined as an SQL query (see Figure 1(c)).
3 SOLUTION OVERVIEW
① Stage-S N, D ② Stage-C N, D
SELECT [col]
[col] id
FROM [tab]
SQL WHERE [col] SQL [tab] highschooler
Structure NOT IN Content [col] id
[col] student_id
(S) (SELECT [col] (C)
[tab] friend
FROM [tab])
Fig. 2. An overview of SC-Prompt. First, the structure stage steers the PLM to generate an SQL structure 𝑆.
Then, the content stage guides the PLM to generate SQL content 𝐶 based on the structure 𝑆. Finally, the
structure 𝑆 and content 𝐶 is combined to generate an SQL query 𝑄.
Figure 2 shows an overview of our SC-Prompt framework for supporting few-shot Text-to-SQL
in two stages.
(1) The structure stage (Stage-S) steers the PLM to generate an SQL structure 𝑆, based on the
NL question 𝑁 , the database schema 𝐷, and a structure prompt.
(2) The content stage (Stage-C) guides the PLM to populate the placeholders in the generated
SQL structure with specific values, resulting an SQL content 𝐶.
We then combine the predicted SQL structure 𝑆 and SQL content 𝐶 to generate an SQL query 𝑄
as described in Section 2.4.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:7
There are two core modules in our SC-Prompt framework, as shown in Figure 2: Prompt Con-
struction (blue rectangles) and Constrained Decoding (red rectangles), which we describe next.
Prompt Construction. Intuitively, the aim of prompt construction is to design a function 𝑓prompt
to transforms the NL question 𝑁 and the database schema 𝐷 into an instruction for a PLM, which
we denote as prompt P where P = 𝑓prompt (𝑁 , 𝐷).
A straightforward implementation for 𝑓prompt is to use a natural language template, with input
and output placeholders, e.g.,
Translate NL to SQL, [𝑁 ] [𝐷], [𝑍 ]
where [𝑁 ] and [𝐷] are input placeholders, which can be populated with each input instance (𝑁 , 𝐷),
and [𝑍 ] is the output blank, which will be predicted by the PLM. Obviously, this prediction task
can be formulated as the masked language modeling (MLM) problem, which is the pre-training
objective of most PLMs.
However, it is non-trivial to design an effective prompting function 𝑓prompt that will result in the
PLM making accurate predictions predictions. In SC-Prompt, we divide the procedure into two
stages: Stage-S and Stage-C, shown in Figure 2. Our approach is to design different prompts for
different stages, although, even then, it is difficult to manually design appropriate prompts that
effectively guide the PLM in each stage.
To address these challenges, we introduce a novel hybrid prompt strategy that combines learnable
vectors for learning more contextual information and fixed vectors (i.e., word embeddings of
manually designed textual prompts) for capturing domain knowledge. This hybrid prompt, in the
form of a high-dimensional vector, is fed to a PLM. More details of prompt construction for both
stages are given in Section 4.
Constrained Decoding. Given a constructed prompt P in the form of a vector, a typical (encoder-
decoder) PLM first uses its encoder to convert P into a high-dimensional hidden vector x, and then
uses its decoder to generate an SQL query 𝑄, i.e.,
𝑄 = 𝑓decode (x) = 𝑓decode (𝑓encode (P)) (2)
A typical solution for decoder 𝑓decode is the auto-regressive method, as described in Section 2.2.
Large PLMs, such as T5 [18] used in our paper, usually have a large number of (sub-word) tokens,
which increases the possibilities of generating invalid SQLs. For this reason, constrained decoding
is introduced to make invalid SQL queries unpresentable [24]. However, existing solutions are
designed for directly generating SQL, instead of the structures and contents in our SC-Prompt
framework. Therefore, we introduce a fine-grained constrained decoding method, which takes
full advantages of our structure and content prompting mechanism. More details of constrained
decoding are given in Section 5.
4 PROMPT CONSTRUCTION
In this section, we first discuss two conventional prompt construction methods and their limitations
in Section 4.1. Then, we introduce our novel hybrid prompt construction method and its learning
method in Section 4.2 and Section 4.3 respectively.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:8 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
PLM
2 3 1 vB vN vD vE
eS eC eN … eD … …
4
vN eN vD eD vE
5 6 Hybrid content
Hybrid structure
prompt vector IC
… … V
prompt vector IS
tasks including text summarization and machine translation. This method constructs prompts in a
human-interpretable way, which is currently the most intuitive method.
Note that, a textual prompt needs to be converted into a sequence of vectors, in order to be used
by the encoder-decoder Text-to-SQL architecture. This is often performed using word embeddings
of PLMs where each input token will be converted into a vector. For example, the NL question
𝑁 , after being tokenized as tok(𝑁 ) = 𝑡 1, . . . , 𝑡𝑛 , will be converted into a sequence of vectors as
𝑒 𝑁 = 𝑣 1, . . . , 𝑣𝑛 , similar for the embeddings 𝑒 𝐷 of the database schema 𝐷 (see Figure 3 for tok(𝑁 ),
tok(𝐷), 𝑒 𝑁 and 𝑒 𝐷 ). These prompts are considered to be fixed, meaning that they will not be
updated (or tuned) during training on new datasets.
Learnable Vectors. This prompt consists of learnable vectors (a.k.a. continuous vectors [11, 13]),
which can be optimized during training. For example, prompt tuning [11] prepends a series of
continuous vectors to the input as a prompt; this shows good performance on the SuperGLUE
benchmark [29]. This method expands the design space of prompt beyond human-interpretable
language, allowing PLMs to learn which features should be included in prompts.
Limitations. The main limitation of textual prompts is that they are designed manually, and thus
highly rely on the designers’ expertise. Even the most experienced designers struggle to discover
optimal prompt templates [10]. The main limitation of learnable vectors, as empirically verified by
PPT [7], is that prompt tuning on learnable vectors performs poorly with few-shot training and is
greatly influenced by the prompt initialization methods (see Section 4.2 for different initialization
methods).
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:9
database schema 𝐷 containing table names and column names into another sequence of tokens as
tok(𝐷) = 𝑡 1′ , 𝑡 2′ , . . . , 𝑡𝑚
′ .
Afterwards, we use a PLM to convert the tokenized sequence to a vector sequence, where each
token will be converted to one high-dimensional vector. That is, the tokenized question tok(𝑁 )
will be converted into 𝑒 𝑁 , and the tokenized database schema tok(𝐷) will be converted into 𝑒 𝐷 , as
shown in Figure 3-❶. Note that, we use the snowflake symbol (^) to denote that these vectors are
fixed, which will not be tuned during training.
Hybrid Prompt Construction Step. Based on the above tokenization results, we construct the
textual prompts and the learnable vectors, respectively, and finally combine them to obtain the
hybrid prompt, as described as follows.
Sub-task Specific Textual Prompt. Since the two sub-tasks have different goals, we add textual
prompts composed of specific task descriptions to guide the model to produce the desired re-
sult. The structure prompt P𝑠 and content prompt P𝑐 are designed as follows:
Each of the above prompts will first be tokenized and then converted into a vector sequence
(same as the process of obtaining 𝑒 𝑁 ), where 𝑒 𝑆 is the vector sequence for P𝑠 (see Figure 3-❷) and
𝑒𝐶 is the vector sequence for P𝑐 (see Figure 3-❸). Note that, these vector sequences are also fixed.
Learnable Vectors. As it is hard to design appropriate fixed prompts to effectively guide PLMs, we
introduce learnable vectors, which can be tuned during training, for learning more contextual
information. The basic idea is to introduce four learnable vectors. Each one is designed for a
different purpose, e.g., question, database, and task, to help the PLM understand different aspects of
the input.
Formally, we introduce four learnable vectors {𝑣 𝐵 , 𝑣 𝑁 , 𝑣 𝐷 , 𝑣 𝐸 }, which are designed for different
purposes: (1) 𝑣 𝑁 is for learning NL question 𝑁 specific features, (2) 𝑣 𝐷 is for learning database 𝐷
specific features, and (4) 𝑣 𝐵 and 𝑣 𝐸 are for learning task specific features at the beginning and end of
the input context, respectively. The main intuition is that due to the positional encoding mechanism
of the transformer-based PLMs, the learnable vectors at the specific points can implicitly learn
the specific contextual information during training, so as to help the PLM understand different
parts of the input (e.g., the database schema) during prediction. Note that we only put the learnable
vectors in the input embedding layer, which is different from prefix-tuning [13] that prepends the
vectors to each layer in the encoder stack, including the input layer. Our experiments show that this
simpler solution is sufficient to achieve good performance in SC-Prompt. We conduct an ablation
study to demonstrate that our method works better than using one single learnable vector with
respect to one purpose (e.g., question, database, or task). Please refer to Section 7.2 for more details.
Next we describe three methods to initialize these four vectors.
(1) Random Initialization randomly initializes all the vectors based on a normal distribution.
(2) Vocabulary Initialization randomly samples words from the vocabulary of the PLM to initialize
the vectors.
(3) Keyword Initialization limits the initialization space to keywords related to our Text-to-SQL
task, e.g., “database”, “table”, “question”, etc.
Prompts Combination. Let the embedding of database 𝑒 𝐷 be 𝑒 𝐷 = 𝑒 𝐷𝑛𝑎𝑚𝑒 ⊕𝑒𝑇 , where “⊕” is for vector
concatenation, 𝑒 𝐷𝑛𝑎𝑚𝑒 is 𝑒 𝐷 ’s sub-sequence w.r.t. the database name, and 𝑒𝑇 is 𝑒 𝐷 ’s sub-sequence
w.r.t. the table schema.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:10 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
We first combine ❶ the embedding for the NL question and database schema with the learnable
vectors {𝑣 𝑁 , 𝑣 𝐷 , 𝑣 𝐸 } as follows:
V = 𝑣 𝑁 ⊕ 𝑒 𝑁 ⊕ 𝑣 𝐷 ⊕ 𝑒 𝐷𝑛𝑎𝑚𝑒 ⊕ 𝑒𝑇 ⊕ 𝑣 𝐸 (3)
The combined vector sequence V is depicted in Figure 3-❹. As discussed above, the reason to do
such combination is to learn features for each learnable vector with different semantics.
For the structure stage, we concatenate the above vector sequence V, the task-specific vector 𝑣 𝐵 ,
and the vectors for the structured prompt 𝑒 𝑆 as the hybrid structure prompt (see Figure 3-❺):
𝐼𝑆 = 𝑣 𝐵 ⊕ 𝑒 𝑆 ⊕ V (4)
This hybrid structure prompt 𝐼𝑆 will be used as the input to the structure stage.
For the content stage, we concatenate the above vector sequence V, the task-specific vector 𝑣 𝐵 ,
and the vectors for the content prompt 𝑒𝐶 as the hybrid content prompt (see Figure 3-❻):
𝐼𝐶 = 𝑣 𝐵 ⊕ 𝑒𝐶 ⊕ V (5)
This hybrid content prompt 𝐼𝐶 will be used as the input to the content stage.
Example 3 (Hybrid prompt construction for Text-to-SQL). Consider the NL question 𝑁 and
the database 𝐷 in Figure 1.
Input preprocessing. First, the database 𝐷 is flattened to “Database: Network. Table highschooler:
id, name, . . . , grade; Table friend: student_id, . . . ”. And the flattened database is tokenized to a token
list: tok(𝐷) = [“database”, “:”, “network”, . . .]. Similarly, the NL question 𝑁 is also tokenized to
tok(𝑁 ) = [“what”, “are”, . . .]. Second, the base PLM is used to convert the tokens to high-dimensional
vectors: tok(𝐷) is converted to 𝑒 𝐷 = 𝑒 𝑑 ⊕ 𝑒𝑇ℎ𝑖𝑔ℎ𝑠𝑐ℎ𝑜𝑜𝑙𝑒𝑟 ⊕ 𝑒𝑇𝑓 𝑟𝑖𝑒𝑛𝑑 ; tok(𝑁 ) is converted to 𝑒 𝑁 .
Hybrid structure prompt construction. First, we combine 𝑒 𝐷 and 𝑒 𝑁 with the four learnable
vectors {𝑣 𝐵 , 𝑣 𝑁 , 𝑣 𝐷 , 𝑣 𝐸 } to get V = 𝑣 𝑁 ⊕ 𝑒 𝑁 ⊕ 𝑣 𝐷 ⊕ 𝑒 𝐷𝑛𝑎𝑚𝑒 ⊕ 𝑒𝑇ℎ𝑖𝑔ℎ𝑠𝑐ℎ𝑜𝑜𝑙𝑒𝑟 ⊕ 𝑒𝑇𝑓 𝑟𝑖𝑒𝑛𝑑 ⊕ 𝑣 𝐸 . Second, we
convert textual prompt P𝑠 to a vector sequence 𝑒 𝑆 and concatenate it with 𝑉 and 𝑣 𝐵 to get input 𝐼𝑆 for
predicting SQL structure: “SELECT [col] FROM [tab] EXCEPT SELECT [col] FROM [tab]”.
Hybrid content prompt construction. This stage is very similar to the previous stage, except
that the textual prompt P𝑐 contains the SQL structure 𝑆 generated above. we convert the textual prompt
P𝑐 to a vector sequence 𝑒𝐶 and concatenate it with 𝑉 to form the input 𝐼𝐶 for predicting the SQL content:
“[col] id [tab] highschooler [col] id [col] student_id [tab] friend”.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:11
representation are generated through auto-regressive decoding. The trainable parameters here
include the SQL structure generator 𝐺𝑠 (i.e., the PLM used in Stage-S) and the learnable vectors,
which are denoted as Θ𝑠 . Next, we learn Θ𝑠 by minimizing the negative log-likelihood loss, i.e.,
𝑛
1 ∑︁
𝐿(Θ𝑠 ) = − 𝑖
log 𝑃 (𝑆𝑔𝑜𝑙𝑑 |𝐼𝑆𝑖 ; Θ𝑠 )
𝑛 𝑖=1
𝑛 𝑚𝑖 (6)
1 ∑︁ ∑︁ 𝑖,𝑗 −1
=− 𝑖,𝑗
log 𝑃 (𝑆𝑔𝑜𝑙𝑑 𝑖,1
|𝐼𝑆𝑖 ; 𝑆𝑔𝑜𝑙𝑑 , . . . , 𝑆𝑔𝑜𝑙𝑑 ; Θ𝑠 )
𝑛 𝑖=1 𝑗=1
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:12 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
Next, we combine constrained decoding and beam search to design decoding methods for the two
stages: keyword constrained decoding for SQL structure generation and structure guided decoding
for SQL content prediction.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:13
Time Step: 1 2 3 4
[tab] output ✘
from
all where
[col]
select
[col]
<start> [col]
[tab]
sql
from
the where
the
Standard Beam Search
all [col]
select [val]
[col]
[col]
<start> from
[tab] output ✔
the where
Fig. 4. Example of our fine-grained constrained decoding for the question and database schema in Figure 1.
Slot-Level. The goal of slot-level constraints is to ensure that the predicted content is consistent
with the slot (i.e., placeholder). For example, [col] can only be followed by the column name.
Technically, we check each time step of beam search, and set the score of candidate tokens that do
not match the slot to −∞. This allows the decoder to search over only valid content.
Sentence-Level. The goal of sentence-level decoding is to ensure that the generated SQL query is
semantically correct at the sentence-level. We use four consistency checks:
• Column-Table Consistency avoids the use of column names that are not in the FROM list
of SQL queries.
• Operator-Value Consistency avoids operations on values of incompatible types.
• Aggregation-Column Consistency avoids aggregation operations whose type does not
apply to its target column.
• Column-Column Consistency avoids operations on two columns whose value types do
not match each other.
Unlike the slot-level checks, sentence-level checks are performed after the beam search. That
is, after beam search outputs the 𝐾 hypotheses with the highest probability score, we check their
correctness. The one that is correct and has the highest probability score will eventually be selected.
If all the top-𝐾 hypotheses fail to pass these checks, we default to the hypothesis with the highest
probability.
Example 4 (Fine-grained constrained decoding). Consider the NL question 𝑁 and the database
𝐷 in Figure 1.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:14 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
Keyword constrained decoding for the structure stage. The target output is “SELECT [col]
FROM [tab] EXCEPT SELECT [col] FROM [tab]”. The SQL structure decoder uses beam search
to keep the most likely 𝐾 hypotheses at each time step and leverages vocabulary V𝑘𝑒𝑦 to limit each
hypothesis to contain only valid SQL commands, operators and three placeholders. As shown in the
constrained beam search process in Figure 4, “the” in time step-1 is excluded from the subsequent beam
search.
Structure guided decoding for the content stage. The output is “[col] id [tab] highschooler
[col] id [col] student_id [tab] friend”. The SQL content decoder first ensures that each
slot is followed by the correct content. If the slot is [col], the next possible token “friend” will be
rejected because it is not a column name of the database 𝐷. Then after the beam search outputs the
top-𝐾 hypotheses, the decoder checks the semantic correctness of each hypotheses and rejects the
hypotheses containing the following four errors: (i) Column-Table inconsistency. The column name “id”
appears in the hypothesis, but its corresponding table “highschooler” does not appear. (ii) Operator-
Value inconsistency. The comparison operator < operates on a textual value. (iii) Aggregation-Column
inconsistency. The aggregation operator SUM operates on the column “name”. (iv) Column-Column
inconsistency. The hypothesis is “[col] id [tab] highschooler [col] name [col] student_id
[tab] friend”, where “name” and “student_id” do not match because “student_id” is a numeric
column and “name” is string.
6 EXPERIMENTAL SETUP
6.1 Datasets
Table 1. Statistics of Text-to-SQL benchmarking datasets: #Instance, #DB, and #Domain represent numbers
of annotated instances, relational databases, and domains respectively, and #Table/DB denotes the average
number of tables per database.
Dataset #Instance #DB #Domain #Table/DB ORDER BY GROUP BY NESTED HAVING
Spider [39] 10181 200 138 5.1 1335 1491 844 388
CoSQL [38] 15598 200 138 5.1 2290 2171 840 688
GeoQuery [40] 877 1 1 6 20 46 167 9
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:15
GeoQuery. GeoQuery is a simple Text-to-SQL benchmark with only one database about US
geography [40]. It has 877 annotated instances, where each instance contains an NL question and
an SQL query. For fair comparison, we follow the existing approach SeqZero [36] (which will be
introduced later) to split the GeoQuery dataset based on the patterns of SQL queries, and thus
obtain three training sets with sizes: 536, 150 and 50.
6.3 Baselines
To comprehensively evaluate the SC-Prompt framework, we used 10 state-of-the-art Text-to-SQL
methods at the time of writing as baselines: DT-Fixup [35], LGESQL [3], SmBoP [20] are specifically
designed for the Spider dataset; R2SQL [8] is specific to the CoSQL dataset; T5 [18] and PICARD [24]
can be applied to both Spider and CoSQL datasets; finally, Iyer et al. [9], GECA [1], Zheng and
Lapata [42], and SeqZero [36] are for designed for and reported to perform well on GeoQuery.
DT-Fixup [35] proposes a theoretically reasonable optimization strategy. It trains a deep-transformer
model on Text-to-SQL task and the reading comprehension task, which has good generalization
ability and fast convergence speed.
LGESQL [3] applies a line-graph enhanced relational graph attention network (RGAT) as the
encoder, and employs an additional edge-centric line graph constructed from a question-database
node-centric graph.
SmBoP [20] develops a semi-autoregressive bottom-up parser. We report the results of SmBoP
combined with Grappa [37], which is a BERT-like language model pre-trained on a large number
of question-SQL pairs, and have achieved superior performance.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:16 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
7 EXPERIMENTAL RESULTS
In this section, we first report the overall performance of our SC-Prompt framework on the three
benchmarks (Section 7.1). Then, we respectively evaluate our proposed hybrid prompt construction
3 https://huggingface.co/docs/transformers/main_classes/trainer
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:17
method (Section 7.2) and the fine-grained constrained decoding strategy (Section 7.3). Finally, we
discuss the evaluation with full training data (Section 7.4), and provide a case study to qualitatively
analyze the effectiveness of SC-Prompt (Section 7.5).
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:18 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
Table 2. Overall results of few-shot Text-to-SQL on the Spider dataset. We randomly select 5%, 10%, 15%
and 20% of the official training set as the few-shot training sets for evaluation. Note that the notation “-”
indicates that the source code of a method does not consider the evaluation metric.
Model EM (%) EX (%)
% Official training data 5% 10% 15% 20% 5% 10% 15% 20%
Existing Methods with PLMs
DT-Fixup [35] 22.9 36.1 41.3 48.2 - - - -
LGESQL + BERT-Large [3] 26.4 35.1 39.4 48.5 - - - -
LGESQL + ELECTRA-Large [3] 38.3 47.8 53.7 58.2 - - - -
SmBoP + Grappa [20] 36.9 48.2 56.2 58.7 - - - -
T5-Large [26] 34.0 42.7 45.3 50.1 37.3 46.2 48.1 52.5
Picard + T5-Large [24] 40.1 46.4 50.2 56.4 43.7 52.7 55.4 57.9
Conventional Prompt Learning Methods
Textual Prompts + T5-Large 34.7 43.5 46.1 50.4 37.9 47.1 48.3 53.1
Learnable Vectors + T5-Large 28.4 38.7 39.8 40.3 32.3 41.3 44.1 44.5
SC-Prompt (Our Approach)
45.2 52.8 57.2 59.5 49.1 56.9 59.3 62.0
SC-Prompt + T5-Large
(+5.1) (+4.6) (+1.0) (+0.8) (+5.4) (+4.2) (+3.9) (+4.1)
w/o Hybrid Prompt Learning 43.7 49.5 54.2 56.6 46.2 54.4 57.1 59.6
w/o Constrained Decoding 39.5 48.1 50.2 52.4 42.7 50.3 51.9 55.1
Table 3. Overall results of few-shot Text-to-SQL on the CoSQL dataset. We randomly select 5%, 10%, 15% and
20% of the official training set as the few-shot training sets for evaluation.
Model EM (%) IM (%)
% Official training data 5% 10% 15% 20% 5% 10% 15% 20%
Existing Methods with PLMs
R2SQL [8] 21.3 27.9 31.9 34.8 4.1 6.5 7.2 9.6
T5-Large [26] 27.4 33.1 37.5 39.7 4.1 5.8 11.0 14.0
Picard + T5-Large [24] 30.2 35.6 39.8 42.1 6.9 8.3 13.4 16.2
Conventional Prompt Learning Methods
Textual Prompts + T5-Large 28.0 33.5 37.6 39.9 4.4 6.2 10.9 13.7
Learnable Vectors + T5-Large 22.9 28.1 32.5 33.1 4.1 5.1 8.6 9.9
SC-Prompt (Our Approach)
33.4 38.3 41.2 44.2 10.0 10.1 13.7 16.1
SC-Prompt + T5-Large
(+3.2) (+2.7) (+1.2) (+2.1) (+3.1) (+1.8) (+0.3) (-0.1)
w/o Hybrid Prompt Learning 32.2 37.1 40.3 42.9 9.2 9.7 13.0 15.2
w/o Constrained Decoding 30.7 35.8 38.7 42.0 7.6 8.6 11.2 13.5
better than the baseline methods at all difficulty levels. In particular, we can see that on the Spider
dataset, LGESQL + ELECTRA-large [3] slightly outperforms the general PLM based method Picard
+ T5-Large on the Easy and Medium level. This is because the task-specific schema-linking layers
in LGESQL + ELECTRA-large can understand the training data more effectively. However, the
performance of SC-Prompt also exceeds these task-specific methods, showing that the divide-
and-conquer approach of SC-Prompt can address the problem of limited training data in few-shot
Text-to-SQL.
Conclusion: (1) With less training data, SC-Prompt is more effective. The main reason is that the
PLM is more prone to the poor generalization problem when there is very limited training data
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:19
Table 4. Overall results of few-shot Text-to-SQL methods on the GeoQuery dataset. Following SeqZero [36],
we use 50, 150 and 536 training instances as the few-shot training sets for evaluation. The notation “-” indicates
the corresponding setting is not evaluated in the original paper.
Model EM (%)
#Training data 50 150 536
Existing Methods with PLMs
Iyer et al. [9] - - 40.0
GECA [1] - - 49.0
Zheng and Lapata [42] - - 69.6
BART-Large [12] 41.2 73.1 72.5
SeqZero + Bart-Large [36] 48.9 74.2 74.7
T5-Large 39.5 69.7 71.4
Conventional Prompt Learning Methods
Textual Prompts + T5-Large 40.3 71.9 73.1
Learnable Vectors + T5-Large 14.7 44.3 49.5
Our SC-Prompt Framework
SC-Prompt + T5-Large 49.0 (+0.1) 76.7 (+2.5) 79.7 (+5.0)
w/o Hybrid Prompt Learning 46.0 73.4 76.5
w/o Constrained Decoding 47.5 74.9 78.1
100 100
DT-Fixup Picard + T5-Large R2SQL SC-Prompt + T5-Large
LGESQL + ELECTRA-large SC-Prompt + T5-Large Picard + T5-Large
80 80
60 60
EM (%)
EM (%)
40 40
20 20
(a) Comparison on the Spider dataset (b) Comparison on the CoSQL dataset
Fig. 5. Approach comparisons on the Spider and CoSQL datasets with respect to difficulty levels. As our
evaluation metrics have similar comparative results, we only show the results on EM here.
available (i.e., 5%). (2) SC-Prompt outperforms the baseline methods at all difficulty levels in the
few-shot setting.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:20 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
method of the learnable vectors [11] cannot achieve comparable performance as the standard
full-PLM fine-tuning method. However, since learnable vectors can learn specific semantic features
for different Text-to-SQL datasets, which can compensate textual prompts, SC-Prompt proposes a
hybrid prompt construction method to combine the best of these two methods. The ablation results
show that, even only with our hybrid prompt, SC-Prompt can effectively improve the Text-to-SQL
performance, and outperforms conventional prompt construction methods.
Evaluation on initialization methods for learnable vectors. We first compare the three
initialization methods for learnable vectors (see Section 4). In particular, we conduct experiments
on the Spider dataset and the CoSQL dataset using 10% of the official training data. As shown in
Table 5, initialization using keywords related to our Text-to-SQL task ( e.g., “database”, “table” and
“question”) is slightly better than vocabulary initialization that sampling from the PLM vocabulary.
The main reason is that the vocabulary initialization strategy may introduce a lot of noises, which
would affect the model’s understanding on the original input. In contrast, the keyword initialization
uses task related words to initialize the vectors, which can reduce the possibility of introducing
noises and achieve better performance.
Evaluation on prompt combination methods. For different prompt combination methods, we
perform an ablation study to compare the methods discussed in Section 4.2: only before the NL
question 𝑒 𝑁 , only before the database 𝑒 𝐷 , and only at the tail. We conduct experiments on both
the Spider dataset and the CoSQL dataset using 10% of the official training data. The experimental
results in Table 6 show that our method performs better than other methods. The main reason is
that putting the learnable vectors in different positions during the prompt construction process
can obtain prompts for different purposes (e.g., learning NL question specific features.), and they
can better guide the PLM to output the desired results.
Conclusion: (1) The combination of the textual prompts and the learnable vectors (i.e., hybrid
prompt learning) achieves the best results. (2) The three initialization methods perform similarly,
with the keyword initialization method slightly better. (3) Constructing the learnable vectors by
putting them in different positions works better than putting them in a single position.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:21
54
52
EM (%) 50
48
w/o Constraint Keyword Constrained
46 Structure Guided SC-Prompt
1 2 4 8
Beam Size
Fig. 6. The performance of different components under different beam sizes in our decoding method. This
experiment is conducted on the Spider dataset, using 10% of the training data. EX Acc. (%) and EM Acc. (%)
have similar comparative results, so we only show EM Acc. (%) here.
set). Specifically, we evaluate the performance of these two task-specific decoding methods under
different beam sizes. Recall that the beam size refers to the number of hypotheses retained by the
model during decoding. Thus, the larger the beam size is, the more likely the model can avoid the
risk of missing hidden high probability sequences. Note that, when the beam size is 1, the method
degenerates into a greedy search strategy. The experimental results are reported in Figure 6. We
can see that the performance of the four decoding methods increase, with the increase of beam size,
especially for the three decoding methods with constraints. We can also find SC-Prompt, which
is equipped with both keyword-constrained and structured-guided decoding, achieves the best
performance, which validates the effectiveness of our fine-grained constrained decoding strategies.
Conclusion: (1) The effect of the decoding method will increase with the increase of the beam
size. (2) Our fine-grained constrained decoding method, equipped with both keyword-constrained
and structured-guided decoding, achieves the best performance.
Although the goal of SC-Prompt is mainly to solve the poor generalization problem in the few-
shot scenario, we also conduct experiments on the full training data setting to verify the effectiveness
of our proposed framework under different training data sizes. We conduct experiments on the
T5-Large model and the T5-3B model respectively, and compare them with the best baseline model
Picard [24]. Table 7 shows the experimental results on the Spider dataset and the CoSQL dataset.
Note that the GeoQuery dataset released by [5] only contains 536 training samples, and the results
in Section 7.1 have already considered the full training data. We can see that SC-Prompt performs
better than the baseline method on the two base PLMs. This shows that even if a large amount of
training data is used, the poor generalization problem still exists due to the large complexity of the
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:22 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
Text-to-SQL task; and our divide-and-conquer framework of SC-Prompt is still effective under the
full training data setting.
Conclusion: When more training data is available (e.g., 100%), SC-Prompt still outperforms
baselines, but the gap becomes smaller, as expected.
8 RELATED WORK
8.1 Text-to-SQL Semantic Parsing
The task of supporting Text-to-SQL with a natural language (NL) interface that allows a non-expert
user to access a relational database easily has been studied by the community over the last few
decades. In recent years, deep learning models using large-scale training data have substantially
improved the accuracy of text-to-SQL. Several methods propose complex encoder architecture
with graph-based layers to model the relationship between the natural language question and the
database schema. For example, RATSQL [30] proposes a relation-aware self-attention mechanism to
encode the hybrid question-schema graph. LGESQL [3] takes the topological structure of edges into
account and models the 1-hop edge features in text-to-SQL with a line graph. R2SQL [8] is proposed
to solve the conversational text-to-SQL task (e.g., CoSQL [38]), which encodes the relationship
in context-dependent questions via a dynamic graph framework. More recently, large PLMs (e.g.,
GPT-3 [2], T5 [18]) have shown to be effective in improving text-to-SQL and achieving state-of-art
performance that take advantage of the extensive knowledge in large-scale corpora. Different from
the previous methods with task-specific design, PLM-based methods [24, 26] can be applied to
different text-to-SQL datasets without specific modifications to the base PLMs. More importantly,
many studies [2, 6, 16] have shown the good performance of large PLMs in few-shot scenarios, which
is very important for Text-to-SQL that requires high-quality training data. Therefore, many works
have used large PLMs for few-shot Text-to-SQL. To bridge the gap between the natural language and
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:23
Table 8. Case study of the generated SQL queries. Questions 1-2 are sampled from the Spider dataset (10%),
questions 3-4 are sampled from the CoSQL dataset (10%), and question 5 is sampled from the GeoQuery
dataset (536). The errors in the generated SQLs from T5-Large are highlighted in red.
Question-1: What are the ids of high school students who do not have friends?
★Ground Truth: SELECT id FROM highschooler EXCEPT SELECT student_id FROM friend
•T5-Large: SELECT student_id FROM highschooler EXCEPT SELECT student_id FROM
friend (×)
•SC-Prompt: SELECT id FROM highschooler WHERE id NOT IN (SELECT student_id FROM
friend) (✓)
Question-2: What are the different years in which there were cars produced that weighed less
than 4000 and also cars that weighted more than 3000?
★Ground Truth: SELECT DISTINCT year FROM cars_data WHERE weight BETWEEN 3000
AND 4000
•T5-Large: SELECT DISTINCT year FROM cars_data WHERE weight > 3000 (×)
•SC-Prompt: SELECT DISTINCT year FROM cars_data WHERE weight < 4000 INTERSECT
SELECT year FROM cars_data WHERE weight > 3000 (✓)
Question-3: Can you tell me the age of each teacher?
★Ground Truth: SELECT name, age FROM teacher
•T5-Large: SELECT t2.name, t2.age FROM course AS t1 JOIN teacher AS t2 ON t1.teacher_id
= t2.teacher_id GROUP BY t2.name (×)
•SC-Prompt: SELECT name, age FROM teacher (✓)
Question-4: What is the average GNP and total population in all nations whose government is
US territory?
★Ground Truth: SELECT AVG(gnp), SUM(population) FROM country WHERE government-
form = “US Territory”
•T5-Large: SELECT AVG(gnp), AVG(population) FROM country WHERE governmentform =
“US Territory” (×)
•SC-Prompt: SELECT AVG(gnp), SUM(population) FROM country WHERE governmentform
= “US Territory” (✓)
Question-5: What is the population of new mexico?
★Ground Truth: SELECT statealias0.population FROM state AS statealias0 WHERE
statealias0.state_name = “new mexico”
•T5-Large: SELECT cityalias0.population FROM city AS cityalias0 WHERE
cityalias0.city_name = “new mexico” (×)
•SC-Prompt: SELECT statealias0.population FROM state AS statealias0 WHERE
statealias0.state_name = “new mexico” (✓)
SQL queries, [27] and [25] first generate canonical natural language utterances with large PLMs and
then transform them to the target SQL queries. Next, considering that it is difficult for the PLM to
directly output long canonical utterance, SeqZero [36] decomposes the original canonical utterance
into multiple sub-clauses (e.g., SELECT-FROM-WHERE) through several manually designed natural
language templates and guides the model to predict each sub-clause sequentially.
Different from the above methods, SC-Prompt does not rely on any manually predefined tem-
plates or grammar rules. To tackle the challenge of limited training data, SC-Prompt decomposes
the complicated Text-to-SQL task into two simpler sub-tasks via structure and content prompt
learning. In addition, to generate semantically correct and executable SQL queries, SC-Prompt
introduces effective prompt construction and decoding strategies to enhance the performance of
the base PLM on the two sub-tasks.
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:24 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
ACKNOWLEDGMENTS
This work was partly supported by the NSF of China (62122090, 62072461 and 62072458), and
National Key Research and Development Program of China (2020YFB2104101).
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:25
REFERENCES
[1] Jacob Andreas. 2020. Good-Enough Compositional Data Augmentation. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter,
and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 7556–7566. https://doi.org/10.18653/v1/2020.acl-
main.676
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
[3] Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao, Su Zhu, and Kai Yu. 2021. LGESQL: Line Graph Enhanced Text-to-SQL
Model with Mixed Local and Non-Local Relations. In Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP
2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli
(Eds.). Association for Computational Linguistics, 2541–2555. https://doi.org/10.18653/v1/2021.acl-long.198
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA,
June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association
for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
[5] Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang,
and Dragomir R. Radev. 2018. Improving Text-to-SQL Evaluation Methodology. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1:
Long Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, 351–360. https:
//doi.org/10.18653/v1/P18-1033
[6] Heng Gong, Yawei Sun, Xiaocheng Feng, Bing Qin, Wei Bi, Xiaojiang Liu, and Ting Liu. 2020. TableGPT: Few-shot
Table-to-Text Generation with Table Structure Reconstruction and Content Matching. In Proceedings of the 28th
International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020,
Donia Scott, Núria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, 1978–1988.
https://doi.org/10.18653/v1/2020.coling-main.179
[7] Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2022. PPT: Pre-trained Prompt Tuning for Few-shot Learning. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL
2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association
for Computational Linguistics, 8410–8423. https://doi.org/10.18653/v1/2022.acl-long.576
[8] Binyuan Hui, Ruiying Geng, Qiyu Ren, Binhua Li, Yongbin Li, Jian Sun, Fei Huang, Luo Si, Pengfei Zhu, and Xiaodan
Zhu. 2021. Dynamic Hybrid Relation Exploration Network for Cross-Domain Context-Dependent Semantic Parsing. In
Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of
Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021,
Virtual Event, February 2-9, 2021. AAAI Press, 13116–13124. https://ojs.aaai.org/index.php/AAAI/article/view/17550
[9] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a Neural
Semantic Parser from User Feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, Regina Barzilay and Min-Yen Kan
(Eds.). Association for Computational Linguistics, 963–973. https://doi.org/10.18653/v1/P17-1089
[10] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How Can We Know What Language Models Know.
Trans. Assoc. Comput. Linguistics 8 (2020), 423–438. https://doi.org/10.1162/tacl_a_00324
[11] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event /
Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and
Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 3045–3059. https://doi.org/10.18653/v1/2021.
emnlp-main.243
[12] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov,
and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.).
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:26 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning 147:27
and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 7699–7715. https://doi.org/10.18653/v1/2021.
emnlp-main.608
[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and
Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon,
Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.).
5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[29] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R.
Bowman. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances
in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS
2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence
d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 3261–3275. https://proceedings.neurips.cc/paper/2019/hash/
4496bf24afe7fab6f046bf4923da8de6-Abstract.html
[30] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware
Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R.
Tetreault (Eds.). Association for Computational Linguistics, 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677
[31] Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a Semantic Parser Overnight. In Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China,
Volume 1: Long Papers. The Association for Computer Linguistics, 1332–1342. https://doi.org/10.3115/v1/p15-1129
[32] Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nallapati, and Bing Xiang. 2019. Multi-passage BERT: A Globally
Normalized BERT Model for Open-domain Question Answering. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan
(Eds.). Association for Computational Linguistics, 5877–5881. https://doi.org/10.18653/v1/D19-1599
[33] Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-Sequence Learning as Beam-Search Optimization. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas,
USA, November 1-4, 2016, Jian Su, Xavier Carreras, and Kevin Duh (Eds.). The Association for Computational Linguistics,
1296–1306. https://doi.org/10.18653/v1/d16-1137
[34] Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming
Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao,
Dragomir R. Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2022.
UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. CoRR
abs/2201.05966 (2022). arXiv:2201.05966 https://arxiv.org/abs/2201.05966
[35] Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi Tang, Chenyang Huang, Jackie Chi Kit Cheung, Simon J. D. Prince,
and Yanshuai Cao. 2021. Optimizing Deeper Transformers on Small Datasets. In Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing,
ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and
Roberto Navigli (Eds.). Association for Computational Linguistics, 2089–2102. https://doi.org/10.18653/v1/2021.acl-
long.163
[36] Jingfeng Yang, Haoming Jiang, Qingyu Yin, Danqing Zhang, Bing Yin, and Diyi Yang. 2022. SEQZERO: Few-shot
Compositional Semantic Parsing with Sequential Prompts and Zero-shot Models. In Findings of the Association
for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-
Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 49–60. https:
//doi.org/10.18653/v1/2022.findings-naacl.5
[37] Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir R. Radev, Richard
Socher, and Caiming Xiong. 2021. GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. In 9th
International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
https://openreview.net/forum?id=kyaIeYj4zZ
[38] Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li,
Youxuan Jiang, Michihiro Yasunaga, Sungrok Shim, Tao Chen, Alexander R. Fabbri, Zifan Li, Luyao Chen, Yuwen
Zhang, Shreya Dixit, Vincent Zhang, Caiming Xiong, Richard Socher, Walter S. Lasecki, and Dragomir R. Radev.
2019. CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to
Databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November
3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics,
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.
147:28 Zihui Gu, Ju Fan, Nan Tang, Lei Cao, Bowen Jia, Sam Madden, and Xiaoyong Du
1962–1979. https://doi.org/10.18653/v1/D19-1204
[39] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle
Roman, Zilin Zhang, and Dragomir R. Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and
Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia
Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 3911–3921. https://doi.org/10.
18653/v1/d18-1425
[40] John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In
Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial
Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, USA, August 4-8, 1996, Volume 2, William J. Clancey and
Daniel S. Weld (Eds.). AAAI Press / The MIT Press, 1050–1055. http://www.aaai.org/Library/AAAI/1996/aaai96-156.php
[41] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. PEGASUS: Pre-training with Extracted Gap-
sentences for Abstractive Summarization. In Proceedings of the 37th International Conference on Machine Learning,
ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 11328–11339.
http://proceedings.mlr.press/v119/zhang20ae.html
[42] Hao Zheng and Mirella Lapata. 2021. Compositional Generalization via Semantic Tagging. In Findings of the Association
for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021,
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational
Linguistics, 1022–1032. https://doi.org/10.18653/v1/2021.findings-emnlp.88
Proc. ACM Manag. Data, Vol. 1, No. 2, Article 147. Publication date: June 2023.