5.3 Study Procedure
In our study, each participant experienced the four conditions described in Section
5.1. As the goal of this study is to investigate the error discovery and repair behavior of users, the example SQL queries for each participant were randomly selected from the dataset of incorrect queries generated by the three NL2SQL models used in the error analysis study. Each query that a participant encountered was also randomly assigned to one of the experimental conditions or the baseline condition.
To facilitate the user experiment, we implemented a web application that can automatically select SQL tasks and assign conditions to study participants. After finishing one SQL query, users can click the “Next” button on the application, and it will randomly select the next query and assign a condition to it. Both the query assignment and the condition assignment were randomized. For each query, the web application renders the task description, the database and its tables, and the assigned error-handling mechanisms.
Each experiment session began with the informed consent process. Then, each participant watched a tutorial video about how to interact with the system to solve an SQL task and fix NL2SQL errors under different conditions. Then, each participant was given a total of 45 minutes to solve as many SQL tasks as possible. On average, each participant completed 22.0 SQL tasks in 45 minutes (5.5 in each condition). After each experiment session, the participant completed a post-study questionnaire. This questionnaire asked participants to rate their overall experience, the usefulness of interactive tool support under different conditions, and their preferences in Likert scale questions. We ended each experiment session with a 10-minute semi-structured interview. In the interview, we asked follow-up questions about their responses to the post-study questionnaire, if they encountered any difficulties with interaction mechanisms under the conditions, and which parts they found useful. We also asked participants about the general workflow as they approached the task and the features they wished they had when handling NL2SQL errors. All user study sessions were video recorded with the consent of the participants.
Following established open coding methods [
11,
36], an author conducted a thematic analysis of the interview transcripts to identify common themes about user experiences and challenges they encountered while using the different error handling mechanisms, as well as their suggestions for new features. Specifically, the coder went through and coded the transcripts of the interview sessions using an inductive approach. For user quotes that did not include straightforward key terms, the coder assigned researcher-denoted concepts as the code.
5.5 Results
In this section, we report the key findings on the efficiency, effectiveness, and usability of different error handling mechanisms and their user experiences. For each condition in a statistical test, the data is sampled evenly and randomly.
F1: The error handling mechanisms do not significantly improve the accuracy of fixing erroneous SQL queries. To start with, we conducted a one-way ANOVA test (
\(\alpha\)=0.05) among tasks that used different error handling mechanisms. The p-value for the accuracy was 0.82, indicating that there were no significant differences between the different error handling methods. The average accuracy and standard deviation among the participants are shown in Table
13.
We then analyzed the effect of different mechanisms on the accuracy of fixing specific error types, including five common syntactic error types (
A: WHERE error; B: JOIN error; C: ORDER BY error; D: SELECT error; E: GROUP BY error, as well as six semantic errors shown in Table
7. Using the same statistical test, we found that the p-values for all types of error were higher than the 0.05 threshold, indicating that there were no significant differences in accuracy when the user used different error-handling mechanisms (Table
14).
Furthermore, we found that different error-handling mechanisms did not significantly influence the accuracy of SQL query error handling at various difficulty levels (Table
15). These findings suggest that existing interaction mechanisms are not very effective for handling NL2SQL errors that state-of-the-art deep learning NL2SQL models make on complex datasets like Spider. We further discuss the reasons behind these results and their implications in the rest of Sections
5.5 and
6.
F2: The error handling mechanisms do not significantly impact the overall time of completion. To study the impact of different error handling mechanisms on time usage, we analyzed the time of completion (ToC) of the query that was solved correctly by the participants. We used the same ANOVA test as applied in the previous analysis to test the mean difference among ToC using various error handling mechanisms (Table
13), no significant significance was found among the groups (
\(p=0.52\)).
Similarly, we analyzed the impact of different error-handling mechanisms on the selected error types. In general, the baseline method was more efficient in solving a task, while the conversational dialog system took more time compared with other methods. The results are shown in Table
16.
Additionally, the results of experiments on SQL queries of various levels of difficulty revealed differences among the error-handling mechanisms tested in the case of easy queries (\(p=0.04\)). Specifically, direct editing was found to be the fastest method when the query was easy, followed by the explanation and example-based approach (C1), the explanation-based visualization approach (C2), and the conversational dialog system (C3).
F3: Users perform better on error types with fewer variants. We analyzed the impact of error types on task accuracy and ToC, and reported the results in Table
18. The results revealed that among the syntactic error types,
A: WHERE errors and
E: GROUP BY errors had high accuracy, while for semantic error types,
d: Value error and
e: Condition error had high accuracy. As shown in the error taxonomy (Table
4), value errors occur only in the WHERE clauses, and those errors usually require fewer steps to fix and have little relationship with the other syntactic parts in an SQL query. Similarly, condition errors such as
wrong sorting directions and
wrong boolean operator (AND, OR, etc.) are relatively independent components in a query. The better user performance on those error types may indicate that users face challenges in handling
semantically complicated errors, such as joining tables and selecting columns from multiple tables, but are more successful in discovering and repairing error types where the error is more
local (i.e., with little interdependency with other parts of the query). This conclusion is also evidenced in the user interview, which we will analyze in the following section.
F4: The explanation- and example-based methods are more useful for non-expert users. When participants were asked to rate their preferences among the different interaction mechanisms (shown in Table
19), we found that the explanation- and example-based approach (C1) is the most preferred, while the explanation-based visualization approach (C2) was rated similarly to the baseline method (B1). In contrast, the conversational dialog system (C3) was generally rated as less useful than the others.
We found that the user’s level of expertise significantly impacts their adoption rate of different error-handling mechanisms. The adoption rate measures when a mechanism is available, and how likely that a user will use the mechanism (instead of just using the baseline method) to handle the error. We calculated the adoption rate for each condition (C1, C2, and C3) for different levels of expertise by dividing the number of SQL queries in which the participant used the provided error-handling mechanism by the total number of queries provided with the corresponding mechanism in the participant’s study session. The result is shown in Table
20.
The primary factor contributing to the lower level of interest in using error handling mechanisms among expert participants under the experimental conditions was their ability to efficiently identify and repair errors independently. For example, P2 stated that “It (the step-by-step execution function in C1) is very redundant and time-consuming to break down the SQL queries and execute the sub-queries, since most errors can be found at first glance.” Another reason why expert users were less interested in using the error handling mechanisms was that they were not confident in the intermediate results they provided. P3, for example, noted that “Though the chatbot is capable of revising the erroneous SQL queries, I found it sometimes gives an incorrect answer and provides no additional clues for me to validate the new query.” Therefore, several expert participants chose to repair the original SQL query instead of validating and repairing the newly generated query.
The study also showed that the conversational dialog system (C3) was the least preferred mechanism among users at all levels of expertise. One reason for this is the relatively low accuracy of the model in recognizing user intents from the dialog and automatically repairing the errors in the query. For example, P3 stated that “Though it sometimes predicts the correct query, for most of the times, the prediction is still erroneous.” In addition, the chatbot did not provide explanations for its suggestions, so users had to spend significant effort to validate and repair the newly generated SQL queries. Furthermore, while the chatbot allowed manual input from users to intervene in the prediction process, such as pointing out erroneous parts and providing correct answers, it often introduced new errors while predicting the SQL. As noted by P7: “In one example, when I asked the chatbot to change the column name that was in SELECT, it somehow changes the column in JOIN as well.” As a result, many users quickly became frustrated after using it for a few SQL queries.
F5: The explanation- and example-based methods are more effective in helping users identify errors in the SQL query than in repairing errors. In the post-study questionnaire, we asked participants to evaluate the usefulness of each condition in terms of its ability to help (1) identify and (2) repair errors, respectively (Figure
6). The results indicate that most of the participants found C1 to be effective in identifying incorrect parts of the SQL query, while half of them thought it was not useful for repairing errors. Meanwhile, a notable proportion of participants (12 out of 26) affirmed C2’s effectiveness in identifying the errors, but it was helpful for repairing the errors. In terms of C3, a significant number of participants (16 and 18) had a negative perception of its effectiveness in both identifying and repairing errors within the SQL query.
Furthermore, we learned that the recursive natural language explanations might help reduce the understanding barrier for a long and syntactic-complicated SQL query. For example, P8 stated that “By looking at the shorter sentences first at the beginning, I could finally understand the meaning that the original long sentence were trying to convey.” P17 also mentioned that: “Those shorter sentences usually did not have complex grammatical structures and confusing inter-table relationships, so that the problems were easier to be spotted.” Additionally, executing the subquery and displaying the results were deemed helpful for localizing the erroneous parts in the original SQL query. For example, P23 stated: “When I noticed that the retrieved result was empty, I realized that some problems should exist in the current query.” In terms of C2, participants affirmed the effectiveness of graph-based SQL visualization in helping them better understand the relationship between the syntactical components of a query. The learning barrier of this approach was also the lowest among all experimental conditions: users could view the connections to a table by simply clicking the widget in the canvas.
Then, we investigated why the participants were less satisfied with the effectiveness of repairing errors in an SQL query for C1. There were two main factors. First, the repair strategies supported by the error-handling mechanisms were limited. Specifically, participants could only replace the incorrect parts with their correct substitutions using the drop-down menu of entity mappings, but for queries that require the addition, deletion, or reorganization of clauses, users had to manually edit the query. This limitation led to frustration among participants and ultimately resulted in them not prioritizing using this error-handling mechanism for future tasks. Second, the current approach provided little assistance for users in validating their edits. As a result, one participant stated that: “I did not trust my own edits nor the suggested changes from the approach.” (P20).