1 Introduction
The
Structured Query Language (
SQL) is ubiquitous, both in practice and in education. It is an established language, introduced in 1970 for querying relational databases [
12] and has been an ISO standard since 1987 [
1]. Research in Database Education has shown that SQL is difficult for learners, as indicated by the high prevalence of errors in SQL queries. Various researchers have studied these errors; one of the first such studies was done by Reisner [
36], while more recent work has provided a more extensive understanding of errors [
2,
4,
29,
46].
Only recently have researchers started to explore the underlying reasons for these SQL errors: misconceptions. In total, there are four types of conceptions, and all can be applied in instructional approaches to help students learn: correct, incomplete, incorrect, and misconceptions [
25]. Margulieux et al. propose the
multiple conceptions theory, which suggests that “learners develop better conceptual knowledge when they are guided to compare multiple conceptions of a concept during instruction” [
25, page 194]. For this reason, it is important to explore all existing conceptions of SQL. Since all types of conceptions for SQL are underexposed, much work needs to still be done.
A recent study that employed a think-aloud protocol to identify student misconceptions on SQL resulted in fourteen misconceptions in four categories: misconceptions based on previous course knowledge, generalization-based misconceptions, misconceptions based on language, and misconceptions due to an incomplete or incorrect mental model [
27]. Building on these findings, we aim to explore the misconceptions of novices from a different viewpoint: that of experts. SQL experts can be educators, researchers, or practitioners. Each of these groups has a distinct view on SQL errors and their underlying causes, which strengthens the content validity of the study. Furthermore, as experts have worked with SQL significantly longer than novices, they should have correct and highly detailed mental models. This, in combination with their experience in teaching SQL, can give us unique insights into underlying causes of errors.
Research goals. Our aim is to explore the underlying causes of student errors from the perspective of SQL experts. The method through which we do this is a derivative of the Delphi technique [
11]. We selected eleven errors of different categories identified by Miedema et al. [
27] and formed a panel of nineteen SQL experts from diverse backgrounds. We then presented these errors to our panel and asked them to provide their hypotheses as to why students might make these errors. We aggregated their answers to find all distinct hypotheses and set up a second questionnaire in which we asked all experts to vote on the likelihood of each hypothesis posed by them and their colleagues. In particular, we aim to answer the following two research questions.
Q1.
How do experts interpret the underlying causes of student errors in SQL?
Q2.
Which are the differences in the perspectives of students and experts on errors in SQL?
Answers to these questions provide insights complementary to that of the student perspective, towards the design of effective educational approaches addressing and correcting the misconceptions underlying the SQL errors made by students.
The remainder of the article is structured as follows: we first introduce related work to our study in Section
2. Then, we introduce the errors that this article is analyzing, by putting them in the context of existing literature in Section
3. We present our research method in Section
4 and our results in Section
5. Finally, we discuss our findings, their validity, limitations and future work in Section
6.
5 Results
In this section we aim at answering the research questions raised in Section
1:
Q1.
How do experts interpret the underlying causes of student errors in SQL?
Q2.
Which are the differences in the perspectives of students and experts on errors in SQL?
Before we dive into the participants’ hypotheses per error, we first introduce our results with a basic overview. In Table
3, we present each error, its reference in existing literature, the number of proposed hypotheses by our participants, the most popular expert hypothesis, and the proposed student hypothesis from Miedema et al. [
27].
In the following subsections, we cover each of these errors and hypotheses in turn to answer Research Questions 1 and 2. To give the reader the appropriate context, we also present the elements our participants received, to support the understanding of the error: (1) the question the students tried to answer, (2) a (partial) query containing the error, and (3) context on the error. The database schema that the questions concern can be found in Table
2. The material presented in each section is exactly the same as the presentation to our participants.
The participants voted on each of the aggregated hypotheses in part two through a five-point Likert scale. We visualize this by means of stacked bar charts with a diverging color scale, where the width of each segment corresponds to the proportion of answers. Each category of votes is shown in a different color (likely hypotheses in blue, unlikely hypotheses in orange), with the neutral answers centered on the midline of the plot. Below each hypothesis, a grey badge indicates the corresponding category as introduced in Section
4.1.
The participants have been assigned random numbers and are referred to as Participant 1 to Participant 19.
5.1 Using ≠ Instead of != or <>
Findings for this error are displayed in Figure
2. Overall, the trend in these expert hypotheses is that
\(\ne\) is a common and intuitive symbol. Students may have used it for a number of reasons, reflected in the hypotheses presented above. Our Delphi participants are sympathetic:
[.] in general, I sympathize with this one, \(\ne\) is just the way it is written in all normal math on paper (and is also the way it should be in SQL if it wasn’t for the missing \(\ne\)-key on the keyboard...(and the lack of support for \(\ne\) in ASCII))—Participant 1
It is a common symbol that will have been used in other courses, and the meaning is actually correct. Plus the <> and != in SQL are not that intuitive in their form in connection with their meaning.—Participant 2
The fact that \(\ne\) is common in other areas of study is also reflected in the mapping of hypotheses: two of them can be classified as previous course knowledge and one under generalizations. Furthermore, one hypothesis can be mapped to lack of practice, and two to mental model.
Regarding Q2, the students and experts have similar perspectives. The fact that students have learned of the \(\ne\) symbol months to years before learning SQL, means it is more accessible in memory than more newly introduced symbols such as <> or !=.
5.2 AVG in WHERE Instead of HAVING
Provided problem statement
Question: Find the names of store chains that on average sell products in quantities of more than 4.
Error context: This placement of AVG does not lead to a correct result as the average should be taken over the grouped attribute.
Student (partial) answer:
Findings for this error are displayed in Figure
3. The hypotheses cover a very wide spectrum of underlying issues. Again, we see a reflection of the “this is more common” approach, but we also see considerations of the concept of groups, the concept of aggregates, and the concept of a filter. Furthermore, a translation could be an issue, as well as plain forgetfulness. Interestingly enough, there seems to be no correlation between the background of our participants, and the type of hypothesis they present for this error.
Selection at different stages was central to many participants’ answers, for example:
WHERE and HAVING both do some form of selection, but just at a different stage. The type of statements (Boolean combinations) also look very similar. So it is not so strange to confuse them. [\(\ldots\)]—Participant 2
The difference between processing individual rows and groups is the main distinction between WHERE and HAVING. This processing difference—combined with having highly similar syntax and functionality—might be confusing, so Participant 18 suggests renaming HAVING to “WHERE GROUP HAS” in the SQL syntax.
Also related to this is the “flow” or the order of processing of a query, and how students may not be familiar with it. Not knowing the order of processing and the implicit restrictions that come with it, may lead to this error. Participant 8 illustrated this by writing:
[\(\ldots\)] student did not understand that WHERE clause is evaluated before GROUP BY is applied–Participant 8
As mentioned above, these hypotheses cover a wide spectrum of issues. As a result, most of them fit into different cause categories: only two were mapped to the same category of lack of practice. The other categories of causes for this error are language, procedural fixedness, ignorance, mental model, and generalizations.
The top hypothesis of the experts and the students reflect similar thoughts. The concept of filter seems to be mainly associated with WHERE by both student and expert users. If there is a condition, students will write this in the WHERE as that is consistent with what they know. The consistency of this misconception is also reflected in the hypotheses listed 5th and 7th in Figure
3.
5.3 JOIN in All Tables that you Need the Primary Keys of
Provided problem statement
Question: List all customer IDs, dates, and quantities of transactions containing products named Apples.
Error context: Here the participant joined the customer table for cID, although cID is already available in the transaction as a foreign key.
Student answer:
Findings for this error are displayed in Figure
4. For the first time, we see some hypotheses that skew to neutral or even disagreement.
Most of the expert answers in round one discussed ways in which students approach the problem. Participants 2 and 16 suggest an approach that students may take:
The thinking about the result is probably in two tiers: (1) which entities do I need, and (2) which attributes of these entities do I need. So if the question asks for the IDs of customers in the result, which it more or less does, then the reflex is to ensure that the record for customer is in the result, and from that select that the ID.—Participant 2
The question is for the *customer* IDs. Therefore, I think the student automatically decides to use the customer table (and the transaction table for the other part of the question), because that is the table that has information on customers. Not realizing that ID’s are often stored in other tables as well. Possibly this also points to a very rigorous application of a step-by-step solution plan, without thinking what you are doing.—Participant 16
The first two hypotheses reflect a template-based way of working. The template (or step-by-step work plan) comes out of long-term memory when it seems applicable to the question, but the reuse of queries draws from similarity to a recently answered question, which is closer to short-term memory. Therefore, one of these hypotheses is classified under generalizations, whereas the other falls under procedural fixedness, a category that is closely related because it focuses on the student becoming accustomed to writing one type of query. Other hypotheses capture lack of practice (2), mental model, and ignorance . Finally, one hypothesis is classified as not an error .
The hypothesis of the students is similar to the two most agreed-upon hypotheses by experts. They both suppose there is some query template being applied here, either from memory or from previous questions.
5.4 Missing Second Table for Self-join
Provided problem statement
Question: List all pairs of customer IDs who live on a street with the same name but in a different city.
Error context: Here, the student should have applied a self-join, but only took the customer table once.
Student answer:
Findings for this error are displayed in Figure
5. For this error, we see a high level of agreement between experts. For the first three hypotheses, none of the experts disagreed. What makes it even more interesting is that the answers represent a wide variety of hypotheses on misconceptions. We see misconceptions on tables/relations, on restrictions in the FROM clause, on the WHERE clause, and on attributes. Participants also mention possible influences of programming:
In the world of programming languages, such a construct would be implemented with two nested iterators defined over the same (shared) data structure. Students may apply this concept to tables as well.—Participant 18
Some of the hypotheses in Figure
5 explore a more black-box mental model of the DBMS, such as the top hypothesis. They reflect incomplete knowledge on the part of the student regarding how tables are accessed. Incomplete knowledge matches the student misconception by Miedema et al. [
27]. Their other proposed misconception for this error is misunderstanding the scope of elements in a query, which is mostly agreed upon by the experts.
We categorize two of the expert hypotheses as previous course knowledge, two as mental model issues, two as ignorance and one as shown examples were too limited. Finally, as Taipalus maps this error to absence of retrieval cue, we apply that category for the hypothesis Students do not realize they need a JOIN.
In the extra comment space on this error, Participant 1 mentioned:
I would also think that because SQL allows for “implicit” joins via the WHERE clause, the students do not clearly see whether something is a filter on a row, vs. a join condition. I’d blame this on SQL’s syntax, not on the students.
Such an error is also partially due to the teaching style of the students’ lecturers and instructors. Some are strict on teaching explicit JOINs whereas others mostly teach implicit JOINs through Cartesian products.
5.5 Lack of Understanding of when to Apply GROUP BY
Provided problem statement
Question: Return the stores table ordered alphabetically on the city.
Error context: The student added GROUP BY when it was not required, leading to an incorrect answer.
Student answer:
Findings for this error are displayed in Figure
6. All hypotheses for this error skew towards neutrality, with many experts both agreeing and disagreeing with each hypothesis. Perhaps the error title was too general, and we should have focused on this specific error context. Nevertheless, we received some interesting insights in the first questionnaire:
Student expecting the relationship between GROUP BY and ORDER BY, possibly caused by the misinterpretation of examples where both co-occur (I guess it’s not unlikely to have ORDER BY clauses over columns that have been grouped because groups are often the “primary” sort key—which of course doesn’t imply that GROUP BY is a pre-requisite for ORDER BY).—Participant 9
It is very likely that a biased selection of examples sampled by the instructor (or read by the student) introduces misconceptions; as such examples have a significant impact on the students’ mental model of the concept.
[...] the student could have seen something was wrong with their answer by seeing that the select statement asks values from non-grouped columns, leading to a malformed query. So maybe the biggest misconception is that you can group on column A and then still project values from a different column B (depending on the DBMS this is only true if B has the same values for every identical value of A).—Participant 16
Participant 16 focuses on how the student could have evaluated their result, and therefore fixed their problem. As the student apparently did not do that, they must have a misconception of the order of operations during SQL execution. This leads us to the previously identified misconception of misunderstanding SQL syntax and its internal (in)consistency in [
27]. The student may not have realized that after applying GROUP BY, not all actions are possible anymore, which is also reflected in the top hypothesis by the experts. Furthermore, the explanation of participant 16 shows that the student did not reflect on the result table. This may show the ignorance that Taipalus maps this error to [
44]. Although different wording is used in all three cases, the students and experts most likely feel similarly about this error.
Mapping of the hypotheses reveals that various themes were touched upon: we categorized two hypotheses as absence of retrieval cue, two hypotheses as shown examples were too limited, one as ignorance (as in Taipalus’ work mentioned above), one as language and one as mental model.
5.6 Missing Parts of Keywords
Provided problem statement
Question: Return the stores table ordered alphabetically on the city.
Error context: Missing parts of keywords (GROUP BY, ORDER BY, INSERT INTO). The student wrote ORDER instead of ORDER BY.
Student answer:
Findings for this error are displayed in Figure
7. In round one, most of our participants mentioned students being careless, not knowing the syntax, or not remembering it correctly. For example, participant 3 writes:
Student simply forgot the concrete syntax of SQL (where ORDER BY is a single token).—Participant 3
The first four hypotheses in Figure
7 reflect this sentiment, although for different underlying reasons. Hypothesis 5 is the only one that falls outside of this category, but a large number of votes for (highly) unlikely shows that this is an unpopular opinion. Regardless of the hypotheses having similar sentiments, they do represent different types of underlying causes: we mapped them to
previous course knowledge, generalizations, language, lack of practice, and
sloppiness .
This is the first error that the students and experts interpret in different ways. For the experts, the focus is on commonalities (Hypothesis 1) and practice (Hypothesis 2). However, from the student’s perspective, this may have more to do with Natural Language effects. Although this hypothesis was posed by the experts, many of them disagreed with this statement.
5.7 Extra Condition when Not Required
Provided problem statement
Question: List all pairs of customer IDs who live on a street with the same name but in a different city.
Error context: In this case, the participant added cID <> cID even though that could not occur with our schema.
Student answer:
Findings for this error are displayed in Figure
8. The most agreed-upon hypothesis reflects on either student’s conception of the schema in general or their conception of foreign keys. The other four hypotheses focus on the process of writing and editing. Participant 2 explicitly focuses on the writing process and working memory in their round one answer:
Mental compartmentalization: the student wants to ensure that (1) the two customers are not the same customer and (2) have different city fields. And they do not realize that (1) follows from (2). What in my experience also plays a role is that in another question there was also asked for two <something> where it was important they were distinct, and where this did not follow from the additional selection criterium.—Participant 2
The hypotheses of the experts again capture a broad range of categories: mental model, procedural fixedness, lack of practice, sloppiness, and not an error all occurred once.
An extra remark from Participant 7 leads us to believe that the hypothesis on laziness and remembering the schema could have had more experts agreeing:
I actually kind of like parts of [...] “students forget the db schema”, but I cannot get behind the value judgment of “Laziness”—Participant 7
This is a valuable comment. In retrospect, we should have framed this hypothesis in a more discerning manner.
Regarding Q2, Miedema et al. [
27] identified this as lacking knowledge on primary keys, a hypothesis that has not been explicitly mentioned by our experts. However, the primary keys are mentioned in the highest-ranking hypothesis. Therefore, although the angle is not the same, both experts and students attribute this error to primary keys.
5.8 Ambiguous Column Name Because of Missing Alias
Provided problem statement
Question: Find the names of all inventory items that have a higher unit price than Bananas.
Error context: It is not clear which pID is meant in pID=pID, because the participant did not add the alias.
Student (partial) answer:
Findings for this error are displayed in Figure
9. In the first three hypotheses, the experts propose that students assume that the DBMS works similarly to a human brain. In that case, if something is interpretable to the user, it should also be clear to the DBMS. This explains why the student missed that there is ambiguity in the query. As Participant 8 says it:
Not double checking all of the details—assuming that the query has the same context as the query writer’s brain.—Participant 8
For this error we have only four hypotheses, over three categories: mental model (2), ignorance, and lack of practice. In the case of the hypotheses here, these three categories are closely related: a lack of practice may lead to an incorrect mental model, and ignorance may lead to similar mistakes.
The student may not have been able to evaluate the query from the correct perspective because of a lack of knowledge or experience. This is the explanation given by Miedema et al. [
27] for this error. It is not a misconception in itself, but more of an explanation of why the misconception hypotheses as suggested by the experts occur.
5.9 Using Comma Instead of AND in WHERE Clause
Provided problem statement
Question: Find the names of all inventory items that have a higher unit price than Bananas.
Error context: Error occurs in the main query (comma after i.unit_price) but not in the subquery.
Student answer:
Findings for this error are displayed in Figure
10. In Round 1, many of our participants mentioned that this mistake comes from sloppiness or confusion of what the student was writing where signaled by to lack of indentation. For example, participant 13 writes:
Possibly a careless mistake. The student knows how to write a where clause but probably got confused due to the way he/she placed the subquery in the where clause before the join over pid. That is, the mistake possibly arises from poorly-structured where clause in the main query.—Participant 13
In Round 2, the experts benefited from the introduction of other experts’ opinions and found other hypotheses relatively likely too. These describe how existing knowledge interferes with learning and the writing process. We can categorize the hypotheses as follows: two hypotheses reference previous course knowledge, two hypotheses reference generalizations, one is related to language, and one relates to sloppiness .
Regarding Q2, here is another agreed-upon hypothesis that matches the findings of Miedema et al. [
27]. They attribute this error to a misunderstanding of SQL syntax and its internal (in)consistency, where the inconsistency is between the SELECT, FROM, and WHERE clauses. This hypothesis is similar to the “list clause” hypothesis, which is broadly agreed upon by the panel of experts.
5.10 Thinking DISTINCT will Take the First Item of a List
Provided problem statement
Question: For this question: Find the names of all inventory items that have a higher unit price than Bananas. There was a follow-up question: What happens if there are two products named bananas?
Student’s original query:
On the question What happens if there are two products named Bananas? they then elaborated:
[Suppose] There will be two prices [for two bananas]. If I would have used a DISTINCT before, on the unit_price, then you’d have taken the first unit_price and compared it with the unit_price of the primary query. [\(\ldots\)] If I used DISTINCT over this C.unit_price, I’d have given only one unit_price and probably that would have been the first one, because that is what it will find first. And then it would just compare [the price] with that one.
Findings for this error are displayed in Figure
11. In Round 1, most of the participants considered misunderstandings, carelessness, or low familiarity as reasons for this error. It is in such cases that the Delphi methodology is advantageous: there were some participants that went beyond conclusions of carelessness, and the inclusion of their hypotheses led our other participants to reflect on a deeper level.
For example, Participant 4 questioned the examples that were used to teach DISTINCT:
Was the example with which they were taught distinct done using just one type of distinct value?—Participant 4
This questioning led to the inclusion of Hypothesis 3, as did the following text from Participant 2:
A DISTINCT does have the connotation of reducing a multitude to something smaller, so if I get many, and only want it to be one (so fewer), it sort of seems to come close to that. The student may have seen an example where a query returned a result with all duplicates, and the DISTINCT was used to reduce that to one record, which comes close to what they want here.—Participant 2
Overall, these four hypotheses capture three categories of causes: language (2), lack of practice and shown examples were too limited.
We have seen this focus on examples already, suggested by our experts for the errors with GROUP BY. It was also suggested by Miedema et al. [
27], who described the misconception as
Believing DISTINCT takes the first item of a list. As the upper three hypotheses in the figure are ranked similarly, we can conclude that the students and experts agree on this hypothesis.
5.11 Lack of Understanding of Subqueries
Provided problem statement
Question: List all customer IDs, dates, and quantities of transactions containing products named Apples.
Error context: There is no need for the authors of these queries to write a subquery. It increases complexity and reduces readability.
Student answers:
Findings for this error are displayed in Figure
12. For this section, our participants debated whether this is actually an error. Do we expect our students to consider the efficiency of their queries? Participant 18 did not think so:
Not sure why this is considered an error. Semantically, the query is correct. Subqueries might execute slower in some DB engines, but I would not expect the students to know this.—Participant 18
Participant 6 was of the same opinion, arguing that efficiency depends on which DBMS is used:
I do not agree with the premise of the question. The efficiency argument is not robust: this varies on different DBMSs. In fact, there are DBMSs where the subquery variant is actually more efficient, and so it might be that the student was exposed earlier to information from that context.—Participant 6
However, these participants only focus on the efficiency of this example, and did not touch upon the readability of subqueries. Two experts who did, are participants 1 and 16.
Subqueries using identifiers from the upper-level query are hard and confusing. They are simply difficult to read.—Participant 1
Honestly, I find (especially) the second query immensely readable. I prefer subqueries over joins mostly for readability. This is a common split among students as well. About a third typically like subqueries, while the majority prefer joins. If you consider the (previously introduced) idea that there can be multiple product with the same name, then I really like the second query.—Participant 16
These participants disagree on whether the queries presented here are readable or not, which might have to do with the query’s size.
Participants also mentioned translating of queries, either from Natural Language (Hypothesis 5) or First Order Logic (Hypothesis 7). The plot in Figure
12 does not indicate high agreement within the participant group, but it represents an interesting question nonetheless: how do students decompose a more complex problem, and where do they start the formulation of their query?
Overall, these seven hypotheses can be grouped into five categories: previous course knowledge (2), lack of practice (2), ignorance, mental model, and the new category of avoidance and database anxiety.
Regarding Q2, Miedema et al. [
27] did not attribute this error to a specific misconception but instead explained this by a lack of knowledge or experience. Given the discussion by experts of whether this should even count as an error, and whether that is DBMS-dependent or not, a lack of student insight could be a fitting explanation.
6 Discussion AND Conclusion
In this article, we contribute by providing additions to the list of misconceptions held by students learning SQL. After validation, these misconceptions provide further insight on how we can design interventions to improve SQL education.
To recap, in this article we posed the following research questions:
Q1.
How do experts interpret the underlying causes of student errors in SQL?
Q2.
Which are the differences in the perspectives of students and experts on errors in SQL?
To answer these questions, we took a Policy Delphi approach, in which we asked a panel of SQL experts for their opinions on novice errors. The subsequent votes on the likelihood of each hypothesis gave us a list of possible misconceptions that we can use as a basis for the design of educational approaches to reduce the prevalence of these misconceptions, consequently improving the learning process of novices.
We answered Q1 by iterating over eleven errors made by novices, and listing the hypotheses as generated by our panel of experts. The different backgrounds of our participants, and their expertise in various aspects of SQL, led to the generation of a wide range of misconception hypotheses per question in round 1. Sometimes the participants were all focused on the same type of hypothesis but approaching it from different angles, such as in the case of Using \(\ne\) instead of != or <> or Using comma instead of AND in WHERE clause. Other times, the participants were divided in their perspective, such as for AVG in WHERE instead of HAVING. In that case, they suggested hypotheses related to the concepts of groups, filters, and aggregates, translation issues, or forgetfulness. Hypotheses related to forgetting, lack of experience, and lack of focus were common. However, the mistakes students make should not just be attributed to such factors.
When answering Q2, we found that in most cases, experts and students were in agreement on the underlying causes, although sometimes there was a difference in viewpoint. For eight out of eleven errors tested in this study, there was an agreement between the groups to some extent, although the student explanation may not have been the most agreed-upon expert hypothesis. For two errors, no student misconceptions were suggested and thus the question could not be answered. Finally, for error 6, Missing parts of keywords, the explanations were different. The experts focused on what is common in SQL, and on student practice, whereas the student perspective focused on Natural Language.
The method that we applied in this study resulted in capturing more input per error than presented by Miedema et al. [
27]. Each error had between four and eight misconception hypotheses. In the mapping of these hypotheses to causes introduced by Miedema et al. and Taipalus [
44], we found that they covered several different topics. Each error had hypotheses from three to six categories, scaling with the number of hypotheses. We found representations of all classes identified in previous work, except for
working memory overload. This is likely due to the setup of the experiment, providing small snippets of information to our experts without a broader context. Beyond these classes from literature, there were representations of the classes
lack of practice,
sloppiness, and
not an error.
Finally, in this work, we introduce two new causes of errors. The first is
shown examples were too limited. This reflects a lack of depth in the material that the teachers use to introduce concepts to their students. Examples may be too simplistic, and as such do not showcase the full range of opportunities for the concept. The second is
avoidance and database anxiety. This is the database equivalent of
math anxiety. Math anxiety has been shown to disrupt cognitive processes by monopolizing resources of working memory [
5]. A similar process may occur in SQL querying when students need to apply more complex concepts such as JOINs, aggregation and subqueries. Future work should examine the extent to which database anxiety influences student performance.
6.1 Implications for Teaching
To improve upon our teaching of SQL, the first step is to identify where possible problems in knowledge transfer occur. This identification of faulty transfer, through the uncovering of misconceptions, is an indication of where course material and instruction need to be adapted to the students’ previous knowledge. Teachers who have an understanding of misconceptions can work towards knowledge refinement and reorganization, to build students’ knowledge of the taught concepts [
39].
One central theme that came up regularly in our findings was both a lack of practice, and a lack of appropriate examples demonstrating SQL concepts. Such issues can be relieved by increasing the opportunities that students have to interact with SQL, as Participant 4 expressed:
Students need more practical experience, just like any other programming language. We need to stop treating databases as an extension of math.
Typically, in Computer Science curricula, there are several courses centered around programming. Students are taught about programming languages, design patterns, algorithms and data structures, and they get the opportunity to practice their programming skills through several courses and projects. Why then do we teach only the basics with regards to Databases? With the increasing need for Data Science topics in Computer Science curricula, there should be more focus on teaching students about data storage and retrieval.
6.2 Comparing Prior Theories and Research Findings
Existing research is conflicted about whether teachers and other experts have enough insight to discuss their students’ misconceptions. Goldman et al. find that teachers have an incomplete understanding of student learning [
15]. From a practical perspective, Qian et al. dicuss the latent nature of misconceptions, and how this may make it difficult for teachers to detect them [
33]. Does this then mean that educators have no insight into where students may misinterpret material? Brown et al. [
8] instead hypothesize that although educators are not accurate in predicting the frequency of (programming) mistakes, they may still be accurate about the cause of mistakes and a student’s conception of the mistake [
8]. To validate this, we recommend finding the prevalence of the misconceptions presented in this article in a future study.
6.3 Strengths and Limitations
Our main limitation is the number of errors we consider in this article. As participants in any study should not be overwhelmed, and our format required open-ended questions, we decided to limit ourselves to 11 errors. Although these errors have been shown in Section
3 to be common and persistent, there are many additional errors that we could have examined, generating more hypotheses for misconceptions. Future work should consider examining misconceptions for these errors. Additional hypotheses might have also come up if we had more participants in our study.
Although our study only included two rounds versus the typical three or more of traditional Delphi, we do not think adding extra rounds would have changed our findings. An extra round could show how our participant population thought of the various hypotheses, such that participants could change their minds. But, as our focus is on qualitative over quantitative results, the scores themselves are not the most important thing, the hypotheses and quotes are.
Instead of the Policy Delphi approach we adopted, we could have also run a focus group. An advantage of Policy Delphi over a focus group is that it allowed asynchronous “discussion” between participants on various continents, with different time zones, who may not have been able to match up schedules otherwise. Furthermore, the anonymous nature of Delphi was an advantage to make sure that each voice counted equally.
The anonymous approach of our participants with regards to filling in the survey means that we did not have the exact same group of respondents for Rounds 1 and 2. We cannot tell to which extent our groups overlapped, although the different number of respondents indicates that the groups do not match. Although Delphi is about anonymity, the participants only need to be anonymous to the other respondents, not necessarily to the researchers. However, we do not think this is a threat to validity, as our main contribution is the list of hypotheses presented above. This is a result of Round 1 only, and thus the differing groups does not have a large impact.
Finally, we could have strengthened our results for Q1 by asking the participants to indicate which group they identified most heavily with (educator, researcher, and practitioner), instead of indicating all they identified with. As a result of this choice, we could not run statistical tests on our data to see if different groups had different opinions.
6.4 Validity
In this discussion, we follow the descriptions of validity for qualitative work by Joseph Maxwell [
26]. He defines the nature of five dimensions of validity: descriptive validity, interpretive validity, theoretical validity, (internal and external) generalizability, and evaluative validity [
26].
First of all, the setup of the study, with participants writing down their answers in a Google Form, means we record the participants’ opinions as quotes. As such, our results are an accurate representation of our participants’ opinions, meaning our study has high descriptive validity. Using quotes also means that we do not have to interpret what our participants mean with their actions, meaning the interpretative validity of the study is high too. One area where some interpretive validity may have been lost is the aggregation step between rounds one and two. An alternative setup to avoid this would have transferred all participants’ quotes directly from the round one form to round two. However, this would have led to many overlapping answers, which would distort the hypotheses voting results. As all three authors translated and aggregated the answers in an iterative process, and the participants voted on the newly generated hypotheses, we believe the interpretative validity of the study was upheld.
Given the direct reporting of participant perspectives in this article, presented as a stepping stone towards more research on SQL misconceptions, the authors believe that the concepts of theoretical and evaluative validity may not be applicable.
Finally, the generalizability of the study has been taken into account by inviting participants of varying backgrounds. The participants are of different geographical and cultural backgrounds, have different types of jobs with various levels of experience, work at different companies and institutions, and have different ages and gender identities. This provides us with confidence regarding the external generalizability (to different communities). The study was not designed to have internal generalizability (the hypotheses do not apply to different errors), future work should consider possible misconceptions for errors that we did not discuss here. One note on the external generalizability is that student misconceptions most likely depend on the teaching method (types of examples, database schema, teacher preference etcetera). Therefore, not all misconceptions identified in a study such as ours will be applicable to all students. However, the significant hindrance that misconceptions cause, makes it important to keep all possibilities in mind.
6.5 Future Work
We consider four avenues for future work. A first avenue is to follow up on this work by validating the proposed hypotheses. This can be done by generating multiple-choice questions testing for each of the misconceptions with distractor answers that indicate the misconception. For more information on this method, we refer to the work by Treagust [
48].
Second, the SQL errors could be researched from a cognitive process point of view, following a model similar to the one developed for event-based systems [
22]. The causes of SQL errors, as interpreted by experts in our work, could be similarly tied to knowledge, strategic, and attentional causes for activities that relate specifically to query formulation. This line of work could highlight how these problems in the various stages of query formulation affect the novices’ problem-solving processes.
Third, additional study is needed to gain deeper insights into which misconceptions are responsible for the biggest obstacles in learning SQL and their prevalence and interactions. Furthermore, the design of materials and interventions to correspondingly address these misconceptions should be investigated. We sketch two possible ways to go about designing new materials. The first is a discussion on knowledge refinement and reorganization, as argued by Smith et al. [
39]. However, discussion material is relatively difficult to design, as discussion content, and context depends on the student, the material and the teacher. Other options are counter examples and refutation texts. Refutation texts consist of either two or three elements: a misconception, an explanation of the correct concept, and (optionally) a cue [
47]. The cue helps the student understand that the misconception is incorrect. These refutation texts for common SQL misconceptions can then be distributed to SQL teachers and students, improving the accuracy of SQL education.
Finally, an intrinsic complication in teaching SQL is its declarative nature. The research on programming misconceptions in imperative languages has shown that decomposition of problem descriptions is difficult for students [
34]. One way in which a student can decompose a problem is to trace the execution of that problem. For SQL, this is even more difficult as execution in declarative languages cannot be traced because query plan generation is non-deterministic. This characteristic of SQL may limit the extent to which a student can generate an appropriate mental model of query execution, which in turn might lead to a “black box” perception of the DBMS. One direction of research that can help students explore query execution is to develop query visualization tools for query explanation [
9,
21], interpretation [
23], and building [
28]. On the other hand, it could be highly useful to research a notional machine for declarative languages, to gain more insight into the link between student misconceptions and DBMS processing.