Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Pausing while programming

Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Software Engineering Education and Training

2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering Education and Training (ICSESEET) Pausing While Programming: Insights From Keystroke Analysis Raj Shrestha Juho Leinonen Albina Zavgorodniaia raj.shrestha@usu.edu Utah State University Logan, Utah juho.2.leinonen@aalto.fi Aalto University Espoo, Finland albina.zavgorodniaia@aalto.fi Aalto University Espoo, Finland Arto Hellas John Edwards arto.hellas@aalto.fi Aalto University Espoo, Finland john.edwards@usu.edu Utah State University Logan, Utah ABSTRACT programming projects in Introductory Computer Programming (CS1). In a typical CS1 course instructors and graders look at the final program a student has produced for assessment, but there is no indication from the code as to how the student wrote it. It is possible that information on the number and types of pauses students take could be mined to shed more light on processes that underlie programming in CS1. Since pauses can be products of various activities (e.g., thinking, disengaged), we investigate whether pause length may hold insights into these activities. Our study features four types of pauses. Micro pauses (2-15 seconds) which may indicate the student is thinking about the code on a low level or “locally” (e.g. syntax). Short pauses (15 seconds to 180 seconds) may indicate that the student is involved in a higher-level process such as planning or revision. Mid pauses (3-10 minutes) may indicate that the student is disengaged or that they are going to an outside resource for help (e.g. YouTube, Stack Overflow, or course materials). Finally, long pauses (greater than 10 minutes) may indicate disengagement from the task. In this paper, we look at relative number of pauses over the course and correlate with outcomes (exam score). That is, if a student takes more or fewer pauses relative to their total number of keystrokes, could it suggest their better or worse course performance? Many pauses that are very small may indicate that the student is planning their typing carefully rather than writing without a clear direction and may have a positive correlation with performance. Many medium-sized pauses may indicate the same thing, but the measurement may be confounded by students who are easily distracted. Many long pauses may indicate distraction. The research questions we investigate in this paper are: Pauses in typing are generally considered to indicate cognitive processing and so are of interest in educational contexts. While much prior work has looked at typing behavior of Computer Science students, this paper presents results of a study specifically on the pausing behavior of students in Introductory Computer Programming. We investigate the frequency of pauses of different lengths, what last actions students take before pausing, and whether there is a correlation between pause length and performance in the course. We find evidence that frequency of pauses of all lengths is negatively correlated with performance, and that, while some keystrokes initiate pauses consistently across pause lengths, other keystrokes more commonly initiate short or long pauses. Clustering analysis discovers two groups of students, one that takes relatively fewer mid-to-long pauses and performs better on exams than the other. CCS CONCEPTS • Social and professional topics → Computing education. KEYWORDS pauses, pausing, breaks, keystroke data, digraphs, programming process data ACM Reference Format: Raj Shrestha, Juho Leinonen, Albina Zavgorodniaia, Arto Hellas, and John Edwards. 2022. Pausing While Programming: Insights From Keystroke Analysis. In 44nd International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET ’22), May 21–29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ 3510456.3514146 RQ1 Is there a correlation between the relative number of pauses a student takes and their performance (exam score)? RQ2 What groups of students exist when clustering on pausing behavior? RQ3 What events initiate a pause and how does this correlate with the performance of the student? 1 INTRODUCTION Pausing during work is a natural behaviour for a person which allows them to reflect on their task, plan what they are going to do next, revise, or take a rest. Pauses, however, can also be initiated by distraction and lead to hindering one’s working process. In this paper we aim to study pauses that students take while doing We seek to answer these research questions using analysis of keystroke data collected in two CS1 courses at different universities on different continents. The closest matches to our work come from two different research streams. One of the research streams has studied syntax errors and identified pauses or breaks when correcting such errors (e.g. [2, 10, 16, 30, 59]). The other research stream has focused on the analysis of keystroke data, which has been shown to be effective at gaining insights into student behavior [26, 32, 37] This work is licensed under a Creative Commons Attribution International 4.0 License. ICSE-SEET ’22, May 21–29, 2022, Pittsburgh, PA, USA © 2022 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9225-9/22/05. https://doi.org/10.1145/3510456.3514146 187 as well as predicting student outcomes [19, 40, 55]. However, little work exists at the intersection of these research streams, where our work lies. The novelty of our analysis is that we are looking less at typing behavior and more at pausing behavior, which might indicate more or less of a student’s cognitive processing, examining of external resources, or disengagement. In this paper, we report on several findings: those students who pause more often generally show worse performance in the course; students who take more shorter pauses perform better than students who take more longer pauses; mid pauses have the strongest negative correlation with exam scores; specific events that precede pauses have a more evident correlation with performance and thus allow conjecture about the underlying processes. interruption to the time of subsequent resumption and number of possible errors in a task. 2.2 Pausing behavior in CER Pausing behavior has been studied in Computing Education Research (CER) both directly and indirectly in the context of computer programming. Similar to written language, where pauses between and within sentences are likely conditioned by different subprocesses [54], code writing has its own milestones and units of different level of complexity. When considering the mental effort needed to write code, one stream of research has focused on identifying and discussing plans and schemas for programming [15, 51]. It has been suggested that programmers who know the solution to a problem write their solution in a linear manner, while solving a new problem is done using means-ends analysis with the use of existing related schemas [15, 51]. Over time and through practice, accumulation and evolution of schemas allow programmers to solve problems more fluently, and also to learn to solve new problems with more ease [51, 58]. As discussed in Section 2.1, pauses can signify cognitive effort and are a natural part of the learning process. In programming however, an additional contributor to pauses, especially for novice programmers, are syntactic and semantic errors related to writing computer programs with the chosen programming language. These errors may be highlighted by the programming environment in use— similar to a word processing engine that shows spelling errors—as programming environments often highlight errors in program code, but they may be also visible through specific actions such as compiling the source code. These errors have been discussed especially in the context of Java programming, where researchers have studied the frequency of different types of errors [2, 10, 16, 30, 59] and the amount of time that it takes to fix such errors [3, 16]. Denny et al. [16] and Altamdri and Brown [2], for example, have noticed that there are significant differences in the time that it takes to fix specific errors, and that over time students learn to avoid specific errors [2]. At the same time, the granularity of the data used in the analysis has an influence on the observed errors [59] – different data granularity will lead to different observed syntax errors. In practice, collecting typing data with timestamps can provide more insight into the programming process over snapshot or submission data [60]. When considering syntax errors, pauses, and performance, the ability to fix syntax errors between successive compilation events has been linked with students’ performance in programming [31], although it is unclear what the underlying factors that contribute to the observation are [46]. Including an estimate on the amount of time that individual students spend on fixing specific errors can increase the predictive power of such models [1, 64], highlighting the effect of time (or pause duration) on the learning process. While the previous examples are specific to syntax errors and time, little effort has been invested into looking into pauses in programming. Perhaps the closest prior work to our work is that of Leppänen et al. [41] who studied students’ pausing behavior in two courses, and identified that a larger quantity of short (1-2 second) pauses was positively correlated with course outcomes, while a larger quantity of longer (over 1 minute) pauses was negatively 2 RELATED WORK 2.1 Pausing behavior In the research literature, pauses are prevalently discussed in relation to language production – written or oral speech/narration [13, 21, 49], language translation [33] and editing [43]. Relying on cognitive psychology, researchers associate pauses with cognitive processing of various types [43]. For example, in writing, it is thought that pauses at higher-level text units (e.g. between sentences) are likely to be conditioned by higher-level subprocesses, such as planning and organization of the content, whereas pauses at lower-level units (e.g. between and inside words) – by lower-level subprocesses, such as morphology encoding and retrieval of items from one’s memory [54]. A pause is also considered to signify cognitive effort imposed by language production mental processes [33, 34]. Butterworth [11] hypothesised that the more cognitive operations are needed for output production, the more pauses arise. Damian and StadthagenGonzalez [14] and Révész et al. [48] argued that the length of a pause taken before a textual unit reflects the mental effort made with respect to production of this forthcoming unit. Reflecting on pausing in post-editing, O'Brien [43] concluded that pausing patterns do, to an extent, indicate cognitive processing. However, they are ultimately subject to individual differences. Pausing has also gained attention in the medical training domain. Lee et al. [35] studied pauses and their relation to cognitive load. Students had to complete a medical game that simulated emergency medicine under two conditions: pause-available and pause-unavailable. In the study, pauses of two types were identified: reflection and relaxation. The first type is argued to enhance taskrelated cognitive processes and therefore increase mental effort (or cognitive load). The second type reflects the opposite process when the load lowers due to the resting state. That being said, pauses during problem-solving can signify not only ongoing mental work but a suspension of it caused by various things. Gould [23] defines three types of interruptions: those that are relevant to the task and reinforce processes in the working memory, those that are relevant to the task and interrupt processes in the working memory, and those that are not relevant to the task. The author states that how these interruptions affect the following resumption and productivity depends on “contextual factors at the moment of interruption”. Borst et al. [9] also relate the length of 188 correlated with course outcomes. Our work builds on this by—in addition to correlation analysis—looking at pausing behavior over different contexts and also by investigating which characters precede pauses. Leppänen et al. hypothesized that one explanation for the correlation between long pauses and poorer course outcomes could be related to task switching between reading the course materials and solving the programming problems, but noted also that the pauses from writing code could be construed as instances of the student engaging in planning, reviewing, and translating the next ideas into code. Another possible hypothesis is related to differences in cognitive flexibility, i.e. the ability to fluently switch between two tasks; for example, Leinikka et al. [36] observed that students with better cognitive flexibility are faster at solving programming errors, although they did not observe links between cognitive flexibility and introductory programming course exam outcomes. 2.3 3.1.1 University A. University A is a mid-sized public university in the Western United States. In a 2019 CS1 course, students used a custom, web-based Python IDE called Phanon [17] for their programming projects. Phanon logged keystrokes and compile/run events. Five programming projects, one per week, were assigned to the students during the study period. Each project consisted of two parts: a text-based mathematical or logical problem, such as writing an interest calculator; and a turtle graphics-based portion requiring students to draw a picture or animation, such as a snowman. A midterm exam was administered between the fourth and fifth project. There were three sections of the course all taught by the same instructor. Projects and instruction were the same for all three sections. In-person instruction was conducted three times per week. At the beginning of the semester students were given the opportunity to opt into the study according to our Institutional Review Board protocol, and this paper uses data only from students who opted in. The course was identical for students who chose to participate in the study and those who chose not to. Gender information on participants was not collected, but in the course participants were recruited from, 19% self-identified as women and 81% self-identified as men. No information on previous programming experience, race or ethnicity was available for this study. Typing and performance in programming In CER, a multitude of data sources has been used for identification of factors and behaviors that contribute to course outcomes [25] – clicker-data [42, 47], programming process snapshots [1, 12, 31, 39, 64], background questionnaires and survey data [7, 8, 53, 57, 65], and so on, but our focus is on keystroke data collected from programming environments [27, 28]. Keystroke data, or typing data, has been used, for example, for predicting academic performance [19, 40, 55], for detecting emotional states such as stress [20, 32], and for identifying possible plagiarism [26]. Much of the analyses of typing data that relate to students’ performance has focused on between-character latencies, i.e. the time that it takes for the student to type two subsequent characters. This analysis has often focused on small latencies, as pauses have been considered as noise. For example, both Leinonen et al. [40] and Edwards et al. [19] used 750 milliseconds as an upper boundary for the between-character latencies. In general, these studies have found that faster typing correlates with previous programming experience and performance in the ongoing programming course. Not all characters are equally important, however. For example, Leinonen et al. [40] identified differences in the time that moving from ‘i‘ to ‘+‘ took for novices and more experienced students, while differences in some other character pairs were more subtle. Similarly, Thomas et al. [55] noted that the use of control functionality (e.g. using control and C keys) in general was slower than the use of e.g. alphanumeric keys, and the use of special keys such as delete and space was also slower than alphanumeric keys. Acknowledging that some of these latencies may be also influenced by the keyboard layout, they hypothesised that some of the latencies may be influenced by the thought processes related to the ongoing problem solving [55]. Our work builds on this prior work by examining which characters precede pauses, i.e. whether all characters are equally important when analyzing pausing behavior. 3.1.2 University B. University B is a research-oriented university in Northern Europe. The data for this study was collected from a 7-week introductory programming course in the Fall of 2016. The introductory programming course is given in Java and it covers the basics of programming, including procedural programming, object-oriented programming and functional programming. During each week of the course, there was a 2-hour lecture that introduced the core concepts of the week using live coding. The emphasis in live coding was in providing examples of how programming problems were solved with the concepts learned during the week, and in helping to create a mental model of an abstraction of the internals of the computer as programs are executed (introduction of variables, changing variable values, objects, call stack in a line-byline fashion). In addition to the lectures, 25 hours of weekly support was available in reserved computer labs with teaching assistants and faculty. The programming assignments in the course are completed using a desktop IDE accompanied with an automated assessment plugin [61] that provides students feedback as they are working on the course assignments. Combined with an automatic assessment server, the plugin also provides functionality for sending assignments for automatic assessment. In addition to the support and assessment, the plugin collects keystroke data from the students’ working process, which allows fine-grained plagiarism detection [26] and makes it possible to provide more fine-grained feedback on students’ progress. Students can opt out of the data collection if they wish to do so; the data collection was conducted according to the ethical protocols of the university. Out of the 244 students at University B included in the study, approximately 40% self-identified as women and 60% as men. No information on previous programming experience nor race or ethnicity was available for this study. 3 METHODOLOGY 3.1 Context and data Our study was conducted in two separate contexts for purposes of generalization of the results. 189 Attribute University A Instruction Language (prog.) Language (inst.) Participants Prog. Environment Lectures w/sections Python English 231 Web-based University B Context Lectures, sessions Java Finnish 244 Desktop Python (US) Java (Europe) Students events/ student pauses/ student 231 244 25186 ± 11243 54698 ± 25538 2183 ± 1009 6774 ± 3189 Table 2: Descriptive statistics of the study. Table 1: Summary of contexts. is a pause initiated by the student pressing the delete key; the last event before a failed run pause is a failed run event. Pauses preceded by other event types are named similarly. Because of data availability, we use a single measure of outcome – exam score. In the US context we use the score of a exam that falls just before the last project in the study. In the European context, we use the exam score from the first out of three programming exams. The first exam was organized on the third week of the seven-week course. Similar to the University A, University B had a midterm exam in the course. For the analyses conducted in this article, we focus on students’ performance in the midterm examination. The contexts are summarized in Table 1. 3.2 Event and pause categories Keystroke data was collected in both contexts. In the analysis, every keystroke of a student was categorized to an event. We consider eight event categories: (1) Alphanumeric keystroke, (2) Delete keystroke, (3) Return keystroke, (4) Spacebar keystroke, (5) Special character keystroke, (6) Tab keystroke, (7) Successful compile/run, and (8) Failed compile/run. Since the US context uses Python, in this paper we will call a compile/run event a “Run”. The reasoning behind these categories is that they represent different tasks of the student: Alphanumeric events represent typing, Delete events indicate the student is preparing to make a correction, Run events represent a completion point where the student is ready to test the code, etc. The European context does not have information on tabs or the status of run events, so analyses relating to the status of run events or tabs will only use data from the US context. For the analysis of pauses, we chose to use four types of pauses. While pausing analyses in the context of programming have been exploratory [41], research in pausing in language production varies in terms of pause thresholds. A lower bound of 1-2 seconds appears to be the most common [5, 44, 62], and thus, we adopted a 2-second lower bound for our study. Taking into account research on working (short-term) memory time capacity [50], a meaningful upper bound is at 15 seconds – pauses between 2 and 15 seconds may reflect thinking about the code on a low, “local” level, including thinking about the syntax, and could be tied to working memory. We call these pauses micro pauses. Short pauses may reflect higher-level processing like planning the following code segment, setting the next sub-goal, and revising code similar to the production stage of written language or some kind of distraction. We chose 180 seconds to be the upper bound for short pauses. Mid pauses, up to 10 minutes, may reflect voluntary or problemsolving related breaks. We hypothesise that students who have difficulties may consult learning materials or visit other resources in search for help or for refreshing their memory. Such a continuous pause would cause longer task resumption [9]. Finally, long pauses, greater than 10 minutes, are most likely to stand for task disengagement as noted by prior work [38]. We expect such pauses to take place after finished code segments or compilation. For simplicity, we also refer to pauses initiated by a certain event type using the name of the event type. For example, a delete pause 3.3 Statistical tests We report 𝑝 values of all statistical significance tests, of which there are 95. We follow the American Statistical Association’s recommendations to use 𝑝 values as one piece of evidence of significance, to be used in context [63], though we do suggest 𝑝 < 0.05/95 ≈ 0.0005 = 5𝑒 −4 as a reasonable guideline for credible 𝑝 values [24]. When considering the claims in our work, we suggest taking into account the additional supports beyond single p-values. For example, claiming that delete pauses are negatively correlated with exam score is based on consistent negative correlation across pause types in both studied contexts. For distribution comparisons we use the t-test (as the data appears normal) with Cohen’s 𝑑 effect sizes and for correlation we use the Pearson 𝑟 statistic. 4 RESULTS 4.1 Descriptive statistics Table 2 shows descriptive statistics of our study and Figure 1 shows the distribution of event types for each of the two contexts. Alphanumeric keystrokes are the most common event, with space (spacebar) and special characters also being common. Run events and the tab keystroke are less common. Both contexts use an editor that automatically indents the next line of code after a return key press, which likely contributes to the observed lack of tab keystrokes. A difference between the contexts is the relative frequency of run events – students in the Java context run their code far more often than those in the Python context. This is likely not due to the language, but the way the courses are organized. In the Python context, students had one large assignment due each week, while Java students had tens of smaller assignments due each week. We conjecture that the smaller assignments induced the students to run/compile their code more often. 4.2 Frequency of pauses 𝑝 We calculated a measure of pause frequency as 𝑛𝑙 where 𝑝𝑙 is the number of pauses of length 𝑙 and 𝑛 is the total number of events. In the Python context, on average, student pause frequency 190 (a) US/Python To identify student types based on the vector representations, we use 𝑘-means clustering to cluster students into student types. Using the elbow method (visually finding the “elbow” of a line chart of number of clusters against explained variance [56]) to identify a good number of clusters, we chose 𝑘 = 2 for interpretability, though, as we will see in Section 5.2, the choice of 𝑘 is not particularly important in this case. We see in Table 3 and Figure 5 that one group of students in each context took relatively more short, mid, and long pauses than the other group, although in the Java context, the difference is less pronounced. We call the clusters the longer pause and shorter pause groups, respectively. When examining the groups and exam scores, we observe that the students in the shorter pause group had higher exam scores than those students who took longer pauses. The distributions of pause frequencies are approximately normal and t-tests suggest that there is a difference between short, mid, and long distributions. (b) European/Java Figure 1: Log-scale bar chart showing the total number of events for each type. (a) Number of events in the US/Python context. (b) Number of events in the European/Java context. is 0.09 ± 0.02, meaning, on average, students execute 11 events before pausing for two seconds or more. Most pauses are micro pauses, which have a frequency of 0.07, followed by short, mid, and long pauses with frequencies of 0.02, 0.001, and 0.001, respectively. The Java context was somewhat different: on average, student pause frequency is 0.13 ± 0.03, meaning, on average, students execute 8 events before pausing for two seconds or more. Most pauses are micro pauses, which have a frequency of 0.09, followed by short, mid, and long pauses with frequencies of 0.03, 0.002, and 0.001, respectively. As might be expected, Figure 2 shows a negative correlation between pause frequency and exam score, meaning students who are pausing more often are performing worse on the exams. In the Python context, this correlation is consistent across micro (𝑟 = −0.30, 𝑝 = 3.16e-6), short (𝑟 = −0.35, 𝑝 = 5.57e-8), and mid (𝑟 = −0.38, 𝑝 = 3.71e-9) pause lengths, with a weaker correlation for long (𝑟 = −0.18, 𝑝 = 0.0061) pauses. The Java context, in contrast, has a weaker correlation for the micro pause (𝑟 = −0.11, 𝑝 = 0.0654) than for the short (𝑟 = −0.20, 𝑝 = 0.0013), mid (𝑟 = −0.23, 𝑝 = 0.0003), or long (𝑟 = −0.22, 𝑝 = 0.0006) pauses. In general, the correlations for the Python context are stronger than for the Java context, although as seen in Figure 2, the Java context has a noticeable ceiling effect in the exam. Figure 3 shows correlations between the number of different type of pauses that students take, i.e., whether students who are pausing for short amounts of time are also taking longer pauses. All types of pauses are at least moderately correlated with all other types of pauses (see Figure 4). Interestingly, the correlations weaken as the pause lengths grow for the Python context, while the Java context shows a strong correlation for the mid/long pause pair. 4.3 4.4 Initiating pauses In Figure 6 we see relative frequencies of event types by pause length. We define relative frequency for an event type E as the percentage of pauses of a given length initiated by an event of type E. For example, in Figure 6 we see that in the Python context, alphanumeric keystrokes initiate 27% of all micro pauses (2-15 seconds) and 15% of long pauses (> 10 minutes) while accounting for 57% of all events, regardless of whether the events initiated a pause or not. In the Java context, the distribution related to alphanumeric events that start a pause is very similar. Roughly 29% of micro pauses and 18% of long pauses are initiated by the events while they account for 28% of all events. Certain types of events in both contexts decrease in frequency with increasing pause length. Alphanumeric, return, space and special characters seem to follow this trend preceding to a greater extent shorter pauses. In Table 4 we see that alphanumeric events initializing micro pauses have a positive correlation with exam score, but that the correlation weakens until it is not detectable for long pauses. Conversely, pausing after special characters is not necessarily correlated with success. In fact, a weak negative correlation exists with special characters initializing micro pauses. In the Python context, the percentage of pauses preceded by the delete, return, and space keystroke events remains roughly the same across pause lengths (Figure 6). The return keystroke is unique among the three in that, despite being so infrequent in the data, it precedes so many pauses (11-13%). This tendency does not repeat in case of delete and space events. In the Java context, the situation is different. Return and space events show steady decline in percentages of preceding pauses. The longer the pause, the less common it is for those event to precede it. The opposite applies to the delete events. This could be accounted for differences in programming languages and their relations to students’ native languages [18]. Even though the deleting behaviour differs across the contexts, correlation of most delete pauses with exam scores remains negative in both cases. Student types To characterize students, we represent each student using a vector that contains the relative proportions of each pause type. For example, a student represented with a vector [0.80, 0.15, 0.03, 0.02] has 80% micro pauses, 15% short pauses, 3% mid pauses, and 2% long pauses. Since the vector is a partition of unity, the feature vector has only three degrees of freedom, though, for clarity, we represent it here with all four coordinates. 191 (a) US/Python (b) European/Java Figure 2: Frequency of the different type of pause correlated with exam score. Frequency is calculated as number of pauses divided by total number of events. (a) US/Python (b) European/Java Figure 3: Correlations of total pauses with each other per student. As expected, the number of micro pauses a student takes has a strong positive correlation with the number of short pauses. While there are still strong and medium correlations between shorter and longer pauses, the correlations become weaker. 192 Centroid Short Mid Context Cluster Micro Python (US) Python (US) shorter longer 0.81 0.75 0.17 0.23 Java (Europe) Java (Europe) shorter longer 0.79 0.72 0.20 0.26 Long Students Average keystrokes ×104 Average pauses ×103 Average exam 0.010 0.016 0.0060 0.0077 71% (164) 29% (67) 2.6 ± 1.1 2.4 ± 1.3 2.3 ± 1.0 2.1 ± 1.1 80.2 ± 11.3 76.2 ± 11.5 0.01 0.01 0.01 0.01 57% (139) 43% (105) 5.2 ± 2.1 5.8 ± 3.0 6.2 ± 2.8 7.6 ± 3.5 9.59 ± 0.90 9.35 ± 0.88 Table 3: Statistics of the clusters for the two contexts, US and European. For the Cluster column, “shorter” means “shorter pause” and similar with longer. A t-test for the two distributions of exam scores yields (𝑡 = 2.2, 𝑝 = 0.026, 𝑑 = 0.35) in the US context and (𝑡 = 2.1, 𝑝 = 0.034, 𝑑 = 0.28) in the European context. (a) US/Python appears that these activities are in the minority and are dominated by negative-effect activities. Frequency of mid pauses (3-10 minutes) in both contexts have the strongest negative correlation with exam score of all the pause types. Comparing to the long pause which does not have an upper bound, it is clear that after at most 10-minutes long mid pause students get back to typing. We conjecture that mid pause may be the most harmful because it potentially can cause the longest resumption. If the activity taking place during the pause is not related to the task, the pause may be treated as irrelevant interruption [23]. According to Altmann and Trafton [4] and many others (for example, see [6, 22, 29]), the length of interruption correlates with the time of task resumption and numbers of possible errors. (b) European/Java Figure 4: Similar to Fig. 3, this figure shows correlation coefficients for the different pause lengths. 5.2 Both successful and failed run attempts have disproportionate prominence among events preceding pauses relative to their overall frequency. Student types Our second research question is: What groups of students exist when clustering on pausing behavior? We clustered students into two types, longer pause students and shorter pause students. Shorter pause group tends to take proportionally more micro pauses, whereas longer pause students take fewer micro but more of short, mid, and long pauses. The shorter pause students appear to perform better in the exam in the both contexts. In a sense, grouping students into clusters is arbitrary: Figure 2 shows that pauses of all lengths are negatively correlated with exam score, indicating that the 4-dimensional feature vectors are not linearly independent, effectively making our clustering singledimensional and not particularly interesting regardless of choice of 𝑘. Nevertheless, the analysis reveals one difference between the contexts that may be of interest: in Table 3, we see that more events correspond to more pauses across the contexts. However, groups which produce more events are not the same. In the US/Python context, the shorter pause group tends to type and pause more, whereas in the European/Java context the opposite applies. Additionally, proportions of pauses in the US/Python context remain roughly consistent across the groups and equal to 0.09 and in the European/Java context similarly, being 0.12 and 0.13. This observation could be due to a number of context-specific factors, such as the way how each context uses programming assignments. 5 DISCUSSION 5.1 Frequency of pauses Before answering the first research question, Is there a correlation between the relative number of pauses a student takes and their performance (exam score)?, we checked whether our bucketing was sensible by performing a correlation test. As we can see from Figs. 3 and 4, there are correlations between all types of pauses which is not surprising since the pauses lengths are on the time continuum. However, neighbouring types of pauses do not show a very high degree of similarity, which justifies our choice. Moreover, micro and short pauses, having the highest correlation coefficient, yield quite different correlation coefficients in terms of relationships with exam scores (see Figure 2). In general, we observe that students who pause more often perform more poorly on exams (Figure 2), which is in line with the results observed by Leppänen et al. [41]. This effect is not large, but it is consistent across pause types and contexts. We note that this measurement is frequency of pauses, so it is normalized across students regardless of the number of total events they execute. In this paper we do not make any claims regarding what students were doing during their pauses, whether they were thinking, drawing on other resources, or disengaged. But the correlations in our data indicate that regardless of pause activity, pauses correlate negatively with exam score, at least in the aggregate. We note that certain activities may not cause negative correlation with achievement, but it 5.3 How pauses are initiated Our third research question is: What events initiate a pause and how does this correlate with the performance of the student? The first thing to note is that the distributions of event types, for each 193 (a) US/Python (b) European/Java Pause length 𝑟 𝑝 Python Micro Short Mid Long All 0.27 0.11 0.076 –0.022 0.2687 1e–5 0.1 0.29 0.80 3e–5 0.28 0.19 0.14 0.005 0.29 Java Figure 5: Clustering. In the Python context, t-test statistics (𝑡, 𝑝) and effect sizes (𝑑) between the two distributions are: Micro (𝑡 = 1.5, 𝑝 = 0.12, 𝑑 = 0.23), Short (𝑡 = −6.9, 𝑝 = 3.6𝑒−11, 𝑑 = −1.01), Mid (𝑡 = −6.4, 𝑝 = 8.4𝑒−10, 𝑑 = −0.93), and Long (𝑡 = −3.2, 𝑝 = 0.0014, 𝑑 = −0.47). In the Java context, t-test statistics (𝑡, 𝑝) and effect sizes (𝑑) between the two distributions are: Micro (𝑡 = 1.7, 𝑝 = 0.08, 𝑑 = −0.22), Short (𝑡 = −10.8, 𝑝 = 2.4𝑒−22, 𝑑 = −1.35), Mid (𝑡 = −9.3, 𝑝 = 1.1𝑒−17, 𝑑 = −1.15), and Long (𝑡 = 2.1, 𝑝 = 2.6𝑒−9, 𝑑 = −0.77). Enter Alphanum 𝑟 𝑝 Micro Short Mid Long All 0.18 –0.012 -0.028 –0.012 0.082 4e–3 0.84 0.66 0.86 0.18 0.31 0.21 0.23 0.033 0.23 Delete Special Space Tab 𝑟 𝑝 𝑟 𝑝 𝑟 𝑝 𝑟 𝑝 1e–5 0.004 0.04 0.95 5e–6 –0.33 –0.245 –0.13 –0.12 –0.36 1e–7 1e–4 0.05 0.11 1e–6 –0.23 0.01 0.13 –0.03 –0.20 4e–4 0.81 0.07 0.74 0.002 0.21 0.19 0.14 0.037 0.22 0.001 3e–3 0.08 0.73 5e–4 0.058 0.062 0.22 0.13 0.073 0.41 0.42 0.23 0.57 0.30 0.0 5e–4 2e–4 0.62 1e–4 –0.20 –0.16 –0.23 0.067 –0.24 9e–4 0.010 2e–4 0.28 1e–4 –0.18 0.055 0.023 –0.24 0.21 3e–3 0.36 0.72 2e–4 5e–4 –0.029 0.021 0.034 0.014 0.12 0.64 0.73 0.60 0.86 0.042 (Success) run 𝑟 𝑝 –0.17 –0.05 –0.007 0.067 –0.13 0.015 0.45 0.91 0.32 0.04 –0.034 –0.024 0.079 0.070 –0.088 0.58 0.69 0.19 0.25 0.15 Fail run 𝑟 𝑝 –0.38 –0.38 –0.22 –0.06 –0.43 1e–7 1e–7 8e–4 0.37 1e–6 Table 4: Pearson 𝑟 correlations with 𝑝 values between a student’s tendency to initiate a given length of pause with a given event type and exam score. “All” indicates percentage across all events (both those initiating pauses and not). We do not have data on tabs or whether a run was successful or not in the Java context, so tab values are not included for the Java context and the (Success) run column should be interpreted as a successful run for the Python context and all runs for the Java context. of the four pause lengths, do not match the overall distribution of events (Figure 6). This confirms, as one might expect, that, in general, students are not pausing at arbitrary times, meaning that pauses are generally purposeful and not taken at random times while typing. the project [38]. One would expect the most natural way to take a break would be a successful run. Yet, in the Python context, only 30% of long pauses are initiated as such. Another intuitive, natural break would be a failed run, as the student might need a break or an extended session of reviewing external materials after a failure. Yet failed runs account for only 4% of long pauses. This means that 66% of long pauses are initiated with a keystroke. The most common event for long pauses, the delete keystroke, initiates 25% 5.3.1 Deletes and failed runs. There is some abruptness regarding what initiates a long pause. Long pauses, those of 10 minutes or more, may indicate that the student is disengaged from working on 194 (a) US/Python pause length as much as it does (Figure 6) suggests that students are completing their lower-level processing thoughts before taking longer breaks. Indeed, it appears that students are deliberate in taking longer breaks rather than getting interrupted, as would be the case if alphanumeric pauses were more common. In Python, statements generally do not end with a special character as they do in Java (e.g. semicolon for a single-line statement and closing brace for a block). So it is not surprising that pauses initiated by special characters decrease in frequency with increasing pause length in the Python context. What is surprising, however, is that the Java context has a very similar phenomenon. We expected longer pauses to be frequently initiated by special characters in the Java context, as ending a line with a semicolon seems like a natural stopping point. We do not know why this is not the case, but we suspect that this is, again, a consequence of the difference in instructional methods between the two contexts. The Java students work on smaller projects and run more often, and so they may be more likely to complete their thought or work session with a run event. In Table 4 we see that, in both the Python and Java contexts, alphanumeric micro pauses are positively correlated with exam score while special character micro pauses are negatively correlated. As these two event types behave similarly in other respects, we discuss a possible explanation for this difference. Roughly half of special characters require further processing: an open parenthesis expects formal parameters for a function call; quotes expect a string; an open bracket expects list/array indices; etc. It may be that a micro pause, which may last as many as 15 seconds, indicates student hesitancy and lack of fluency with Python or Java syntax. This lack of fluency with a fundamental aspect of programming may be why the student exam scores are lower. Problems with special characters being indicative of struggling has been hypothesized also in previous work [19, 40]. If special character pauses do indicate an uncertainty with syntax then instructors may consider an increased focus on syntax fluency for students initiating pauses with special characters. (b) European/Java Figure 6: Grouped bar chart showing normalized/relative frequencies of keystrokes by pause length. “All” are all events, whether they precede a pause or not. of the pauses. The Java context is similar, with 22% of long pauses initiated by delete. This seems remarkable. A delete press often indicates an error and so, after the delete press, the student needs to execute keystrokes to replace the incorrect code. At times, however, students are taking a break instead of completing the correction. If this happens it could indicate that the student may lack motivation, diligence, or the corrective know-how without consulting external help. Other types of pauses were also rather often preceded by a delete event (26-27%). From Table 4, we can see that the correlations of the exam score with such pauses are negative. This may signify that deletes are used less often for removing unneeded code (e.g., print statements or comments) and more often when students are confused and do not know how to proceed. This same reasoning could be used to explain the negative correlation of failed runs initiating pauses, i.e., that an extended pause after a failed run indicates the student does not know how to fix the problem and has to take time to either consult other materials or take a break. Indeed, the correlations of failed runs with exam score closely mirror those of delete key presses. One could suggest that the consistent negative correlations of delete and failed run events initiating pauses with exam score simply reflect the overall correlation of these event types with exam score. We note, however, that distributions of the two events are different and they demonstrate different degree of involvement in a long pause initiation. While deletes, constituting 20%/24% of all events and preceding 25%/22% of long pauses, failed runs account for only 0.5% of all events but precede 4% of long pauses. Thereby, it is six times more likely that failed run event will initiate a long pause than a delete keystroke. A plausible interpretation of this observation is that students are deliberately pausing after failed runs, at least more often than after deletes. 5.3.3 Events constant across pause length. In the Python context, the percentage of pauses initiated by the return event remains roughly the same across pause lengths (Figure 6). This makes sense in the context of both shorter and longer pauses: pressing return requires short-term planning for the next line, so its prevalence among micro and short pauses is logical; pressing return is also a natural stopping point before taking a break, so it is frequent among mid and long pauses. Being every 33rd event in the Python data and every 7th in Java data, return initiates approximately 12% of any type pauses in the Python context and as much as 12-24% of of any type pauses in the Java context. This seems to confirm the return keypress being a natural stopping point. 5.3.4 Run events. Being rather rare in the typing data in both contexts, run events are notably evident among events preceding pauses, especially short, mid and long. This is not unexpected: it would be highly unusual for a run event to take fewer than two seconds, so the great majority of run events would precede at least a micro pause. A large proportion of long pauses are initiated by successful runs in the Python context and runs in the Java context. It 5.3.2 Events decreasing in frequency with pause length. Alphanumeric keystrokes are what we might call “middle” events – they are the most common while being somewhat less significant in terms of reflecting thinking processes. The fact that the frequency of alphanumeric events preceding a pause decreases with increasing 195 is instructive to consider why students would pause after a successful run. It is possible that a student takes a pause to consult external resources (e.g., internet, textbook, another person) regarding how to proceed with their program, but it seems more likely that the student would have at least an idea of what to do next after a successful run. Therefore, we suggest that the more likely scenario is that the student is instead disengaging from working on their assignment. If this is the case, then we could possibly use the percentage of successful run long pauses in the Python context as a lower bound for the number of long pauses in which students are disengaging. In our data, this indicates that students are disengaging during at least 30% (roughly) of long pauses. We expect that this is a conservative lower bound. 5.4 be particularly useful. In addition, as previous studies that have used keystroke data for predicting course outcomes have mainly focused on latencies smaller than 750ms [19, 40], future research should seek to combine such keystroke data with pausing data and study whether these phenomena have the same underlying tacit factors. RQ2 What groups of students exist when clustering on pausing behavior? We found in a cluster analysis that students whose pausing behavior tended toward short pauses performed better in general on exams. The cluster analysis primarily indicated a correlation between typical pause length for a student and exam score. When considering the identified student types in the light of CER studies that have identified student types such as the tinkerers, stoppers, and movers [30, 45], most of the students in the studied contexts could be categorized as movers, despite the differences in their pausing behavior. As pausing is linked with cognition and thought processes, and as writing code is linked with a multitude of factors including understanding syntax and the given problem [52, 66], further research is needed to understand the lack of stopping and the differences in pausing. RQ3 What events initiate a pause and how does this correlate with the performance of the student? We have presented evidence that pauses do not occur randomly while a student is programming – students tend to finish their thoughts and pause after a natural stopping point. This observation is in line with the studies on student cognition and programming and how students solve programming problems [15, 51], where students write constructs informed by schemas that engage procedural memory. Fully 25% (Python) and 22% (Java) of long pauses (>10 minutes) are initiated by delete events. We suggest that students who pause after delete are possibly less engaged (taking a break instead of writing the code to replace the deleted characters) or they lack the knowledge to write a fix (consulting external resources to learn how to fix the problem). This presents interesting questions for future research, such as what percentage of delete pauses indicate a disengaged student. Beyond identification of at-risk students, the negative correlations of special character and failed compile pauses suggest possible pedagogical and material innovations to improve student fluency after special characters and minimize the number of failed runs. In addition to the directions for future research discussed above, there are additional avenues for further research. As an example, while previous research in syntax errors has noted that there are differences in the time that it takes to fix syntax errors [2, 16], our study highlights that pause durations are related to the pressed keys. Combining information on present syntax errors (or the lack of them) with information on pauses could create more in-depth understanding of students knowledge and actions – for example, pauses preceded by a syntax error likely indicates different thought processes than pauses not preceded by a syntax error. Similarly, looking at what syntactic construct was just typed or is being typed could affect pausing behavior. While our definitions of pauses were based on related literature (e.g., [41]), future work could explore alternative bins, including higher resolution bins for the micro pause, which spans lengths from 2 to 15 seconds in the work reported in this paper. Language specific differences should also be studied further – as an example, we noted that in the European/Java context students took a micro pause on average after 8 keystrokes, Threats to validity Internal validity. As is natural in educational studies, our study comes with an inherent self-selection bias. It is possible that the way the studied courses were organized and the way the student population at both universities is formed influences the observed outcomes. It is unclear, for example, whether similar results would be observed if the study would have been conducted in the context of primary or secondary education, or in life-long learning. When considering the outcome of the courses, we used exam score as a proxy for performance, which can be affected by factors such as exam stress. In addition, the European/Java context had a noticeable ceiling effect in the exam outcomes. It is possible that this also influenced some of our findings and that lifting the ceiling effect would affect the correlations. External validity. We studied keystrokes in two contexts to increase the degree to which our findings can be generalized to other contexts (see Section 3.1). The strength of the correlations and the 𝑝 values varied somewhat between the contexts and we cannot state which context-specific factors contributed to the differences. 6 CONCLUSIONS In this article, we presented an analysis of keystrokes with an eye toward understanding pausing behavior of CS1 students and its implications on academic outcomes. In this section we draw conclusions from our results in each of our three research questions. RQ1 Is there a correlation between the relative number of pauses a student takes and their performance (exam score)? We observe that negative correlations between pause frequency and exam score exist as illustrated in Fig. 2. The most illustrative is the frequency of mid pauses – those of length 3-10 minutes. We suggest that these pauses indicate that a student may be distracted easily, but it could also indicate students who are spending time using external resources for help on their projects. Révész et al. [48] suggests, since keystroke logs alone do not allow us to “make inferences about the specific cognitive processes that underlie pausing behaviors”, that combining event logs with “other techniques such as verbal reports and eyetracking” could be helpful in obtaining more detailed information. Further study could help us understand what these students are doing during pauses and what they were working on when they paused. But in the meantime, the pause/exam score correlation appears actionable. We suggest that a tool that allows practitioners to visualize students’ pausing behavior could 196 while students in the US/Python context took a micro pause on average after 11 keystrokes. It would be meaningful to understand where this difference stems from. If it is simply the language, then one possible implication is that the relative verbosity of Java when compared to Python would not only require the students to type more, but also to pause to think more. On the other hand, if it is a product of a contextual factor, then it could be something that could be sought to disseminate to other contexts as well. Future studies could also focus on differences between the beginning and end of the course to see if programming behavior changes with experience. Computing Education Research. 204–215. [19] John Edwards, Juho Leinonen, and Arto Hellas. 2020. A Study of Keystroke Data in Two Contexts: Written Language and Programming Language Influence Predictability of Learning Outcomes. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education. 413–419. [20] Clayton Epp, Michael Lippold, and Regan L Mandryk. 2011. Identifying emotional states using keystroke dynamics. In Proceedings of the sigchi conference on human factors in computing systems. 715–724. [21] Jean-Noël Foulin. 1995. Pauses et débits : les indicateurs temporels de la production écrite. L'année psychologique 95, 3 (1995), 483–504. [22] Tony Gillie and Donald Broadbent. 1989. What makes interruptions disruptive? A study of length, similarity, and complexity. Psychological research 50, 4 (1989), 243–250. [23] Alexander JJ Gould. 2014. What makes an interruption disruptive? Understanding the effects of interruption relevance and timing on performance. Ph.D. Dissertation. UCL (University College London). [24] Winston Haynes. 2013. Bonferroni Correction. Springer New York, New York, NY, 154–154. https://doi.org/10.1007/978-1-4419-9863-7_1213 [25] Arto Hellas, Petri Ihantola, Andrew Petersen, Vangel V Ajanovski, Mirela Gutica, Timo Hynninen, Antti Knutas, Juho Leinonen, Chris Messom, and Soohyun Nam Liao. 2018. Predicting academic performance: a systematic literature review. In Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education. 175–199. [26] Arto Hellas, Juho Leinonen, and Petri Ihantola. 2017. Plagiarism in take-home exams: Help-seeking, collaboration, and systematic cheating. In Proceedings of the 2017 ACM conference on innovation and technology in computer science education. 238–243. [27] C. D. Hundhausen, D. M. Olivares, and A. S. Carter. 2017. IDE-Based Learning Analytics for Computing Education: A Process Model, Critical Review, and Research Agenda. ACM Trans. Comput. Educ. 17, 3, Article 11 (Aug. 2017), 26 pages. https://doi.org/10.1145/3105759 [28] Petri Ihantola, Arto Vihavainen, Alireza Ahadi, Matthew Butler, Jürgen Börstler, Stephen H. Edwards, Essi Isohanni, Ari Korhonen, Andrew Petersen, Kelly Rivers, Miguel Ángel Rubio, Judy Sheard, Bronius Skupas, Jaime Spacco, Claudia Szabo, and Daniel Toll. 2015. Educational Data Mining and Learning Analytics in Programming: Literature Review and Case Studies. In Proc. of the 2015 ITiCSE on Working Group Reports (Vilnius, Lithuania) (ITICSE-WGR ’15). ACM, 41–63. [29] Shamsi T Iqbal and Brian P Bailey. 2006. Leveraging characteristics of task structure to predict the cost of interruption. In Proceedings of the SIGCHI conference on Human Factors in computing systems. 741–750. [30] Matthew C Jadud. 2005. A first look at novice compilation behaviour using BlueJ. Computer Science Education 15, 1 (2005), 25–40. [31] Matthew C Jadud. 2006. Methods and tools for exploring novice compilation behaviour. In Proceedings of the second international workshop on Computing education research. ACM, 73–84. [32] Agata Kołakowska. 2016. Towards detecting programmers’ stress on the basis of keystroke dynamics. In 2016 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 1621–1626. [33] Minna Kumpulainen. 2015. On the operationalisation of ‘pauses’ in translation process research. Translation & Interpreting 7, 1 (2015), 47–58. [34] Isabel Lacruz and Gregory M. Shreve. 2014. Pauses and Cognitive Effort in Post-Editing. In Post-editing of Machine Translation: Processes and Applications. Cambridge Scholars Publishing. [35] Joy Yeonjoo Lee, Jeroen Donkers, Halszka Jarodzka, Géraldine Sellenraad, and Jeroen J.G. van Merriënboer. 2020. Different effects of pausing on cognitive load in a medical simulation game. Computers in Human Behavior 110 (Sept. 2020), 106385. [36] Marianne Leinikka, Arto Vihavainen, Jani Lukander, and Satu Pakarinen. 2014. Cognitive flexibility and programming performance. In Psychology of programming interest group workshop. 1–11. [37] Juho Leinonen. 2019. Keystroke Data in Programming Courses. Ph.D. Dissertation. University of Helsinki. [38] Juho Leinonen, Francisco Enrique Vicente Castro, and Arto Hellas. 2021. FineGrained Versus Coarse-Grained Data for Estimating Time-on-Task in Learning Programming. In Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021). The International Educational Data Mining Society. [39] Juho Leinonen, Leo Leppänen, Petri Ihantola, and Arto Hellas. 2017. Comparison of time metrics in programming. In Proceedings of the 2017 ACM Conference on International Computing Education Research. ACM, 200–208. [40] Juho Leinonen, Krista Longi, Arto Klami, and Arto Vihavainen. 2016. Automatic inference of programming performance and experience from typing patterns. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education. 132–137. [41] Leo Leppänen, Juho Leinonen, and Arto Hellas. 2016. Pauses and spacing in learning to program. In Proceedings of the 16th Koli Calling International Conference on Computing Education Research. ACM, 41–50. [42] Soohyun Nam Liao, Daniel Zingaro, Kevin Thai, Christine Alvarado, William G. Griswold, and Leo Porter. 2019. A Robust Machine Learning Technique to Predict REFERENCES [1] Alireza Ahadi, Raymond Lister, Heikki Haapala, and Arto Vihavainen. 2015. Exploring machine learning methods to automatically identify students in need of assistance. In Proceedings of the eleventh annual international conference on international computing education research. 121–130. [2] Amjad Altadmri and Neil C.C. Brown. 2015. 37 Million Compilations: Investigating Novice Programming Mistakes in Large-Scale Student Data. In Proceedings of the 46th ACM Technical Symposium on Computer Science Education (Kansas City, Missouri, USA) (SIGCSE ’15). Association for Computing Machinery, New York, NY, USA, 522–527. https://doi.org/10.1145/2676723.2677258 [3] Amjad Altadmri, Michael Kolling, and Neil CC Brown. 2016. The cost of syntax and how to avoid it: Text versus frame-based editing. In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC). IEEE, 748–753. [4] Erik M Altmann and J Gregory Trafton. 2007. Timecourse of recovery from task interruption: Data and a model. Psychonomic Bulletin & Review 14, 6 (2007), 1079–1084. [5] Rui A Alves, São Luís Castro, Liliana de Sousa, and Sven Strömqvist. 2007. Chapter 4: Influence of Typing Skill on Pause–Execution Cycles in Written Composition. In Writing and Cognition. BRILL, 55–65. [6] Mark B. Edwards and Scott D Gronlund. 1998. Task interruption and its effects on memory. Memory 6, 6 (1998), 665–687. [7] Jens Bennedsen and Michael E Caspersen. 2006. Abstraction ability as an indicator of success for learning object-oriented programming? ACM Sigcse Bulletin 38, 2 (2006), 39–43. [8] Susan Bergin and Ronan Reilly. 2005. Programming: factors that influence success. In Proceedings of the 36th SIGCSE technical symposium on Computer science education. 411–415. [9] Jelmer P Borst, Niels A Taatgen, and Hedderik van Rijn. 2015. What makes interruptions disruptive?: A process-model account of the effects of the problem state bottleneck on task interruption and resumption. In Proceedings of the 33rd annual ACM conference on human factors in computing systems. ACM, 2971–2980. [10] Neil Christopher Charles Brown, Michael Kölling, Davin McCall, and Ian Utting. 2014. Blackbox: a large scale repository of novice programmers’ activity. In Proceedings of the 45th ACM technical symposium on Computer science education. ACM, 223–228. [11] Brian L. Butterworth. 1980. Evidence from pauses in speech. New York: Academic Press. [12] Adam S Carter, Christopher D Hundhausen, and Olusola Adesope. 2015. The normalized programming state model: Predicting student performance in computing courses based on programming behavior. In Proceedings of the eleventh annual international conference on international computing education research. 141–150. [13] Jasone Cenoz. 2000. Pauses and hesitation phenomena in second language production. ITL - International Journal of Applied Linguistics 127-128 (Jan. 2000), 53–69. [14] Markus F Damian and Hans Stadthagen-Gonzalez. 2009. Advance planning of form properties in the written production of single and multiple words. Language and Cognitive Processes 24, 4 (2009), 555–579. [15] Simon P Davies. 1991. The role of notation and knowledge representation in the determination of programming strategy: a framework for integrating models of programming behavior. Cognitive Science 15, 4 (1991), 547–572. [16] Paul Denny, Andrew Luxton-Reilly, and Ewan Tempero. 2012. All Syntax Errors Are Not Equal. In Proceedings of the 17th ACM Annual Conference on Innovation and Technology in Computer Science Education (Haifa, Israel) (ITiCSE ’12). ACM, New York, NY, USA, 75–80. https://doi.org/10.1145/2325296.2325318 [17] John Edwards, Joseph Ditton, Dragan Trninic, Hillary Swanson, Shelsey Sullivan, and Chad Mano. 2020. Syntax exercises in CS1. In Proceedings of the 16th Annual Conference on International Computing Education Research (Dunedin, New Zealand) (ICER ’20). [18] John Edwards, Juho Leinonen, Chetan Birthare, Albina Zavgorodniaia, and Arto Hellas. 2020. Programming Versus Natural Language: On the Effect of Context on Typing in CS1. In Proceedings of the 2020 ACM Conference on International 197 [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] Low-Performing Students. ACM Trans. Comput. Educ. 19, 3, Article 18 (Jan. 2019), 19 pages. https://doi.org/10.1145/3277569 Sharon O'Brien. 2006. Pauses as Indicators of Cognitive Effort in Post-editing Machine Translation Output. Across Languages and Cultures 7, 1 (June 2006), 1–21. Thierry Olive, Rui Alexandre Alves, and São Luís Castro. 2009. Cognitive processes in writing during pause and execution periods. European Journal of Cognitive Psychology 21, 5 (Aug. 2009), 758–785. David N Perkins, Chris Hancock, Renee Hobbs, Fay Martin, and Rebecca Simmons. 1986. Conditions of learning in novice programmers. Journal of Educational Computing Research 2, 1 (1986), 37–55. Andrew Petersen, Jaime Spacco, and Arto Vihavainen. 2015. An exploration of error quotient in multiple contexts. In Proceedings of the 15th Koli Calling Conference on Computing Education Research. 77–86. Leo Porter, Daniel Zingaro, and Raymond Lister. 2014. Predicting Student Success Using Fine Grain Clicker Data. In Proceedings of the Tenth Annual Conference on International Computing Education Research (Glasgow, Scotland, United Kingdom) (ICER ’14). Association for Computing Machinery, New York, NY, USA, 51–58. https://doi.org/10.1145/2632320.2632354 Andrea Révész, Marije Michel, and MinJin Lee. 2017. Investigating IELTS Academic Writing Task 2: Relationships between cognitive writing processes, text quality, and working memory. British Council, Cambridge English Language Assessment and IDP. Andrea Révész, Marije Michel, and Minjin Lee. 2019. EXPLORING SECOND LANGUAGE WRITERS’ PAUSING AND REVISION BEHAVIORS. Studies in Second Language Acquisition 41, 3 (July 2019), 605–631. Russell Revlin. 2013. Cognition : theory and practice. Worth Publishers, New York, NY. Robert S Rist. 1989. Schema creation in programming. Cognitive Science 13, 3 (1989), 389–414. Robert S Rist. 1995. Program structure and design. Cognitive science 19, 4 (1995), 507–562. Nathan Rountree, Janet Rountree, Anthony Robins, and Robert Hannah. 2004. Interacting factors that predict success and failure in a CS1 course. ACM SIGCSE Bulletin 36, 4 (2004), 101–104. Joost Schilperoord. 1996. It’s about time: Temporal aspects of cognitive processes in text production. Vol. 6. Rodopi. [55] Richard C Thomas, Amela Karahasanovic, and Gregor E Kennedy. 2005. An investigation into keystroke latency metrics as an indicator of programming performance. In Proceedings of the 7th Australasian conference on Computing education-Volume 42. 127–134. [56] Robert L Thorndike. 1953. Who belongs in the family? Psychometrika 18, 4 (1953), 267–276. [57] Markku Tukiainen and Eero Mönkkönen. 2002. Programming Aptitude Testing as a Prediction of Learning to Program.. In PPIG. 4. [58] Jeroen JG Van Merrienboer and Fred GWC Paas. 1990. Automation and schema acquisition in learning elementary computer programming: Implications for the design of practice. Computers in Human Behavior 6, 3 (1990), 273–289. [59] Arto Vihavainen, Juha Helminen, and Petri Ihantola. 2014. How Novices Tackle Their First Lines of Code in an IDE: Analysis of Programming Session Traces. In Proceedings of the 14th Koli Calling International Conference on Computing Education Research (Koli, Finland) (Koli Calling ’14). ACM, New York, NY, USA, 109–116. https://doi.org/10.1145/2674683.2674692 [60] Arto Vihavainen, Matti Luukkainen, and Petri Ihantola. 2014. Analysis of source code snapshot granularity levels. In Proceedings of the 15th annual conference on information technology education. 21–26. [61] Arto Vihavainen, Thomas Vikberg, Matti Luukkainen, and Martin Pärtel. 2013. Scaffolding students’ learning using test my code. In Proceedings of the 18th ACM conference on Innovation and technology in computer science education. 117–122. [62] Luuk Van Waes and Peter Jan Schellens. 2003. Writing profiles: the effect of the writing mode on pausing and revision patterns of experienced writers. Journal of Pragmatics 35, 6 (June 2003), 829–853. [63] Ronald L Wasserstein and Nicole A Lazar. 2016. The ASA statement on p-values: context, process, and purpose. [64] Christopher Watson, Frederick WB Li, and Jamie L Godwin. 2013. Predicting performance in an introductory programming course by logging and analyzing student programming behavior. In 2013 IEEE 13th international conference on advanced learning technologies. IEEE, 319–323. [65] Laurie Honour Werth. 1986. Predicting student performance in a beginning computer science class. ACM SIGCSE Bulletin 18, 1 (1986), 138–143. [66] Leon E Winslow. 1996. Programming pedagogy—a psychological overview. ACM Sigcse Bulletin 28, 3 (1996), 17–22. 198