1 Introduction
Though educational software, including intelligent tutors and educational games, is increasingly ubiquitous, evaluating the effectiveness of such software remains a challenge, at both an individual level and a population level. Typically school districts are most interested in long-term student outcomes, like end of year state or national assessments. Though such assessments are generally highly rigorous and carefully designed, their rare occurrence introduces multiple challenges, making it harder to identify students likely to fail such assessments, as well as making it slow for researchers, educators, product designers and policy makers to be able to assess the effectiveness of particular educational tools. It has long been recognized (
e.g., in the broad investigation with ASSISTments [
20,
24,
25,
32]) that the log data generated when students interact with educational software could themselves be used as a form of temporally integrated insight into a student’s state of knowledge. In this work we investigate machine learning predictors using students’ logs during their first few hours of use with educational technology can provide useful predictive insight into those students’ end-of-school year external assessment.
There are multiple reasons such predictors would be helpful. Those predictors might be used within the software itself, to help introduce different forms of pedagogical instruction and support to students who are struggling, or even new challenges to those that are thriving. While many educational tools already involve some aspects of personalization, to our knowledge, such tools typically rely on proxy measures of student outcomes (such as performance on a set of internally defined skills). Though there is some prior work relating such proxies as observed over the school year to external assessments [
30,
37], it is still unclear if such features as observed over a full year are necessarily predictive given only
a limited window, which is important if these shorter term signals are used for instructional decisions.
Equally importantly, that short-term information could be provided to teachers to help them better understand the progress of individual students and their class as a whole, potentially informing the need for additional resources or changes in strategy. For example, a teacher might assign an aide to spend more time with a struggling student, or they might choose to increase the amount of time spent on math if their whole class is likely to perform poorly. In general, it is unclear if many educational products and software are carefully aligning short-term observations to make automated instructional decisions, towards maximizing desired long-term outcomes. Indeed, doing so has often been very challenging due to the limited time horizons involved, and to our knowledge there has been limited prior work on using such short horizon log data to predict delayed long term outcomes.
Rather, prior work has tackled other related aspects to the problem we consider. First, there have been a number of papers showing that using long-horizon student data logs can be used to help predict external assessments, across multiple intelligent tutoring systems and educational software products [
4,
6,
12,
19,
25,
30,
37]. For instance, Ritter at al. [
30] and Zheng et al. [
37] used students’ log data, demographics, and pre-assessment scores from an academic year to predict standardized tests/outcome using machine learning methods, and Feng et al. [
12] investigated ASSISTments data through a year to predict a high-stake state test (MCAS) at the end of the year using log data and pre-assessment score. Such work has typically shown that log data provides a useful signal to help predict student tests scores, focusing on population measures like Pearson correlation and root mean squared error. Recently, perhaps in part motivated by the significant challenges during the covid-19 pandemic of conducting standard educational assessments when many students are remote, people have been interested in designing much shorter assessments that have similar benefits to existing, much longer assessments – for example, Tran et al. [
31] developed a reading assessment that calculates each student’s score at 10 seconds time intervals which was then correlated against a full 3-minute standardized test scores. In such settings, the assumption is that the assessment is being done to capture static student performance, rather than extracting signal during standard usage of a product designed to support student learning.
1Most related to our current work is a couple recent papers that have similarly examined whether very short-term data from student log outcomes can predict delayed student outcomes [
2,
10,
11,
14,
21,
22]. However, in those settings, the length of horizon was very short, focused on a single session of interaction with a student. For example, Gao et al. [
14] examined how performance on the first problem related to post-test after 5 problems, and Mao et al. [
22] looked at performance on the first 1 minute related to the final outcome after 20 minutes. While such work has a related motivation, in contrast, in our work we are interested in much longer time scales, seeking to use a limited amount of likely multi-session data to predict external assessments taken months later to evaluate student learning (as well as the impact of educational support) on a much broader scale. In addition, to our knowledge, prior research that has developed predictors of external student assessments using many session (e.g. school year) student tutoring logs, has analyzed performance on a single educational tool and/or platform, leaving open trends and similarities across educational technology systems.
We note that there has been significant work on developing surrogate measures of delayed outcomes in social science and economics [
5,
13,
17,
36]. Such approaches generally built a model to predict long-term outcomes from a set of short-term outcomes, and estimated long-term treatment effects using the predicted long-term outcomes. And prior works have found that leveraging the short-term observed information from humans can provide reliable estimation for long-term outcomes. For example, Athey et al. predicted employment many years after a short-term job training program, using a surrogate of 1 year employment status [
5]. Zhang et al. used 14 days of users’ data to develop a surrogate index, which was highly correlated to a directly measure of 63-day treatment effects [
36]. Surrogate endpoints are also used in clinical settings, when the desired outcome of interest may be substantially delayed (such as 5 year survival rates) and other shorter term measures are known to be predictive of the long-term outcome.
2, and increasingly in other settings such as finance and recommendation systems [
17,
33,
36]. Those are potentially related to our interests, and motivate us to developing models by using short-term log data for estimating long-term external outcomes given an educational system.
When predicting students’ long-term performance, another challenging situation is that available observations of students can be commonly limited. Prior works have found some powerful indicators (e.g., demographics, pre-test scores, knowledge components (KCs), etc.) [
1,
3,
30,
37] and expert-defined features (e.g., clicker questions, programming error/distance metrics) [
9,
18,
21,
26,
27,
34] to predict students’ outcomes at an early stage. However, those features may not always be available to tutors. And most of the prior works mainly focused on a specific platform or context, there is lack of investigation on what features might be generally important across contexts to guide tutors from a new context to quickly understand the potential future outcomes of students. Moreover, while a crucial goal for developing new tools and interventions is broadly enhancing learning outcomes over student populations, it is also pivotal to understand its potential effects on varied students subgroups sorted by performance at an early stage, since we would like to understand whether various performers could benefit from new interventions. However, prior work mainly focus on developing techniques to enhance prediction on specific subgroups (
e.g., [
16]), while we are further interested in evaluating over both population and subgroups during predicting long-term outcomes, with respect to short horizon and the sets of predictive features, using features that may generalizable across contexts.
In this work, we investigate the prediction of long-term, external students’ outcomes using their short-horizon log data, across tutoring systems across three different educational contexts: the Can’t Wait to Learn reading educational technology games (using data from students in Uganda), iReady middle school math intelligent tutor (used by seventh graders in the United States) and MATHia middle school math intelligent tutor (used by 6-8 graders in the United States). We explore the potential of using short horizon features that could be generally extracted from log data across contexts, without extensive domain knowledge of the underlying educational software tool or student demographics or student prior performance data. We do this in part as such data may not always be available for many reasons, including the common case of new students who transfer to a new district, particularly midway through an academic year. In addition, by focusing on broadly similar features that are likely to be present across many educational platforms, we can evaluate the similarities and differences across settings. Using such features, we compare the performance of three popular machine learning models (i.e., linear regression, support vector regression, and random forest) with respect to the length of horizon and contexts.
While prior work has primarily focused on population level metrics, part of the motivation for such work is the potential to help support students who are expected to have major challenges months later on an external assessment if their trajectory continues, or to potentially celebrate or challenge a student that is already showing signs of strong expected future performance. To investigate this further, we also analyze the resulting quality of the short-horizon estimates over subgroups of students that are sorted by performance for a more thorough understanding about the prediction performance. Moreover, we examine the effects of pre-assessments on predictions over both population and subgroups, with different set of features, to understand if pre-assessments or pre-test data, when available, is similar, different or complimentary, to student log data. Specifically, we explore the following research questions:
(1)
Can we consistently, across multiple educational tools, obtain a short-horizon log data only predictor of student long-term outcomes which provides significant predictive power on external, much delayed educational assessment? How does its accuracy compare to using the full log horizon data, and does performance versus horizon vary by datasets/settings?
(2)
Does the machine learning algorithms used to form the predictors significantly impact the resulting accuracy of the external assessment predictor?
(3)
Is there a stable set of log data features across domains and datasets that is needed to form accurate short-horizon predictors of long-term outcomes? Are multiple features important for improving accuracy?
(4)
What is the resulting quality of the short-horizon estimates and how does it differ across subgroups of students sorted by performance? Are we systematically better or worse at predicting higher/lower performers?
(5)
When pre-assessments are available, is the quality of using log data equivalent to this form of information, and do we see additive gains of combining both?
3 Results
3.1 RQ1: Short-Horizon Log Data for Predicting Long-Term Outcomes across Multiple Domains
To understand the potential signal from using student log data given various lengths of usage since the start of data collection, we train and test each machine learning prediction model with cumulative log data on varied lengths of horizons,
i.e., 1, 2, 3, 4, 5, 12 hours, as well as the full length of data (denoted as
H). The prediction models’ RMSE and
R2 are presented in Figure
1. The x-axis represents the length of horizon for used cumulative log data to train models and y-axis represents the prediction results averaged from 5-fold cross validation.
Interestingly, we observe that for all three educational products and datasets, there exists a machine learning model built using short-horizon of log data that is nearly as or equally effective as a the best machine learning model built using the full horizon log data.
The second thing to note is that the short horizon time period with the best performance (as measured by RMSE and R2 varies slightly by datasets/setting. For example, for CWTLReading, the random forest machine learning model has notably better performance with 5 hours of interactive usage (0.13 RMSE and 0.34 R2) but worse performance using log data from 12 hours and full log data. In MATHia, two hours of log data is the best of the early horizon (up to 5 hours) exceeds the performance of the full horizon log data, and for iReady strong early performance (measured by R2 and RMSE) among the first 5 hours is obtained at 3 hours, though the performance is quite similar over teh full range.
We also note that performance not always monotonically improve with longer horizon log data– machine learning predictors for both
CWTLReading and
MATHia seem to slightly decrease in performance at the longest horizon lengths. It is known from the economics surrogate literature that using a surrogate measure, when surrogacy is satisfied, can result in a lower variance estimate [
5], since additional data beyond the surrogate window can introduce additional noise. Though we would not expect to surrogacy to necessarily always hold in real educational data, this may be a reason for why we see, sometimes, higher accuracy estimates using a shorter horizon.
3.2 RQ2: Impacts of Selected Machine Learning Algorithms on Resulting Accuracy
We next assess if the machine learning algorithm used has a significant impact on resulting performance, which we visualize in Figure
1. In general, all machine learning algorithms are quite similar. For
CWTLReading, random forest (RF) achieves the lowest RMSE across different horizons, and achieves the highest overall
R2. Support vector regression (SVR) were slightly less consistent, and linear regression (LR) performs relatively more stable across lengths of horizon and performs better than random forest at some early horizons in terms of
R2. For
MATHia, RMSE for all three models (LR, SVR, RF) is similar, and the difference of
R2 between the three methods is minimal on given short horizon log data. For
iReady, random forest consistently achieves the lowest RMSE and highest
R2, followed by support vector regression, and then linear regression.
All differences are relatively minor, though random forest overall seems to have a slight noticeable benefit in the CWTLReading and iReady datasets. The baseline of simply predicting the average score in the training data performs poorly across all horizons across datasets.
We also note the predictive models have lower accuracy in CWTLReading than the other two datasets. A potential reason is that the records across students of CWTLReading are relatively more sparse than the other two datasets. On average, the numbers of logged events over the best short and full data are 270 (5hr) and 1245 for CWTLReading, 586 (2hr) and 8738 for MATHia, 1162 (3hr) and 5042 for iReady. This likely merely reflects the differences across different educational technology designs and instrumentation, and it may have an impact on the density of data available for prediction.
3.3 RQ3: Generalizability of Important Log Data Features across Domains for Long-Term Outcomes Prediction
Using interpretable features from log data for our machine learning predictors has the potential to capture interactions that reflect underlying learning processes of students, and allow us to understand how such processes may be similar or different across platforms and temporal periods. Towards such insights, we first identify features that are important in our machine learning model, and then also consider if we can use such features to formulate a much simpler model which may replace our more complex machine learning models.
Important features shared across educational contexts. To quantify important features, we consider top 5 features selected by a random forest with 100 decision trees (with maximum depth as 10) trained on each dataset, both using a selected short horizon and full data. The results are shown Table
1. Overall, there is a significant overlap across contexts, and across both horizon lengths shown within each context. Specifically, across all three datasets, the percentage of times the student succeeded at a problem, and the average number of times a student attempted a problem, are frequently selected as important features, in some cases being the most important feature for a dataset. We also note that within a dataset, the important features often significantly overlap between the short and long term important features. For example, the average number of attempts per problem and counts of when a student took a long time to complete a problem, are top features in
CWTLReading using both short (
i.e., 5 hours) and full horizon. Two features, the percentage of times the student succeeded at a problem and the average time to finish a successful problem, are selected in the top 5 features on
MATHia using both short (
i.e., 2 hours) and full horizon. Three features, including the average number of times a student attempted a problem, the average time to finish a successful problem, and how long a student persisted unsuccessfully to finish a problem, are selected in the top 5 features on
iReady using both short (
i.e., 3 hours) and full horizon. The features are generally consistent across datasets and horizon settings, suggests that one might build a general predictive model across educational contexts using the same feature set.
Prediction effects using single feature vs. the entire set of features. In part motivated by the prior observation in the last paragraph, we investigated the effectiveness of using a single feature compared to using a set of features extracted from log data. We train linear regression using the percent success per problem, and the average attempts per problem, respectively, using short horizons as noted in prior sections (
i.e., 5 hours for
CWTLReading, 2 hours for
MATHia, 3 hours for
iReady). Table
2 shows the comparison results in terms of RMSE and
R2. Our results show that while a single feature machine learning model still performs much better than the baseline predicting the average performance in a held out set, our machine learning models using the full set of available log features outperform it, in terms of both RMSE and
R2. This suggests more complex machine learning models can offer a useful advantage.
3.4 RQ4: Quality of Short-Horizon Estimates on Performance-Based Student Subgroups
As mentioned, though population level prediction accuracy is the most common reported measure, another important motivation of predicting outcomes using short-horizon data is to help understand and inform additional support and challenge for students in need. We divide students into performance subgroups, by two strategies: 1) using quintile division sorted by their test outcomes (represented by Q1 to Q5 in ascending order); 2) using context-specific binary clusters such as pass and fail (that we refer as ‘on track’ and ‘not on track’, respectively), which are defined by the given assessment.
Quintile subgroups sorted by performance. Figure
2 shows heat maps of confusion matrices that represent the prediction performance of the three machine learning methods (linear regression, support vector regression and random forest) on three datasets using short horizon. The x-axis shows the true performance subgroups in training set (sorted by post-test scores in ascending order), and the y-axis shows the predicted performance groups. Overall, linear regression (LR) performs better in general at having a stronger diagonal on the confusion matrices. Across all three methods, the accuracy is worst for students predicted in the middle (
i.e., Q3 are frequently over- and under-estimated).
The highest precision is for students at either extreme. For example, for CWTLReading, when predicting students in the lowest performance quintile (Q1), the best model (random forest) is accurate 77% of the time when it predicts someone is going to be Q1, and is accurate 72% of the time if it predicts someone is going to be Q5. Random forest is also the best at Q1 and Q5 for iReady, with an accuracy of 57% at predicting Q1, and 88% at predicting Q5. For MATHia support vector machines has the slight edge over the other models at predictive accuracy for its extreme predictions, reaching an accuracy for predicted Q1 of 70%, and for Q5 is around 61%. We note in all cases the other models do worse, but similarly. The models are most inaccurate at predicting quintiles 2 to 5. Note though that the ranges of these quintiles are not evenly spaced (CWTLReading: [0.11, 0.42), [0.42, 0.53), [0.53, 0.62), [0.62, 0.74), [0.74, 1]; MATHia: [0, 0.37), [0.37, 0.48), [0.48, 0.55), [0.55, 0.63), [0.63, 1]; iReady: [0, 0.28), [0.28, 0.42), [0.42, 0.52), [0.52, 0.64), [0.64, 1]).
While we generally see correlation with the true quantile, the predicted models often overestimate poor performance and underestimate high performance, suggesting it is not yet well-prepared for individual or subgroup level performance predictions. However, the models’ relatively high precision at the extremes suggest they might be able to highlight when students might benefit from additional challenge or support, though further investigation and validation is key before employing such predictions for any important interventions.
Predicting subgroups on or off track. In MATHia, the end-of-year state test also has 5 achievement categories where level 3 to 5 correspond to a passing score. Thus, we consider cluster students into two subgroups, ‘on track’ subgroup whose achievement categories are in level 3 to 5, and ‘not on track’ subgroup whose achievement categories are in level 1 to 2. The prediction process is conducted over the original achievement categories and then we calculate a confusion matrix based on the two clusters. Using the short horizon (2hr) log data, we find that the model correctly classifies as "on track" 1541 students, correctly classifies as "not on track" 423 students, and misclassifies 680 students. In particular, the model has low recall for the ‘not on track’ subgroup (0.44, true predicted ‘not on track’ divided by actual ‘not on track’), indicating it misses a large number of students who are not on track. On the plus side, similar to our results for the quintile analysis, the model has reasonable precision: if it predicts a student is not on track, 74% of those instances did have a post assessment that indicates they are not on track.
3.5 RQ5: The Effects of Pre-Assessment Scores
Prediction over population using pre-assessment scores combining with a set of log data features. Tables
3 &
4 show: 1) using pre-test scores only; 2) combining pre-test with log data features on short horizon; and 3) combining pre-test with full log data features. In general, pre-assessment or pre-test score is a powerful indicator for long-term outcomes, that can be stand-alone and outperforms using log data features only. With pre-test scores, all models show similar performance with RMSE around 0.14 for
CWTLReading and around 0.12 for
iReady. And RMSE is not significantly improved compared to using log data features in
CWTLReading. But
R2 values are higher when pre-test score is included, peaking at 0.44 for LR and SVR in the Short+Pre-Test (
CWTLReading) and 0.66 for SVR in Short+Pre-Test (
iReady), which indicates a better fit model when pre-test scores are integrated. Overall, unsurprizingly, it is beneficial to include a pre-test score in the model if available.
We further investigate the effects of combining single feature with pre-test scores. Table
5 shows the prediction results on
CWTLReading and
iReady using LR. Pre-test can substantially enhance LR’s performance with single feature. And even when two features have differed effects on model performance (
e.g., 0.197 RMSE using PS only vs. 0.179 RMSE using AA only), combining with pre-test could fill the gap by improving prediction performance to the similar level (
e.g., 0.123 RMSE using PS+Pre-Test vs. 0.122 RMSE using AA+Pre-Test). Those indicate the effectiveness of pre-test scores in prediction.
Prediction over subgroups using pre-assessment scores combining with a set of log data features. We calculate confusion matrices on quintile division of subgroups using both with and without pre-test scores on selected short horizon. Figure
3 shows the heat maps of the confusion matrices for
CWTLReading and
iReady. For lower performers Q1 and Q2 in
CWTLReading, using log data features with pretest (Short+Pre-Test) shows more concentrated on overestimating cases (with less on predicting Q1 as Q3-Q5). And for highest performers Q5, Short+Pre-Test shows more concentrated on underestimating cases (with less on predicting Q5 as Q1-Q3). For lowest performers Q1 in
iReady, similar to what we observe from
CWTLReading, introducing pre-test to log data features helps LR less overestimate. And for highest performers Q5 in
iReady, Pre-Test also help LR less underestimate the performers as Q1-Q3.
To summarize, if pre-test score is available, one can use that alone or combine it with a single feature. If we don’t have pre-test score and only have log data, there is still a gain to using more than one feature extracted from log data.
4 Discussions
From the performance subgroup analysis, we observed that the three machine learning models are not highly accurate at classifying students’ likely quintile assessment performance or binary "on track" measure, but the precision of the models is quite good when they suggest someone is in danger of performing very poorly, or is likely to perform very well. This suggests interesting future directions for creating systems that could alert educators or provide additional support or challenge in this setting. Another interesting direction is to see if there are additional features or models that could further improve the predictive accuracy. However, before investigating such, we strongly believe that additional testing and validation would be crucial, particularly to avoid accidentally causing worse outcomes for students, should they be labeled inaccurately or if some teachers or other stakeholders deprioritize such students because they think those students are expected to fail. We also note that stakeholders may wish to understand and interpret the predictions of such models. While linear regression is the most interpretable approach in this work, the others models require additional effort to become interpretable to stakeholders.
In this study, we use the fixed usage time to define short horizons, i.e., we use accumulative interaction time of each student until they reach a threshold of short horizon usage (e.g., 1 hour). We do not consider a potential usage time situation if some students share the same potential usage time but they do not have productive usage during that time, which may be another important indicator to consider for long term outcomes. In addition, we note that fixed usage time may result in varied course content coverage across students. For example, imagine if the length of a class is 3 hours (i.e., the potential usage time is 3 hours), but some students only actively engage for 2 hours. If we set a fixed usage time as 3 hours, the considered data of those students will contain their interactions within the 1st hour of the next class. These issues would be an interesting direction for further work.
We also note that when additional features about student demographics are available, or specific more detailed information about the educational product (such as knowledge components), this information can be used to potentially further enhance predictive accuracy [
30,
37]. We did not consider those features in this study, because they were not available and may not always be available in the future, and because our primary interest was a general study using exploring how short-term features that can be extracted across many educational contexts can be used to predict on long-term external assessments.