research-article

Open access

Predicting Long-Term Student Outcomes from Short-Term EdTech Log Data

Authors: Ge Gao, Amelia Leon, Andrea Jetten, Jasmine Turner, Husni Almoubayyed, Stephen Fancsali, Emma BrunskillAuthors Info & Claims

LAK '25: Proceedings of the 15th International Learning Analytics and Knowledge Conference

Pages 631 - 641

https://doi.org/10.1145/3706468.3706552

Published: 03 March 2025 Publication History

PDF eReader

Abstract

Educational stakeholders are often particularly interested in sparse, delayed student outcomes, like end-of-year statewide exams. The rare occurrence of such assessments makes it harder to identify students likely to fail such assessments, as well as making it slow for researchers and educators to be able to assess the effectiveness of particular educational tools. Prior work has primarily focused on using logs from students full usage (e.g. year-long) of an educational product to predict outcomes, or considered predictive accuracy using a few minutes to predict outcomes after a short (e.g. 1 hour) session. In contrast, we investigate machine learning predictors using students’ logs during their first few hours of usage can provide useful predictive insight into those students’ end-of-school year external assessment. We do this on three diverse datasets: from students in Uganda using a literacy game product, and from students in the US using two mathematics intelligent tutoring systems. We consider various measures of the accuracy of the resulting predictors, including its ability to identify students at different parts along the assessment performance distribution. Our findings suggest that short-term log usage data, from 2-5 hours, can be used to provide valuable signal about students’ long term external performance.

1 Introduction

Though educational software, including intelligent tutors and educational games, is increasingly ubiquitous, evaluating the effectiveness of such software remains a challenge, at both an individual level and a population level. Typically school districts are most interested in long-term student outcomes, like end of year state or national assessments. Though such assessments are generally highly rigorous and carefully designed, their rare occurrence introduces multiple challenges, making it harder to identify students likely to fail such assessments, as well as making it slow for researchers, educators, product designers and policy makers to be able to assess the effectiveness of particular educational tools. It has long been recognized (e.g., in the broad investigation with ASSISTments [20, 24, 25, 32]) that the log data generated when students interact with educational software could themselves be used as a form of temporally integrated insight into a student’s state of knowledge. In this work we investigate machine learning predictors using students’ logs during their first few hours of use with educational technology can provide useful predictive insight into those students’ end-of-school year external assessment.

There are multiple reasons such predictors would be helpful. Those predictors might be used within the software itself, to help introduce different forms of pedagogical instruction and support to students who are struggling, or even new challenges to those that are thriving. While many educational tools already involve some aspects of personalization, to our knowledge, such tools typically rely on proxy measures of student outcomes (such as performance on a set of internally defined skills). Though there is some prior work relating such proxies as observed over the school year to external assessments [30, 37], it is still unclear if such features as observed over a full year are necessarily predictive given only a limited window, which is important if these shorter term signals are used for instructional decisions.

Equally importantly, that short-term information could be provided to teachers to help them better understand the progress of individual students and their class as a whole, potentially informing the need for additional resources or changes in strategy. For example, a teacher might assign an aide to spend more time with a struggling student, or they might choose to increase the amount of time spent on math if their whole class is likely to perform poorly. In general, it is unclear if many educational products and software are carefully aligning short-term observations to make automated instructional decisions, towards maximizing desired long-term outcomes. Indeed, doing so has often been very challenging due to the limited time horizons involved, and to our knowledge there has been limited prior work on using such short horizon log data to predict delayed long term outcomes.

Rather, prior work has tackled other related aspects to the problem we consider. First, there have been a number of papers showing that using long-horizon student data logs can be used to help predict external assessments, across multiple intelligent tutoring systems and educational software products [4, 6, 12, 19, 25, 30, 37]. For instance, Ritter at al. [30] and Zheng et al. [37] used students’ log data, demographics, and pre-assessment scores from an academic year to predict standardized tests/outcome using machine learning methods, and Feng et al. [12] investigated ASSISTments data through a year to predict a high-stake state test (MCAS) at the end of the year using log data and pre-assessment score. Such work has typically shown that log data provides a useful signal to help predict student tests scores, focusing on population measures like Pearson correlation and root mean squared error. Recently, perhaps in part motivated by the significant challenges during the covid-19 pandemic of conducting standard educational assessments when many students are remote, people have been interested in designing much shorter assessments that have similar benefits to existing, much longer assessments – for example, Tran et al. [31] developed a reading assessment that calculates each student’s score at 10 seconds time intervals which was then correlated against a full 3-minute standardized test scores. In such settings, the assumption is that the assessment is being done to capture static student performance, rather than extracting signal during standard usage of a product designed to support student learning.¹

Most related to our current work is a couple recent papers that have similarly examined whether very short-term data from student log outcomes can predict delayed student outcomes [2, 10, 11, 14, 21, 22]. However, in those settings, the length of horizon was very short, focused on a single session of interaction with a student. For example, Gao et al. [14] examined how performance on the first problem related to post-test after 5 problems, and Mao et al. [22] looked at performance on the first 1 minute related to the final outcome after 20 minutes. While such work has a related motivation, in contrast, in our work we are interested in much longer time scales, seeking to use a limited amount of likely multi-session data to predict external assessments taken months later to evaluate student learning (as well as the impact of educational support) on a much broader scale. In addition, to our knowledge, prior research that has developed predictors of external student assessments using many session (e.g. school year) student tutoring logs, has analyzed performance on a single educational tool and/or platform, leaving open trends and similarities across educational technology systems.

We note that there has been significant work on developing surrogate measures of delayed outcomes in social science and economics [5, 13, 17, 36]. Such approaches generally built a model to predict long-term outcomes from a set of short-term outcomes, and estimated long-term treatment effects using the predicted long-term outcomes. And prior works have found that leveraging the short-term observed information from humans can provide reliable estimation for long-term outcomes. For example, Athey et al. predicted employment many years after a short-term job training program, using a surrogate of 1 year employment status [5]. Zhang et al. used 14 days of users’ data to develop a surrogate index, which was highly correlated to a directly measure of 63-day treatment effects [36]. Surrogate endpoints are also used in clinical settings, when the desired outcome of interest may be substantially delayed (such as 5 year survival rates) and other shorter term measures are known to be predictive of the long-term outcome.², and increasingly in other settings such as finance and recommendation systems [17, 33, 36]. Those are potentially related to our interests, and motivate us to developing models by using short-term log data for estimating long-term external outcomes given an educational system.

When predicting students’ long-term performance, another challenging situation is that available observations of students can be commonly limited. Prior works have found some powerful indicators (e.g., demographics, pre-test scores, knowledge components (KCs), etc.) [1, 3, 30, 37] and expert-defined features (e.g., clicker questions, programming error/distance metrics) [9, 18, 21, 26, 27, 34] to predict students’ outcomes at an early stage. However, those features may not always be available to tutors. And most of the prior works mainly focused on a specific platform or context, there is lack of investigation on what features might be generally important across contexts to guide tutors from a new context to quickly understand the potential future outcomes of students. Moreover, while a crucial goal for developing new tools and interventions is broadly enhancing learning outcomes over student populations, it is also pivotal to understand its potential effects on varied students subgroups sorted by performance at an early stage, since we would like to understand whether various performers could benefit from new interventions. However, prior work mainly focus on developing techniques to enhance prediction on specific subgroups (e.g., [16]), while we are further interested in evaluating over both population and subgroups during predicting long-term outcomes, with respect to short horizon and the sets of predictive features, using features that may generalizable across contexts.

In this work, we investigate the prediction of long-term, external students’ outcomes using their short-horizon log data, across tutoring systems across three different educational contexts: the Can’t Wait to Learn reading educational technology games (using data from students in Uganda), iReady middle school math intelligent tutor (used by seventh graders in the United States) and MATHia middle school math intelligent tutor (used by 6-8 graders in the United States). We explore the potential of using short horizon features that could be generally extracted from log data across contexts, without extensive domain knowledge of the underlying educational software tool or student demographics or student prior performance data. We do this in part as such data may not always be available for many reasons, including the common case of new students who transfer to a new district, particularly midway through an academic year. In addition, by focusing on broadly similar features that are likely to be present across many educational platforms, we can evaluate the similarities and differences across settings. Using such features, we compare the performance of three popular machine learning models (i.e., linear regression, support vector regression, and random forest) with respect to the length of horizon and contexts.

While prior work has primarily focused on population level metrics, part of the motivation for such work is the potential to help support students who are expected to have major challenges months later on an external assessment if their trajectory continues, or to potentially celebrate or challenge a student that is already showing signs of strong expected future performance. To investigate this further, we also analyze the resulting quality of the short-horizon estimates over subgroups of students that are sorted by performance for a more thorough understanding about the prediction performance. Moreover, we examine the effects of pre-assessments on predictions over both population and subgroups, with different set of features, to understand if pre-assessments or pre-test data, when available, is similar, different or complimentary, to student log data. Specifically, we explore the following research questions:

(1)

Can we consistently, across multiple educational tools, obtain a short-horizon log data only predictor of student long-term outcomes which provides significant predictive power on external, much delayed educational assessment? How does its accuracy compare to using the full log horizon data, and does performance versus horizon vary by datasets/settings?

(2)

Does the machine learning algorithms used to form the predictors significantly impact the resulting accuracy of the external assessment predictor?

(3)

Is there a stable set of log data features across domains and datasets that is needed to form accurate short-horizon predictors of long-term outcomes? Are multiple features important for improving accuracy?

(4)

What is the resulting quality of the short-horizon estimates and how does it differ across subgroups of students sorted by performance? Are we systematically better or worse at predicting higher/lower performers?

(5)

When pre-assessments are available, is the quality of using log data equivalent to this form of information, and do we see additive gains of combining both?

2 Design Context & Methods

2.1 Data

Can’t Wait to Learn (CWTL) is an educational program developed by the nonprofit organization War Child to support children impacted by or residing in conflict affected areas who may have challenges accessing quality education. The educational technology is a curriculum aligned, self-paced, autonomous learning program that aims to teach foundational numeracy and literacy skills, and prior and ongoing work suggests that it can have a positive impact on student learning outcomes (e.g. [8]). The program is delivered on a tablet and targets learning objectives from grades 1-3. Based on the context, the program can be used as a standalone or a supplemental educational program. CWTL has been used in many settings, including South Sudan, Sudan, Lebanon, Jordan, Chad, Bangladesh, Ukraine and Uganda.

In this paper we used CWTL data collected across 30 schools from September 2021 to December 2022. We have access to log data from 739 students learning literacy (referred as CWTLReading in this paper), who participated in the study with both pre- and post-test scores. The post-test scores were calculated by averaging of sub-task scores, including letter knowledge, phonemic awareness, reading fluency, and reading comprehension in an assessment.

MATHia (previously Cognitive Tutor [29]) is an intelligent tutoring system for middle school mathematics. The anonymized MATHia data used in this study was collected from 2644 students using the software from August 2020 to June 2021 in a mid-western US state. The predicted post-test score is assessed by a state end-of-year test. The test has both an integer score, and 5 achievement categories: levels 3 to 5 correspond to a passing score.

iReady is an online program for K-8 reading and/or mathematics that provides an adaptive diagnostic, and online instruction. The iReady data used in this study contains 428 Grade 7 students in mathematics courses with complete records including both pre- and post-test scores from August 2022 to June 2023. The predicted post-test score is assessed by the Smarter Balanced Assessment System (SBAC) which is a standardized test consortium and creates Common Core State Standards-aligned tests to be used in multiple states of the US. The students were part of another research study with a particular focus on schools with more struggling students.

2.2 Data Analysis

In this section, we will outline our approach for estimating delayed external assessments using short-term log data as students work on an educational technology product. We will overview the features extracted from log data, before describing the machine learning algorithms used, and the evaluations of our proposed methods

We first note an important detail, which is how to define usage time. Though it may seem trivial, defining usage time often requires several decisions, including whether to define things as available usage time (the time in a classroom a student had the opportunity to use educational software), the wall clock time a student was logged in to an educational software tool, as well as the active time a student spent doing problems. The first definition (potential usage time) is often very hard to know, since it requires information around attendance. The second one (wall clock or logged in time) is easier to extract for most systems that include a session variable, though not all software systems do – in particular, our CWTL data did not include a session variable. Even in other systems this can be a challenge as it is common for some software products to automatically log out a student given a sufficiently long inactive time period. For example, if a student logs in at 9:10am, and then is logged out at 9:25am, and then logs in again at 9:40am, and logs out for the day at 10:00am, one could count this as a 50 minutes of usage, or 35 minutes. The 50-minute view might reveal important aspects of disengagement that would be missed if it was treated as 35 minutes. For the MATHia and iReady data, for which we do have session identifiers, we consider wall clock time spent within each session. In contrast, for CWTL, which lacks session identifiers in our available data, we instead accumulate active time on each played minigame. This is possible because the start and end time on each minigame is recorded. Note that in general this may underestimate the total amount of time spent by a student, since it does not include watching videos or time between completing activities. In our discussion we will further reflect on the potential impact of our particular choices around usage time definitions.

2.2.1 Features Extraction.

While it is possible to make direct predictions from raw click-stream log data, we choose, for interpretability and comparison across multiple educational platforms, to take the common approach of first employing a pare-processing feature extraction step. Briefly, we focused on log features that are generally obtainable but skills or knowledge components may sometimes be not possible. We extract features that are broadly used in prior research and may commonly shared within log data across learning systems (see e.g. [28, 30, 35]). In particular, the log features were:

(1)

Aggregate features across the full period of time considered:

num_problem: the total number of attempted problems;

num_success_problem: the number of success problems

perc_success_problem: the percentage of success problems

(2)

The three educational technology products considered all have subdivisions into lessons/ workspaces / sections, as do many educational software systems. Here we consider features normalized per section:

min_attempts_per_problem: the minimum number of attempts within a problem

avg_attempts_per_problem: the average number of attempts within a problem

max_attempts_per_problem: the maximum number of attempts within a problems

num_guess_in_problem: the number of likely guess attempts (i.e., attempts accomplished within 2 seconds [7]) across problems

num_idle_in_problem: the number of attempts spent where the time spent for that event was greater than average time across students

num_twice_avg_time_in_problem: the number of attempts where the time spent for that event was greater than twice of average time across students

num_long_idle_in_problem: the number of attempts spent where the time spent for that event was greater than five minutes

avg_time_per_problem: average time (in seconds) spent on each problem

avg_time_per_success_problem: average time (in seconds) spent on each success problem

avg_time_per_failed_problem: average time (in seconds) spent on each failed problem

(3)

Behavior-related features from log data that relate to indicators of wheel spinning and unproductive persistence [7]:

num_unproductive_persistence_thres_5: the number of consecutive five attempts without succeeding a problem

num_unproductive_persistence_thres_10: the number of consecutive 10 attempts without succeeding a problem

time_first_unproductive_persistence: the time step that the first num_unproductive_persistence_thres_5 occurs

There is considerable variability across educational systems. For ’problem’ level features, our focus is investigating features with respect to the most fundamental level recording activities of each student. For example, in CWTL-Reading, students play educational mini-games, and play multiple bubble games under each mini-game, where they have to answer a minimum number of instances (e.g., 8 out of 10) for the current bubble games. In this work, we refer the bubble game for CWTL-Reading as a ’problem’, given each activity recorded in log data is associated with the attempt on the level of bubble game. In MATHia, students learn by solving multi-step problems. In iReady, students learn with a personalized sequence of lessons within each session, and provide response at some time points during each lesson, where we refer the lesson as a ‘problem’ in this work.

The above listed features can all be considered basic counting features based on prior suggestions from experts. An importance affordance of educational data mining methods is to be able to go beyond what experts have identified [14, 15, 23, 33]. To do so, we also automatically extracted potentially useful behavioral features by using a held-out dataset, which is the training set in the cross-validation. More precisely, we automatically extract sequences of students’ activities as temporal features using a sequential pattern by Gao et al. [14]. We first extract initial frequent sequential patterns within each outcome group (where we divide students into two outcome groups by split in the median of their post-test scores). Then we use the chi-2 test to identify patterns which are significant frequent across students within one outcome group, and select the top 10 patterns as the features to be used in the prediction, where each feature represents whether its associated pattern occurs within each student. Therefore, the overall number of features used in this study would be 26 (or a value between 16 and 26 if less than 10 patterns are identified).

In addition to the features described above, we also consider using pre-test scores when available. It is of interest to build predictions both with and without pre-test scores, as these may not always be available.

2.2.2 Predicting Outcomes using Machine Learning Methods.

We then use the above features as input to machine learning algorithms to predict students’ external assessments. We consider three standard but popular machine learning models: linear regression (LR), support vector regression (SVR) and random forest (RF)³. We also include a baseline, i.e., using mean scores of final outcomes from training set as the predicted values for test set, for comparison with the prediction models. To make the results more comparable across the three educational tools and corresponding datasets, we project post-test scores to be within⁴[0,1]. We reported the best results from grid searching among C={0.1, 1, 10, 50, 100}, epsilon={0.01, 0.1, 1} for SVR (with kernel=‘rbf’), and maximum depth={2, 5, 10, 12, 15} for random forest. Other hyper-parameters followed default settings by Python sklearn package.

2.2.3 Model Evaluation.

We train the models and report evaluation results using 5-fold cross validation. To measure the performance of prediction models, we report rooted mean squared error (RMSE); (2) the square of the Pearson correlation coefficient R (we report the coefficient of determination, R²).

2.2.4 Prediction Quality by Student performance.

Prior work has mostly focused on reporting average accuracy or correlation between predictions and true outcomes over the entire population (e.g. [28, 30]). However, one of the key motivations for short horizon prediction of long term external outcomes is the potential to be able to identify when a student or set of students is in need of additional support, or potentially to challenge and congratulate those that are thriving. To do so, we consider how the predictive accuracy varies on student subgroups with different post test performance in two ways. First, we subdivide students into five quantiles (i.e., Q1, Q2, Q3, Q4, Q5, in ascending order of post-test scores) and assess the models’ accuracy with respect to each sub-group. Second, a particularly salient classification may be whether a student is considered to be proficient or above, or not, on the final assessment. We also consider this in the context of the Mathia dataset, and report results below.

3 Results

3.1 RQ1: Short-Horizon Log Data for Predicting Long-Term Outcomes across Multiple Domains

To understand the potential signal from using student log data given various lengths of usage since the start of data collection, we train and test each machine learning prediction model with cumulative log data on varied lengths of horizons, i.e., 1, 2, 3, 4, 5, 12 hours, as well as the full length of data (denoted as H). The prediction models’ RMSE and R² are presented in Figure 1. The x-axis represents the length of horizon for used cumulative log data to train models and y-axis represents the prediction results averaged from 5-fold cross validation.

Interestingly, we observe that for all three educational products and datasets, there exists a machine learning model built using short-horizon of log data that is nearly as or equally effective as a the best machine learning model built using the full horizon log data.

The second thing to note is that the short horizon time period with the best performance (as measured by RMSE and R² varies slightly by datasets/setting. For example, for CWTLReading, the random forest machine learning model has notably better performance with 5 hours of interactive usage (0.13 RMSE and 0.34 R²) but worse performance using log data from 12 hours and full log data. In MATHia, two hours of log data is the best of the early horizon (up to 5 hours) exceeds the performance of the full horizon log data, and for iReady strong early performance (measured by R² and RMSE) among the first 5 hours is obtained at 3 hours, though the performance is quite similar over teh full range.

We also note that performance not always monotonically improve with longer horizon log data– machine learning predictors for both CWTLReading and MATHia seem to slightly decrease in performance at the longest horizon lengths. It is known from the economics surrogate literature that using a surrogate measure, when surrogacy is satisfied, can result in a lower variance estimate [5], since additional data beyond the surrogate window can introduce additional noise. Though we would not expect to surrogacy to necessarily always hold in real educational data, this may be a reason for why we see, sometimes, higher accuracy estimates using a shorter horizon.

Figure 1:

3.2 RQ2: Impacts of Selected Machine Learning Algorithms on Resulting Accuracy

We next assess if the machine learning algorithm used has a significant impact on resulting performance, which we visualize in Figure 1. In general, all machine learning algorithms are quite similar. For CWTLReading, random forest (RF) achieves the lowest RMSE across different horizons, and achieves the highest overall R². Support vector regression (SVR) were slightly less consistent, and linear regression (LR) performs relatively more stable across lengths of horizon and performs better than random forest at some early horizons in terms of R². For MATHia, RMSE for all three models (LR, SVR, RF) is similar, and the difference of R² between the three methods is minimal on given short horizon log data. For iReady, random forest consistently achieves the lowest RMSE and highest R², followed by support vector regression, and then linear regression.

All differences are relatively minor, though random forest overall seems to have a slight noticeable benefit in the CWTLReading and iReady datasets. The baseline of simply predicting the average score in the training data performs poorly across all horizons across datasets.

We also note the predictive models have lower accuracy in CWTLReading than the other two datasets. A potential reason is that the records across students of CWTLReading are relatively more sparse than the other two datasets. On average, the numbers of logged events over the best short and full data are 270 (5hr) and 1245 for CWTLReading, 586 (2hr) and 8738 for MATHia, 1162 (3hr) and 5042 for iReady. This likely merely reflects the differences across different educational technology designs and instrumentation, and it may have an impact on the density of data available for prediction.

3.3 RQ3: Generalizability of Important Log Data Features across Domains for Long-Term Outcomes Prediction

Table 1:

Feature	CWTLReading		MATHia		iReady
	Short Horizon (5hr)	Full Horizon	Short Horizon (2hr)	Full Horizon	Short Horizon (3hr)	Full Horizon
Perc. Success Problem	X*		X*	X*		X
Avg. Attempts per Problem	X	X*	X		X*	X*
Num Idle in Problem	X	X
Num Problem	X
Num Long Idle in Problem	X	X
Max Attempts per Problem		X		X	X
Num Success Problem			X
Num Guess in Problem			X
Avg. Time per Success Problem			X	X	X	X
Avg. Time Failed Problem				X
Time First Unproductive Persistence		X		X	X	X
Num Twice Avg. Time in Problem					X
Min Attempts per Problem						X

Table 1: Top 5 important features selected by random forest with 100 decision trees trained on each dataset using short horizon and full log data. ‘X’ represents the feature is selected as the top 5 for the data using a length of horizon. ‘*’ represents the feature is selected as the top 1.

Using interpretable features from log data for our machine learning predictors has the potential to capture interactions that reflect underlying learning processes of students, and allow us to understand how such processes may be similar or different across platforms and temporal periods. Towards such insights, we first identify features that are important in our machine learning model, and then also consider if we can use such features to formulate a much simpler model which may replace our more complex machine learning models.

Important features shared across educational contexts. To quantify important features, we consider top 5 features selected by a random forest with 100 decision trees (with maximum depth as 10) trained on each dataset, both using a selected short horizon and full data. The results are shown Table 1. Overall, there is a significant overlap across contexts, and across both horizon lengths shown within each context. Specifically, across all three datasets, the percentage of times the student succeeded at a problem, and the average number of times a student attempted a problem, are frequently selected as important features, in some cases being the most important feature for a dataset. We also note that within a dataset, the important features often significantly overlap between the short and long term important features. For example, the average number of attempts per problem and counts of when a student took a long time to complete a problem, are top features in CWTLReading using both short (i.e., 5 hours) and full horizon. Two features, the percentage of times the student succeeded at a problem and the average time to finish a successful problem, are selected in the top 5 features on MATHia using both short (i.e., 2 hours) and full horizon. Three features, including the average number of times a student attempted a problem, the average time to finish a successful problem, and how long a student persisted unsuccessfully to finish a problem, are selected in the top 5 features on iReady using both short (i.e., 3 hours) and full horizon. The features are generally consistent across datasets and horizon settings, suggests that one might build a general predictive model across educational contexts using the same feature set.

Prediction effects using single feature vs. the entire set of features. In part motivated by the prior observation in the last paragraph, we investigated the effectiveness of using a single feature compared to using a set of features extracted from log data. We train linear regression using the percent success per problem, and the average attempts per problem, respectively, using short horizons as noted in prior sections (i.e., 5 hours for CWTLReading, 2 hours for MATHia, 3 hours for iReady). Table 2 shows the comparison results in terms of RMSE and R². Our results show that while a single feature machine learning model still performs much better than the baseline predicting the average performance in a held out set, our machine learning models using the full set of available log features outperform it, in terms of both RMSE and R². This suggests more complex machine learning models can offer a useful advantage.

Table 2:

	CWTLReading			MATHia			iReady
	PS	AA	Entire	PS	AA	Entire	PS	AA	Entire
RMSE	0.177 (0.007)	0.177 (0.007)	0.142 (0.018)	0.123 (0.004)	0.138 (0.006)	0.12 (0.005)	0.197 (0.012)	0.179 (0.01)	0.165 (0.012)
R²	0.1 (0.05)	0.03 (0.01)	0.34 (0.194)	0.39 (0.034)	0.232 (0.023)	0.43 (0.053)	0.074 (0.053)	0.236 (0.079)	0.369 (0.076)

Table 2: Performance of LR trained using single feature (i.e., perc_success_problem (PS) and avg_attempts_per_problem (AA)) compared to using entire set of log data features (Entire) on the three datasets using short horizon.

3.4 RQ4: Quality of Short-Horizon Estimates on Performance-Based Student Subgroups

Figure 2:

As mentioned, though population level prediction accuracy is the most common reported measure, another important motivation of predicting outcomes using short-horizon data is to help understand and inform additional support and challenge for students in need. We divide students into performance subgroups, by two strategies: 1) using quintile division sorted by their test outcomes (represented by Q1 to Q5 in ascending order); 2) using context-specific binary clusters such as pass and fail (that we refer as ‘on track’ and ‘not on track’, respectively), which are defined by the given assessment.

Quintile subgroups sorted by performance. Figure 2 shows heat maps of confusion matrices that represent the prediction performance of the three machine learning methods (linear regression, support vector regression and random forest) on three datasets using short horizon. The x-axis shows the true performance subgroups in training set (sorted by post-test scores in ascending order), and the y-axis shows the predicted performance groups. Overall, linear regression (LR) performs better in general at having a stronger diagonal on the confusion matrices. Across all three methods, the accuracy is worst for students predicted in the middle (i.e., Q3 are frequently over- and under-estimated).

The highest precision is for students at either extreme. For example, for CWTLReading, when predicting students in the lowest performance quintile (Q1), the best model (random forest) is accurate 77% of the time when it predicts someone is going to be Q1, and is accurate 72% of the time if it predicts someone is going to be Q5. Random forest is also the best at Q1 and Q5 for iReady, with an accuracy of 57% at predicting Q1, and 88% at predicting Q5. For MATHia support vector machines has the slight edge over the other models at predictive accuracy for its extreme predictions, reaching an accuracy for predicted Q1 of 70%, and for Q5 is around 61%. We note in all cases the other models do worse, but similarly. The models are most inaccurate at predicting quintiles 2 to 5. Note though that the ranges of these quintiles are not evenly spaced (CWTLReading: [0.11, 0.42), [0.42, 0.53), [0.53, 0.62), [0.62, 0.74), [0.74, 1]; MATHia: [0, 0.37), [0.37, 0.48), [0.48, 0.55), [0.55, 0.63), [0.63, 1]; iReady: [0, 0.28), [0.28, 0.42), [0.42, 0.52), [0.52, 0.64), [0.64, 1]).

While we generally see correlation with the true quantile, the predicted models often overestimate poor performance and underestimate high performance, suggesting it is not yet well-prepared for individual or subgroup level performance predictions. However, the models’ relatively high precision at the extremes suggest they might be able to highlight when students might benefit from additional challenge or support, though further investigation and validation is key before employing such predictions for any important interventions.

Predicting subgroups on or off track. In MATHia, the end-of-year state test also has 5 achievement categories where level 3 to 5 correspond to a passing score. Thus, we consider cluster students into two subgroups, ‘on track’ subgroup whose achievement categories are in level 3 to 5, and ‘not on track’ subgroup whose achievement categories are in level 1 to 2. The prediction process is conducted over the original achievement categories and then we calculate a confusion matrix based on the two clusters. Using the short horizon (2hr) log data, we find that the model correctly classifies as "on track" 1541 students, correctly classifies as "not on track" 423 students, and misclassifies 680 students. In particular, the model has low recall for the ‘not on track’ subgroup (0.44, true predicted ‘not on track’ divided by actual ‘not on track’), indicating it misses a large number of students who are not on track. On the plus side, similar to our results for the quintile analysis, the model has reasonable precision: if it predicts a student is not on track, 74% of those instances did have a post assessment that indicates they are not on track.

3.5 RQ5: The Effects of Pre-Assessment Scores

Table 3:

	RMSE					R²
	Short	Pre-Test	Short+Pre-Test	Full	Full+Pre-Test	Short	Pre-Test	Short+Pre-Test	Full	Full+Pre-Test
Baseline	0.38 (0.013)					0.00 (0.)
LR	0.14 (0.018)	0.14 (0.02)	0.13 (0.009)	0.16 (0.005)	0.14 (0.009)	0.34 (0.194)	0.39 (0.097)	0.44 (0.06)	0.18 (0.05)	0.28 (0.058)
SVR	0.14 (0.002)	0.13 (0.002)	0.13 (0.006)	0.19 (0.007)	0.17 (0.006)	0.38 (0.006)	0.4 (0.053)	0.44 (0.048)	0.1 (0.037)	0.2 (0.034)
RF	0.13 (0.003)	0.14 (0.008)	0.13 (0.005)	0.16 (0.006)	0.15 (0.007)	0.4 (0.017)	0.39 (0.044)	0.42 (0.032)	0.2 (0.064)	0.27 (0.054)

Table 3: Prediction performance using all features extracted from short horizon (5 hours) log data, compared to using pre-test scores and using full log data (Full) for CWTLReading

Table 4:

	RMSE					R²
	Short	Pre-Test	Short+Pre-Test	Full	Full+Pre-Test	Short	Pre-Test	Short+Pre-Test	Full	Full+Pre-Test
Baseline	0.21 (0.003)					0.00 (0.)
LR	0.17 (0.012)	0.12 (0.01)	0.12 (0.009)	0.17 (0.017)	0.12 (0.01)	0.37 (0.076)	0.65 (0.044)	0.65 (0.056)	0.35 (0.085)	0.64 (0.059)
SVR	0.17 (0.022)	0.12 (0.01)	0.12 (0.009)	0.15 (0.012)	0.12 (0.009)	0.37 (0.134)	0.66 (0.053)	0.66 (0.057)	0.47 (0.078)	0.68 (0.049)
RF	0.16 (0.009)	0.15 (0.005)	0.13 (0.008)	0.15 (0.011)	0.12 (0.009)	0.41 (0.061)	0.54 (0.055)	0.64 (0.059)	0.47 (0.073)	0.65 (0.064)

Table 4: Prediction performance using all features extracted from short horizon (3 hours) log data, compared to using pre-test scores and using full log data (Full) for iReady

Prediction over population using pre-assessment scores combining with a set of log data features. Tables 3 & 4 show: 1) using pre-test scores only; 2) combining pre-test with log data features on short horizon; and 3) combining pre-test with full log data features. In general, pre-assessment or pre-test score is a powerful indicator for long-term outcomes, that can be stand-alone and outperforms using log data features only. With pre-test scores, all models show similar performance with RMSE around 0.14 for CWTLReading and around 0.12 for iReady. And RMSE is not significantly improved compared to using log data features in CWTLReading. But R² values are higher when pre-test score is included, peaking at 0.44 for LR and SVR in the Short+Pre-Test (CWTLReading) and 0.66 for SVR in Short+Pre-Test (iReady), which indicates a better fit model when pre-test scores are integrated. Overall, unsurprizingly, it is beneficial to include a pre-test score in the model if available.

We further investigate the effects of combining single feature with pre-test scores. Table 5 shows the prediction results on CWTLReading and iReady using LR. Pre-test can substantially enhance LR’s performance with single feature. And even when two features have differed effects on model performance (e.g., 0.197 RMSE using PS only vs. 0.179 RMSE using AA only), combining with pre-test could fill the gap by improving prediction performance to the similar level (e.g., 0.123 RMSE using PS+Pre-Test vs. 0.122 RMSE using AA+Pre-Test). Those indicate the effectiveness of pre-test scores in prediction.

Table 5:

	CWTLReading				iReady
	PS	AA	PS+Pre-Test	AA+Pre-Test	PS	AA	PS+Pre-Test	AA+Pre-Test
RMSE	0.177 (0.007)	0.177 (0.007)	0.132 (0.001)	0.131 (0.002)	0.197 (0.012)	0.179 (0.01)	0.123 (0.009)	0.122 (0.009)
R²	0.1 (0.05)	0.03 (0.01)	0.4 (0.024)	0.41 (0.025)	0.074 (0.053)	0.236 (0.079)	0.646 (0.044)	0.651 (0.043)

Table 5: Performance of LR trained using single feature (i.e., perc_success_problem (PS) and avg_attempts_per_problem (AA)) combined with pre-test scores on the two datasets using short horizon.

Prediction over subgroups using pre-assessment scores combining with a set of log data features. We calculate confusion matrices on quintile division of subgroups using both with and without pre-test scores on selected short horizon. Figure 3 shows the heat maps of the confusion matrices for CWTLReading and iReady. For lower performers Q1 and Q2 in CWTLReading, using log data features with pretest (Short+Pre-Test) shows more concentrated on overestimating cases (with less on predicting Q1 as Q3-Q5). And for highest performers Q5, Short+Pre-Test shows more concentrated on underestimating cases (with less on predicting Q5 as Q1-Q3). For lowest performers Q1 in iReady, similar to what we observe from CWTLReading, introducing pre-test to log data features helps LR less overestimate. And for highest performers Q5 in iReady, Pre-Test also help LR less underestimate the performers as Q1-Q3.

To summarize, if pre-test score is available, one can use that alone or combine it with a single feature. If we don’t have pre-test score and only have log data, there is still a gain to using more than one feature extracted from log data.

Figure 3:

4 Discussions

From the performance subgroup analysis, we observed that the three machine learning models are not highly accurate at classifying students’ likely quintile assessment performance or binary "on track" measure, but the precision of the models is quite good when they suggest someone is in danger of performing very poorly, or is likely to perform very well. This suggests interesting future directions for creating systems that could alert educators or provide additional support or challenge in this setting. Another interesting direction is to see if there are additional features or models that could further improve the predictive accuracy. However, before investigating such, we strongly believe that additional testing and validation would be crucial, particularly to avoid accidentally causing worse outcomes for students, should they be labeled inaccurately or if some teachers or other stakeholders deprioritize such students because they think those students are expected to fail. We also note that stakeholders may wish to understand and interpret the predictions of such models. While linear regression is the most interpretable approach in this work, the others models require additional effort to become interpretable to stakeholders.

In this study, we use the fixed usage time to define short horizons, i.e., we use accumulative interaction time of each student until they reach a threshold of short horizon usage (e.g., 1 hour). We do not consider a potential usage time situation if some students share the same potential usage time but they do not have productive usage during that time, which may be another important indicator to consider for long term outcomes. In addition, we note that fixed usage time may result in varied course content coverage across students. For example, imagine if the length of a class is 3 hours (i.e., the potential usage time is 3 hours), but some students only actively engage for 2 hours. If we set a fixed usage time as 3 hours, the considered data of those students will contain their interactions within the 1st hour of the next class. These issues would be an interesting direction for further work.

We also note that when additional features about student demographics are available, or specific more detailed information about the educational product (such as knowledge components), this information can be used to potentially further enhance predictive accuracy [30, 37]. We did not consider those features in this study, because they were not available and may not always be available in the future, and because our primary interest was a general study using exploring how short-term features that can be extracted across many educational contexts can be used to predict on long-term external assessments.

5 Conclusions, Limitations, & Future Work

In this study, we investigated the efficacy of using short-horizon log data to predict long-term educational outcomes across multiple learning platforms and contexts. Our results show that machine learning models like linear regression and random forest can use data from 2 to 5 hours of educational technology usage (as recorded in three datasets) to provide a useful predictor of long term external outcomes taken months later, with similar performance to using log data from the entire (multi-month) usage period. We find that percentage of success problems and average attempts across problems are generally important predictive features, though more complex machine learning models offer additional predictive power. We also find that our prediction models are not sufficient to provide highly accurate individual predictions, but that they have sufficient precision on predicting those most likely to severely struggle on the post-test, that further investigation could be warranted to see if such information is reliable enough to be used in such tutoring systems, and/or provided as potential information to teachers. Moreover, we find that such short-term data often offers similar predictive power to using pre-test/assessments, and occasionally the combination of both offers additional benefit. As pre-test may not always be available, or may take time away from instruction and practice, short-horizon log data may be a useful tool to consider in predicting long-term external outcomes.

Acknowledgments

This work was supported in part by a Stanford Human centered Artificial Intelligence (HAI) seed grant. The authors would also like to express their thanks to the children, school personnel and WCA teams involved in Can’t Wait to Learn programme and research implementation for their data and time.

Footnotes

An interesting possibility if very short test become feasible is whether they could be more widely incorporated in standard software, so as not to distract from learning, which is often a challenge with lengthy assessments.

https://www.fda.gov/about-fda/innovation-fda/fda-facts-biomarkers-and-surrogate-endpoints

We also explored using Gradient Boosted Decision Trees and found GBDR achieved similar results. In particular, we trained GBDR for the Mathia dataset to predict post test performance, and the resulting was RMSE 0.119 when using short horizon (2hr) log data, and 0.116 RMSE for full horizon log data

⁴

We also explored z-centering the data but found it made little difference.

References

[1]

Anal Acharya and Devadatta Sinha. 2014. Early prediction of students performance using machine learning techniques. International Journal of Computer Applications 107, 1 (2014), 37–43.

Abstract

1 Introduction

2 Design Context & Methods

2.1 Data

2.2 Data Analysis

2.2.1 Features Extraction.

2.2.2 Predicting Outcomes using Machine Learning Methods.

2.2.3 Model Evaluation.

2.2.4 Prediction Quality by Student performance.

3 Results

3.1 RQ1: Short-Horizon Log Data for Predicting Long-Term Outcomes across Multiple Domains

3.2 RQ2: Impacts of Selected Machine Learning Algorithms on Resulting Accuracy

3.3 RQ3: Generalizability of Important Log Data Features across Domains for Long-Term Outcomes Prediction

3.4 RQ4: Quality of Short-Horizon Estimates on Performance-Based Student Subgroups

3.5 RQ5: The Effects of Pre-Assessment Scores

4 Discussions

5 Conclusions, Limitations, & Future Work

Acknowledgments

Footnotes

References

Index Terms

Recommendations

Learning Hierarchical Weather Data Representation for Short-Term Weather Forecasting Using Autoencoder and Long Short-Term Memory Models

Novel FTLRNN with gamma memory for short-term and long-term predictions of chaotic time series

ARIMA for Short-Term and LSTM for Long-Term in Daily Bitcoin Price Prediction

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations