1 Introduction

The rapid growth of educational resources on the Internet has brought lots of popular online learning platforms, such as massive open online course (MOOC) platforms [1] and intelligent tutoring systems (ITSs) [2]. Some ITS can trace the current students’ mastery of knowledge through the performance of answering questions in the past [3, 4]. So students can obtain appropriate guidance in the process of acquiring relevant knowledge [5, 6]. The development of artificial intelligence technology makes these online learning platforms more and more intelligent, and brings a very wide range of advantages [7, 8]. For example, these platforms can automatically provide personalized feedback and learning suggestions to each student by analyzing the historical learning data of each student [9, 10]. Behind these personalized tutoring services, the learner performance prediction technology (LPP) plays a key role. The technology predicts future practice performance of the learners through their proficiency in mastering skills and concepts.

In fact, the knowledge proficiency of learners will change over time, which is the result of learners acquiring and forgetting knowledge [11, 12]. Hence the task of LPP should be based on learners' dynamic knowledge states implicitly contained in their learning logs, where knowledge tracing comes into play. A key problem in learners' data analysis is to predict learners' future performance, given their past performance, which is referred to as the knowledge tracing problem [13, 14].

Usually, the knowledge tracing task can be described as: given a specific learning task, we predict the learner's next performance \({x}_{t+1}\) by observing the sequence \(x=\{{x}_{1},\dots ,{x}_{t}\}\) of the learner's historical performance on a specific problem. A common task in knowledge tracing is feature selection, \({x}_{t}\) can be represented by a tuple (\({q}_{t },{a}_{t}\)), where \({q}_{t}\) and \({a}_{t}\) are, respectively, the question information of the learner's answering to at the time \(t\) and the right or wrong of the answer, so the probability \(P({a}_{t+1}=1|{q}_{t+1},{x}_{t})\) of the learner answering the question correctly at the time \(t+1\) can be predicted [15, 16].

The existing knowledge tracing models still have some problems such as low prediction accurate, and hard to predict for multi-skills problem [17]. Therefore, the further research is necessary to improve the performance of the model and solve the multi-skill problem. The study which is applying XGBoost algorithm to knowledge tracing tasks improves both the performance and efficiency of the knowledge tracing model.

2 Related Work

The knowledge tracing (KT) model, first proposed by Atkinson, is a classic model for diagnosing and tracing learners' learning states [18]. The existed knowledge tracing models can be divided into two types: statistical KT models and deep learning KT models.

For statistical KT models, Bayesian Knowledge Tracing (BKT) model is one of the most classical knowledge tracing models [19]. BKT models the learners' latent knowledge state as a set of binary variables, representing whether to master a certain knowledge component (KC). After each learned interaction, BKT uses a hidden Markov model (HMM) to update the probabilities of these binary variables [20, 22].

As deep learning develops, many researchers have applied it into the field of knowledge tracing. The deep knowledge tracing (DKT) model [23] was proposed in 2015, and its basic structure is recurrent neural network (RNN). Inspired by the concepts of Memory-augmented neural networks (MANN), the idea of augmenting dynamic key-value memory networks (DKVMN) [24] with an auxiliary memory was proposed. DKVMN explicitly maintains a KC representation matrix (key) and a knowledge state representation matrix (value). Sequential Key-Value Memory Network (SKVMN) [25] has combined the strengths of DKT’s recurrent modeling capacity and DKVMN’s memory capacity for knowledge tracing. For text-aware knowledge tracing models, the Exercise-enhanced RNN (EERNN) [26] uses a bi-directional LSTM module to extract the representation of each question from the question’s text. The Exercise-aware Knowledge Tracing (EKT) [27] model, integrated the EERNN model and the DKVMN model, has a dual attention module. A lot of attempts have been made to use attention mechanism to enhance model interpretability. Pandey et al. [28] took the lead in using the Transformer model in the field of knowledge tracing and proposed the SAKT model. Choi et al. [29] proposed a model to improve self-attentive computation for knowledge tracing adaptation and named the Separated Self-attentive Neural Knowledge Tracing (SAINT). In addition, further work has been made through using other deep learning models for knowledge tracing. The Graph-Based Knowledge Tracing (GKT) [30] model attempts to incorporate a graph where nodes represent KCs and edges represent the dependency relation between KCs for a relational inductive bias. A joint graph convolutional network-based Deep Knowledge Tracing (JKT) [31] framework is proposed to model the multi-dimensional relationships into graph. The Convolutional Knowledge Tracing (CKT) [32] model is the first model to use convolutional neural networks in the field of knowledge tracing. A neural Turing machine-based skill-aware knowledge tracing (NSKT) [33] for conjunctive skills was proposed, which can capture the relevance among the knowledge concepts of a question to model students’ knowledge state more accurately and to discover more latent relevance among knowledge concepts effectively. For various DKT models, they have higher prediction performance than most of statistical KT models, but they still have the following problems: first, most DKT models only use few features and do not try different combination of features; secondly, due to over-parameterized black-box nature of deep learning, it is often difficult to understand the prediction results of DKT; finally, most DKT models need more training time and they only have good prediction result on huge data amount.

3 Knowledge Tracing Model Based on XGBoost

The XGBoost (eXtreme Gradient Boosting) algorithm [34, 35] is a Boosting-type integrated learning algorithm, which completes the learning task by constructing and combining multiple weak learners. The basic idea of the XGBoost algorithm is to continuously add different trees to the model, and to make the tree model grow through feature splitting. Each time adding a tree is equivalent to learning a new function, and then fitting the residual of the last prediction. Finally, the predicted value of the sample is the sum of the scores of all trees on the sample. Combining the characteristics of knowledge tracing tasks and the advantages of strong generalization ability and high computational efficiency of XGBoost, this study constructs a knowledge tracing model based on the XGBoost algorithm [36]. The flow chart of the model is shown in Fig. 1.

Fig. 1
figure 1

Flow chart of XGBoost knowledge tracing model

The input data of XGBoost model can be different combination features of online learning platform including question, knowledge skill, attempt time, students answer, etc. Assuming that the dataset contains N samples and M-dimensional features, the model containing k decision trees can be represented by \({\widehat{y}}_{i}\). The calculation is shown in formula (1):

$$\hat{y}_{i} ^{{\left( t \right)}} = \mathop \sum \limits_{{k = 1}}^{t} f_{k} \left( {x_{i} } \right)$$
(1)

The \(t-th\) round of model prediction in the decision tree is shown in formula (2):

$${{\widehat{y}}_{i}}^{(t)}={{\widehat{y}}_{i}}^{(t-1)}+{f}_{t}\left({x}_{i}\right)$$
(2)

where \({{\widehat{y}}_{i}}^{(t-1)}\) represents the predicted value of round \(t-1\), and \({f}_{t}\left({x}_{i}\right)\) is the score of the sample at round \(t\). The objective function calculation of XGBoost model is shown as formula (3):

$$Obj^{{\left( t \right)}} = \mathop \sum \limits_{{i = 1}}^{n} l\left( {y_{i} ,\hat{y}_{i} ^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right)} \right) + \Omega \left( {f_{t} } \right) + C$$
(3)

The \(t-th\) round of regularization is obtained by the sum of the regularization terms of all trees. The objective function consists of two parts: one is the difference between the real value and the predicted value, and the other part is the regularization function to prevent overfitting during the training of XGBoost model. The regularization function consists of two parts, as shown in Eq. (4):

$$\Omega \left( {f_{t} } \right) = \gamma T + \frac{1}{2}\lambda \sum\limits_{{i = 1}}^{n} {w_{j}^{2} }$$
(4)

where T refers to the number of leaf nodes, γ and λ represent the penalty coefficient, and \({w}_{j}\) represents the score of the \(j-th\) leaf node. The expansion of Obj's second-order Taylor formula is shown in Eq. (5):

$$Obj^{{\left( t \right)}} = \mathop \sum \limits_{{i = 1}}^{n} \left[ {l\left( {y_{i} ,\hat{y}_{i} ^{{\left( {t - 1} \right)}} } \right) + g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right]$$
(5)

Among them, \({g}_{i}\) refers to the first-order partial derivative of the loss function of the \(i-th\) sample, and \({h}_{i}\) refers to the second-order partial derivative of the loss function of the \(i-th\) sample. When the model is trained, its objective function is shown in formula (6):

$$Obj^{{\left( t \right)}} \approx \mathop \sum \limits_{{j = 1}}^{T} \left[ {\left( {\sum g_{i} } \right)w_{j} + \frac{1}{2}\left( {\sum h_{i} + \lambda } \right)w_{j}^{2} } \right] + \gamma T$$
(6)

Among them, \(obj\) is a quadratic equation of one variable about \({w}_{j}\), and the value of \(obj\) is the smallest when \(w_{j} = - G_{j} /\left( {H_{j} + \lambda } \right)\). At this time, its objective function is as shown in formula (7):

$$Ob{j}^{\left(t\right)}=\frac{1}{2}\sum_{j=1}^{T}\frac{{G}_{j}^{2}}{{H}_{j}+\uplambda }+{\gamma }^{T}$$
(7)

Among them, \({G}_{j}=\sum {g}_{i}\), \({H}_{j}=\sum {h}_{i}\) and \(obj\) is equivalent to the function of the Gini coefficient, which evaluates the quality of the tree.

The XGBoost algorithm selects the feature with the largest score as the split feature. Through the above analysis, the XGBoost algorithm can predict the probability of students doing the right question based on student data. Therefore, the XGBoost model can be applied to knowledge tracing tasks. The model selects relevant student's characteristic data as input, such as student id, knowledge skill, question id, the number of prompts, the number of attempts to question, etc. In the next section, experiments will be conducted to verify the effectiveness of the XGBoost knowledge tracing model.

4 Dataset

4.1 Dataset Description

The experiment uses three datasets, including ASSIST09, Algebra08 and ASSIST17. The number of students, knowledge skills and interactions in the three datasets are shown in Table 1. An interaction is an answering record for a question by a student.

Table 1 The statistics of the three datasets

ASSIST09 and ASSIST17 are student answer data collected from the ASSISTments online teaching platform 2009 and 2017. The dataset of ASSIST09 contains a total of 525,535 interactions for 124 knowledge skills by 4,217 students. The dataset of ASSIST17 contains 942,816 interactions, 686 students and 102 knowledge skills.

Algebra08 [37] has records of 2008–2009 interactions between students and computer-aided-tutoring systems. In this dataset, there are 8,918,054 interactions of 3,310 students on 922 knowledge skills.

4.2 Data Preprocessing

Since the datasets come from online teaching platforms, there is a certain amount of noisy data in the dataset, including missing values, duplicate values, outliers, etc. A preprocess for data cleaning is necessary before feeding the data into the model. The steps of data preprocessing in our work include outlier processing, missing value processing and data labeling. Outlier data are extreme values that deviate from other observations on dataset, they may indicate a variability in a measurement, experimental errors or a novelty. For example, there are some minus value (such as '-7,759,575′) for overlap_time column in ASSISTments data. Obviously, it is not a time value and should be deleted. For Missing value processing, the skill_id feature is knowledge skill ID which a question belongs to, but there are 66,326 empty values which should be deleted.

5 Model Training

The training of model uses the scikit-learn library in Python, and the main parameters of XGBoost are shown in Table 2.

Table 2 The main parameters of XGBoost model

5.1 Model Evaluation

In the knowledge tracing task, Logloss is used as the loss function in the experiment, and its calculation is shown in Eq. (8) [38].

$$J\left( \theta \right) = - \frac{1}{m}\mathop \sum \limits_{{i = 1}}^{m} \left[ {y_{i} \log p\left( {x_{i} } \right) + \left( {1 - y_{i} } \right)\log \left( {1 - p\left( {x_{i} } \right)} \right)} \right]$$
(8)

where \({y}_{i}\) is the true category of instance \({x}_{i}\), \(p\left({x}_{i}\right)\) is the probability of predicting input instance \({x}_{i}\) correctly.

5.2 Parameters

In the paper, the knowledge tracing models based on FM, DeepFM, AutoInt, XGBoost and DKT are chosen as the baseline models [39]. The parameters of the FM, DeepFM, AutoInt, XGBoost and DKT model are shown in Tables 3, 4, 5, 6 and 7, respectively.

Table 3 The parameters of FM model
Table 4 The parameters of deepFM model
Table 5 The parameters of autoint model
Table 6 The parameters of XGBoost model
Table 7 The parameters of DKT model

Among the models, the FM model and AutoInt model respectively apply the FM and AutoInt algorithm to the knowledge tracing task; the DeepFM model includes two parts: the FM model and the feedforward neural network.

6 Experimental Results and Analysis

In the experiment, FM, DeepFM, AutoInt, and XGBoost models use feature combinations for model training and predict the results. The DKT, which is regression prediction model, uses time series data constructed with user_id and skill_id features to predict the results. We use the area under the receiver operating characteristic curve (AUC) as an evaluation metric to compare prediction performance among the models. A higher AUC indicates better performance. For the knowledge tracing model based on FM, DeepFM, AutoInt and XGBoost, the correct or wrong answer of a student is regarded as a label and so it is treated as a classification problem. The result of the model is the probability of students answering knowledge skills. It is also reasonable to use AUC as the indicator of the model. The time recorded in the experiments is the total training time. The running time of the FM model is affected with the num_iter parameter, and the running time of the XGBoost model is affected with the n_estimators parameter. The running time of the DeepFM, AutoInt and DKT models is influenced by epochs and batch_size. In addition, the models include DeepFM, AutoInt and DKT run on GPU machines, while the other models based on FM and XGBoost run on CPU machines [40].

Tables 8, 9 and 10 show the experimental results of the knowledge tracing models based on FM, DeepFM, AutoInt, XGBoost and DKT for the ASSIST09, Algebra08 and ASSIST17 datasets using different features. We also give the bar charts of the experimental results for a visual representation in Figs. 2 and 3 for the ASSIST09 dataset.

Table 8 The experimental results for ASSIST09
Table 9 The experimental results for Algebra08
Table 10 The experimental results for ASSIST17
Fig. 2
figure 2

ACC for XGBoost, AutoInt, DeepFM, and FM model

Fig. 3
figure 3

AUC for XGBoost, AutoInt, DeepFM, and FM models

From Tables 8 and 9, when only the user_id and skill_id features are used, the XGBoost model has no better performance than the AutoInt model. After adding the feature of problem_id, from Tables 8 and 10, the performance of the XGBoost model is better than the models based on FM, DeepFM, and AutoInt. But for the Algebra08 dataset, the XGBoost model does not perform better than other models. When the model adds the features of attempt_count and extra, for ASSIST09 dataset, the XGBoost model can reach 0.9855 (AUC) and 0.9442 (ACC). Compared with the basic model using user_id and skill_id, its AUC is increased by 0.234 and ACC is increased by 0.204. However, from Table 10, for the ASSIST17 dataset, when using all features, the AUC is only 0.7860.

Compared with all other models, the running time of the XGBoost model is minimum. When using user_id and skill_id features, although the models based on AutoInt and DKT can perform better, the training time of the models is much longer than the XGboost model. It is not conducive to the model deployment and application of online education platforms. However, the knowledge tracing model based on the XGBoost algorithm can greatly reduce the training time of the model if it is deployed to online platforms.

From Tables 8 and 9, it can be found that the attempt_count feature has a greater impact on the performance of the model, especially for Algebra08, adding the attempt_count feature to the model can bring an improvement of 0.2963 (AUC) and 0.1221 (ACC) to the XGBoost model. For ASSIST09 and ASSIST17 datasets, if the extra feature is added to the model, it can bring an improvement to the XGBoost model. Therefore, the attempt_count feature still has a greater impact on the prediction performance of the knowledge tracing model based on XGBoost.

In addition, adding the problem_id and skill_id features to the XGBoost model can also accurately understand which problem or knowledge skill the model has mastered. It can effectively deal with multi-skills problems without processing the original dataset.

7 XGBoost Model Analysis

It can be seen from above that using the XGBoost model for the knowledge tracing task has good experimental results. Using XGBoost model for knowledge tracing task also has the following advantages. Adding the complexity of the tree as a regularization term to the optimized target reduces the risk of overfitting. The work using the XGBoost model for the knowledge tracing task can parallelize the calculation between trees, which makes the calculation speed of the prediction stage faster.

Intersecting with other knowledge tracing models, especially various DKT models, using the XGBoost model saves more time. At the same time, after one iteration, XGBoost model will multiply the weight of the leaf node by the coefficient, mainly to weaken the influence of each tree, so that there is more space for learning later. Weakening the influence of previous tree can be used to represent the forgetting behavior of students during the learning process. For example, in the Attentive Knowledge Tracing (AKT) model, weight decay is used to account for forgetting effects in student memory over time by reducing the attentional weight of questions over a series of interactions. The XGBoost model can use machine learning method to achieve this function. XGBoost algorithm internally implements a boosted tree model, which can automatically handle missing values. However, when the amount of training data in the knowledge tracing dataset is too large, and there is a suitable deep knowledge tracing model, the accuracy of deep learning can be far ahead of XGBoost.

8 Conclusion

The study introduces the basic principles of the XGBoost algorithm, and then the method of applying XGBoost into the knowledge tracing model. Experimental results show that XGBoost algorithm can be effectively applied to knowledge tracing tasks. When multiple features are added to the model, the best predicted AUC value can reach 0.9855. At the same time, compared with previous knowledge tracing models using deep learning, the XGBoost model saves more training time. It can be more conveniently deployed and applied on online education platforms.