Leveraging Pedagogical Theories to Understand Student Learning Process with Graph-based Reasonable Knowledge Tracing

Jiajun Cui 0000-0001-5900-7643 cuijj96@gmail.com East China Normal UniversityShanghaiChina , Hong Qian 0000-0003-2170-5264 hqian@cs.ecnu.edu.cn East China Normal UniversityShanghaiChina , Bo Jiang 0000-0002-7914-1978 bjiang@deit.ecnu.edu.cn East China Normal UniversityShanghaiChina and Wei Zhang 0000-0001-6763-8146 zhangwei.thu2011@gmail.com East China Normal UniversityShanghaiChina

(2024)

Abstract.

Knowledge tracing (KT) is a crucial task in intelligent education, focusing on predicting students’ performance on given questions to trace their evolving knowledge. The advancement of deep learning in this field has led to deep-learning knowledge tracing (DLKT) models that prioritize high predictive accuracy. However, many existing DLKT methods overlook the fundamental goal of tracking students’ dynamical knowledge mastery. These models do not explicitly model knowledge mastery tracing processes or yield unreasonable results that educators find difficulty to comprehend and apply in real teaching scenarios. In response, our research conducts a preliminary analysis of mainstream KT approaches to highlight and explain such unreasonableness. We introduce GRKT, a graph-based reasonable knowledge tracing method to address these issues. By leveraging graph neural networks, our approach delves into the mutual influences of knowledge concepts, offering a more accurate representation of how the knowledge mastery evolves throughout the learning process. Additionally, we propose a fine-grained and psychological three-stage modeling process as knowledge retrieval, memory strengthening, and knowledge learning/forgetting, to conduct a more reasonable knowledge tracing process. Comprehensive experiments demonstrate that GRKT outperforms eleven baselines across three datasets, not only enhancing predictive accuracy but also generating more reasonable knowledge tracing results. This makes our model a promising advancement for practical implementation in educational settings. The source code is available at https://github.com/JJCui96/GRKT.

knowledge tracing, student behavior modeling, data mining, pedagogical theory, reasonable knowledge tracing

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining ; August 25–29, 2024; Barcelona, Spain.^†^†booktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spain^†^†isbn: 979-8-4007-0490-1/24/08^†^†doi: 10.1145/XXXXXX.XXXXXX^†^†ccs: Computing methodologies Neural networks^†^†ccs: Applied computing Education^†^†ccs: Information systems Data mining

1. Introduction

In personalized learning, Knowledge Tracing (KT) is crucial for tracking students’ evolving knowledge mastery based on their historical question responses (Wu et al., 2024; Corbett and Anderson, 1994). Early researchers addressed this challenge by leveraging the monotonicity assumption (Embretson and Reise, 2013), linking better mastery of one knowledge concept (KC) to a higher probability of correctly answering related questions. They trained models to predict student responses on given questions, proposing typical machine learning-based KT methods (Corbett and Anderson, 1994; Pardos and Heffernan, 2011). Consequently, predicting student performance became the primary task, with prediction accuracy as the mainstream metric for evaluating KT models, promoting the emergence of deep learning knowledge tracing (DLKT) methods. However, many DLKT approaches prioritize prediction ability over the fundamental objective of knowledge tracing, sometimes forgoing tracing altogether (Choi et al., 2020; Ghosh et al., 2020). Others use internal network weights to represent knowledge mastery (Yin et al., 2023; Shen et al., 2021), facing challenges in constructing meaningful tracing results due to the low interpretability and reasonability of deep neural network structures. Hidden neurons in these networks adaptively learn from data without explicit meaning (Guidotti et al., 2018). It is worth noting that the cognitive diagnosis task also assesses knowledge mastery but usually focuses on static testing instead of dynamic learning process (Liu et al., 2021; Leighton and Gierl, 2007). Therefore, we do not delve into it within this paper.

Refer to caption — Figure 1. Illustration of a student’s evolving knowledge mastery while answering ten questions, traced by two DLKT models, along with an assumed ideal tracing result. The student is sampled from the ASSIST12 dataset, introduced in Section 5.1.1.

Figure 1 illustrates the traced dynamic knowledge mastery of an example student by two DLKT models: DKT (Piech et al., 2015) and LPKT (Shen et al., 2021). DKT is a pioneering approach that directly applies recurrent neural networks (RNNs) to the KT task. In this case, when the student responds to the initial four questions related to the blue KC Calculations with Similar Figures, their knowledge mastery of the unrelated green KC Ordering Integers increases, presenting an unreasonable outcome. Furthermore, a correct response to the sixth question results in a contrary decrease in its corresponding KC’s mastery, demonstrating an inconsistent change in direction. LPKT, as a time-aware method, models learning and forgetting processes for more reasonable knowledge tracing. However, it struggles to capture the relation between the yellow KC Area Triangle and the blue KC Calculations with Similar Figures, as evidenced by the decreasing mastery of the blue curve following a correct response to an question of yellow. Both of these two KCs examine students’ calculations about the base and height of triangles, which suggests their underlying relation. Beneath the figure is a tracing result from an assumed ideal model, which we design based on comprehensive pedagogical effects. As shown, the student mastery will increase and drop according to their right/wrong responses based on the testing effect (Roediger III and Karpicke, 2006). The mastery of the yellow KC would relatedly increase due to the correct response to the sixth orange KC, according to the transfer of learning (Perkins et al., 1992). Besides, the mastery between responses should also vary due to students’ learning and forgetting behaviors modeled by the learning and forgetting curves (Yelle, 1979; Ebbinghaus, 1885).

From this example, we summarize three deficiencies of current DLKT methods in dynamic knowledge tracing reasonability: (i) Mastery change of unrelated KCs - learning one KC affects unrelated KC mastery; (ii) No mastery change of related KCs - learning one KC does not impact related KCs; (iii) Inconsistent mastery change direction - correct answers may decrease KC mastery, and vice versa. These stem from opaque deep neural networks, whose parameters serve the overarching objective of performance prediction. Moreover, many researches use RNNs to model knowledge application and update by the recurrent units’ output and state transition (Shen et al., 2021, 2022; Liu et al., 2019; Piech et al., 2015). This mixes the effects of students answering questions and their spontaneous behaviors, leading to confusing tracing results. For example, incorrect responses may strengthen wrong knowledge retrieval and get a mastery drop of the related KC. But when they get feedback and learn from their errors, they can make a final progress. This fine-grained knowledge mastery changing is not captured. To address these above issues, we introduce GRKT, a Graph-based Reasonable Knowledge Tracing to enhance knowledge tracing reasonability while retaining neural networks’ representational power.

To be specific, we integrate pedagogical theories (Perkins et al., 1992; Yelle, 1979; Ebbinghaus, 1885; Roediger III and Karpicke, 2006) into the KT modeling, dividing the learning process into three distinct stages. (i) The knowledge retrieval stage analyzes how students respond to questions. This stage draws from cognitive psychology (Melton, 1963), viewing learning as encoding, storing, and retrieving memories. When students answer questions, retrieval from memory becomes crucial. We start this stage by retrieving the encoded memory related to the question’s KC and project it into a mastery value. We then compare this value with the question’s difficulty score to predict if the student could correctly answer the question. (ii) The memory strengthening stage focuses on how answering questions impacts students’ knowledge mastery. Here, students strengthen their memory retrieval routes, aligning with the Testing Effect theory (Roediger III and Karpicke, 2006; Kornell et al., 2009). Correct retrievals enhance learning, while incorrect ones reinforce errors. We encode this positive/negative memory strengthening in the knowledge memory of the relevant KC based on whether the question is correctly solved. (iii) The knowledge learning/forgetting stage explores what students do after question answering. This stage aims to model the active learning and natural forgetting behaviors based on the Learning curve (Yelle, 1979) and the Forgetting curve (Ebbinghaus, 1885). Both curves suggest a decreasing rate of learning and forgetting over time. Concretely, we first introduce a learning decider to determine whether students will continue learning the KCs just practiced or the KCs for future study. Then, we employ KC-specific time-aware kernels to model the learning/forgetting curves of all involved KCs based on these decisions. By applying this three-stage modeling process iteratively across students’ response sequences, we establish a coherent and reasonable knowledge tracing framework. This approach effectively captures mastery changes resulting from question answering and subsequent behaviors, addressing the issue of inconsistent mastery change direction.

To handle the two other issues of mastery change of unrelated KCs and no mastery change of related KCs, we utilize the message passing mechanism of graph neural networks (GNNs) applied to KC relation graphs. This mechanism establishes clear boundaries between related and unrelated KCs. Specifically, changes in knowledge mastery of one KC are propagated through the graph edges to its related KCs within a specific number of hops. From the pedagogical perspective, this message passing aligns with the Transfer of Learning theory (Perkins et al., 1992), which explains humans’ ability to transfer knowledge between similar fields to solve problems and acquire skills. We integrate this understanding into our three-stage learning process modeling using KC relation-based GNNs. For instance, in the first stage of GRKT, instead of solely retrieving knowledge from the target question’s KC, we utilize graph aggregation to synthesize the memory of the KC’s neighbors for solving the question. Similarly, during the second stage, the memory strengthening process involves propagating the gain and loss of knowledge mastery to the KC’s neighbors, and this process is also applied in the third stage’s knowledge learning. Additionally, we exploit the homophily of GNNs to generate similar time-aware kernels for related KCs, effectively modeling their similar learning/forgetting processes. This defines the boundaries between related and unrelated KCs based on the number of hops in GNN operations, effectively addressing challenges associated with mastery changes between different KCs. It’s worth noting that KCs have various types of relations, including prerequisite, similarity, collaboration, remedial, and hierarchy (Gao et al., 2023). In GRKT, we primarily focus on leveraging the two most commonly used relations: prerequisite and similarity.

To the best of our knowledge, this work represents the first comprehensive analysis of the reasonability issues in current DLKT methods, and integrates multiple pedagogical theories to address these concerns. The main contributions of this paper are as follows:

•

Motivation. We identify the reasonability issues arising from the widespread adoption of deep learning techniques in the KT task. Many DLKT methods tend to excessively prioritize student performance prediction, often overlooking unreasonable knowledge tracing results due to the inherent interpretability challenges posed by neural networks.
•

Methods. We outline three primary reasonability issues prevalent in current DLKT methods. To address these issues, we introduce GRKT, a graph-based reasonable knowledge tracing, which establishes a three-stage learning process modeling. Additionally, we utilize the KC relation graph to mitigate mutual effects among KCs. The incorporation of multiple pedagogical theories provide sufficient support for our proposed method.
•

Experiments. Comprehensive experimental results showcase that our GRKT exhibits superior prediction performance and yields reasonable knowledge tracing results when compared to eleven baselines across three widely-used datasets.

2. related work

2.1. Reasonable Knowledge Tracing

Early KT methods in machine learning, such as Bayesian Knowledge Tracing (BKT) (Corbett and Anderson, 1994), initially showcased reasonable results due to their transparent and interpretable internal structure. BKT utilizes Hidden Markov Models (HMMs) to probabilistically represent the student learning process. It transitions knowledge mastery and emits probabilities of correct responses, while also considering guessing and slipping behaviors. Subsequent KT methods expanded upon BKT by incorporating additional pedagogical factors such as question difficulty (Pardos and Heffernan, 2011) or prior student information (Yudelson et al., 2013).

However, traditional methods show inferior prediction performance when compared to subsequent emerging DLKT methods (Piech et al., 2015; Liu et al., 2019; Shen et al., 2022; Pandey and Karypis, 2019; Ghosh et al., 2020; Choi et al., 2020; Cui et al., 2024), which reach high prediction performance due to the power of neural networks. Even so, these DLKT methods fail to produce reasonable knowledge tracing results due to their inherently opaque structures. Efforts have been made to tackle this challenge. Shen et al. (Shen et al., 2021) proposed Learning Process-consistent Knowledge Tracing (LPKT), which utilizes student response duration and interval time to capture learning and forgetting behaviors. However, it only focuses on knowledge learning and forgetting and does not model the interplay of knowledge mastery changes between KCs, limiting its reasonability. Similarly, Yin et al. (Yin et al., 2023) introduced the Diagnostic Transformer (DTransformer), which diagnoses student knowledge mastery from each tackled question and employs a contrastive learning framework to produce more stable knowledge tracing. While this stability enhances reasonability to some extent, its transformer-based structures do not adequately reflect the transition of knowledge mastery between continuous student responses. Therefore, while these approaches improve model reasonability from specific angles, they do not offer a comprehensive method to generate reasonable knowledge tracing results covering both KC relations and continuous learning processes.

We address this gap with our proposed GRKT, which utilizes GNNs to model KC relations and introduces a three-stage learning process to capture evolving knowledge mastery. By integrating these techniques, GRKT achieves high prediction performance while also generating more reasonable knowledge tracing results.

2.2. Graph-based Knowledge Tracing

Graph Neural Networks (GNNs) (Scarselli et al., 2008) serve as an efficient tool to capture intricate relations between instances in real-world scenarios. Their message aggregation and propagation operations on graphs yield deep representations for node features, enhancing performance in various downstream tasks across different domains. In the context of KT, researchers explore various structures to harness the power of GNNs. Nakagawa et al. (Nakagawa et al., 2019) pioneered the incorporation of GNNs into KT by reformulating it as a time-series node-level classification problem based on KC relation graphs. Gan et al. (Gan et al., 2022) leveraged this structure to enhance graph representation learning, generating more informative question and concept embeddings. Except for KC relations, question-question and question-KC relations are also widely considered. For instance, Bi-CLKT (Song et al., 2022) applied contrastive learning to question-KC and KC-KC graphs to generate question embeddings enriched with question and KC structural information. Another work (Yang et al., 2021) leveraged question-KC relations to address question sparsity and multi-skill problems. In our paper, we specifically focus on utilizing GNNs to model the mutual effects of KCs during students’ knowledge leveraging and changing, constructing a more reasonable approach to knowledge tracing.

It is worth noting that some other GNN-based or memory-based methods (e.g., GKT (Nakagawa et al., 2019) and DKVMN (Zhang et al., 2017)) also update mastery between KCs. However, their knowledge state updating is still potentially performed by the erase-followed-by-add mechanism, which uses GRU/LSTM cells unable to solve the reasonability issues such as not guaranteeing the direction of consistency change between KCs.

3. preliminary

3.1. Task Formulation

Knowledge tracing aims to trace the dynamic evolution of students’ knowledge mastery throughout their learning processes characterized by their responses to questions. Suppose there are a student set $\mathcal{U}$ , a question set $\mathcal{Q}$ , and a KC set $\mathcal{C}$ . Each student $u\in\mathcal{U}$ has a historical response sequence $\mathcal{H}^{u}=\{r^{u}_{1},r^{u}_{2},\cdots,r^{u}_{|\mathcal{H}^{u}|}\}$ , where each response $r^{u}_{t}=\left(q^{u}_{t},a^{u}_{t},c^{u}_{t},T^{u}_{t}\right)$ comprises the involved question $q^{u}_{t}\in\mathcal{Q}$ , the correctness $a^{u}_{t}\in\{0,1\}$ (where $a^{u}_{t}=1$ means a correct response), the KC $c_{t}^{u}\in\mathcal{C}$ examined by the question, and the timestamp $T_{t}^{u}$ of the response. It is worth noting that there could be multiple KCs associated with one question. To be concise, we use the notations with just one KC to describe the task setting and the proposed method, but our method is easily extended to the setting of multiple KCs (e.g., averaging the KC representations as mentioned in Section 4.3). The objective is to track and monitor the evolving knowledge mastery of $u$ after each response, $\mathcal{M}^{u}=\{\textbf{m}^{u}_{1},\textbf{m}^{u}_{2},\cdots,\textbf{m}^{u}_% {|\mathcal{H}^{u}|}\}$ where $\textbf{m}^{u}_{t}$ is stacked with $\{m^{u}_{c_{i},t}|c_{i}\in\mathcal{C}\}$ and $m^{u}_{c_{i},t}$ signifies the student’s knowledge mastery of the KC $c_{i}$ at time step $t$ . A higher value denotes a superior level of mastery. However, the absence of annotated mastery levels necessitates researchers to resort to the student performance prediction task as a surrogate measure (Liu et al., 2021). In this paradigm, given $\mathcal{H}^{u}$ , the objective is to predict whether student $u$ can correctly answer a new question $q^{u}_{|\mathcal{H}^{u}|+1}$ , with its associated KC $c^{u}_{|\mathcal{H}^{u}|+1}$ at timestamp $T^{u}_{|\mathcal{H}^{u}|+1}$ . This hinges on the monotonicity assumption (Embretson and Reise, 2013), which posits that higher knowledge mastery leads to a higher probability of answering questions correctly. For brevity, we omit the superscript $u$ in the later method description.

4. Methodology

As shown in Figure 2, GRKT conducts a recurrent modeling within a three-stage learning process: knowledge retrieval, memory strengthening, and knowledge learning/forgetting. The proposed KC relation-based graph neural networks capture knowledge mastery variation between KCs throughout these stages. This section introduces the KC relation-based GNNs first, then explains the three-stage learning process modeling with these GNNs. For ease of understanding GRKT, we list and explain all relevant notations in Appendix A.

4.1. KC Relation-based Graph Neural Networks

Based on the transfer of learning theory (Perkins et al., 1992), we introduce KC relation-based GNNs to transfer the knowledge leveraging and changing throughout the three-stage learning process, as shown in Figure 2. To avoid repetition, we first elaborate on a prototype of KC relation-based GNNs in this section and highlight differences when applied to different stages in the subsequent sections.

Due to the lack of KC relation annotations, we follow previous works (Nakagawa et al., 2019; Song et al., 2022) that construct KC relations based on the data statistics. Details could be referred to in Appendix B. Besides, we focus on the two most common relations, prerequisite and similarity and extend three relation graphs ${\mathcal{P},\mathcal{S},\mathcal{R}}$ , whose edges denote one KC being prerequisite/subsequent/relevant (similar) to another one. This is because the forward and backward message passed along the unidirectional prerequisite relation should be differentiated. Based on this, we design the KC relation-based GNNs with multiple layers. They receive KC node features such as knowledge memory, knowledge gain/loss, or knowledge learnt in the three stages, which would be introduced later. To capture the graph information, each layer first aggregates the features of each node’s neighbors for each graph $\mathcal{G}\in\{\mathcal{P},\mathcal{S},\mathcal{R}\}$ from the last layer as

(1)

\bar{\textbf{f}}^{\mathcal{G},(l)}_{c_{i}}=\frac{1}{|\mathcal{G}(c_{i})|}\sum_% {c_{j}\in\mathcal{G}(c_{i})}\left(\beta^{\mathcal{G}}_{c_{i},c_{j}}\cdot\tilde% {\textbf{f}}^{(l-1)}_{c_{j}}\textbf{W}^{\mathcal{G},(l)}_{proto}\right)

(2)

\tilde{\textbf{f}}^{\mathcal{G},(l)}_{c_{i}}=\text{ReLU}\left(\bar{\textbf{f}}% _{c_{i}}^{\mathcal{G},(l)}\right)\textbf{O}_{proto}^{\mathcal{G},(l)}.

where $\textbf{W}^{\mathcal{G},(l)}_{proto}\in\mathbb{R}^{d_{l-1}\times d_{l-1}}$ and $\textbf{O}_{proto}^{\mathcal{G},(l)}\in\mathbb{R}^{d_{l-1}\times d_{l}}$ are the learnable weight matrices in this layer. $\mathcal{G}(\cdot)$ is the neighbor function of $\mathcal{G}$ . $\text{ReLU}(\cdot)$ is an activation function to introduce non-linearity to enhance model representability. $\beta^{\mathcal{G}}_{c_{i},c_{j}}$ is the correlation score of KC $c_{i}$ and $c_{j}$ on graph $\mathcal{G}$ , obtained by

(3)

\beta^{\mathcal{G}}_{c_{i},c_{j}}=\sigma\left({\textbf{k}}^{\text{T}}_{c_{i}}% \textbf{W}^{\mathcal{G}}_{cor}\textbf{k}_{c_{j}}\right).

$\textbf{k}_{c_{i}},\textbf{k}_{c_{j}}\in\mathbb{R}^{1\times d_{e}}$ are the two KCs’ embeddings where $d_{e}$ is the number of embedding dimensions. $\textbf{W}^{\mathcal{G}}_{cor}\in\mathbb{R}^{d_{e}\times d_{e}}$ is the trainable matrix for $\mathcal{G}$ , and $\sigma(\cdot)$ denotes the sigmoid function, which regularizes the score in $(0,1)$ . We then fuse the aggregated features from the three graphs by

(4)

\tilde{\textbf{f}}^{(l)}_{c_{i}}=\begin{cases}\sum_{\mathcal{G}\in\{\mathcal{P% },\mathcal{S},\mathcal{R}\}}\tilde{\textbf{f}}_{c_{i}}^{\mathcal{G},(l)}+% \tilde{\textbf{f}}^{(l-1)}_{c_{i}},&\text{if }d_{l-1}=d_{l},\\ \sum_{\mathcal{G}\in\{\mathcal{P},\mathcal{S},\mathcal{R}\}}\tilde{\textbf{f}}% _{c_{i}}^{\mathcal{G},(l)},&\text{if }d_{l-1}\neq d_{l},\end{cases}

where we apply a residual connection (Szegedy et al., 2017) when $d_{l-1}=d_{l}$ to stabilize the training process. In this prototype, we denote the input features of all KCs as $\tilde{\textbf{F}}^{(0)}\in\mathbb{R}^{|C|\times d_{0}}$ and one of them as $\tilde{\textbf{f}}^{(0)}_{c_{i}}\in\mathbb{R}^{1\times d_{0}}$ for KC $c_{i}$ , and the output features as $\tilde{\textbf{F}}^{(L)}\in\mathbb{R}^{|C|\times d_{L}}$ and $\tilde{\textbf{f}}_{c_{i}}^{(L)}\in\mathbb{R}^{1\times d_{L}}$ , where $d_{0},d_{L}$ are the numbers of input and output feature dimensions. Then this prototype GNN is formulated as:

(5)

\tilde{\textbf{F}}^{(L)}=\text{GNN}_{proto}(\tilde{\textbf{F}}^{(0)}|d_{0},d_{% 1},\cdots,d_{L})

(6)

\tilde{\textbf{f}}_{c_{i}}^{(L)}=\text{GNN}_{proto}(\tilde{\textbf{f}}_{c_{i}}% ^{(0)}|d_{0},d_{1},\cdots,d_{L}).

This prototype is then extended for different student learning stages to construct reasonable knowledge tracing based on the transfer of learning theory. Besides, the number of layers $L$ controls the number of hops the feature propagates on the graphs, which clarifies the boundary between related and non-related KCs.

4.2. Knowledge Memory & Knowledge Tracing

GRKT aims to model the process of student retrieving and learning knowledge with their memory. Therefore, we employ a dynamic knowledge memory bank denoted as $\textbf{H}\in\mathbb{R}^{|C|\times d_{k}}$ , where each row $\textbf{h}_{c_{i}}$ encodes the current knowledge memory of KC $c_{i}$ for the student. Here, $d_{k}$ signifies the number of memory dimensions. This memory bank evolves alongside the student’s learning process, represented as $\textbf{H}_{t}$ , with a learnable initial state $\textbf{H}_{0}$ representing their prior knowledge before engaging in any learning behavior. To track the knowledge mastery of a specific KC, we apply a non-negative projection vector $\textbf{w}_{h}\in\mathbb{R}^{d_{k}\times 1}_{\geq 0}$ to $\textbf{h}_{c_{i},t}$ using the equation:

(7)

\hat{m}_{c_{i},t}=\textbf{h}_{c_{i},t}\cdot\textbf{w}_{h},

which yields the mastery of KC $c_{i}$ at time step $t$ . The non-negative constraint on the network weights guarantees the monotonic relationship between mastery and each memory dimension. This technique has been widely adopted in numerous studies (Wang et al., 2022, 2021) to satisfy the monotonicity assumption. Moreover, we leverage this constraint to establish a foundation for reasonable knowledge tracing, which would be gradually refined in subsequent descriptions.

4.3. Stage I: Knowledge Retrieval

In this stage, students retrieve stored knowledge from memory to solve given questions, a mechanism explained by memory theory (Melton, 1963). Additionally, the transfer of learning theory (Perkins et al., 1992) suggests that learners transfer knowledge from similar fields to tackle problems. Leveraging this insight, we employ a KC relation-based GNN to model knowledge transfer from related KCs. Specifically, we aggregate the knowledge memory of the given KC $c_{t}$ to solve its corresponding question $q_{t}$ before time step $t$ (represented as $t^{-}$ ):

(8)

\tilde{\textbf{h}}_{c_{t},t^{-}}^{(L)}=\text{GNN}_{rtv}(\tilde{\textbf{h}}^{(0% )}_{c_{t},t^{-}}|\{d_{k}\}_{L+1})\,,

with initializing $\tilde{\textbf{h}}^{(0)}_{c_{t},t^{-}}=\textbf{h}_{c_{t},t^{-}}$ . Recognizing that different questions have different mastery requirements of KCs, we incorporate question-KC correlation scores into the aggregation process in this GNN, which are calculated by:

(9)

\alpha_{q_{i},c_{j}}=\sigma\left({\textbf{e}}^{\text{T}}_{q_{i}}\textbf{W}_{% req}\textbf{k}_{c_{j}}\right)\,,

where $\textbf{e}_{q_{i}}\in\mathbb{R}^{d_{e}\times 1}$ and $\textbf{k}_{c_{j}}$ are the embeddings of $q_{i}$ and $c_{j}$ , and $\textbf{W}_{req}\in\mathbb{R}^{d_{e}\times d_{e}}$ is a learnable matrix. Then, the graph message aggregation process of Equation 8 is actually

(10)

\tilde{\textbf{h}}^{\mathcal{G},(l)}_{c_{t}}=\frac{1}{|\mathcal{G}(c_{t})|}% \sum_{c_{i}\in\mathcal{G}(c_{t})}\left(\alpha_{q_{t},c_{i}}\cdot\beta^{% \mathcal{G}}_{c_{t},c_{i}}\cdot\tilde{\textbf{h}}^{(l-1)}_{c_{i}}\textbf{W}^{% \mathcal{G},(l)}_{rtv}\right).

We also remove the non-linear feed-forward process and restrict $\textbf{W}^{\mathcal{G},(l)}_{rtv}\in\mathbb{R}^{d_{k}\times d_{k}}_{\geq 0}$ to ensure higher values of the related KCs’ memory bring higher knowledge mastery. After getting the aggregating knowledge memory from this GNN, we get the knowledge mastery as Equation 7 and compare it with the question difficulty $d_{q_{t}}$ to generate the predictive probability of solving the question:

(11)

\hat{a}_{t}=\sigma\left(\tilde{\textbf{h}}_{c_{t},t}^{(L)}\cdot\textbf{w}_{h}-% d_{q_{t}}\right).

For multi-KC questions, we average the KCs’ memory. The difficulty $d_{q_{t}}$ of question $q_{t}$ is generated by a Multi-Layer Perception (MLP):

(12)

d_{q_{t}}=\text{ReLU}\left(\bar{\textbf{e}}_{q_{t}}\textbf{W}^{(1)}_{diff}+% \textbf{b}^{(1)}_{diff}\right)\textbf{W}^{(2)}_{diff}+\textbf{b}^{(2)}_{diff}.

Here, $\bar{\textbf{e}}_{q_{t}}=[\textbf{k}_{c_{t}}\oplus\textbf{e}_{q_{t}}]$ is the concatenated representation of $q_{t}$ and its examined KC $c_{t}$ ’s embeddings. For multi-KC questions, we use the KCs’ average embedding. $\textbf{W}_{diff}^{(1)}\in\mathbb{R}^{2d_{e}\times d_{h}}$ , $\textbf{W}_{diff}^{(2)}\in\mathbb{R}^{d_{h}\times 1}$ , $\textbf{b}_{diff}^{(1)}\in\mathbb{R}^{1\times d_{h}}$ , and $\textbf{b}_{diff}^{(2)}\in\mathbb{R}^{1\times 1}$ are learnable matrices and vectors. $d_{h}$ is the number of hidden dimensions. We denote the process of this two-layer MLP as $d_{q_{t}}=\text{MLP}_{diff}(\bar{\textbf{e}}_{q_{t}}|2d_{e},d_{h},1)$ , and a similar notation is applied for brevity in subsequent descriptions. Hereinafter, we accurately model the process whereby students retrieve knowledge from memory to answer new questions.

4.4. Stage II: Memory Strengthening

The testing effect theory (Roediger III and Karpicke, 2006) reveals that a correct retrieval strengthens the storage of knowledge in memory, while an unsuccessful retrieval can lead to incorrect strengthening. Without correction or active learning after the error, this may reduce knowledge mastery (Kornell et al., 2009). In this stage, we determine the memory strengthening process based on whether the examined KC is correctly retrieved to solve the question, resulting in either knowledge gain or loss. Additionally, these knowledge changes are propagated to related KCs based on the transfer of learning theory. To enhance memory from a correct response to question $q_{t}$ , we first combine and input the current memory $\textbf{h}_{c_{t},t^{-}}$ of KC $c_{t}$ and the question information $\bar{\textbf{e}}_{q_{t}}$ into an MLP to obtain an initial memory feature:

(13)

\textbf{g}_{c_{t},t}=\text{MLP}_{gain}\left([\textbf{h}_{c_{t},t^{-}}\oplus% \bar{\textbf{e}}_{q_{t}}]|d_{k}+2d_{e},d_{h},d_{k}\right).

For multi-KC question, we calculate all the associated KCs’ features. This feature serves as a spark to propagate knowledge changes via another KC relation-based GNN. Specifically, by initializing an input feature matrix $\tilde{\textbf{G}}^{(0)}_{t}$ , where $\tilde{\textbf{g}}^{(0)}_{c_{i},t}=\textbf{g}_{c_{t},t}$ if $c_{i}=c_{t}$ and $\tilde{\textbf{g}}^{(0)}_{c_{i},t}=\textbf{0}$ if $c_{i}\neq c_{t}$ , the knowledge gain for all KCs is obtained as follows:

(14)

\tilde{\textbf{G}}_{t}^{(L)}=\text{ReLU}(\text{GNN}_{gain}(\tilde{\textbf{G}}^% {(0)}_{t}|\{d_{k}\}_{L+1})).

The $\text{ReLU}(\cdot)$ activation function ensures that the knowledge gain to be positive. Moreover, due to the zero feature initialization except for the examined KC, the knowledge gain is only propagated to KCs within $L$ hops, delineating a boundary between related and unrelated KCs. Similarly, we could derive the negative knowledge loss $\tilde{\textbf{L}}_{t}^{(L)}$ when students provide incorrect responses and wrongly strengthen their memory, by using a similar network $\text{GNN}_{loss}(\cdot)$ .

Subsequently, we update the knowledge memory bank with respect to the response $a_{t}$ as follows:

(15)

\textbf{H}_{t}=\textbf{H}_{t^{-}}+a_{t}\tilde{\textbf{G}}_{t}^{(L)}+(1-a_{t})% \tilde{\textbf{L}}_{t}^{(L)}.

It is worth noting that different questions also have different effects on strengthening students’ memory of KCs. Therefore, similar to Equation 10, these two GNNs also add the question-KC correlation scores during message passing. Henceforth, the second stage, memory strengthening, is reasonably modeled based on the testing effect and the transfer of learning.

4.5. Stage III: Knowledge Learning/Forgetting

After students answer questions, their subsequent actions vary depending on the feedback received. They may review their correct answers or correct their mistakes. Besides, they might prepare for the next question’s KC they would encounter. These active learning behaviors contribute to improving their knowledge mastery, which we model as the knowledge learning process in this stage. Concretely, the KC of the last question and the next question both influence the student’s learning target. Therefore, we use an MLP to determine if the student actively learns them based on his/her current knowledge memory and the involved questions’ information. For KC $c_{i}\in\{c_{t},c_{t+1}\}$ (or more involved KCs for multi-KC questions), the two-dimension policy distribution is calculated by:

(16)

\pi_{c_{i},t}=\text{softmax}\left(\text{MLP}_{dcs}([\textbf{h}_{c_{i},t}\oplus% \bar{\textbf{e}}_{q_{t}}\oplus\bar{\textbf{e}}_{q_{t+1}}]|d_{k}+4d_{e},d_{h},2% )\right).

Here, $\text{argmax }\pi_{c_{i},t}=0$ indicates that the first dimension is bigger. We suppose there is no active learning. Contrarily, $\text{argmax }\pi_{c_{i},t}=1$ indicates the student would learn $c_{i}$ . Under this circumstance, we calculate the progress of learning $c_{i}$ in a similar way:

(17)

\textbf{p}_{c_{i},t}=\text{MLP}_{prg}([\textbf{h}_{c_{i},t}\oplus\bar{\textbf{% e}}_{q_{t}}\oplus\bar{\textbf{e}}_{q_{t+1}}]|d_{k}+4d_{e},d_{h},d_{k})).

Based on the transfer of learning theory, this progress is also propagated to related KCs using another KC relation-based GNN. After initializing $\tilde{\textbf{P}}^{(0)}_{t}$ where $\tilde{\textbf{p}}^{(0)}_{c_{i},t}=\textbf{p}_{c_{i},t}$ for $c_{i}\in\{c_{t},c_{t+1}\}$ with $\text{argmax }\pi_{c_{i},t}=1$ , and $\tilde{\textbf{p}}^{(0)}_{c_{i},t}=\textbf{0}$ otherwise, we compute

(18)

\tilde{\textbf{P}}_{t}^{(L)}=\text{ReLU}(\text{GNN}_{prg}(\tilde{\textbf{P}}^{% (0)}_{t}|\{d_{k}\}_{L+1})).

This active learning process continues until the student answers the next question, allowing us to model each KC’s progress $\tilde{\textbf{p}}_{c_{i},t}^{(L)},c_{i}\in\mathcal{C}$ with a KC-specific time-aware kernel function to update:

(19)

\textbf{h}_{c_{i},(t+1)^{-}}=\textbf{h}_{c_{i},t}+\boldsymbol{\phi}_{c_{i}}(% \tilde{\textbf{p}}^{(L)}_{c_{i},t},\Delta T_{t+1})

where $\Delta T_{t+1}=T_{t+1}-T_{t}$ is the time duration until the next question. According to the learning curve (Yelle, 1979), the efficiency of students in learning a specific KC tends to be high initially and gradually decreases over both the learning time and frequency. Therefore, we design the kernel function in an exponential form:

(20)

\boldsymbol{\phi}_{c_{i}}(\tilde{\textbf{p}}^{(L)}_{c_{i},t},\Delta T_{t+1})=% \tilde{\textbf{p}}^{(L)}_{c_{i},t}\odot(\textbf{1}-\text{exp}(-(n_{c_{i},t}+1)% \Delta T_{t+1}\cdot\tilde{\boldsymbol{\gamma}}_{c_{i}}^{(L)})),

where $\odot$ is the Hadamard product. $n_{c_{i},t}$ is the number of times that $c_{i}$ has been learned by the student. $\tilde{\boldsymbol{\gamma}}_{c_{i}}^{(L)}$ represents the KC-specific kernel parameters of $c_{i}$ generated by another KC relation-based GNN. It leverages the property of graph homophily that makes related KCs have similar learning ratios:

(21)

\tilde{\boldsymbol{\gamma}}_{c_{i}}^{(L)}=\text{softplus}(\text{GNN}_{lrn}(% \tilde{\boldsymbol{\gamma}}_{c_{i}}^{(0)}|d_{e},\{d_{k}\}_{L}))\,,

with initializing $\tilde{\boldsymbol{\gamma}}_{c_{i}}^{(0)}=\textbf{k}_{c_{i}}$ which is $c_{i}$ ’s embedding. Here, $\text{softplus}(\cdot)$ is an activation function to restrict the parameter to be positive. On the other hand, for KCs that students have acquired before but they do not choose to learn, we introduce the knowledge forgetting process. Therefore, for the KCs students do not make progress on (i.e., $\tilde{\textbf{p}}^{(L)}_{c_{i},t}=\textbf{0}$ ), their previously acquired knowledge fades over time:

(22)

\textbf{h}_{c_{i},(t+1)^{-}}=\textbf{h}_{c_{i},t}-\boldsymbol{\kappa}_{c_{i}}(% \Delta\textbf{h}_{c_{i},t},\Delta T_{t+1})

where $\Delta\textbf{h}_{c_{i},t}=\textbf{h}_{c_{i},t}-\textbf{h}_{c_{i},0}$ represents the total knowledge acquisition the student has accumulated. According to the forgetting curve (Ebbinghaus, 1885), the speed that students forget knowledge follows a pattern of initially rapid decay and then a gradual decrease over time and the review frequency. Therefore, we similarly design KC-specific forgetting kernel functions in an exponential form:

(23)

\boldsymbol{\kappa}_{c_{i}}(\Delta\textbf{h}_{c_{i},t},\Delta T_{t+1})=\Delta% \textbf{h}_{c_{i},t}\odot(\textbf{1}-\text{exp}(-(n_{c_{i},t}+1)\Delta T_{t+1}% \cdot\tilde{\boldsymbol{\theta}}_{c_{i}}^{(L)})),

where the kernel parameters $\tilde{\boldsymbol{\theta}}_{c_{i}}^{(L)}$ are similarly generated by another KC relation-based GNN:

(24)

\tilde{\boldsymbol{\theta}}_{c_{i}}^{(L)}=\text{softplus}(\text{GNN}_{fgt}(% \tilde{\boldsymbol{\theta}}_{c_{i}}^{(0)}|d_{e},\{d_{k}\}_{L}))

with initializing $\tilde{\boldsymbol{\theta}}_{c_{i}}^{(0)}=\textbf{k}_{c_{i}}$ . Consequently, based on the learning and forgetting curves, we have derived the updated knowledge memory $\textbf{H}_{(t+1)^{-}}$ in this stage, which is recursively used for answering the next question.

Table 1. Statistics of the three preprocessed datasets.

Dataset	ASSIST09	ASSIST12	Junyi
#response	0.4m	2.6m	25.4m
#sequence	7.4k	38.1k	325.4k
#question	13.5k	51.0k	2.8k
#concept	140	198	722
#concept/question	1.22	1.0	1.0

Table 2. Results of the main experiments. The best results among GRKT and the baselines are in bold. The second ones are in italic. * indicates statistical significance over the best baseline, measured by T-test with p-value

\leq

0.05. “CONS”, “GAUC” and “RPT” are short for the three metrics for reasonability, consistency, GAUCM and Repetition.

Dataset	ASSIST09					ASSIST12					Junyi
Metric	AUC	ACC	CONS	GAUC	RPT	AUC	ACC	CONS	GAUC	RPT	AUC	ACC	CONS	GAUC	RPT
DKT	0.7695	0.7246	0.6463	0.7172	0.8131	0.7303	0.7358	0.6772	0.6929	0.7955	0.8003	0.8541	0.7432	0.6415	0.8790
DKVMN	0.7680	0.7239	0.8708	0.7116	0.8061	0.7279	0.7349	0.9273	0.6729	0.7971	0.8004	0.8541	0.9455	0.6379	0.8780
DKT+	0.7707	0.7245	0.6364	0.7089	0.8395	0.7300	0.7353	0.6809	0.6766	0.8172	0.7993	0.8539	0.7624	0.6436	0.8869
SAKT	0.7634	0.7206	0.8539	0.7101	0.7749	0.7227	0.7329	0.8202	0.6866	0.7797	0.7995	0.8535	0.8600	0.6387	0.8747
GKT	0.7702	0.7252	0.6697	0.7183	0.8124	0.7339	0.7372	0.7450	0.6971	0.7986	0.8023	0.8547	0.7403	0.6398	0.8788
AKT	0.7820	0.7320	0.5870	0.7113	0.8184	0.7665	0.7514	0.5909	0.6892	0.8172	0.8161	0.8593	0.5810	0.6398	0.8734
SKT	0.7732	0.7273	0.7023	0.7098	0.8092	0.7354	0.7398	0.7813	0.6952	0.7934	0.8045	0.8552	0.7792	0.6420	0.8805
LPKT	0.7869	0.7369	0.7909	0.7124	0.8205	0.7740	0.7556	0.8174	0.6839	0.8255	0.8153	0.8585	0.7238	0.6453	0.8845
DIMKT	0.7814	0.7351	0.7899	0.7153	0.8221	0.7711	0.7550	0.8099	0.6995	0.8198	0.8163	0.8594	0.8945	0.6424	0.8850
DTrans	0.7858	0.7345	0.8928	0.7126	0.8253	0.7720	0.7542	0.9217	0.6863	0.8249	0.8149	0.8577	0.9274	0.6420	0.8893
LBKT	0.7865	0.7372	0.8054	0.7134	0.8225	0.7763	0.7562	0.8123	0.6814	0.8230	0.8140	0.8568	0.8123	0.6409	0.8871
GRKT	0.7914*	0.7398*	1.0000*	0.7209*	0.8486*	0.7794*	0.7576	1.0000*	0.7064*	0.8319*	0.8207*	0.8624*	1.0000*	0.6473*	0.8957*
improv.	0.57%	0.35%	12.01%	0.36%	1.09%	0.40%	0.19%	8.50%	0.98%	0.78%	0.54%	0.35%	7.83%	0.31%	0.72%

Table 3. Results of the ablation experiments.

Dataset	ASSIST09					ASSIST12					Junyi
Metric	AUC	ACC	CONS	GAUC	RPT	AUC	ACC	CONS	GAUC	RPT	AUC	ACC	CONS	GAUC	RPT
GRKT	0.7914	0.7398	1.0000	0.7209	0.8486	0.7794	0.7576	1.0000	0.7064	0.8319	0.8207	0.8624	1.0000	0.6473	0.8957
-LF	0.7871	0.7367	1.0000	0.7066	0.8243	0.7767	0.7558	1.0000	0.6809	0.8276	0.8170	0.8598	1.0000	0.6401	0.8815
-SIM-PRE	0.7578	0.7161	1.0000	0.6197	0.8246	0.7502	0.7424	1.0000	0.6223	0.8291	0.7921	0.8481	1.0000	0.6084	0.8781
-SIM	0.7896	0.7375	1.0000	0.7135	0.8402	0.7777	0.7564	1.0000	0.6888	0.8259	0.8186	0.8611	1.0000	0.6447	0.8862
-PRE	0.7897	0.7384	1.0000	0.7149	0.8437	0.7779	0.7563	1.0000	0.6915	0.8264	0.8191	0.8615	1.0000	0.6452	0.8897

4.6. Model Training

The three-stage modeling is recurrent along the student response sequence. After learning/forgetting knowledge in the third stage, the updated knowledge memory is prepared for the first stage to answer the next question. This makes GRKT an end-to-end style so we directly train the model by the binary cross-entropy loss, aligning the predictive probability $\hat{a}^{u}_{t}$ from Equation 11 with the ground-truth response correctness label $a^{u}_{t}$ :

(25)

\mathcal{L}=-\sum_{u\in\mathcal{U}}\sum_{r_{t}^{u}\in\mathcal{H}^{u}}a^{u}_{t}% \log\hat{a}^{u}_{t}+(1-a^{u}_{t})\log(1-\hat{a}^{u}_{t}).

Here, we omit the averaging notation for brevity. Besides, we also apply the $l_{2}$ normalization to the model parameters during the training process to avoid the over-fitting issue.

5. Experiments

In this section, we design comprehensive experiments to address the following research questions:

Q1:

Does GRKT achieve competitive results in terms of both prediction performance and knowledge tracing reasonability compared to current state-of-the-art DLKT methods?
Q2:

What are the roles and impacts of different components of GRKT on the overall performance and reasonability?
Q3:

How reasonable is the knowledge mastery traced by GRKT from an intuitive perspective?

Additionally, we conduct other experiments such as hyper-parameter analysis. Due to space constraints, we include them in Appendix C.4.

5.1. Experimental Setup

5.1.1. Datasets

We evaluate the performance of GRKT on three widely-used public KT datasets:

•

ASSIST09 (Feng et al., 2009)¹¹1https://sites.google.com/site/assistmentsdata/home/2009-2010-assistment-data: ASSISTments is an online tutoring system for mathematics, which collected this dataset from 2009 to 2010. We use the combined version. For the missing timestamp information, we approximate it using the field order_id.
•

ASSIST12 (Feng et al., 2009)²²2https://sites.google.com/site/assistmentsdata/home/2012-13-school-data-with-affect: Another dataset from ASSISTments, collected during the period of 2012 to 2013.
•

Junyi (Chang et al., 2015)³³3https://pslcdatashop.web.cmu.edu/Files?datasetId=1198: This dataset is collected from the Junyi Academy online platform in 2015. It contains a part of annotated KC relationships which are suitable for the requirements of GRKT. We use the junyi_ProblemLog_original.csv version.

For preprocessing each dataset, we partition the response sequences of every student into subsequences, each containing 100 responses. Subsequences containing fewer than 10 responses are eliminated, while those with less than 100 responses are padded with zeros to meet the required length. Statistics of the processed datasets can be found in Table 1.

5.1.2. Evaluation

As a binary classification task of predicting student responses, we utilize the area under the curve (AUC) and accuracy (ACC) as the evaluation metrics for prediction performance. For evaluating model reasonability, we introduce three metrics:

•

Consistency: We propose this metric to measure the ratio of consistent variation between the mastery of KCs. When a student’s mastery of the corresponding KC declines after answering a certain question, the mastery of other KCs should either decline (for related KCs) or remain unchanged (for unrelated KCs). We calculate this percentage.
•

GAUCM: This metric calculates the average AUC scores with respect to the mastery of each question’s examined KC. Its reflects the monotonicity assumption: a question could be more likely to be correctly answered if students have higher mastery of its KC. This metric is proposed by Zhang et al. (Zhang et al., 2023).
•

Repetition: This metric is proposed by Yeung et al. (Yeung and Yeung, 2018), stating that a reasonable KT method should satisfy: after a student has finished a question and is given this same question again, the response result (correct or incorrect) should remain the same. We calculate the accuracy under this circumstance.

The formulas of these metrics are presented in Appendix C.1. Moreover, we employ a five-fold cross-validation to assess the model’s performance. 10% of the sequences of each fold serve as the validation set for parameter tuning. We stop the training when the validation performance fails to improve for 10 consecutive epochs.

5.1.3. Baselines

To compare with mainstream DLKT methods covering different aspects, we select eleven baselines from 2015 to 2023, including DKT (Piech et al., 2015), DKVMN (Zhang et al., 2017), DKT+ (Yeung and Yeung, 2018), SAKT (Pandey and Karypis, 2019), GKT (Nakagawa et al., 2019), AKT (Ghosh et al., 2020), SKT (Tong et al., 2020), LPKT (Shen et al., 2021), DIMKT (Shen et al., 2022), Dtransformer (Yin et al., 2023) and LBKT (Xu et al., 2023). Among them, GKT and SKT leverages the KC graph, and LPKT leverage the timestamp information. DKT+, LPKT and Dtransformer consider some aspects of model reasonability: the knowledge tracing stability or learning/forgetting behaviors, but not comprehensively address the DLKT unreasonableness issue. For the methods not providing the proxy of tracing knowledge mastery, AKT and DIMKT, we follow previous works (Cui et al., 2023; Liu et al., 2019) that replace input question features with zeros to estimate the mastery. We note that cognitive diagnosis baselines are not considered because they usually focus on static testing environments (Leighton and Gierl, 2007) but we study in the dynamic learning situation.

5.1.4. Implementation Details

We employ the Adam optimizer (Kingma and Ba, 2014) for all methods to achieve their best performance. We choose their learning rates from {1e-2, 5e-3, 1e-3, 5e-4, 1e-4}, and fixed the embedding and hidden dimension numbers at 128 for fairness. We strictly follow the original papers of all methods to set their hyper-parameters. For GRKT, detailed hyper-parameter setting is referred in the Appendix C.2. Furthermore, for the non-negative constraint on the specified network weights in Equations 7 and 10, we use the softmax operation along the knowledge memory dimension, which performs best in practice. Besides, the Junyi dataset includes some labeled relations, which we experiment with and present the results in Appendix C.3.

5.2. Overall Performance (Q1)

Table 2 illustrates the comprehensive performance comparison between GRKT and eleven other baselines. Notably, GRKT showcases the highest efficacy, surpassing the leading baselines by margins ranging from 0.19% to 12.01% across both prediction performance and reasonability metrics. For metrics such as AUC and ACC, which primarily gauge predictive accuracy, the state-of-the-art DLKT techniques, LPKT, and DIMKT exhibit exemplary performance owing to their sophisticated neural architectures. Besides, methods that emphasize aspects of reasonability, such as enhancing knowledge tracing stability and explicitly modeling learning and forgetting behaviors, DKT+, LPKT, and DTransformer, demonstrate competitive performance across reasonableness metrics. These methods secure seven out of nine second-place positions in reasonability metrics. Remarkably, GRKT achieves a perfect score of 1.0 on the consistency metric, signifying its ability to effectively address the challenge of maintaining consistency in knowledge mastery changes across KCs by the network constraints.

5.3. Ablation Study (Q2)

The ablation study aims to evaluate the impact of each component in GRKT by removing specific techniques and comparing the results with the full model. Four components are removed:

•

-LF: Removal of the third stage, knowledge learning/forgetting.
•

-SIM: Removal of the similarity relation.
•

-PRE: Removal of the prerequisite relation.
•

-SIM-PRE: Removal of the leverage of KC relation graphs.

As shown in Table 3, GRKT-SIM-PRE experiences the most significant deterioration, emphasizing the crucial role of KC relations in the KT task. Moreover, when only one of these two relations is utilized, there is a notable improvement in performance, indicating that each provides meaningful information for GRKT. Moreover, the performance is further enhanced when both relations are used together. Additionally, the degradation of GRKT-LF underscores the importance of modeling the knowledge learning/forgetting stage.

5.4. Reasonable Knowledge Tracing (Q3)

To intuitively validate the resonability of GRKT, we present one student’s dynamic knowledge mastery traced by GRKT in Figure 3. As depicted, the result aligns well with our hypothesis of a comprehensive and reasonable knowledge tracing model integrating various effects based on pedagogical theories. Furthermore, it addresses three key issues in the reasonableness of existing DLKT methods: mastery changes of unrelated KCs, not mastery changes of related KCs, and inconsistent mastery change direction. We also present GRKT, LPKT and DKT tracing one another student’s mastery on KC Addition and Subtraction Integers in Figure 4. As shown, GRKT yields reasonable knowledge tracing results such as the fine-grained knowledge changing from testing effects and the faded knowledge with forgetting curves. LPKT and DKT still have reasonable issues such as the mastery change of unrelated KCs and no mastery change of related KCs.

5.5. Complexity Analysis

Although the detailed methodology description of GRKT, its internal composition of only GNNs and MLPs does not make the inference complicated. Suppose $t$ is the length of response sequence, $C$ is the KC set, $E$ is the KC relation edge set, $d$ is the hidden dimension number we set as a small value of 16, and $k$ is the GRKT’s memory dimension number. The time complexity of GRKT is then $O(t|E|k+|E|d+t|C|k^{2}+|C|d^{2}+td^{2})$ , consisting of feature aggregation $O(t|E|k+|E|d)$ and feature non-linear transformation $O(t|C|k^{2}+|C|d^{2})$ of the GNNs, and $O(td^{2})$ of the MLPs. In contrast, other comparable attention or RNN-based methods usually have time complexity $O(td^{2}+t^{2}d)$ . In real scenarios, $t,|C|,d$ usually lie in 100-200 and the KC relation graphs are sparse. Therefore, we can approximately assume $t=d=|C|=k^{2}=n$ and $|E|=k\cdot|C|$ to facilitate the complexity comparison, which indicates the GRKT’s time complexity is actually in the same order of magnitude $O(n^{3})$ as other methods. We also test the inference speed of GRKT. It averagely costs 60ms for one student, which is acceptable in practice.

6. conclusion

In this paper, we point out the issue that many existing DLKT approaches prioritize predictive accuracy over tracking students’ dynamic knowledge mastery. This often results in models that yield unreasonable outcomes, complicating their application in real teaching scenarios. To this end, our study introduces GRKT, a graph-based reasonable knowledge tracing. It employs graph neural networks and consists of a finer-grained three-stage modeling process based on pedagogical theories, conducting a more reasonable knowledge tracing. Extensive experiments across multiple datasets demonstrate that GRKT not only enhances predictive accuracy but also generates more reasonable knowledge tracing results. In the future, we plan to address certain limitations of GRKT, such as enhancing the model’s ability to provide more fine-grained responses, including multiple-choice or essay answers. Furthermore, we would evaluate GRKT in real teaching scenarios.

References

(1)
Chang et al. (2015) Haw-Shiuan Chang, Hwai-Jung Hsu, and Kuan-Ta Chen. 2015. Modeling Exercise Relationships in E-Learning: A Unified Approach.. In EDM. 532–535.
Choi et al. (2020) Youngduck Choi, Youngnam Lee, Junghyun Cho, Jineon Baek, Byungsoo Kim, Yeongmin Cha, Dongmin Shin, Chan Bae, and Jaewe Heo. 2020. Towards an appropriate query, key, and value computation for knowledge tracing. In Proceedings of the seventh ACM conference on learning@ scale. 341–344.
Corbett and Anderson (1994) Albert T Corbett and John R Anderson. 1994. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction 4 (1994), 253–278.
Cui et al. (2023) Jiajun Cui, Zeyuan Chen, Aimin Zhou, Jianyong Wang, and Wei Zhang. 2023. Fine-Grained Interaction Modeling with Multi-Relational Transformer for Knowledge Tracing. ACM Transactions on Information Systems 41, 4 (2023), 1–26.
Cui et al. (2024) Jiajun Cui, Minghe Yu, Bo Jiang, Aimin Zhou, Jianyong Wang, and Wei Zhang. 2024. Interpretable Knowledge Tracing via Response Influence-based Counterfactual Reasoning. In Proceedings of the 40th IEEE International Conference on Data Engineering.
Ebbinghaus (1885) Hermann Ebbinghaus. 1885. Über das gedächtnis: untersuchungen zur experimentellen psychologie. Duncker & Humblot.
Embretson and Reise (2013) Susan E Embretson and Steven P Reise. 2013. Item response theory. Psychology Press.
Feng et al. (2009) Mingyu Feng, Neil Heffernan, and Kenneth Koedinger. 2009. Addressing the assessment challenge with an online system that tutors as it assesses. User modeling and user-adapted interaction 19 (2009), 243–266.
Gan et al. (2022) Wenbin Gan, Yuan Sun, and Yi Sun. 2022. Knowledge structure enhanced graph representation learning model for attentive knowledge tracing. International Journal of Intelligent Systems 37, 3 (2022), 2012–2045.
Gao et al. (2023) Weibo Gao, Hao Wang, Qi Liu, Fei Wang, Xin Lin, Linan Yue, Zheng Zhang, Rui Lv, and Shijin Wang. 2023. Leveraging transferable knowledge concept graph embedding for cold-start cognitive diagnosis. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 983–992.
Ghosh et al. (2020) Aritra Ghosh, Neil Heffernan, and Andrew S Lan. 2020. Context-aware attentive knowledge tracing. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2330–2339.
Guidotti et al. (2018) Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51, 5 (2018), 1–42.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Kornell et al. (2009) Nate Kornell, Matthew Jensen Hays, and Robert A Bjork. 2009. Unsuccessful retrieval attempts enhance subsequent learning. Journal of Experimental Psychology: Learning, Memory, and Cognition 35, 4 (2009), 989.
Leighton and Gierl (2007) Jacqueline Leighton and Mark Gierl. 2007. Cognitive diagnostic assessment for education: Theory and applications. Cambridge University Press.
Liu et al. (2019) Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, and Guoping Hu. 2019. Ekt: Exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering 33, 1 (2019), 100–115.
Liu et al. (2021) Qi Liu, Shuanghong Shen, Zhenya Huang, Enhong Chen, and Yonghe Zheng. 2021. A survey of knowledge tracing. arXiv preprint arXiv:2105.15106 (2021).
Melton (1963) Arthur W Melton. 1963. Implications of short-term memory for a general theory of memory. Journal of verbal Learning and verbal Behavior 2, 1 (1963), 1–21.
Nakagawa et al. (2019) Hiromi Nakagawa, Yusuke Iwasawa, and Yutaka Matsuo. 2019. Graph-based knowledge tracing: modeling student proficiency using graph neural network. In IEEE/WIC/ACM International Conference on Web Intelligence. 156–163.
Pandey and Karypis (2019) Shalini Pandey and George Karypis. 2019. A Self-Attentive Model for Knowledge Tracing. International Educational Data Mining Society (2019).
Pardos and Heffernan (2011) Zachary A Pardos and Neil T Heffernan. 2011. KT-IDEM: Introducing item difficulty to the knowledge tracing model. In User Modeling, Adaption and Personalization: 19th International Conference, UMAP 2011, Girona, Spain, July 11-15, 2011. Proceedings 19. Springer, 243–254.
Perkins et al. (1992) David N Perkins, Gavriel Salomon, et al. 1992. Transfer of learning. International encyclopedia of education 2 (1992), 6452–6457.
Piech et al. (2015) Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha Sohl-Dickstein. 2015. Deep knowledge tracing. Advances in neural information processing systems 28 (2015).
Roediger III and Karpicke (2006) Henry L Roediger III and Jeffrey D Karpicke. 2006. Test-enhanced learning: Taking memory tests improves long-term retention. Psychological science 17, 3 (2006), 249–255.
Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
Shen et al. (2022) Shuanghong Shen, Zhenya Huang, Qi Liu, Yu Su, Shijin Wang, and Enhong Chen. 2022. Assessing Student’s Dynamic Knowledge State by Exploring the Question Difficulty Effect. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 427–437.
Shen et al. (2021) Shuanghong Shen, Qi Liu, Enhong Chen, Zhenya Huang, Wei Huang, Yu Yin, Yu Su, and Shijin Wang. 2021. Learning process-consistent knowledge tracing. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 1452–1460.
Song et al. (2022) Xiangyu Song, Jianxin Li, Qi Lei, Wei Zhao, Yunliang Chen, and Ajmal Mian. 2022. Bi-CLKT: Bi-graph contrastive learning based knowledge tracing. Knowledge-Based Systems 241 (2022), 108274.
Szegedy et al. (2017) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.
Tong et al. (2020) Shiwei Tong, Qi Liu, Wei Huang, Zhenya Hunag, Enhong Chen, Chuanren Liu, Haiping Ma, and Shijin Wang. 2020. Structure-based knowledge tracing: An influence propagation view. In 2020 IEEE international conference on data mining (ICDM). IEEE, 541–550.
Wang et al. (2022) Fei Wang, Qi Liu, Enhong Chen, Zhenya Huang, Yu Yin, Shijin Wang, and Yu Su. 2022. NeuralCD: a general framework for cognitive diagnosis. IEEE Transactions on Knowledge and Data Engineering (2022).
Wang et al. (2021) Xinping Wang, Caidie Huang, Jinfang Cai, and Liangyu Chen. 2021. Using knowledge concept aggregation towards accurate cognitive diagnosis. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2010–2019.
Wu et al. (2024) Siyu Wu, Yang Cao, Jiajun Cui, Runze Li, Hong Qian, Bo Jiang, and Wei Zhang. 2024. A Comprehensive Exploration of Personalized Learning in Smart Education: From Student Modeling to Personalized Recommendations. arXiv:2402.01666
Xu et al. (2023) Bihan Xu, Zhenya Huang, Jiayu Liu, Shuanghong Shen, Qi Liu, Enhong Chen, Jinze Wu, and Shijin Wang. 2023. Learning behavior-oriented knowledge tracing. In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining. 2789–2800.
Yang et al. (2021) Yang Yang, Jian Shen, Yanru Qu, Yunfei Liu, Kerong Wang, Yaoming Zhu, Weinan Zhang, and Yong Yu. 2021. GIKT: a graph-based interaction model for knowledge tracing. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part I. Springer, 299–315.
Yelle (1979) Louis E Yelle. 1979. The learning curve: Historical review and comprehensive survey. Decision sciences 10, 2 (1979), 302–328.
Yeung and Yeung (2018) Chun-Kit Yeung and Dit-Yan Yeung. 2018. Addressing two problems in deep knowledge tracing via prediction-consistent regularization. In Proceedings of the fifth annual ACM conference on learning at scale. 1–10.
Yin et al. (2023) Yu Yin, Le Dai, Zhenya Huang, Shuanghong Shen, Fei Wang, Qi Liu, Enhong Chen, and Xin Li. 2023. Tracing Knowledge Instead of Patterns: Stable Knowledge Tracing with Diagnostic Transformer. In Proceedings of the ACM Web Conference 2023. 855–864.
Yudelson et al. (2013) Michael V Yudelson, Kenneth R Koedinger, and Geoffrey J Gordon. 2013. Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education: 16th International Conference, AIED 2013, Memphis, TN, USA, July 9-13, 2013. Proceedings 16. Springer, 171–180.
Zhang et al. (2017) Jiani Zhang, Xingjian Shi, Irwin King, and Dit-Yan Yeung. 2017. Dynamic key-value memory networks for knowledge tracing. In Proceedings of the 26th international conference on World Wide Web. 765–774.
Zhang et al. (2023) Moyu Zhang, Xinning Zhu, Chunhong Zhang, Wenchen Qian, Feng Pan, and Hui Zhao. 2023. Counterfactual Monotonic Knowledge Tracing for Assessing Students’ Dynamic Mastery of Knowledge Concepts. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3236–3246.

Table 4. The notation table of GRKT. We omit the superscript of the target student

u

whose knowledge is to be traced.

Task formulation
$\mathcal{U},\mathcal{Q},\mathcal{C}$	sets of students, questions, KCs
$c_{i},c_{j},q_{i},q_{j}$	certain KCs, questions
$u,t$	the target student, time step
$\mathcal{H}$	response history of $u$
$r_{t}$	response of $u$ at $t$
$q_{t},c_{t}$	question and examined KC of $r_{t}$
$a_{t},T_{t}$	binary correctness and timestamp of $r_{t}$
$\mathcal{M}$	evolving knowledge mastery of $u$
$\textbf{m}_{t}$	knowledge mastery of $u$ at $t$
$m_{c_{i},t}$	knowledge mastery of $c_{i}$ of $u$ at $t$
KC relation-based GNN
$\mathcal{P},\mathcal{S},\mathcal{R}$	prerequisite, subsequence, similarity graphs
$\mathcal{P}(\cdot),\mathcal{S}(\cdot),\mathcal{R}(\cdot)$	neighbor functions of $\mathcal{P},\mathcal{S},\mathcal{R}$
$\mathcal{G}$	certain graph in $\mathcal{P},\mathcal{S},\mathcal{R}$
$\mathcal{G}(\cdot)$	neighbor function of $\mathcal{G}$
$L$	number of GNN layers
$\text{GNN}_{proto}$	prototype of KC relation-based GNN
$d_{0},d_{1},...,d_{L}$	# of dimensions of prototype GNN’s layers
$\tilde{\textbf{f}}_{c_{i}}^{(0)},\tilde{\textbf{F}}^{(0)}$	prototype input of $c_{i}$ and all to $\text{GNN}_{proto}$
$\tilde{\textbf{f}}_{c_{i}}^{(L)},\tilde{\textbf{F}}^{(L)}$	prototype output of $c_{i}$ and all from $\text{GNN}_{proto}$
$\tilde{\textbf{f}}_{c_{i}}^{(l)},\tilde{\textbf{F}}^{(l)}$	prototype intermedium of $c_{i}$ and all of
$\tilde{\textbf{f}}_{c_{i}}^{(l)},\tilde{\textbf{F}}^{(l)}$	$l^{th}$ layer of $\text{GNN}_{proto}$
$\textbf{W}^{\mathcal{G},(l)}_{proto},\textbf{O}^{\mathcal{G},(l)}_{proto}$	weight matrices of $l^{th}$ of $\text{GNN}_{proto}$ for $\mathcal{G}$
$\tilde{\textbf{f}}_{c_{i}}^{\mathcal{G},(l)}$	prototype intermedium of $c_{i}$ of $l^{th}$ layer
$\tilde{\textbf{f}}_{c_{i}}^{\mathcal{G},(l)}$	of $\text{GNN}_{proto}$ for $\mathcal{G}$
GRKT basic factors
$\textbf{e}_{q_{i}},\textbf{e}_{q_{t}},\textbf{k}_{c_{i}},\textbf{k}_{c_{t}}$	embeddings of $q_{i},q_{t},c_{i},c_{t}$
$\bar{\textbf{e}}_{q_{i}},\bar{\textbf{e}}_{q_{t}}$	concatenation of $q_{i}$ and its KC’s embeddings
$\alpha_{q_{t},c_{j}}$	requirement score of $q_{t}$ requiring $c_{j}$
$\textbf{W}_{req}$	matrix to calculate requiring scores
$\beta^{\mathcal{G}}_{c_{i},c_{j}}$	correlation score of $c_{i}$ and $c_{j}$ for $\mathcal{G}$
$\textbf{W}^{\mathcal{G}}_{cor}$	matrix to calculate correlation scores for $\mathcal{G}$
$\textbf{H}_{0}$	initial knowledge memory of $u$
$\textbf{H}_{t^{-}}$	knowledge memory of $u$ at a moment before $t$
$\textbf{H}_{t}$	knowledge memory of $u$ at $t$
$\textbf{h}_{c_{i},t}$	knowledge memory of $c_{i}$ of $u$ at $t$
$d_{e},d_{k},d_{h}$	embedding, memory, and hidden dimensions
$\textbf{w}_{h}$	vector to project knowledge memory to mastery
$\hat{m}_{c_{i},t}$	modeled knowledge mastery of $c_{i}$ at $t$
$d_{q_{t}}$	question difficulty of $q_{t}$
$\text{MLP}_{diff}$	MLP to generate question difficulty
$\textbf{W}_{diff}^{(1)},\textbf{W}_{diff}^{(2)}$	weight matrices in $\text{MLP}_{diff}$
$\textbf{b}_{diff}^{(1)},\textbf{b}_{diff}^{(2)}$	weight vectors in $\text{MLP}_{diff}$

Table 5. The continuing notation table for the three stages.

Stage I: knowledge retrieval
$\text{GNN}_{rtv}$	KC relation-based GNN for knowledge retrieval
$\tilde{\textbf{h}}_{c_{i},t^{-}}^{(0)},\tilde{\textbf{H}}_{t^{-}}^{(0)}$	memory input of $c_{i}$ and all to $\text{GNN}_{rtv}$ before $t$
$\tilde{\textbf{h}}_{c_{i},t^{-}}^{(L)},\tilde{\textbf{H}}^{(L)}_{t^{-}}$	memory output of $c_{i}$ and all from $\text{GNN}_{rtv}$ before $t$
$\hat{a}_{t}$	predictive probability of $a_{t}$
Stage II: memory strengthening
$\text{MLP}_{gain}$	MLP to get memory feature for knowledge gain
$\textbf{g}_{c_{t},t}$	memory feature of $c_{t}$ at $t$ for knowledge gain
$\text{GNN}_{gain}$	KC relation-based GNN for knowledge gain
$\tilde{\textbf{g}}_{c_{t},t}^{(0)},\tilde{\textbf{G}}_{t}^{(0)}$	memory feature input of $c_{t}$ and all to $\text{GNN}_{gain}$ at $t$
$\tilde{\textbf{g}}_{c_{t},t}^{(L)},\tilde{\textbf{G}}_{t}^{(L)}$	knowledge gain of $c_{t}$ and all from $\text{GNN}_{gain}$ at $t$
$\text{MLP}_{loss}$	MLP to get memory feature for knowledge loss
$\textbf{l}_{c_{t},t}$	memory feature of $c_{t}$ at $t$ for knowledge loss
$\text{GNN}_{loss}$	KC relation-based GNN for knowledge loss
$\tilde{\textbf{l}}_{c_{t},t}^{(0)},\tilde{\textbf{L}}_{t}^{(0)}$	memory feature input of $c_{t}$ and all to $\text{GNN}_{loss}$ at $t$
$\tilde{\textbf{l}}_{c_{i},t}^{(L)},\tilde{\textbf{L}}_{t}^{(L)}$	knowledge loss of $c_{t}$ and all from $\text{GNN}_{loss}$ at $t$
Stage III: knowledge learning/forgetting
$\text{MLP}_{dsc}$	MLP to get policy distribution for active learning
$\pi_{c_{i},t}$	policy distribution if $u$ decide to learn $c_{i}$ at $t$
$\text{MLP}_{prg}$	MLP to get initial knowledge progress
$\textbf{p}_{c_{i},t}$	initial knowledge progress of $c_{i}$ at $t$
$\text{GNN}_{prg}$	KC relation-based GNN for knowledge progress
$\tilde{\textbf{p}}_{c_{i},t}^{(0)},\tilde{\textbf{P}}_{t}^{(0)}$	initial progress input of $c_{i}$ and all to $\text{GNN}_{prg}$ at $t$
$\tilde{\textbf{p}}_{c_{i},t}^{(L)},\tilde{\textbf{P}}_{t}^{(L)}$	knowledge progress of $c_{i}$ and all from $\text{GNN}_{prg}$ at $t$
$\Delta T_{t+1}$	time interval between $T_{t}$ and $T_{t+1}$
$\boldsymbol{\phi}_{c_{i}}$	KC-specific time-aware kernel for learning $c_{i}$
$n_{c_{i},t}$	# of times $u$ has learnt $c_{i}$
$\text{GNN}_{lrn}$	KC relation-based GNN to get parameters of $\boldsymbol{\gamma}_{c_{i}}$
$\tilde{\boldsymbol{\gamma}}_{c_{i}}^{(0)}$	input feature of $c_{i}$ initialized as $\textbf{k}_{c_{i}}$ to $\text{GNN}_{lrn}$
$\tilde{\boldsymbol{\gamma}}_{c_{i}}^{(L)}$	output parameters of $\boldsymbol{\gamma}_{c_{i}}$ for $c_{i}$ from $\text{GNN}_{lrn}$
$\boldsymbol{\kappa}_{c_{i}}$	KC-specific time-aware kernel for forgetting $c_{i}$
$\text{GNN}_{fgt}$	KC relation-based GNN to get parameters of $\boldsymbol{\theta}_{c_{i}}$
$\tilde{\boldsymbol{\theta}}_{c_{i}}^{(0)}$	input feature of $c_{i}$ initialized as $\textbf{k}_{c_{i}}$ to $\text{GNN}_{fgt}$
$\tilde{\boldsymbol{\theta}}_{c_{i}}^{(L)}$	output parameters of $\boldsymbol{\kappa}_{c_{i}}$ for $c_{i}$ from $\text{GNN}_{fgt}$

Appendix A Notation Table

We list and explain the notations in our methodology introduction in Table 4 and 5.

Appendix B Method Details

B.1. KC Relation Graph Construction

In the absence of KC relation annotations in the datasets, we construct the KC relation graph based on data statistics. For the similarity between KCs $c_{i}$ and $c_{j}$ , we estimate their similarity score using:

sim_{c_{i},c_{j}}=\frac{\sum_{u\in\mathcal{U}}\sum_{r_{t}^{u},r_{t^{\prime}}^{% u}\in\mathcal{H}^{u}}I(a^{u}_{t}=a^{u}_{t^{\prime}},c^{u}_{t}=c_{i},c^{u}_{t^{% \prime}}=c_{j})}{\sum_{u\in\mathcal{U}}\sum_{r_{t}^{u},r_{t^{\prime}}^{u}\in% \mathcal{H}^{u}}I(c^{u}_{t}=c_{i},c^{u}_{t^{\prime}}=c_{j})},

where $I(\cdot)$ is the indicator function that takes value 1 if the condition is satisfied. This approximates the probability that a student could answer questions of $c_{i}$ correctly while he/her could also answer questions of $c_{j}$ correctly (or both incorrectly), indicating an underlying similarity between them.

For the prerequisite relationship between $c_{i}$ and $c_{j}$ , we assume that if $c_{i}$ is prerequisite to $c_{j}$ , then answering questions of $c_{i}$ correctly but $c_{j}$ incorrectly is more likely than answering questions of $c_{i}$ incorrectly but $c_{j}$ correctly. Therefore, we use:

pre_{c_{i},c_{j}}=\frac{\sum_{u\in\mathcal{U}}\sum_{r_{t}^{u},r_{t^{\prime}}^{% u}\in\mathcal{H}^{u}}I(a^{u}_{t}=1,a^{u}_{t^{\prime}}=0,c^{u}_{t}=c_{i},c^{u}_% {t^{\prime}}=c_{j})}{\sum_{u\in\mathcal{U}}\sum_{r_{t}^{u},r_{t^{\prime}}^{u}% \in\mathcal{H}^{u}}I(a^{u}_{t}\neq a^{u}_{t^{\prime}},c^{u}_{t}=c_{i},c^{u}_{t% ^{\prime}}=c_{j})},

to approximate the probability that $c_{i}$ is a prerequisite to $c_{j}$ .

Finally, we set a threshold $\eta$ to determine whether $c_{i}$ is similar/prerequisite to $c_{j}$ (by $sim_{c_{i},c_{j}}\geq\eta$ and $pre_{c_{i},c_{j}}\geq\eta$ , respectively). Additionally, KC pairs with a co-occurrence frequency under 10 times in the dataset are not considered.

Table 6. Hyperparameter setting of GRKT applying for the three datasets.

Parameter	ASSIST09	ASSIST12	Junyi
$lr$	5e-3	5e-3	5e-3
$L$	2	2	2
$d_{k}$	16	16	16
$\eta$	0.6	0.7	0.8
$l_{2}$	1e-6	1e-5	1e-5

Appendix C Supplements for Experiments

C.1. Metrics for Reasonability

We formulate the three metrics for model reasonability in this section:

C.1.1. Consistency

This metric measures the ratio of consistent variation between the mastery of KCs:

(26)

Consistency=\sum_{u\in\mathcal{U}}\sum_{r_{t}^{u}\in\mathcal{H}^{u}}\frac{\sum% _{c_{i}\in\mathcal{C}}I(m_{c_{i},t}^{u}\geq m_{c_{i},t+1}^{u})}{\sum_{c_{i}\in% \mathcal{C}}I(m_{c^{u}_{t},t}^{u}\geq m_{c^{u}_{t},t+1}^{u})}.

Here, we omit the averaging operation over the students and responses for conciseness. We only consider the situation where a student’s mastery of the learnt KC of the current question declines while other KCs do not increase, instead of the current one increasing and the others declining. This is because the latter case might be due to natural forgetting behaviors.

C.1.2. GAUCM

This metric calculates the average AUC scores with respect to the mastery of each question’s examined KC:

(27)

GAUCM=\frac{\sum_{q_{i}\in\mathcal{Q}}N(q_{i})\cdot AUC\left[\{\hat{m}^{u}_{c^% {u}_{t},t}\},\{a^{u}_{t}\}\right]^{q^{u}_{t}=q_{i}}_{u\in\mathcal{U},r_{t}^{u}% \in\mathcal{H}^{u}}}{\sum_{q_{i}\in\mathcal{Q}}N(q_{i})}.

$AUC[\hat{\mathcal{Y}},\mathcal{Y}]_{A}^{B}$ indicates the AUC score of the prediction set $\hat{\mathcal{Y}}$ and the ground-truth set $\mathcal{Y}$ , given the range $A$ and the condition $B$ . $N(q_{i})$ is the number of $q_{i}$ being answered. For evaluating GRKT, we use the aggregated mastery instead of the single KC’s mastery to calculate AUC because we consider the transfer of learning theory that students may leverage related KCs to solve questions.

C.1.3. Repetition

This metric supposes that a reasonable KT method should adhere to the following rule: after a student has finished a question and is given the same question again, the response result (correct or incorrect) should remain the same:

(28)

Repetition=ACC\left[\{\textbf{KT}(q^{u}_{t}|\{r_{t^{\prime}}^{u}|1\leq t^{% \prime}\leq t\})\},\{a^{u}_{t}\}\right]_{u\in\mathcal{U},r_{t}^{u}\in\mathcal{% H}^{u}}.

$ACC(\cdot)$ denotes the accuracy score whose notation is similar to the $AUC(\cdot)$ in Equation 27. $\textbf{KT}(q^{u}_{t}|\{r_{t^{\prime}}^{u}|1\leq t^{\prime}\leq t\})$ denotes the prediction score if $u$ could correctly answer $q^{u}_{t}$ given his/her past $t$ responses $\{r_{t^{\prime}}^{u}|1\leq t^{\prime}\leq t\}$ including the response to $q_{t}^{u}$ itself.

C.2. Hyper-parameter Setting

We provide the hyper-parameter settings in Table 6. The notations on the left side indicate the learning rate, the number of GNN layers, the number of knowledge memory dimensions, the graph construction threshold, and the value of $l_{2}$ normalization.

Table 7. Comparison of GRKT applied to the Junyi dataset with labeled KC relations (GRKT-L), statistics-based relations (GRKT-S), and no relations (GRKT-0). The two values in the “sparsity” column respectively denote the constructed KC similarity and prerequisite graphs’ sparsity.

Model	Sparsity	AUC	ACC	CONS	GAUC	RPT
GRKT-S	0.171, 0.169	0.8207	0.8624	1.0000	0.6473	0.8957
GRKT-L	0.006, 0.003	0.8108	0.8562	1.0000	0.6423	0.8861
GRKT-0	0.000, 0.000	0.7921	0.8481	1.0000	0.6084	0.8781

C.3. Experimental Results Using labeled Graph Relations

The Junyi dataset includes KC similarity and prerequisite relations annotated by experts with confidence scores ranging from 1 to 9. We select relations with average scores higher than 5 as graph edges. Table 7 presents the experimental results of GRKT leveraging expert-labeled relations compared with statistics-based relations and no relations. As shown, the graphs established on expert annotations are too sparse, with only an average of 1-2 related KCs for one KC, which may not reflect real scenarios. Despite the experimental results based on expert-labeled relations being inferior to the statistics-based version, they still exhibit noticeable improvement compared to the version without any relations.

C.4. Hyper-parameter Analysis

We conduct experiments to analyze the effects of various hyperparameters on GRKT’s performance. The experiments are performed on the two ASSIST datasets, as shown in Figure 5. The results show that setting the number of layers in the KC relation-based graphs to 2 achieves the best performance for GRKT, suggesting that retrieving information from further distances over the graph can enhance the model. However, employing more layers may lead to overfitting issues. For the KC graph construction threshold, the performance peaks at around 0.6 to 0.8. In this interval, the sparsity of the two graphs ranges from 0.01 to 0.3, indicating that too many relations lead to structural redundancy, while too few result in limited information sharing between KCs.