Improving Students Academic Performance With AI A

Improving Students’ Academic
Performance with AI and Semantic

Technologies
arXiv:2206.03213v1 [cs.CY] 2 May 2022
Yixin Cheng
COMP 8755 Individual Computing Project

Supervised by: Dr. Bernardo Pereira Nunes
The Australian National University
November 2021
© Yixin Cheng 2021
Except where otherwise indicated, this report is my own original work.
Yixin Cheng
19 November 2021
Acknowledgments
Firstly, I must express my sincere appreciation to my supervisor, Dr. Bernardo Pereira

Nunes, for his heuristic guidance, perceptive comments, and personal support. It
is an honor as well as a luck to be his student. His enthusiasm and meticulousness
towards academics motivate me along the way. He has monitored my progress and
offered advice and encouragement throughout. Without his constant help, I would
not have completed this project on schedule.
Thanks also to Felipe Cueto-Ramirez, for providing support and advice for my project.
Special thanks to Jinxiao Xie, who has always encouraged me to challenge myself and
made time to help and support me.
Last but not least, I owe a debt of gratitude to my parents and sisters, who pro-
vided me with the opportunity to study abroad, and endless love.
iii
Abstract
Artificial intelligence and semantic technologies are evolving and have been applied
in various research areas, including the education domain. Higher Education insti-
tutions (HEIs) strive to improve students’ academic performance. Early intervention
to at-risk students and a reasonable curriculum is vital for students’ success. Prior
research opted for deploying traditional machine learning models to predict students’
performance. However, the existing research on applying deep learning models on
prediction is very limited. In terms of curriculum semantic analysis, after conducting
a comprehensive systematic review regarding the use of semantic technologies in the
context of Computer Science curriculum, a major finding of the study is that technolo-
gies used to measure similarity have limitations in terms of accuracy and ambiguity
in the representation of concepts, courses, etc. To fill these gaps, in this study, three
implementations were developed, that is, to predict students’ performance using
marks from the previous semester, to model a course representation in a semantic
way and compute the similarity, and to identify the prerequisite between two similar
courses. Regarding performance prediction, we used the combination of Genetic
Algorithm and Long-Short Term Memory (LSTM) on a dataset from a Brazilian uni-
versity containing 248730 records. As for similarity measurement between courses,
we deployed Bidirectional Encoder Representation with Transformers (BERT), which
is proposed by Jacob et al. [Devlin et al., 2018] to o encode the sentence in the course
description from the Australian National University (ANU). We then used cosine
similarity to obtain the distance between courses. With respect to prerequisite identi-
fication, TextRazor was applied to extract concepts from course description, followed
by employing SemRefD, which was presented by [Manrique et al., 2019a], to measure
the degree of prerequisite between two concepts. The outcomes of this study can be
summarized as: (i) a breakthrough result improves Manrique’s work [Manrique et al.,
2019b] by 2.5% in terms of accuracy in dropout prediction; (ii) uncover the similarity
between courses based on course description; (iii) identify the prerequisite over three
compulsory courses of School of Computing at ANU. In the future, these technolo-
gies could potentially be used to identify at-risk students, analyze the curriculum of
university programs, aid student advisors, and create recommendation systems.
Keywords: Dropout Prediction, Curriculum Semantic Analysis, Similarity Mea-

surement, Prerequisite Identification, Genetic Algorithm, Long-Short Term Mem-
ory, Bidirectional Encoder Representation with Transformers, SemRefD
v
vi
Contents
Acknowledgments iii
Abstract v
1 Introduction 1
1.1 Problem Statement and Motivations . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background and Related Work 3

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Tools/Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Languages, Classes and Vocabulary . . . . . . . . . . . . . . . . . 7
2.2.4.1 Curriculum Design Analysis and Approaches . . . . . . 8
3 Approaches 11
3.1 Dropout Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Procedure Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.4 Training and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Curriculum Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Procedure Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.3 Prerequisite Identification . . . . . . . . . . . . . . . . . . . . . . . 24
4 Results and Discussion 27

4.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Dropout Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Experimental Results an Discussions . . . . . . . . . . . . . . . . 28
4.2.2.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2.2 Dropout Prediction . . . . . . . . . . . . . . . . . . . . . 29
vii
viii Contents
4.3 Curriculum Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Prerequisite Identification . . . . . . . . . . . . . . . . . . . . . . . 33
5 Conclusion and Future Work 35
Bibliography 37
Appendix 1 43
Appendix 2 45
Appendix 3 49
Appendix 4 51
List of Figures
2.1 Paper Selection Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 The workflow of dropout prediction . . . . . . . . . . . . . . . . . . . . . 12

3.2 Dataset snippet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Distribution by year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Student distribution by degree . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Dropout distribution by year . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Procedure of GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.7 Uniform Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 The workflow of GA+SVM feature selection . . . . . . . . . . . . . . . . 19
3.9 The internal architecture of LSTM cell [Sanjeevi, 2018] . . . . . . . . . . 20
3.10 The workflow of LSTM + FCs . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.11 The brief workflow of the Similarity Measurement . . . . . . . . . . . . . 23
3.12 Prerequisite Identification workflow . . . . . . . . . . . . . . . . . . . . . 23
3.13 BERT Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.14 The prerequisite of concept A (c A ) and Concept B (c B ) . . . . . . . . . . 26
4.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Accuracy of LSTM+FC on ADM . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Loss of LSTM+FC on ADM . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Accuracy of LSTM+FC on ARQ . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Loss of LSTM+FC on ARQ . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 Accuracy of LSTM+FC on CSI . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7 Loss of LSTM+FC on CSI . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 All level courses similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.9 1000-level courses similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.13 All level courses similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Contract page 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Contract page 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 README page 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4 README page 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 README page 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
ix
x LIST OF FIGURES
5.6 README page 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

List of Tables
3.1 Data wrangling techniques used in pre-processing. . . . . . . . . . . . . 14

3.2 Genetic Algorithm techniques, descriptions, goals, and specific method
used in this study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 The components of a LSTM cell . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Results of ADM, ARQ, and CSI by SVM . . . . . . . . . . . . . . . . . . . 29
4.3 Results of ADM, ARQ, and CSI . . . . . . . . . . . . . . . . . . . . . . . 31
xi
xii LIST OF TABLES
Chapter 1
Introduction
1.1 Problem Statement and Motivations
Students’ academic performance is not only important for themselves, but also impor-
tant for higher education institutions (HEIs), who use it as a basis for measuring the
success of their educational programs. A variety of metrics can be used to measure
student performance. Dropout rates are notable among these metrics. In other words,
the reduce of dropout rates could indicate the improvement of student’s academic
performance. However, according to the published figures, each year, one-fifth of
first-year undergraduates drop out their degrees across Australia [Shipley and Ian,
2019]. In Brazil, it is estimated that only 62.4% of university enrolments succeed in
obtaining an undergraduate degree [Sales et al., 2016]. These are concerning statistics
for a country’s development and can affect students’ lives financially, mentally, and
professionally. As a consequence, HEIs and researchers have shown an increasing
interest in predicting systems to identify students at risk of dropping out [Manrique
et al., 2019a]. For example, Ruben et al. created multiple feature sets from student
data to predict whether a student would drop out by using machine learning algo-
rithms [Manrique et al., 2019a]; also, Mujica et al. pointed out that path analysis
was useful for predicting student dropouts based on affective and cognitive variables
[Diaz-Mujica et al., 2019].
According to Vergel et al., the curriculum design may affect the students’ perfor-
mance and retention. A careful design is required to "understand the role of curricu-
lum design in dropout rates and unmask how the political dynamics embedded in
the curriculum influence students’ retention" [Vergel et al., 2018]. Moreover, Drennan,
Rohde, Johnson, and Kuennen claimed that students’ academic performance in a
course is related to their performance in their prerequisite courses [Drennan and
Rohde, 2002], [Johnson and Kuennen, 2006]. It can be seen that curriculum plays a
crucial role in student performance as well as decreasing dropout rates.
The main motivation for this research is to improve students’ academic perfor-
mance based on the analysis of curriculum and student academic performance in
different courses. Identifying prerequisites for a course is key to enhance student
experience, improve curriculum design and increasing graduation rates.
1
2 Introduction
1.2 Objectives
The main objectives of this study is to apply Artificial Intelligence and Semantic
technologies to analyse and allow changes that may improve student academic per-
formance.
The goal of this research project is threefold:
• Predict students’ performance in university courses using grades from previous

semesters;
• Model a course semantic representation and calculate the similarity among

courses; and, finally,
• Identify the sequence between two similar courses.
To accomplish these goals, a systematic review was carried out to identify what
technologies being used in CS curriculum.
1.3 Contributions
The main contributions of this thesis can be divided into:
• An approach for student dropout prediction using Genetic Algorithm (GA) and
Long Short-Term Memory (LSTM);
• A comprehensive systematic review to understand how Semantic Web and

Natural Language Processing technologies have been applied to curriculum
analysis and design in Computer Science; and, finally,
• An analysis of a Computer Science curriculum using the SemRefD distance

proposed by Manrique et al. [Manrique et al., 2019a] and BERT proposed by
Devlin et al. [Devlin et al., 2018].
1.4 Report Outline

Section 2 presents the background information on dropout prediction and text analy-
sis. Section 3 covers the methodology used in this work including a systematic review
and general techniques. Section 4 presents the experiments, results and discussion
and, finally, section 5 concludes this report with future work directions.
Chapter 2
Background and Related Work
The chapter is divided into two sections: background and related work. The following
section briefly introduces the basic concepts unitized in this research. In related work,
we conduct a systematic review regarding curriculum analysis by using Artificial
Intelligence and Semantic Web technologies.
2.1 Background
Artificial Intelligence refers to the development of machines or computers that sim-
ulate or emulate the functions of the human brain. The functions of a computer
differ according to the area of study. Prolog, for example, is a programming language
that aims to understand human logic. It also applies mathematics in order to create
systems that can discern relevant conclusions from a set of statements. Intelligent
Agents is another example. Unlike traditional agents, Intelligent Agents are designed
to take actions that are optimized to achieve a specific goal. Based on their perception
of the environment and internal rules, intelligent agents make decisions. The study
explores four growing fields: Machine Learning, Deep Learning, Natural Language
Processing, and Semantic Web.
Machine Learning is a field of study that uses algorithms to detect patterns in data
and to predict or make useful decisions. Different machine learning algorithms can be
implemented in an innumerable number of scenarios. Machine learning algorithms
are optimized for specific data sets; There are a variety of algorithms used to build
models to suit different use cases. Using machine learning algorithms, we predict
students’ performance in university courses based on their performance in previous
courses.
Deep Learning is a neural network with multiple layers of perceptrons. The neural
networks try to imitate human brain activities, albeit far from matching its capabilities,
which enables it to study from high volume of data. Additional hidden layers can
help to refine and optimize a neural network for accuracy, even when it has a single
layer. Recurrent Neural Networks (RNNs) are one special type of neural networks
that combines the information from previous time steps to generate updated outputs.
The Natural Language Processing (NLP) process is a psychological procedure that
consists of analyzing successful people’s strategies and applying them to reaching a
3
4 Background and Related Work
personal goal. The process links cognitions, language, and patterns of action that are
learned through experience to particular outcomes.
W3C (World Wide Web Consortium) standards enable the Semantic Web to be
extended from the World Wide Web. W3C standardizes the process of making in-
formation on the World Wide Web machine-readable. In order to accomplish this
goal, a series of communication standards were created that enable developers to
describe concepts and entities, as well as their classification and relationships. The
Resource Description Framework (RDF) and Web Ontology Language (OWL) have
enabled developers to create systems that store and use complex knowledge data
bases known as knowledge graphs.
2.2 Related Work

This section discusses tools, frameworks, datasets, semantic technologies, and ap-
proaches used for curriculum design in Computer Science. Note that we present the
methodology used to carry out a systematic review, but for brevity and adequacy we
only present the most relevant works and sections. The complete systematic review
will be submitted to a conference in Computer Science in Education as an outcome
of this thesis.
2.2.1 Methodology
This systematic review was conducted following the methodology defined by Kitchen-
ham and Charters [BA and Charters, 2007]. The method is composed of three main
steps: planning, conducting and reporting.
The planning stage helps to identify existing research in the area of interest as
well as build the research question. Our research question was created based on
the PICOC method [BA and Charters, 2007] and used to identify keywords and
corresponding synonyms to build the search string to find relevant related works.
The resulting search string is given below.
("computer science" OR "computer engineering" OR "informatics") AND ("curriculum"
OR "course description" OR "learning outcomes" OR "curricula" OR “learning objects”)
AND ("semantic" OR "ontology" OR "linked data" OR "linked open data")
To include a paper in this systematic review, we defined the following inclusion
criteria: (1) papers must have been written in English; (2) papers must be exclusively
related to semantic technologies, computer science and curriculum; (3) papers must
have 4 or more pages (i.e., full research papers); (4) papers must be accessible online;
and, finally, (5) papers must be scientifically sound, present a clear methodology and
conduct a proper evaluation for the proposed method, tool or model.
Figure 2.1 shows the paper selection process in detail. Initially, 4,510 papers were
retrieved in total. We used the ACM Digital Library1 , IEEE Xplore Digital Library2 ,
1 https://dl.acm.org/
2 https://ieeexplore.ieee.org/
§2.2 Related Work 5
Figure 2.1: Paper Selection Process
Springer3 , Scopus4 , ScienceDirect5 and Web of Science6 as digital libraries. The

Springer digital library returned a total of 3,549 studies. The large number of papers
returned by the Springer search mechanism led us to develop a simple crawling
tool to help us in the process of including/rejecting papers in our systematic review.
All the information returned by Springer was collected and stored in a relational
database. After that, we were able to correctly query the Springer database and select
the relevant papers for this systematic review.
We also applied the forward and backward snowballing method to this systematic
review to identify relevant papers that were not retrieved by our query string. The
backward snowballing method was used to collect relevant papers from the references
of the final list of papers in Phase 1, whereas the forward snowballing method was
used to collect the papers that cited these papers [Wohlin, 2014]. Google Scholar7
was used in the forward snowballing method. In total, 37 studies were identified as
relevant; most of the studies were published in the last few years, which shows the
increasing relevancy of this topic.
In the following sections we present the most relevant works that inspired the
proposed approaches reported in this thesis.
3 https://www.springer.com/
4 https://www.scopus.com/
5 https://www.sciencedirect.com/
6 https://www.webofscience.com/
7 https://scholar.google.com/
2.2.2 Tools/Frameworks
Protégé8 [Imran and Young, 2016] is a popular open-source editor and framework for
ontology construction, which several researchers have adopted to design curricula
in Computer Science [Tang and Hoh, 2013], [Wang et al., 2019], [Nuntawong et al.,
2017], [Karunananda et al., 2012], [Hedayati and Mart, 2016], [Saquicela et al., 2018],
[Maffei et al., 2016], and [Vaquero et al., 2009]. Despite its wide adoption, Protégé still
presents limitations in terms of the manipulation of ontological knowledge [Tang and
Hoh, 2013]. As an attempt to overcome this shortcoming, Asoke et al. [Karunananda
et al., 2012] developed a curriculum design plug-in called OntoCD. OntoCD allows
curriculum designers to customise curricula by loading a skeleton curriculum and a
benchmark domain ontology. The evaluation of OntoCD is performed by developing
a Computer Science degree program using benchmark domain ontologies developed
in accordance with the guidelines provided by IEEE and ACM. Moreover, Adelina
and Jason [Tang and Hoh, 2013] used Protégé to develop SUCO (Sunway University
Computing Ontology), an ontology-specific Application Programming Interface (API)
for curricula management system. They claim that, in response to the shortcoming
with using the Protégé platform, SUCO shows a higher level of ability to manipulate
and extract knowledge and will function effectively if the ontology is processed as an
eXtensible Markup Language (XML) format document.
Other specific ontology-based tools in curriculum management have also been
developed. CDIO [Liang and Ma, 2012] is an example of such tool. CDIO was cre-
ated to automatically adapt a given curriculum according to teaching objectives and
teaching content based on the designed constructed ontology. A similar approach
was used by Maffei et al. /citepMaffei2016 to model the semantics behind functional
alignments in order to design, synthesize, and evaluate functional alignment activ-
ities. In Mandić’s study, the author presented a software platform9 for comparing
chosen Curriculum for information technology teachers [Mandić, 2018]. In Hedayati’s
work, the authors used the curriculum Capability Maturity Model (CMM), which is
a taxonomical model used for describing the organization’s level of capability in the
domain of software engineering [Paulk et al., 1993], An ontology-driven model for
analyzing the development process of the vocational ICT curriculum in the context
of the culturally sensitive curriculum in Afghanistan is used as a reference model
[Hedayati and Mart, 2016].
2.2.3 Datasets
There are several datasets used in curriculum design studies in Computer Science.
The open-access CS201310 dataset is the result of the joint development of a com-
puting curriculum sponsored by the ACM and IEEE Computer Society [Piedra and
Caro, 2018]. The CS2013 dataset has been used in several studies [Piedra and Caro,
2018], [Aeiad and Meziane, 2016], [Nuntawong et al., 2016], [Nuntawong et al., 2017],
8 https://protege.stanford.edu/
9 http://www.pef.uns.ac.rs/InformaticsTeacherEducationCurriculum
10 https://cs2013.org/
[Karunananda et al., 2012], [Hedayati and Mart, 2016], and [Fiallos, 2018] to develop
ontologies or as a benchmark curriculum in similarity comparison between computer
science curricula.
Similar to CS2013, The Thailand Qualification Framework for Higher Education
(TQF: HEd) was developed by the Office of the Thailand Higher Education Commis-
sion to be used by all higher education institutions (HEIs) in Thailand as a framework
to enhance the quality of course curricula, including the Computer Science curricu-
lum. TQF: HEd was used for the guidelines in terms of ontology development in
the following studies [Nuntawong et al., 2017], [Nuntawong et al., 2016], [Hao et al.,
2008], and [Nuntawong et al., 2015].
Other studies use self-created datasets (e.g., [Wang et al., 2019], [Maffei et al.,
2016], [Hedayati and Mart, 2016], and [Fiallos, 2018]). Specifically, in Wang’s work,
the Ontology System for the Computer Course Architecture (OSCCA) was proposed
based on a dataset created using course catalogs from top universities in China
as well as network education websites [Wang et al., 2019]. In Maffei’s study, the
authors experimented and evaluate the proposal based on the Engineering program
at KTH Royal Institute of Technology in Stockholm, Sweden [Maffei et al., 2016]. In
Gubervic et al.’s work, the dataset used in comparing courses comes from Faculty
of Electrical Engineering and Computing compared to all universities from United
States of America [Guberovic et al., 2018]. In Fiallos’s study, not only did the author
adopt CS2013 for domain ontologies modeling, but also the core courses from Escuela
Superior Politécnica del Litoral (ESPOL11 ) Computational Sciences were collected for
semantic similarity comparison [Fiallos, 2018].
2.2.4 Languages, Classes and Vocabulary

RDF is used as the design standard for data interchange in the following studies
[Piedra and Caro, 2018], [Nuntawong et al., 2017], and [Saquicela et al., 2018]. In
particular, Saquicela et al. [Saquicela et al., 2018] generated curriculum data in the
RDF format, creating and storing data in a repository when the ontological model
has been defined and created.
OWL is an extension of RDF that adds additional vocabulary and semantics
to the classic framework [McGuinness and van Harmelen, 2004]. OWL is used in
many studies [Piedra and Caro, 2018], [Barb and Kilicay-Ergin, 2020], [Mandić, 2018],
[Wang et al., 2019], [Maffei et al., 2016], and [Vaquero et al., 2009] for representing and
sharing knowledge on the Web12 . Apart from OWL, only two studies used XML13
due to implementation requirements of research ([Tang and Hoh, 2013] and [Hao
et al., 2008]).
Body of Knowledge (BoK), a subclass of OWL, is a complete set of concepts, terms
and activities, which can represent the accepted ontology for a professional domain
[Piedra and Caro, 2018]. BoK has become a common development method in many
11 https://www.espol.edu.ec/
12 https://www.w3.org/OWL/
13 https://www.w3.org/standards/xml/
studies [Piedra and Caro, 2018], [Nuntawong et al., 2017], [Nuntawong et al., 2016],
[Hao et al., 2008], [Karunananda et al., 2012], [Tapia-Leon et al., 2018], [Chung and
Kim, 2014], and [Nuntawong et al., 2015]. In Piedra’s study [Piedra and Caro, 2018],
the BoK defined in CS2013 ontology was viewed as a to-be-covered description of the
content and a curriculum to implement this information. Similarly, Numtawong et
al.[Nuntawong et al., 2017] applied the BoK, which is based on the ontology of TQF:
HEd, to conduct the ontology mapping.
The Library of Congress Subject Headings14 (LCSH) is a managed vocabulary
Upheld by the Library of Congress. LCSH terminology is a BoK, which contains
more than 240,000 topical subject headings. The equivalence, hierarchical, and asso-
ciative types of relationships between headings can be offered. In Adrian’s study, the
authors create an ontology based on LCSH and the Faceted Application of Subject
Terminology15 (a completely enumerative faceted subject terminology schema origi-
nated from LCSH), to assess the consistency of an academic curriculum and apply it
to an Information Science curriculum [Barb and Kilicay-Ergin, 2020].
Knowledge Area (KA), also a subclass of OWL, is an area of specialization such as
Operating Systems and Algorithm. The relationship between BoK and KA is built in
various ways in the studies. For example, in Piedra’s study [Piedra and Caro, 2018],
each BoK class contains a set of KAs. In contrast, Numtawong et al. [Nuntawong
et al., 2017] considered KA as the superclass of BoK. KA classification was proposed in
Orellana et al.’s study [Orellana et al., 2018]. In that paper, the curricula are classified
in the KA defined by UNESCO16 which defines 9 main areas and 24 subareas of
knowledge. To do this, they convert the curricula to the vector space and then process
with using traditional supervised approaches such as the support vector machines
and k-nearest neighbors [Orellana et al., 2018]. By classifying the KA, the similarity
measurement can be applied more easily.
2.2.4.1 Curriculum Design Analysis and Approaches
One development approach found is the extraction and interrelationship analysis.

NLTK17 is deployed to segment raw data into terms, using various algorithms to
extract and analyse the interrelationship of items, then construct an ontology by
using a certain framework such as Protégé as in [Piedra and Caro, 2018], [Wang et al.,
2019], and [Tapia-Leon et al., 2018].
Text Mining, also known as text analysis methods, have been used to find the
interesting patterns in HEIs’ curriculum [Orellana et al., 2018]. With using Text
Mining approaches, keywords can be extracted from both found documents and
course materials for the further comparison and analysis [Kawintiranon et al., 2016].
In the next section, the application of Text Mining approaches in Computer Science
curriculum will be elaborated.
14 https://www.loc.gov/aba/cataloging/subject/
15 https://www.oclc.org/en/fast.html
16 https://whc.unesco.org/en/
17 https://www.nltk.org/
In the context of curriculum similarity measurement, Gomaa and Fahmy [H.Gomaa

and A. Fahmy, 2013] define string-based, corpus-based, knowledge-based and hybrid
similarity measures. String-based measures the distance between string (words) that
can be compared by characters or terms. Corpus-based approach measures the seman-
tic meaning of terms and phrases which are provided in the corpus. Knowledge-based
uses synsets-formed word networks such as WordNet to compare cognitive meaning
between each other. String-based similarity was conducted by Corpus-based was
done in [Orellana et al., 2018], [Pawar and Mago, 2018], [Aeiad and Meziane, 2016]
and [Fiallos, 2018]. Knowledge-based similarity approach is proposed in [Nuntawong
et al., 2015].
String-based similarity between terms was measured in many studies [Orellana
et al., 2018], [Pawar and Mago, 2018], [Barb and Kilicay-Ergin, 2020], [Seidel et al.,
2020], [Wang et al., 2019], and [Saquicela et al., 2018]. Orellana et al. [Orellana et al.,
2018] used cosine similarity between terms to acquire the level of similarity between
two course descriptions. Adrian and Wang used the same approach to measure the
similarity [Barb and Kilicay-Ergin, 2020], [Wang et al., 2019].
Pawar and Mago utilized the Bloom’s taxonomy to measure the similarity of
sentence pairs in the Learning outcomes [Pawar and Mago, 2018]. In another paper,
Saquicela et al. used the K-means, an unsupervised clustering algorithm to calculate
the similarity among courses content[Saquicela et al., 2018].
Corpus-Based similarity measurement was proposed in these studies [Orellana
et al., 2018], [Pawar and Mago, 2018], [Barb and Kilicay-Ergin, 2020], and [Fiallos,
2018]. In Orellana et al.’s study, the topics are extracted and processed by LSA.
Through the process, the terms and documents are located within their context to
get the most relevant documents (Wikipedia articles and curricula) by adding the
similarity threshold.
Knowledge-based (Semantic) similarity measurement was proposed in [Nunta-
wong et al., 2016] and [Pawar and Mago, 2018]. In Nuntawong’s paper [Nuntawong
et al., 2017], the authors designed curriculum ontology and defined the ontology
mapping rules by semantic relationships that can be established between curricula.
After completing the steps above, an ontological system was built by converting input
curriculum data to ontology. Retrieve BoK in KA from TQF: HEd and course descrip-
tions which were compared to WordNet. In the end, calculate the semantic similarity
values with using extended Wu & P’s algorithm [Wu and Palmer, 1994]. To calculate
the semantic similarity between words, Pawar and Mago used Synsets from WordNet.
The method simulates supervised learning through the use of corpora. Additionally,
they used the NLTK-implemented max similarity algorithm to determine the sense
of the words [Pawar and Mago, 2018].
One important component in curricula is the Learning Outcomes (LOs). It defines
what students are expected to learn by taking the course. There are six layers in a
hierarchical structure in Bloom’s Taxonomy (Remembering, Understanding, Apply-
ing, Analysing, Evaluating, and Creating) [Lasley, 2013]. Semantic technologies are
used along with the Bloom Taxonomy to calculate the similarity of learning outcomes
between courses. Pawar and Mago [Pawar and Mago, 2018] propose a semantic
similarity metric using WordNet18 to generate a score to compare LOs based on the
Bloom taxonomy. Similarly, Mandić [Mandić, 2018] proposed the taxonomic struc-
tural similarity between curricula by applying the revised Bloom’s taxonomy which
has the adjustment in cognitive levels of learning.
This section partially presented some of the works found in the literature that used
semantic technologies to help university stakeholders design curricula for Computer
Science. The following chapters present the approaches and analysis carried out
using previous works as inspiration.
18 https://wordnet.princeton.edu/
Chapter 3
Approaches
This chapter contains two sections. The first section introduces the dropout prediction
task and a series of techniques to predict students’ performance. The second part
presents semantic techniques that can be used to analyze the relationship between
Computer Science courses.
3.1 Dropout Prediction
This section presents an approach to predicting dropout. We illustrate the entire

dropout prediction workflow followed by the description of the dataset used and the
corresponding pre-processing measurements. We then present an SVM-based genetic
algorithm (GA) for feature selection and our LSTM approach for dropout prediction.
3.1.1 Procedure Overview
As illustrated in Figure 3.1, the entire dropout prediction workflow is composed of

four steps. Details of each step will be explained in the following sections.
Briefly, the first step of dropout prediction is responsible for data pre-prepocessing
including obtaining datasets. This step uses data wrangling and machine learning
(ML) techniques. The second step is responsible for the feature selection. It uses the
pre-processed data outputted in the first step. Steps 3 and 4 are merged together
and starts with training and testing the Long Short-term Memory (LSTM) and Fully
Connected (FC) neural network.
3.1.2 Data Pre-processing
Before introducing the data pre-processing, we introduce the data set used for this
experiment. Figure 3.2 presents a few instances of the dataset used.
11
12 Approaches
Figure 3.1: The workflow of dropout prediction
Figure 3.2: Dataset snippet

§3.1 Dropout Prediction 13
(a) Student distribution by year (b) Record distribution by year
Figure 3.3: Distribution by year
Figure 3.4: Student distribution by degree Figure 3.5: Dropout distribution by year
The dataset used in this thesis has been used in previous research Manrique et al.
[2019b] and is provided by a Brazilian university. It contains 248,730 academic records
of 5,582 students enrolled in six distinct degrees from 2001 to 2009. The dataset is in
Portuguese and we will translate the main terms for better understanding.
The dataset contains 32 attributes in total: cod_curso (course code); nome_curso
(course name); cod_hab (degree code); nome_hab (degree name); cod_enfase (empha-
sis code); nome_enfase (emphasis name); ano_curriculo (curriculum year); cod_curriculo
(curriculum code); matricula (student identifier); mat_ano (student enrolment year);
mat_sem ((semester of the enrolment); periodo (term); ano (year); semestre (semester);
grupos (group); disciplina (discipline/course); semestre_recomendado (recommended
semester); semestre_do_aluno (student semester); no_creditos (number of credits);
turma (class); grau (grades); sit_final (final status (pass/fail)); sit_vinculo_atual (cur-
rent status); nome_professor (professor name); cep (zip code); pontos_enem (na-
tional exam marks); diff (difference between semesters and students performance);
tentativas (number of attempts in a course); cant (previous course); count (count);
14 Approaches
Table 3.1: Data wrangling techniques used in pre-processing.
Technique Operation
Data Cleaning outliers/duplicates correction

Data Validation inappropriate data identification
Data Enrichment Data enhancement
Data Imputation missing values filling
Data Normalisation data re-scale
identificador (identifier); nome_disciplina (course name).

Most of the data is related to the status of a student in a give course and degree.
The "semestre" attribute is related to the semester a student took a course and has
been previously used to create a time series of students Manrique et al. [2019a]. The
attribute "sit_vinculo_atual" indicates that there are 12 status of enrolment where three
of them ("DESLIGADO","MATRICULA EM ABANDONO","JUBILADO") represent
the dropout status. Grades is scaled from 0 to 10 inclusive where 10 is the maximum
grade. The dataset is anonymous and the identifiers do not allow re-identification.
The students’ ids are encoded with dummy alphanumeric code such as "aluno1010"
("student1010"). The dataset was profiled by year and degree and is presented in
Figure 3.3 - 3.5.
To pre-process the data, we first split and reorganised the data set. We split the
dataset into three parts: CSI ("Information Systems"), ADM ("Management"), and
ARQ ("Architecture") datasets. After that we removed duplicate data, resolved data
inconsistencies and removed outliers. For example, in the attribute "grades", a few
marks were greater than 10, an invalid instance given the university marking scheme,
and therefore removed. Attributes were also removed such as "grupos" as it was
irrelevant and basically a copy of another attribute ("disciplinas"). Table 3.1 presents
all information of the data-preprocessing step.
Another technique used was data validation where inappropriate data is identified
and removed. As part of the data pre-processing step, we also converted categorical
to numerical data. For this step we used a tool named LabelEncoder which is a
extensive library of Sklearn1 .
For the droupout status attribute ("sit_vinculo_atual"), the instances were replaced
by 0 (dropout) and 1 (enrolled). With respect of data imputation, the Random Forest
(RF), which is a popular Machine Learning algorithm with constructing multiple
decision trees in training, was employed to implement. As for normality or non-
normality, linearity or non-linearity type of data, RF has a better performance than
other algorithms in terms of handling imputation[Pantanowitz and Marwala, 2009],
[Shah et al., 2014], and [Hong and Lynn, 2020]. To be specific, the sequence of
imputation is from the least column to the ones with more missing data, filling the
1 https://scikit-learn.org/
missing values with 0 in other columns, and then to run the RF and perform iterations
until each missing value will be handled. The procedure of imputation is shown in
Algorithm 1. Next comes to group the data by "course" and "semester" attributes.
The final step normalizes all the input data except the "sit_vinculo_atual" and
"semestre" with the z-score method as z-score can improve the model performance
than other techniques [Imron and Prasetyo, 2020], and [Cheadle et al., 2003]. The
z-score method is computed based on a set of values S = { x1 , x2 , ..., xn }, the mean of
the set (µ), and the standard deviation (σ). The formula is presented in what follows.
x−µ
z= (3.1)
σ
The last pre-processing step applied to the dataset it the data enrichment step.
As Table 3.1 shows, the number of instances in the CSI dataset is very small and
not sufficient for training/testing purposes. For this, we used the Synthetic Minor-
ity Oversampling Technique (SMOTE) to increase the number of instances of the
CSI dataset. Briefly, SMOTE generates new instances based on real ones. SMOTE
considers the minorities to generate and balance the dataset with new instances.
Algorithm 1 Procedure of Imputation

Input: Dataset with missing values
Output: Complete dataset with predicted values
1: procedure
2: Get the order of imputation column by the number of missing values
3: do
4: Fill NULL with 0 in all NULL except the column which has least number
of missing values
5: Train RF to predict the missing values
6: while Number of missing values > 0 . Perform Iteration again
3.1.3 Feature Selection

After the data is pre-processed, to reduce the time consumption and increase the
computational efficiency in training, it is necessary to apply feature selection to re-
move irrelevant features. Consequently, Support Vector Machine(SVM)-based Genetic
Algorithm (GA) will be introduced in this section.
In GA, optimization strategies are implemented based on simulation of evolution
by natural selection for a species. Table /reftab:ga lists popular GA operators for creat-
ing optimal solutions and solving search problems based on biologically inspired prin-
ciples such as mutation, crossover, and selection. Moreover, as its nature mentioned
previsouly, the application of GA for feature selection is popular and was proven
to improve the model performance in [Babatunde et al., 2014], [Huang et al., 2007],
and [Leardi, 2000]. The whole procedure is displayed in Figure 3.6. In this thesis, to
perform a GA-based feature selection, a set of techniques was deployed. First, a popu-
lation was randomly generated as the initial solutions whose size is 1,000, followed bi-
16 Approaches
nary encoding which determines the chosen feature as 1, and 0 represents column will
not be chosen. The next step is to calculate the fitness of each individual in the popula-
tion by performing SVM in the three datasets (ADM, CSI, and ARQ). SVM is a super-
vised linear machine learning technique that is most commonly employed for classifi-
cation purposes, and it has good performance in the fitness calculation in feature selec-
tion [Tao et al., 2019]. It is defined as a linear classifier with maximum interval on the
feature space (when a linear kernel is used), which is basically an interval maximiza-
tion strategy that results in a convex quadratic programming problem. Given a train-
ing dataset, D = {([ x11 , x12 , ...x1n ], y1 ), ([ x21 , x22 , ...x2n ], y2 ), ..., ([ xm1 , xm2 , ...xmn ], yn )} , yn ∈
{−1, +1}, where m is the number of samples in the dataset, and n is the number of
features. In this experiment n = 27. The target is to get a decision boundary so as to
separate the samples into different areas. The separator in two-dimensional space is
a straight line and can be written as formula 3.2:
mx + c = 0 (3.2)
Once mapping the separator to the n-dimensional space, it will become the hyper-
plane separator H. It can be written as formula 3.3:
H : wT x + b = 0 (3.3)
where w = (w1 , w2 , ...wn ) where w is the vector determining the hyperplane direc-
tion, The hyperplane is located at the origin after a displacement term b is applied. By
determining the vector w and the bias b, the division hyperplane is denoted as (w, b).
In the sample space, a given point vector ∅(xo ) is the distance from the hyperplane.
3.4:
wT ∅(xo ) + b
d H (∅(xo ))) = (3.4)
kwk2
where kwk2 is the 2-norm which can be defined by w as in formula 3.5:
q
kwk2 = w21 + w22 + ... + w2n (3.5)
Suppose hyperplane (w, b) can classify the training dataset, so as to ( xo , yo ) ∈ D,

the following corollary can be obtained as below:
(
wT ∅(xo ) + b ≥ 1, ym = +1 if correct
(3.6)
wT ∅(xo ) + b ≤ 1, ym = −1 if incorrect
The vectors close to the hyperplane make the formula 3.6 hold are named "support
vectors". The sum of distance of 2 vectors which are from different areas to the
hyperplane is called "margin", which is defined in the formula 3.7:
2
γ= (3.7)
kwk2
Table 3.2: Genetic Algorithm techniques, descriptions, goals, and specific method used in
this study
Operator Description Goal Method deployed
Initialization initialization of population acquisition to generate a set of solutions Randomly generated initialization
Encoding representation of the individuals to convey the necessary information Binary Encoding
Fitness the degree of health of individuals to evaluate the optimality of a solution Support Vector Machine (SVM)
Crossover parents are chosen for production to determine which solutions are to be preserved Uniform Crossover
Selection solutions are selected for production to select the individuals with better adaptability Proportional Selection
Mutation a gene is deliberately changed to maintain diversity in the population set mutation rate setting-up
Termination a process stops the evolution to terminate and output the optimal outcomes Maximum generations
To obtain a partitioned hyperplane with maximum margin, that is, to find param-
eters w and b that comply with the constraint in equation 3.6 such that the optimal
hyperplane classifies all the points in D correctly, namely:
h i
w∗ = argw max(γ), s.t.minn yn wT ∅(xo ) + b = 1 (3.8)
Followed by the fitness computation with SVM in 5-fold cross validation, which
is a technique to prevent overfitting, proportional selection was conducted to get
the individuals, the process is similar to the roulette wheel, that is, individuals with
higher fitness ratings will have the greater chance to be selected, the formula of this
process is shown in equation 3.9:
fγ (xi (t))
ϕs (xi (t)) = n (3.9)
∑I=1 fγ (xI (t))
where ns is a population’s total number of chromosomes whose initial number
is 1,000, ϕs (xi ) is the possibility of xi being selected , and fγ (xi ) is the fitness of xi to
yield float type of value greater than 0.
After completing selection, the uniform crossover was performed to get the off-
springs between a pair of selected parents. Specifically, each gene is selected at
random from one of the corresponding genes on every parent chromosome as shown
in Figure 3.7. Note that this method will only yield one offspring and cross rate is
0.75. Reproduction is followed by mutation which certain gene(s) will be changed
randomly by the setting rate, in this thesis, is 0.002. The final step is the terminal
criterion, when the maximum generation reached, the population has evolved 100
generations, an optimal subset will be generated as the output. The entire workflow
is illustrated in Figure 3.8.
3.1.4 Training and Test

After feature selection, 3 subsets corresponding to the 3 used datasets in this study
have been generated with optimal features. In this step, it’s necessary to train the
model with processed data so as to predict dropout.
To begin with, Long short-term memory (LSTM) is used for the training model in
this study, LSTM is a special type of recurrent neural networks (RNNs). Unlike tradi-
18 Approaches
Figure 3.6: Procedure of GA
Figure 3.7: Uniform Crossover

Figure 3.8: The workflow of GA+SVM feature selection

20 Approaches
Table 3.3: The components of a LSTM cell
Component Description Denote
Forget Gate activation vector with sigmoid f

Candidate layer activation vector with Tanh C
e
Input Gate activation vector with sigmoid I
Output Gate activation vector with sigmoid O
Hidden state hidden state vector H
Memory state cell state vector C
Current Input vector X
Figure 3.9: The internal architecture of LSTM cell [Sanjeevi, 2018]
tional feedforward neural networks and traditional RNNs, it has feedback connection
and gate schema, which power LSTM to learn and memorize the information over
time sequences. Likewise, human learning activities are matchable to LSTM based on
gate schema. In this study, it also simulate the process of influence of past exam to the
current performance. Furthermore, LSTM is capable to handle the vanishing gradient.
There are multiple studies in the past focusing on Deep Knowledge Tracing to predict
student performance in the quiz[Piech et al., 2015] and simulation of Human Activity
Recognition based on the combination of LSTM and convolutional neural network
(CNN)[Xia et al., 2020].
As illustrated in Figure 3.9, LSTM consists of the following components which
has been listed in Table 3.3. The inputs at each time sequence are comprised of 3
elements, that is, Xt , Ht−1 , and Ct−1 . With regards to outputs, Ht and Ct are exported
by the LSTM. Note that there are 8 weight W parameters in total, 4 associated with
hidden state and others linked with the input. Moreover, 4 bias b will be used in
one unit. All the W and b will be initialized randomly for the first place, and will be
adjusted by back-propagation. To prevent gradients exploring when gradients reach
the designed threshold, which is 1.01, clipping the gradients.
Once execution in LSTM during one sequence t with using the dataset from
section 3.1.3. The first move is to determine which information will be forgotten as
formula 3.10 illustrates. This step is executed by the forget gate, Ht−1 and Xt pass the
gate to get a output between 0 and 1.
ft = σ(Wf · [Ht−1 , xt ] + b f ) (3.10)

where t is between 0 to 31 inclusive which indicates there are 32 time steps per
student representing the information across 8 semester and 4 courses each semester.
Regarding the students who dropped out in the midway or have not finished all the
semesters, padding time series sequences to make sure the total time step is the same.
The xt is the input, whose size initial input (x0 ) equals to the number of features in
the input dataset. Furthermore, there are two hidden layers and the hidden size is 50
which determines size of the H0 and C0 .
The second step is to decide what new messages will be used for storing in the
cell state. First, input gate will get the update values it , followed by a tanh layer
creates a vector of new candidate values C e t , these values will be combined and the
state updated. The formula for this is shown in 3.11 and 3.12:
it = σ(Wi · [Ht−1 , xt ] + bi ) (3.11)

e t = tanh(WC · [Ht−1 , xt ] + bC )
C (3.12)
Next step to update the Ct−1 , and create a new cell state Ct . According to the
formula 3.13, combining the previous outputs ft , it , and C e t , by a series of arithmetic
operations, to get the final cell state Ct , which will be the input in the next step.
Ct = ft ∗ Ct −1 + It ∗ C
et (3.13)
The final step is to output the hidden state. As shown in formula 3.14 and 3.15,
similar to step 1 and 2, carrying Ht−1 and Xt as inputs, to pass output gate with
sigmoid which will make a decision regarding the parts of outputs. Then output is
transited to a tanh layer to acquire the hidden state Ht , which will be used as the
input in the time step or another neural network.
Ot = σ(Wo · [Ht−1 , xt ] + bo ) (3.14)
Ht = Ot ∗ tanh(Ct ) (3.15)
In this study, after a series of hyper-parameter tuning, the final output hidden
state will be regarded as the input of a 3-layer fully connected (FC) neural network
to predict the dropout. The hidden size in FC is 128, in the first layer, sigmoid is
deployed. Likewise, ReLU is applied in the output layer. As for loss measurement, as
illustrate in formula 3.16, Mean Squared Error (MSE) is used to compute the loss:
1 n
n i∑
MSE = (Yi − Ŷi )2 (3.16)
=1
where n is the number of samples, Yi is the observed value, and Ŷi is the predicted
22 Approaches
Figure 3.10: The workflow of LSTM + FCs
value.
After computing the loss, the back-propagation will be executed to adjust the
weights with using Adam as the optimizer. Furthermore, by introducing a dropout
layer on outputs of each LSTM layer apart from the last layer, with dropout possibility
equals to 0.7, to prevent overfitting. Figure 3.10 visualizes the process of once training.
3.2 Curriculum Semantic Analysis
Identifying dropout is an important step to understand the reason why students fail
and what can be done to increase graduation rates. We argue that the sequence of
courses taken by students may influence student attrition. This section presents an
approach to create a sequence of courses based on its descriptions. The idea is that
in the future be able to correlate dropout rates, the sequence of courses taken by
students and the course content/description.
We divide this section into two parts. The first part introduces a methods to
measure the similarity between course pairs using the Bidirectional Encoder Rep-
resentations from Transformers (the so-called BERT language model). The second
part is responsible for ordering course pairs by employing Semi-Reference Distance
(SemRefD) [Manrique et al., 2019b].
3.2.1 Procedure Overview
As Figure 3.11 and 5.2 show, the first move is to unify BERT to conduct similarity mea-
surement. It involves three steps, that is, dataset acquirement, sentence embedding,
and similarity measurement. Likewise, the second step uses the same dataset, the
first process is to extract entities, followed by concept comparison and the sequence
will be identified between courses in the end.
§3.2 Curriculum Semantic Analysis 23
Figure 3.11: The brief workflow of the Similarity Measurement
Figure 3.12: Prerequisite Identification workflow
3.2.2 Similarity Measurement

To compare the courses description, it is necessary to encode the contextual data into
vectors so as to be comparable in a semantic way. To achieve this, we make use of
BERT for sentence embedding.
BERT was proposed by Google [Devlin et al., 2018], which is a revolutionary pre-
training model that uses multi-head attention based transformer to learn contextual
relations between words or sentences in the context. Transformers were also proposed
by Google [Vaswani et al., 2017]. It consists of an encoder, which is a bidirectional
RNN, and a decoder. There is a scaled dot-product attention layer and a feedforward
neural network layer in the encoder. With regards to self-attention, it unifies the
matrix representation to calculate scoring and final output embedding in one step as
formula 3.17 illustrates:
QK T
Attention( Q, K, V ) = so f tmax ( √ )V (3.17)
dk
where K and V create a key-value pair of input of of dimension dk , Q stands for
the query. The output is a weighted sum of values.
Compare with computing the attention one time, the multi-head mechanism
which is used in BERT, goes through the scaled dot-product attention several times at
same time and creates separate embedding matrices that are combined into one final
output embedding as formula 3.18 shown:
MultiHead( Q, K, V ) = Concat(head1 , ..., headh )W O (3.18)

where headi = Attention( QWiQ , KWiK , VWiV ) and WiQ , WiK , WiV , and W O are sepa-
rate weight matrices.
24 Approaches
Figure 3.13: BERT Base
BERT contains 12 stacked encoders in base version, 24 stacked encoders in larger

version. Transformer encoders read each sequence of words at once instead of the
sequential reading of text input as in directional models. With this characteristic, a
model is able to determine a word’s context based on all of its surrounding informa-
tion as shown in Figure 3.13.
To start with, the datasets this study uses ANU Course & Program Website2 .
Furthermore, this study takes Computer Science courses into consideration, hence,
course code beginning with "COMP" will be considered. After using a tokenizer
to segment the full text into sentences and applying BERT on sentences as such for
encoding, the vectors of sentences are constructed. Next the vectors will be computed
by cosine similarity measurement which aims to get the distance, the opposite of
similarity as equation 3.19 specifies:
te ∑in=1 ti ei
cos(t, e) = =p n (3.19)
ktkkek ∑i=1 (ti )2 ∑in=1 (ei )2
p
where t and e are sentence vectors which were generated by BERT, the output
ranges within [-1, 1] depicting the degree of contextual similarity.
To get the entire contextual similarity, the average similarity among sentences will
be calculated. After the processes above, the course similarity is obtained.
3.2.3 Prerequisite Identification

Despite the similarity between courses has been computed, the sequence between
high similar courses remain uncertain, hence, to identify the prerequisite dependency
2 https://programsandcourses.anu.edu.au/
(PD) between two highly similar courses becomes necessary.

Regarding PD, it is a relation between two concepts where, in education, the
prerequisite concept should be taught first. For instance, Binary Tree and Red-black
Tree are two concepts belonging to data structure field in Computer Science, the latter
should be introduced after the former. By measuring the prerequisite relationship
between courses, the curriculum will be analyzed as a whole.
To begin with, similar to the similarity measurement, the same dataset and pre-
processing techniques were used in this study. Subsequently, the entities behind the
text will be extracted, the tool named TextRazor was employed to complete this task.
TextRazor is a NLP-based API developed to segment text and capture conceptual
terms. Then a technique called Semi-Reference Distance (SemRefD) was conduct to
measure semantic PD between the entities of two courses in DBpedia3 Knowledge
Graph (KG).
With respect to KG, it is known as a semantic network, which stands for a real-
world entities network, in this studies, they refer to concepts, and illustrates the
relationship between them. DBpedia is one of main KGs on Semantic Web and
provides a wide variety of topic which can be used to encompass courses from
various fields of study. Moreover, it is also very tolerant and inclusive of many
different semantic properties, which empower the liberal connection to multiple
types of concepts. Using given concept as a query in DBpedia, there will be two lists
storing the candidate concepts, that is, direct list and neighbor list. As for direct list,
in which a list of concepts sharing a category with the given concept will be returned.
For the latter list, in which the candidate list is expanded by adding concepts linked
to the target through non-hierarchical pathways up to m hops [Manrique et al., 2019a].
The path length parameter m decides the maximum length of the path between the
target concept and the farthest candidate concept to consider, in this study, m is 1.
Soon after acquiring the candidate lists, the SemRefD will be performed to compute
the degree of prerequisite in the next step.
SemRefD was presented by Manrique et al.[Manrique et al., 2019a] based on
Reference Distance (RefD) which was proposed by Chen et al. [Chen et al., 2018] as
defined in formula 3.20 by inputting two concepts which are denoted c A and c B as
below:
∑kj=1 i (c j , c B )s(c j , c A ) ∑kj=1 i (c j , c A )s(c j , c B )

Re f D (c A , c B ) = − (3.20)
∑kj=1 s(c j , c B ) ∑kj=1 s(c j , c B )
An indicator function i (c j , c A ) indicates that there is a relationship between c j and
c A , and a weighting function s(c j , c A ) indicates whether there is a relationship. The
values of Re f D (c A , c B ) range from -1 to 1. According to Figure 3.14, c B is more likely
to be a prerequisite for c A the if it is closer to 1.
RefD does not take into account the semantic properties of DBpedia to deter-
mine whether two concepts have a prerequisite dependency. SemRefD does. In the
weighting function 1(c j , c A ), the common neighbors’ concepts in the KG KG hierarchy
3 https://www.dbpedia.org/
26 Approaches
Figure 3.14: The prerequisite of concept A (c A ) and Concept B (c B )
are considered, while in the indicator function 1(c j , c A ), the property paths between
concept target and related concepts are considered [Manrique et al., 2019a].
As a result, all concepts from two courses will be compared and summed up to
reveal the order of them, that is, whether A is prerequisite of B or B is prerequisite of
A.
Chapter 4
Results and Discussion
In this chapter, the results will be presented in 2 sections from experiments described
in Chapter 3, namely, dropout prediction and curriculum semantic analysis along
with discussions in details. Notably, the results yielded by methods in systematic
review for have been indicated in Section 2.2.1. Additionally, the experimental envi-
ronment will be introduced initially.
4.1 Experimental Environment

The experiments are conducted on the with following devices and corresponding
hardware as shown in Table 4.1. Specifically, the paper retrieving process in system-
atic review is completed on the Apple Macbook Pro. The rest of experiments are
conducted on the server.
4.2 Dropout Prediction

This section contains two subsections. The evaluation criteria will be presented firstly,
followed by the results from feature selection and dropout prediction.
4.2.1 Evaluation Metrics

Dropout prediction is evaluated by accuracy, precision, recall, and F1 score based on
the confusion matrix as shown in Figure 4.1. Specifically, the TP, TN, FP, and FN
are defined as below:
True Positive (TP): Models predicted positive outcomes and are accurate.
True Negative (TN): Models predicted negative outcomes and are accurate.
False Positive (FP): Models predicted positive outcomes and are inaccurate.
Table 4.1: Experimental Environment
Model CPU GPU Memory Hard disk size System
Server Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Tesla V100-DGXS-32GB*4 251GB 10TB Ubuntu 18.04.6
Apple Macbook Pro M1 chip 8-core GPU 16GB 512GB macOS Monterey 12.0.1
27
28 Results and Discussion
Figure 4.1: Confusion Matrix
False Negative (FN): Models predicted negative outcomes and are inaccurate.
To be specific, the The accuracy refers to the proportion of correctly predicted
observations in a whole samples. The precision measures the correctly predicted
TP samples to the total positive ones. Recall depicts the sensitivity of model by
measuring the correctly predicted positive observations to the all the observations.
F1 score measure the overall performance of model The equations are as formula 4.1
- 4.4 shown below:
TP + TN
Accuracy = (4.1)
TP + FP + FN + TN
TP
Precision = (4.2)
TP + FP
TP
Recall = (4.3)
TP + FN
2 ∗ ( Recall ∗ Precision) TP
F1 = = (4.4)
Recall + precision TP + FP+2 FN
4.2.2 Experimental Results an Discussions

As mentioned in Section 3.1.2, ADM, ARQ, and CSI datasets will be used to evaluate
the proposed models in terms of feature selection and dropout prediction. The first
section is regarding feature selection, which was proposed in Section 3.1.3, followed
by dropout prediction, as mentioned in section 3.1.4.
4.2.2.1 Feature Selection
The results of feature selection as listed in Table 4.2. After 100 generations evolution
in a population comprised of 1000 individuals, as for the performance of final optimal
individuals among these datasets, ARQ has the best performance in terms of accuracy
(90.61), F1 score (95.03) with the least number of dropped features, indicating that
this individual fits the environment most than the others. By comparison, CSI has
the least performance in terms of measurement metrics. By observing the lower score
in precision than recall across used dataset, indicating that the model has it’s bias in
terms of the prediction preference in positive instead of negative, and emphasising
the importance of the balance of datasets. Furthermore, by acquiring the ratio of
negative in CSI (24%), ADM (13%), and ARQ (10%), also proving this inference. Thus,
employing model on a balanced dataset may improve the its performance. As for
ADM, it has the nearly same recall as ARQ’s and best recall score in this experiment.
Table 4.2: Results of ADM, ARQ, and CSI by SVM
Dataset Accuracy(%) Precision(%) Recall(%) F1 score(%) No. dropped features
ADM 88.42 88.75 99.52 93.83 10

ARQ 90.61 91.14 99.26 95.03 6
CSI 81.23 82.28 97.84 89.36 12
Overall, the results turn out this model align with this experiment.
4.2.2.2 Dropout Prediction
Likewise, this experiment used same datasets on the dropout prediction. As dropout
prediction is one of the main objectives of this project, the performance of pro-
posed model is important to this project. As illustrated in 4.2 - 4.7, these out-
comes on training reveal the capacity of the model, whose end of the abscissa is
the epoch(every10epochs) ∗ timestep. Furthermore, the performance of model on test
further validate the suitability of the model on datasets as shown in Table 4.3.
According to the accuracy and loss during training, we can observe that the model
always converges after 10 epochs. As for the accuracy in ADM and ARQ datasets,
which starts from below 10% initially, then fall to the lower, and rises to the close to
the top afterwards. For ARQ, the curve is similar to some extent. After investigation,
multiple reasons are identified, which leads to the abnormal curve. For instance,
dropout rate is high, which shows that model drops the well-functional neurons, or
the batch size is large. In this study, the time step is fixed as mentioned in section
3.1.4, which indicates that the batch size is uncontrollable. In addition, after hyper-
parameter tuning, high dropout rate (0.7) enables the best results of model, which
shows that selecting the current dropout rate is a trade-off decision. Thus, this study
chose to keep a better performance with using current dropout rate.
In terms of loss, after reaching the convergence, it will repeat the loss curve along
with the reset of the time step. Apart from repetition of loss, we also observed that the
unstable curve during one epoch training, which is named multiple descent [Chen
et al., 2020]. The reason behind the anomaly may vary. For example, we assumed
there are minimas θ1 and θ2 , when the distance between two minima d = kθ1 − θ2 k is
very small, and the learning rate is not small enough, which leads to cross a local θ1
and arrives in θ2 eventually. Also, this phenomenon can be caused by datasets [Chen
et al., 2020].
With respect to the performance of proposed model in test, the model performs
well in ADM and ARQ, whose best accuracy reaches 92.83% and 97.65% respec-
tively. Notably, the accuracy of ARQ improves the result in Manrique’s previous
work (95.2%) by 2.45% [Manrique et al., 2019a]. Finally, we captured a potential
improvement for model by adopting dynamic time steps instead of fixed steps to
make full use of dataset. Overall, the model proposed by this study is suitable for the
current datasets.
Figure 4.2: Accuracy of LSTM+FC on ADM Figure 4.3: Loss of LSTM+FC on ADM
Figure 4.4: Accuracy of LSTM+FC on ARQ Figure 4.5: Loss of LSTM+FC on ARQ
Figure 4.6: Accuracy of LSTM+FC on CSI Figure 4.7: Loss of LSTM+FC on CSI
Table 4.3: Results of ADM, ARQ, and CSI
Dataset Avg. acc (10 iter.)(%) Avg. acc (100 iter.)(%) Avg. acc (200 iter.)(%) Top. acc(%)
ADM 88.76 88.24 90.17 92.83

ARQ 94.84 95.76 95.19 97.65
CSI 74.98 75.84 76.31 81.52
4.3 Curriculum Semantic Analysis
This section contains two subsections, that is, similarity measurement, as mentioned
in Section 3.2.2, and prerequisite identification as mentioned in Section 3.2.3.
4.3.1 Similarity Measurement
After the encoding between sentence captured by BERT and the average similarity
between courses computed hereby, the results have been acquired and visualized in
heat map format from Figure 4.8 ranging from 0.8265 to 1.0, which lists the result of
all the course comparison, to Figure 4.9 - Figure 4.12 as below.
In 1000-level courses comparison, we can see that COMP1110 (Structured Pro-
gramming) has the closest average distance to the rest of courses. In contrast, the
COMP1600 (Foundations of Computing) has the farthest to others. From courses
content to interpret, similarity among COMP1100 (Programming as Problem Solving)
to COMP1100 is one of the most fundamental programming course among 1000 level
courses, and COMP 1600 focuses on the mathematical perspective.
With respect to 2000-level courses comparison, COMP2100 (Software Design
Methodologies) has the closest average distance to the rest of courses. On the con-
trary, the COMP2560 (Studies in Advanced Computing R & D) has the farthest to
others. To sum up, the overall similarity between 2000-level courses becomes lower
than 1000-level’s, which indicates that the curriculum differentiation has appeared.
As regards 3000-level and 4000-level courses, it can be seen that this trend has
become more obvious and grown, which also aligns with real situations, for example,
students knowledge grows, the course content deepens and divided by specifications,
such as Data Science, Machine Learning etc. In the meantime, students also have
the interest to dive in, which aids students to make use of strengths in the region of
interest and improve the academic performance in return. Finally, we identified an
obstacle in automatic evaluation, which is a common limitation in studies in system-
atic review [Piedra and Caro, 2018], [Pata et al., 2013], [Yu et al., 2007], [Saquicela
et al., 2018] and [Tapia-Leon et al., 2018].
Figure 4.8: All level courses similarity

Figure 4.9: 1000-level courses similarity Figure 4.10: 2000-level courses similarity
Figure 4.11: 3000-level courses similarity Figure 4.12: 4000-level courses similarity
4.3.2 Prerequisite Identification

As Section 4.3.1 presented, the similarity course have been computed, thus it is
necessary to identify the prerequisite between two similar courses.
For this study, we selected 3 courses, that is, COMP1100, COMP1110, and COMP2100
to conduct this experiment as they have high similarity score in the previous stage
and they are all programming oriented courses. The result is shown in Figure 4.13.
Between COMP1100 and COMP2100, whose similarity is 0.9479, the prerequisite
score is 13.17. Based on the rule of thumb in the experiment, the score is very high,
Figure 4.13: All level courses similarity
indicating that there is a strong relation in prerequisite in these two courses.

Similarly, COMP1110 and COMP2100 still have a very high score, which is 12.06,
indicating that COMP1110 is also one of the prerequisite of COMP2100.
Regarding COMP1100 and COMP1110, the similarity is 0.9543 and score is 4.4,
which reveals that although there is a strong bond between these two courses, the
prerequisite relationship is not as strong as the relationship with COMP2100 respec-
tively.
Chapter 5
Conclusion and Future Work
This study was undertaken to improve students’ academic performance by using AI

and semantic technologies. To achieve this goal, we came up with three objectives,
along with three implementations, that is, to predict students’ performance using
grades from the previous semester, to model a course representation in a semantic way
and compute the similarity, and to identify the sequence between two similar courses.
For highlighting the research gaps in the Computer Science curriculum semantic
analysis and contributing to the growing field of research, we conducted a systematic
review regarding what semantic technologies are currently being used. A major
finding of the study is that technologies used to measure similarity have limitations
in terms of accuracy and ambiguity in the representation of concepts, courses, or
curriculum. Our research fills this gap. Furthermore, this review also inspires us to
think further regarding identifying the sequence between similar courses.
Regarding students’ academic prediction, we conducted a dropout prediction
experiment on a dataset from a Brazilian university. Three clean datasets were gener-
ated after pre-processing. Then, LSTM was performed in order to predict dropouts
based on a SVM-based GA feature selection. Taken together, the results from the ex-
periment illustrate that LSTM has strong adaptability in predicting dropout as there
is breakthrough progress in terms of accuracy, which improves the best accuracy by
2.45% over Manrique’s work [Manrique et al., 2019a] on the ARQ dataset. Due to
unbalanced datasets, we also observed the limitations of this study. The model has
a bias in feature selection, abnormal accuracy decline during training, and multiple
descents in loss, which emphasizes the importance of the balance of datasets.
With respect to course similarity measurement, we deployed BERT, which has the
strong power of input embedding, to encode the sentence in the course description
from the Australian National University. We then used cosine similarity to obtain
the distance between courses. As a result, we found that COMP1110 has the closest
average similarity to the rest of the 1000-level courses. In 2000-level courses, we
identified COMP2100 has the closest relationship with the rest. In terms of courses at
the 3000 and 4000 level, since specialization has formed, comparing them is pointless.
The results also align with this viewpoint. As regards limitations, at this stage we
cannot evaluate the result.
The final step in this project is measuring the sequence between two similar
courses. We employed the textRazor to extract entities from course description and
35
36 Conclusion and Future Work
then used SemRefD, which was proposed and evaluated by [Manrique et al., 2019a], to
measure the degree of prerequisite between two concepts. By deploying the model on
COMP1100, COMP1110, and COMP2100 respectively, we established the relationship
between these three courses. The results show that COMP1100 and COMP1110 have
a strong prerequisite relationship with COMP2100, while the relationships between
the two courses are inclined to be on the same level.
In terms of the future work, these technologies could potentially be used to anal-
yse the curriculum of university programs, to aid student advisors, and to create
recommendation systems that combine semantic and deep learning technologies.
Furthermore, this study suggested that using GA+LSTM model could provide insti-
tutions with early detection for dealing with problems and retaining students. As
for the experiment’s future refinement, first, in dropout prediction, dynamic time
steps can be introduced in LSTM instead of using fixed steps to make full use of the
dataset. Moreover, a more balanced dataset could be used in the experiment. We will
continue to investigate and develop appropriate models to improve the results in the
future.
Bibliography
Aeiad, E. and Meziane, F., 2016. Validating learning outcomes of an E-Learning

system using NLP. Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9612 (2016), 292–300.
doi:10.1007/978-3-319-41754-7_27. (cited on pages 6 and 9)
BA, K. and Charters, S., 2007. Guidelines for performing systematic literature
reviews in software engineering. 2 (01 2007). (cited on page 4)
Babatunde, O.; Armstrong, L.; Leng, J.; and Diepeveen, D., 2014. A genetic
algorithm-based feature selection. International Journal of Electronics Communication
and Computer Engineering, 5 (07 2014), 889–905. (cited on page 15)
Barb, A. S. and Kilicay-Ergin, N., 2020. Applications of natural language tech-

niques to enhance curricular coherence. Procedia Computer Science, 168 (2020), 88–96.
doi:https://doi.org/10.1016/j.procs.2020.02.263. https://www.sciencedirect.com/science/
article/pii/S1877050920304026. “Complex Adaptive Systems”Malvern, Pennsylva-
niaNovember 13-15, 2019. (cited on pages 7, 8, and 9)
Cheadle, C.; Vawter, M. P.; Freed, W. J.; and Becker, K. G., 2003. Analysis of
microarray data using z score transformation. The Journal of Molecular Diagnostics,
5, 2 (2003), 73–81. doi:https://doi.org/10.1016/S1525-1578(10)60455-2. https://www.
sciencedirect.com/science/article/pii/S1525157810604552. (cited on page 15)
Chen, L.; Jianbo, Y.; Shuting, W.; Bart, P.; and Giles, C., 2018. Investigating active
learning for concept prerequisite learning. In 32nd AAAI Conference on Artificial In-
telligence, AAAI 2018, 32nd AAAI Conference on Artificial Intelligence, AAAI 2018,
7913–7919. AAAI press. Funding Information: We gratefully acknowledge partial
support from the Pennsylvania State University Center for Online Innovation in
Learning. Publisher Copyright: Copyright © 2018, Association for the Advance-
ment of Artificial Intelligence (www.aaai.org). All rights reserved.; 32nd AAAI
Conference on Artificial Intelligence, AAAI 2018 ; Conference date: 02-02-2018
Through 07-02-2018. (cited on page 25)
Chen, L.; Min, Y.; Belkin, M.; and Karbasi, A., 2020. Multiple descent: Design your
own generalization curve. CoRR, abs/2008.01036 (2020). https://arxiv.org/abs/2008.
01036. (cited on page 29)
Chung, H. S. and Kim, J. M., 2014. Semantic model of syllabus and learning ontology
for intelligent learning system. Lecture Notes in Computer Science (including subseries
37
38 Bibliography
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8733 (2014),
175–183. doi:10.1007/978-3-319-11289-3_18. (cited on page 8)
Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K., 2018. BERT: pre-training of
deep bidirectional transformers for language understanding. CoRR, abs/1810.04805
(2018). http://arxiv.org/abs/1810.04805. (cited on pages v, 2, and 23)
Diaz-Mujica, A.; Pérez, M.; Bernardo, A.; Cervero, A.; and González-Pienda,
J., 2019. Affective and cognitive variables involved in structural prediction of
university droput. Psicothema, 31 (10 2019), 429–436. doi:10.7334/psicothema2019.124.
(cited on page 1)
Drennan, L. and Rohde, F., 2002. Determinants of performance in advanced un-
dergraduate management accounting: An empirical investigation. Accounting and
Finance, 42 (02 2002), 27–40. doi:10.1111/1467-629X.00065. (cited on page 1)
Fiallos, A., 2018. Assisted curricula design based on generation of domain ontologies
and the use of NLP techniques. 2017 IEEE 2nd Ecuador Technical Chapters Meeting,
ETCM 2017, 2017-Janua (2018), 1–6. doi:10.1109/ETCM.2017.8247474. (cited on pages
7 and 9)
Guberovic, E.; Turcinovic, F.; Relja, Z.; and Bosnic, I., 2018. In search of a
syllabus: Comparing computer science courses. 2018 41st International Convention
on Information and Communication Technology, Electronics and Microelectronics, MIPRO
2018 - Proceedings, (2018), 588–592. doi:10.23919/MIPRO.2018.8400111. (cited on page
7)
Hao, X.; Meng, X.; and Cui, X., 2008. Knowledge Point Based Curriculum Developing
and Learning Object Reusing. Knowledge Creation Diffusion Utilization, (2008), 126–
137. (cited on pages 7 and 8)
Hedayati, M. H. and Mart, L., 2016. Ontology-driven modeling for the culturally-
sensitive curriculum development: A case study in the context of vocational ICT
education in Afghanistan. Proceedings of the 10th INDIACom; 2016 3rd International
Conference on Computing for Sustainable Global Development, INDIACom 2016, , Ontol-
ogy 101 (2016), 928–932. (cited on pages 6 and 7)
H.Gomaa, W. and A. Fahmy, A., 2013. A Survey of Text Similarity Approaches. Inter-
national Journal of Computer Applications, 68, 13 (2013), 13–18. doi:10.5120/11638-7118.
(cited on page 9)
Hong, S. and Lynn, H. S., 2020. Accuracy of random-forest-based imputation of
missing data in the presence of non-normality, non-linearity, and interaction. BMC
Medical Research Methodology, 20 (2020). (cited on page 14)
Huang, J.; Cai, Y.; and Xu, X., 2007. A hybrid genetic algorithm for feature selection
wrapper based on mutual information. Pattern Recognition Letters, 28, 13 (2007),
1825–1844. doi:https://doi.org/10.1016/j.patrec.2007.05.011. https://www.sciencedirect.
com/science/article/pii/S0167865507001754. (cited on page 15)
Bibliography 39
Imran, M. and Young, R. I., 2016. Reference ontologies for interoperability across
multiple assembly systems. International Journal of Production Research, 54, 18 (2016),
5381–5403. doi:10.1080/00207543.2015.1087654. (cited on page 6)
Imron, M. A. and Prasetyo, B., 2020. Improving algorithm accuracy k-nearest
neighbor using z-score normalization and particle swarm optimization to predict
customer churn. (cited on page 15)
Johnson, M. and Kuennen, E., 2006. Basic math skills and performance in an
introductory statistics course. Journal of Statistics Education, 14, 2 (2006), null. doi:
10.1080/10691898.2006.11910581. https://doi.org/10.1080/10691898.2006.11910581.
(cited on page 1)
Karunananda, A. S.; Rajakaruna, G. M.; and Jayalal, S., 2012. OntoCD - Onto-
logical solution for curriculum development. International Conference on Advances
in ICT for Emerging Regions, ICTer 2012 - Conference Proceedings, (2012), 137–144.
doi:10.1109/ICTer.2012.6421412. (cited on pages 6, 7, and 8)
Kawintiranon, K.; Vateekul, P.; Suchato, A.; and Punyabukkana, P., 2016. Under-
standing knowledge areas in curriculum through text mining from course materials.
In 2016 IEEE International Conference on Teaching, Assessment, and Learning for Engi-
neering (TALE), 161–168. doi:10.1109/TALE.2016.7851788. (cited on page 8)
Lasley, T. J., 2013. Bloom’s Taxonomy. doi:10.4135/9781412957403.n51. (cited on
page 9)
Leardi, R., 2000. Application of genetic algorithm–pls for feature selection in spectral
data sets. Journal of Chemometrics - J CHEMOMETR, 14 (09 2000), 643–655. doi:
10.1002/1099-128X(200009/12)14:5/63.0.CO;2-E. (cited on page 15)
Liang, Y. and Ma, X., 2012. Teaching reform of software engineering course. ICCSE
2012 - Proceedings of 2012 7th International Conference on Computer Science and Edu-
cation, , Iccse (2012), 1936–1939. doi:10.1109/ICCSE.2012.6295452. (cited on page
6)
Maffei, A.; Daghini, L.; Archenti, A.; and Lohse, N., 2016. CONALI Ontology. A
Framework for Design and Evaluation of Constructively Aligned Courses in Higher
Education: Putting in Focus the Educational Goal Verbs. Procedia CIRP, 50 (2016),
765–772. doi:10.1016/j.procir.2016.06.004. http://dx.doi.org/10.1016/j.procir.2016.06.004.
(cited on pages 6 and 7)
Mandić, M., 2018. Semantic web based software platform for curriculum harmoniza-
tion*. ACM International Conference Proceeding Series, (2018). doi:10.1145/3227609.
3227654. (cited on pages 6, 7, and 10)
Manrique, R.; Pereira Nunes, B.; and Marino, O., 2019a. Exploring knowledge
graphs for the identification of concept prerequisites. Smart Learning Environments,
6 (12 2019), 21. doi:10.1186/s40561-019-0104-3. (cited on pages v, 1, 2, 14, 25, 26,
29, 35, and 36)
40 Bibliography
Manrique, R.; Pereira Nunes, B.; Marino, O.; Casanova, M.; and Nurmikko-
Fuller, T., 2019b. An analysis of student representation, representative features
and classification algorithms to predict degree dropout. 401–410. doi:10.1145/
3303772.3303800. (cited on pages v, 13, and 22)
McGuinness, D. L. and van Harmelen, F., 2004. OWL Web Ontology Lan-
guage Overview. W3C recommendation, 10, February (2004). http://www.w3.org/
TR/owl-features/ . (cited on page 7)
Nuntawong, C.; Namahoot, C. S.; and Brückner, M., 2016. A web based co-
operation tool for evaluating standardized curricula using ontology mapping.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial In-
telligence and Lecture Notes in Bioinformatics), 9929 LNCS (2016), 172–180. doi:
10.1007/978-3-319-46771-9_23. (cited on pages 6, 7, 8, and 9)
Nuntawong, C.; Namahoot, C. S.; and Brückner, M., 2017. HOME: Hybrid ontol-
ogy mapping evaluation tool for computer science curricula. Journal of Telecommu-
nication, Electronic and Computer Engineering, 9, 2-3 (2017), 61–65. (cited on pages 6,
7, 8, and 9)
Nuntawong, C.; Snae, C.; and Brückner, M., 2015. A semantic similarity assessment
tool for computer science subjects using extended wu & palmer’s algorithm and
ontology. Lecture Notes in Electrical Engineering, 339 (02 2015), 989–996. doi:10.1007/
978-3-662-46578-3_118. (cited on pages 7, 8, and 9)
Orellana, G.; Orellana, M.; Saquicela, V.; Baculima, F.; and Piedra, N., 2018.
A text mining methodology to discover syllabi similarities among higher edu-
cation institutions. Proceedings - 3rd International Conference on Information Sys-
tems and Computer Science, INCISCOS 2018, 2018-Decem (2018), 261–268. doi:
10.1109/INCISCOS.2018.00045. (cited on pages 8 and 9)
Pantanowitz, A. and Marwala, T., 2009. Missing data imputation through the use
of the random forest algorithm. Advances in Computational Intelligence, 116 (01 2009).
doi:10.1007/978-3-642-03156-4_6. (cited on page 14)
Pata, K.; Tammets, K.; Laanpere, M.; and Tomberg, V., 2013. Design principles for
competence management in curriculum development. Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes
in Bioinformatics), 8095 LNCS (2013), 260–273. doi:10.1007/978-3-642-40814-4_21.
(cited on page 31)
Paulk, M.; Curtis, B.; Chrissis, M.; and Weber, C., 1993. Capability Maturity Model
for Software, Version 1.1. (cited on page 6)
Pawar, A. and Mago, V., 2018. Similarity between learning outcomes from course
objectives using semantic analysis, bloom’s taxonomy and corpus statistics. arXiv,
(2018). (cited on page 9)
Bibliography 41
Piech, C.; Spencer, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L. J.; and
Sohl-Dickstein, J., 2015. Deep knowledge tracing. CoRR, abs/1506.05908 (2015).
http://arxiv.org/abs/1506.05908. (cited on page 20)
Piedra, N. and Caro, E. T., 2018. LOD-CS2013: Multileaming through a semantic

representation of IEEE computer science curricula. IEEE Global Engineering Educa-
tion Conference, EDUCON, 2018-April (2018), 1939–1948. doi:10.1109/EDUCON.2018.
8363473. (cited on pages 6, 7, 8, and 31)
Sales, A. R. P.; Balby, L.; and Cajueiro, A., 2016. Exploiting academic records for
predicting student drop out: a case study in brazilian higher education. J. Inf. Data
Manag., 7 (2016), 166–180. (cited on page 1)
Sanjeevi, M., 2018. Chapter 10.1: Deepnlp - lstm (long short term mem-
ory) networks with math. https://medium.com/deep-math-machine-learning-ai/
chapter-10-1-deepnlp-lstm-long-short-term-memory-networks-with-math-21477f8e4235.
(cited on pages ix and 20)
Saquicela, V.; Baculima, F.; Orellana, G.; Piedra, N.; Orellana, M.; and Es-
pinoza, M., 2018. Similarity detection among academic contents through semantic
technologies and text mining. In IWSW. (cited on pages 6, 7, 9, and 31)
Seidel, N.; Rieger, M.; and Walle, T., 2020. Semantic textual similarity of course
materials at a distance-learning university. CEUR Workshop Proceedings, 2734 (2020).
(cited on page 9)
Shah, A. D.; Bartlett, J. W.; Carpenter, J. R.; Nicholas, O.; and Hemingway,
H., 2014. Comparison of random forest and parametric imputation models for
imputing missing data using mice: A caliber study. American Journal of Epidemiology,
179 (2014), 764 – 774. (cited on page 14)
Shipley, B. and Ian, W., 2019. Here comes the drop: university drop out rates and
increasing student retention through education. https://www.voced.edu.au/content/
ngv:84995. (cited on page 1)
Tang, A. and Hoh, J., 2013. Ontology-specific API for a curricula management system.
2013 2nd International Conference on E-Learning and E-Technologies in Education, ICEEE
2013, (2013), 294–297. doi:10.1109/ICeLeTE.2013.6644391. (cited on pages 6 and 7)
Tao, Z.; Huiling, L.; Wenwen, W.; and Xia, Y., 2019. Ga-svm based feature selection
and parameter optimization in hospitalization expense modeling. Applied Soft
Computing, 75 (2019), 323–332. doi:https://doi.org/10.1016/j.asoc.2018.11.001. https:
//www.sciencedirect.com/science/article/pii/S1568494618306264. (cited on page 16)
Tapia-Leon, M.; Rivera, A. C.; Chicaiza, J.; and Luján-Mora, S., 2018. Application
of ontologies in higher education: A systematic mapping study. In 2018 IEEE Global
Engineering Education Conference (EDUCON), 1344–1353. doi:10.1109/EDUCON.2018.
8363385. (cited on pages 8 and 31)
42 Bibliography
Vaquero, J.; Toro, C.; Martín, J.; and Aregita, A., 2009. Semantic enhancement of
the course curriculum design process. In Proceedings of the 13th International Con-
ference on Knowledge-Based and Intelligent Information and Engineering Systems: Part I,
KES ’09 (Santiago, Chile, 2009), 269–276. Springer-Verlag, Berlin, Heidelberg. doi:
10.1007/978-3-642-04595-0_33. https://doi.org/10.1007/978-3-642-04595-0_33. (cited
on pages 6 and 7)
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.;
Kaiser, L.; and Polosukhin, I., 2017. Attention is all you need. CoRR,
abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762. (cited on page 23)
Vergel, J.; Quintero, G. A.; Isaza-Restrepo, A.; Ortiz-Fonseca, M.; Latorre-

Santos, C.; and Pardo-Oviedo, J. M., 2018. The influence of different curricu-
lum designs on students’ dropout rate: a case study. Medical Education Online,
23, 1 (2018), 1432963. doi:10.1080/10872981.2018.1432963. https://doi.org/10.1080/
10872981.2018.1432963. PMID: 29392996. (cited on page 1)
Wang, Y.; Wang, Z.; Hu, X.; Bai, T.; Yang, S.; and Huang, L., 2019. A Courses On-
tology System for Computer Science Education. 2019 IEEE International Conference
on Computer Science and Educational Informatization, CSEI 2019, 3 (2019), 251–254.
doi:10.1109/CSEI47661.2019.8938930. (cited on pages 6, 7, 8, and 9)
Wohlin, C., 2014. Guidelines for snowballing in systematic literature studies and a
replication in software engineering. ACM International Conference Proceeding Series,
(2014). doi:10.1145/2601248.2601268. (cited on page 5)
Wu, Z. and Palmer, M., 1994. Verbs semantics and lexical selection. In Proceedings
of the 32nd Annual Meeting on Association for Computational Linguistics, ACL ’94 (Las
Cruces, New Mexico, 1994), 133–138. Association for Computational Linguistics,
USA. doi:10.3115/981732.981751. https://doi.org/10.3115/981732.981751. (cited on
page 9)
Xia, K.; Huang, J.; and Wang, H., 2020. Lstm-cnn architecture for human activity
recognition. IEEE Access, 8 (2020), 56855–56866. doi:10.1109/ACCESS.2020.2982225.
(cited on page 20)
Yu, X.; Tungare, M.; Fan, W.; Pérez-Quiñones, M.; Fox, E. A.; Cameron, W.;
and Cassel, L., 2007. Using automatic metadata extraction to build a structured
syllabus repository. Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 4822 LNCS (2007),
337–346. doi:10.1007/978-3-540-77094-7_43. (cited on page 31)
Chapter 5
Appendix 1
Final Project Description

This project aims to improve students’ performance by predicting students’ dropouts
and curriculum semantic analysis.
As for dropout prediction, this project uses SVM- based Genetic Algorithm to
perform a feature selection; followed by using LSTM to predict the dropout. The
details is regarding conducted a dropout prediction experiment on a dataset from a
Brazilian university. Three clean datasets were generated after pre-processing. Then,
LSTM was performed in order to predict dropouts based on a SVM-based GA feature
selection.
With respect to course similarity measurement, we deployed BERT, which has the
strong power of input embedding, to encode the sentence in the course description
from the Australian National University. We then used cosine similarity to obtain the
distance between courses.
Regarding measuring the sequence between two similar courses. We employed the
textRazor to extract entities from course description and then used SemRefD, which
was proposed and evaluated by Manrique, to measure the degree of prerequisite
between two concepts. By deploying the model on COMP1100, COMP1110, and
COMP2100 respectively, we established the relationship between these three courses.
The outcomes of project are expected to use these technologies could potentially
be used to analyse the curriculum of university programs, to aid student advisors,
and to create recommendation systems that combine semantic and deep learning
technologies.Furthermore, this study suggested that using GA+LSTM model could
provide institutions with early detection for dealing with problems and retaining
students.
43
44 Appendix 1
Chapter 5
Appendix 2
Contract
45
46 Appendix 2
INDEPENDENT STUDY CONTRACT

Note: Enrolment is subject to approval by the Honours/projects co-ordinator
SECTION A (Students and Supervisors)
UniID: u6809382
FAMILY NAME: Cheng PERSONAL NAME(S): Yixin
PROJECT SUPERVISOR (may be external): Bernardo Pereira Nunes
COURSE SUPERVISOR (a RSCS academic): Penny Kyburz
COURSE CODE, TITLE AND UNIT: COMP8755 <Individual Computing Project> 12 units
SEMESTER S1 YEAR: _2021_ S2 YEAR: _2021_

PROJECT TITLE:
Improving Students’ Academic Performance through AI and Semantic Technologies
LEARNING OBJECTIVES:
- Understanding and ability to apply/implement semantic technologies, AI / Machine Learning models
and Data Mining techniques;
- Understanding fundamental research to address a research question;
- Apply computing knowledge and implementation skills to the area of Computers in Education;
- Understanding of interdisciplinary research; and,
- Understand how to conduct a substantial research-based project Understand how to analyse data,
synthesize and report research findings.
PROJECT DESCRIPTION:
• Write a literature survey
• Development of a model to improve students' performance using AI/Semantic technologies
• Train and test the model on large datasets
• Tune parameter and inference
• Evaluation/validation of the proposed model
• Deploy the model
• Write thesis
Research School of Computer Science Form approved CDC 11-Jul-19
Figure 5.1: Contract page 1

47
Figure 5.2: Contract page 2

48 Appendix 2
Chapter 5
Appendix 3
Description of Software
Dropout prediction
preprocessing.py implements the pre-processing. subset_seperator(self) is separating

raw dataset by degree; organise(self) is for data cleaning, data normalization and
Data validation. impute_missing_value(self) is for data imputation.
feature_selector.py implements the feature selection by using GA+SVM. get_fitness

(self,pop,path) is for getting fitness of each individual by SVM; select(self,pop, fitness),
crossover(self,parent, pop), mutate(self,child), and evolution(self) are the steps of GA.
xx_dataloader.py, xx=ADM,ARQ, or CSI implements the training and test by LSTM.
Directly Run xx_dataloader.py, xx=ADM,ARQ, or CSI to get the result of prepro-

cessing, feature selection and training and test.
Similarity Measurement
Installment of BERT as Service and runing The installment of BERT used in this
project as the link above or following.
Install Bert as a Service by pip install -U bert-serving-client(instruction is available on

https://github.com/hanxiao/bert-as-service)
Run the Bert as a Service pointing to the unzipped downloaded model using the fol-
lowing command: bert-serving-start -model_dir /your_directory/wwm_uncased_L-
24_H-1024_A-16 -num_worker=4 -port XXXX -max_seq_len NONE
run python similarity.py to get the result of two sentences in two courses comparison.
The result will be in /result/similarity_measurement/full_similarity
49
50 Appendix 3
run python text_similarity.py to get the result of two courses comparison, which
will be stored in /result/similarity_measurement/full_similarity/similarity_full.csv
run python diagram.py to visualize the result.
Prerequisite Identification
Install the external libraries by using conda or pip command RefDSimple.py is for us-
ing RefD to retrieve and get potential candidates from DBpedia https://www.dbpedia.org/
which is a main knowledge graph in Semantic Web.
entity_extractor.py is using TextRazor (https://www.textrazor.com/) to segment and

extract entities from text.
config.cfg is the configuration file, you may need to change the proxy before us-
ing it.
run entity_extractor.py to get the result which will be in /result/prerequisite_identification
Systematic Review-Paper Crawling
Run paper_crawler.py to acquire the paper from SPringer to local file named "dataset-
byabstract", which includes title, abstract, and so on.
Chapter 5
Appendix 4
README file
51
52 Appendix 4
Figure 5.3: README page 1

53

54 Appendix 4

55

Improving Students Academic Performance With AI A

Uploaded by

Copyright:

Available Formats

Improving Students Academic Performance With AI A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving Students Academic Performance With AI A

Uploaded by

Copyright:

Available Formats

Improving Students’ Academic

Performance with AI and Semantic

COMP 8755 Individual Computing Project

Firstly, I must express my sincere appreciation to my supervisor, Dr. Bernardo Pereira

Keywords: Dropout Prediction, Curriculum Semantic Analysis, Similarity Mea-

2 Background and Related Work 3

4 Results and Discussion 27

4.3 Curriculum Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Conclusion and Future Work 35

2.1 Paper Selection Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 The workflow of dropout prediction . . . . . . . . . . . . . . . . . . . . . 12

4.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Contract page 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 README page 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.6 README page 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.1 Data wrangling techniques used in pre-processing. . . . . . . . . . . . . 14

4.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.1 Problem Statement and Motivations

• Predict students’ performance in university courses using grades from previous

• Model a course semantic representation and calculate the similarity among

• Identify the sequence between two similar courses.

• A comprehensive systematic review to understand how Semantic Web and

• An analysis of a Computer Science curriculum using the SemRefD distance

1.4 Report Outline

Background and Related Work

2.2 Related Work

Figure 2.1: Paper Selection Process

Springer3 , Scopus4 , ScienceDirect5 and Web of Science6 as digital libraries. The

2.2.4 Languages, Classes and Vocabulary

2.2.4.1 Curriculum Design Analysis and Approaches

One development approach found is the extraction and interrelationship analysis.

In the context of curriculum similarity measurement, Gomaa and Fahmy [H.Gomaa

3.1 Dropout Prediction

This section presents an approach to predicting dropout. We illustrate the entire

3.1.1 Procedure Overview

As illustrated in Figure 3.1, the entire dropout prediction workflow is composed of

3.1.2 Data Pre-processing

Figure 3.1: The workflow of dropout prediction

Figure 3.2: Dataset snippet

(a) Student distribution by year (b) Record distribution by year

Figure 3.3: Distribution by year

Table 3.1: Data wrangling techniques used in pre-processing.

Data Cleaning outliers/duplicates correction

identificador (identifier); nome_disciplina (course name).

Algorithm 1 Procedure of Imputation

3.1.3 Feature Selection

Suppose hyperplane (w, b) can classify the training dataset, so as to ( xo , yo ) ∈ D,

Operator Description Goal Method deployed

3.1.4 Training and Test

Figure 3.6: Procedure of GA

Figure 3.7: Uniform Crossover

Figure 3.8: The workflow of GA+SVM feature selection

Table 3.3: The components of a LSTM cell

Component Description Denote

Forget Gate activation vector with sigmoid f

Figure 3.9: The internal architecture of LSTM cell [Sanjeevi, 2018]

ft = σ(Wf · [Ht−1 , xt ] + b f ) (3.10)

it = σ(Wi · [Ht−1 , xt ] + bi ) (3.11)

Ot = σ(Wo · [Ht−1 , xt ] + bo ) (3.14)