Multinomial Problem Statement
Multinomial Problem Statement
Multinomial Problem Statement
Instructions:
Please share your answers filled in-line in the word document. Submit code
separately wherever applicable.
Grading Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s)
aresubmitted along with the documentation explaining the method and results. Failing to submit either
of those will be considered an invalid submission and will not be considered for evaluation.
2. Assignments submitted after the deadline will affect your grades.
Grading:
On time
% & above On time Late
% & above On time % & above Late
On time % & above Late
● Grade A: (>= 90): When all assignments are submitted on or before the given deadline.
● Grade B: (>= 80 and < 90):
o When assignments are submitted on time but less than 80% of problems are completed.
(OR) o All assignments are submitted
after the deadline.
Hints:
1. Business Problem
1.1. What is the business objective?
1.2. Are there any constraints?
2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:
2.1. Make a table as shown above and provide information about the features such as
itsdata type and its relevance to the model building. And if not relevant, provide reasons
and a description of the feature.
Using R and Python codes perform:
3. Data Pre-processing
3.1. Data Cleaning, Feature Engineering, etc.
3.2. Outlier treatment.
5. Model Building
5.1. Build the model on the scaled data (try multiple options).
5.2. Build a Multinomial Regression model.
5.3. Train and test the model and compare accuracies by confusion matrix, ROC & AUC
curves.
5.4. Briefly explain the model output in the documentation.
6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?
Problem Statement:
a. prog: is a categorical variable indicating what type of program a student is in: “General” (1),
“Academic” (2), or “Vocational” (3).
b. Sees: is a categorical variable indicating someone’s socioeconomic status: “Low” (1),
“Middle” (2), and “High” (3).
c. read, write, math, and science are their scores on different tests.
d. honors: Whether they are an honor roll or not.
Abstract
Several challenges are associated with e-learning systems, the most significant of which is the lack of student motivation in
various course activities and for various course materials. In this study, we used machine learning (ML) algorithms to identify
low-engagement students in a social science course at the Open University (OU) to assess the effect of engagement on student
performance. The input variables of the study included highest education level, final results, score on the assessment, and the
number of clicks on virtual learning environment (VLE) activities, which included data plus, forming, glossary, collaborate,
content, resources, subpages, homepage, and URL during the first course assessment. The output variable was the student level
of engagement in the various activities. To predict low-engagement students, we applied several ML algorithms to the dataset.
Using these algorithms, trained models were first obtained; then, the accuracy and kappa values of the models were compared.
The results demonstrated that the J48, decision tree, JRIP, and gradient-boosted classifiers exhibited better performance in
terms of the accuracy, kappa value, and recall compared to the other tested models. Based on these findings, we developed a
dashboard to facilitate instructor at the OU. These models can easily be incorporated into VLE systems to help instructors
evaluate student engagement during VLE courses with regard to different activities and materials and to provide additional
interventions for students in advance of their final exam. Furthermore, this study examined the relationship between student
engagement and the course assessment score.
Introduction
Web-based learning has become commonplace in education and can take many forms, from massive open online courses
(MOOCs) to virtual learning environment (VLE) and learning management system (LMS). In MOOCs, students can study anytime
and from nearly any location [1]. MOOCs provide a new way to train students, change the traditional approach to studying, and
attract students from around the world. The best-known platforms are Coursera, Ed, and Harvard. Additionally, MOOCs have
contributed to higher education]. In MOOCs and other web-based systems, students often register to download videos and
© 2013 – 2021 360DigiTMG. All Rights
materials but do not complete the entire course. As a result, the total number of activities a student engages in falls below the
recommended threshold Therefore, teachers must understand the engagement of their students.
In the traditional approach to education, teachers take various steps to appraise students’ levels of performance, motivation,
and engagement [4], such as conducting exams, checking student attendance, and monitoring studying via security cameras.
However, in web-based platforms, there are no face-to-face meetings, and it is difficult to determine student engagement levels
in online activities such as participating in discussion forums or watching videos. Therefore, in web-based systems, student data
represent the only source through which instructors can assess student performance and engagement.
Due to the absence of face-to-face meetings, web-based systems face some challenges that need to be addressed. The first and
most important is course drop out. In web-based systems, dropping out is the principal problem that research has attempted to
solve. In web-based systems, 78% of students fail to complete their courses [5]. The main reason students drop an MOOC
course is the lack of student engagement, and the second most common reason is their inability to locate the requisite activities
and materials for the next assessment
An important element in reducing student dropout rates in a virtual learning environment (VLE) is to understand the
engagement of students in meaningful activities. As student participation in course activities increases, the experiences become
more engaging, and the probability of a student achieving a high assessment score and completing the e-learning course
increases [
The students of the OU are generally divided into groups, and an instructor is assigned to each group. The instructor can guide
these student groups through courses, for example, by answering their questions and grading their assignments. Additionally,
the OU can use various types of intervention to support and motivate weaker students, e.g., through e-mail and face-to-face
meetings [11]. However, the sheer number of students in the OU makes it increasingly difficult for the university to engage
students in its courses via face-to-face meetings. Moreover, the number of instructors is limited, and it is not possible to contact
all students in all courses. Therefore, an intelligent data system that predicts student engagement by analyzing logged student
data is needed.
In web-based learning systems, a student’s degree of engagement in educational learning is lower than that in traditional
education systems [17]. Access to online VLE activities is used as a measurement of student engagement. Because the course
involves web-based learning, often, no face-to face interaction occurs between students and the instructor. In web-based
systems, it is difficult to measure a student’s engagement using traditional methodologies (e.g., metrics such as class
attendance, participation in discussions, and grades) [18, 19], because many of these predictors are not directly available in e-
learning systems. Therefore, investigating students’ engagement in web-based learning is a challenging task [20].
To accomplish our goals, we developed a predictive analytic model utilizing machine learning (ML) algorithms. The most
appropriate ML predictive model was selected for analyzing student interactions in VLE learning activities and determining
students’ levels of engagement in VLE courses given that a lack of student engagement results in a high dropout rate [15].
Predictive models are currently used in many educational institutions [21]. A predictive model can help instructors guide
students in succeeding in a course and be used to determine which activities and materials are more important to the course
assessment. Such models also enable instructors to engage students in different activities through the VLE, thereby encouraging
Our models can easily be integrated into VLE systems and can enable teachers to identify low-engagement students through
different assessments, the use of different course materials, and the number of times VLE activities (e.g., data plus, forming,
glossary, resources, URL, homepage, collaborate, and subpages) are accessed. Teachers can also spend more time on
assessments and materials that are difficult for a particular group of students, enabling them to discover why an assessment is
easy or difficult and providing supplementary intervention to students who need it.
A predictive system enables an instructor to automatically identify low-engagement students during a course based on activities
from that online course. Given such detection, the instructor can then motivate (e.g., send an e-mail reminder) or identify
difficulties during the course [22]. When a student receives an advisory e-mail from an instructor (i.e., an e-mail asking about
any difficulty), on a weekly basis, the student is more likely to work hard and increase their engagement. Such communication is
important because it assesses student workloads and addresses issues at an early stage of the course [23]. Apt advice will also
improve student retention and decrease the course dropout rate.
Acquiring feedback is a challenge for instructors in an e-learning system after redesigning a course and related materials. The
instructor can more effectively redesign a course and student materials using a predictive model of the progress of student and
the finding can be used to improve the course and materials and increased engagement levels of students. Furthermore,
teachers receive feedback on the courses they teach via e-learning systems and feedback focuses on the difficulty level, burden,
and illustrative richness.
Tracking student engagement in different educational learning activities encourages high-quality learning, and comprehensive
analysis of student engagement can help to minimize course dropout rates.
The main tasks of learning analytics in education are to collect data, analyze these data and provide appropriate suggestions
and feedback to students to improve their learning [25, 26]. With the help of predictive analytics, an instructor can also discover
what students are doing with the learning material and how a student’s assessment scores are related to that student’s
engagement level [27].
The cognitive ability of computers in some fields is still below that of humans, but due to ML algorithms, computer abilities are
increasing quickly in domains such as e-learning, recommendation, pattern recognition, image processing, medical diagnosis,
and many others. ML algorithms are trained using sample data as inputs and then tested with new data [28].
Instructors can use ML algorithms to obtain student-related information in real time, which helps them intervene during early
course stages [29, 30]. ML is often used to build predictive models from student data; ML techniques can address both
numerical and categorical predictor variables. Decision trees (DTs) are often used to construct trees and find predictive rules
based on available data [31].
We used six types of ML classifiers (decision trees, JRIP, J48, gradient-boosted trees (GBT), classification and regression tree
(CART), and a Naive Bayes classifier (NBC)) to build predictive (learning analytic) models that predict student engagement in
different courses. These classifiers were selected because they accept both numeric and categorical attributes as inputs. The
algorithms perform well on noisy data and are unaffected by nonlinear relationships between variables. They are white boxes in
In the current study, we used behavioral features (student features related to interaction with the VLE) to predict
lowengagement students in an e-learning system. These features are readily available in almost every web-based system.
Additionally, these features predict student engagement in a manner closer to that for a real-world task (a traditional learning
environment) [33].
The classifier inputs consisted of student e-learning activity data from the logs of a VLE system. After examining these data, we
concluded the following: (1) the J48, DT, JRIP, and GBT classifiers were appropriate for predicting low-engagement students in
the VLE course; (2) the number of student logins to forming (discussion forums), content (OU course materials), and subpage
activities were strongly related to student engagement in the VLE system; and (3) highly engaged students achieved better
results on course assessments than did low-engagement students. Furthermore, the results also indicated that the students
who had lower engagement in courses, achieved lower scores, and participated in fewer course activities.
Question 1: Can we model the student engagement in different course activities by utilizing ML algorithms, and if so, which ML
classifier offers optimal performance in predicting student engagement in the VLE course?
Question 2: Is it possible to identify the activities and conditions that are most important in web-based learning for predicting
student engagement?
Question 3: How is a student’s engagement in different VLE activities associated with that student’s final score on an
assessment?
The problem is described in Section 2. Related work is discussed in Section 3, Details about the materials and methods are
presented in Section 4. Section 5 describes and discusses the experimental results. Section 6 provides conclusions and outlines
future work.
The VLE contains the study material for each course, and each student’s clicks per day are recorded in the VLE logs. The study
material in the VLE is delivered via hypertext markup language (HTML), PDFs, and lecture format. The OU records activity and
demographic data when a student uses the OU VLE system. The activity variables capture the type of communication through
which the student is engaged with the VLE, and the activity types include data plus, forming, glossary, collaborate, content,
resource, subpage, homepage, and URL. The demographic data include the student’s performance records. The instructor can
use these data to monitor the student’s engagement in different VLE activities.
In web-based systems, each group of students is supported by a specific instructor who can guide the students and provide
feedback throughout the course. However, the resources for teacher-student interactions in the VLE are limited. As the number
of students increases, it becomes more difficult for the OU staff to provide real-time support to all students.
The problem addressed in this paper involves reducing the dropout rate of students by identifying low-engagement students in
the first course assessment stage, based on where students invested their time differentially and the activities they engaged in
while completing the course assessments.
In the equation above, is the training set in the study and is an -dimensional input vector that contains the input features.
These features include the number of clicks on the VLE activities up to the student’s completion of the first course assessment.
represents the number of students in the first assessment (); is the vector of the target class that determines the class of the
input features , and . The result is assumed to be an indicator of engagement. When a student’s level of engagement in the
course is high through the first assessment, is set to 1, and if the student’s engagement level is low through the first
assessment, then is set to 0 (see Materials and Methods for the definition of engagement). The proposed functions to classify
student engagement are as follows:
Let be a classifier. We trained each classifier on features. The training set used to train each classifier was a dyad of , where
denotes the historical record of the features and is the class of feature [35]. After training, we tested the classifiers using the
test dataset, and the results are shown in Section 5.
Related Work
Considerable research has been conducted to investigate student engagement in web-based learning systems and traditional
educational systems. Such research has used different techniques and input features to investigate the relationship between
student data and student engagement. For example, Guo et al. [37] studied student engagement while students watched
videos. This study’s input features were based on the time spent watching the video and the number of times the student
responded to assessments. The study concluded that short videos engaged students to a greater degree than did prerecorded
lectures. Bonfanti et al. [38] used qualitative analysis and a statistical model (stepwise binomial logistic regression) to
investigate student engagement in an MOOC discussion forum and while watching videos and related this engagement to
student achievement. They used the number of posts submitted to a discussion forum, the number of videos watched, and post
content review to study student engagement. The results indicated that the number of posts submitted in a discussion forum
and the number of videos watched during a course were positively related to student achievement in the MOOC. Ramesh et al.
[39] studied the engagement of MOOC students using a probabilistic model called probabilistic soft logic based on student
behavior. Ramesh et al. [40] predicted student engagement/disengagement using student posts in a discussion forum. Beer [18]
applied statistical methods to predict student engagement in a web-based learning environment and concluded that variables
such as course design, teacher participation, class size, student gender, and student age need to be controlled for when
assessing student engagement. Manwaring et al. [41] conducted a study to understand student engagement in higher education
blended-learning classrooms. The study used a cross-lagged modeling technique and found that course design and student
perception variables greatly affected student engagement in the course. Mutahir et al. [4] conducted a study to investigate the
relationship between a student’s final score and the student’s engagement in material using a statistical technique and found
that students who had high levels of engagement for quizzes and materials earned higher grades on the final exam. Aguiar et al.
[42] developed an early-warning system using engagement-related input features and found that these variables are highly
predictive of studentretention problems. Thomas and Jayagopi [43] measured student engagement using an ML algorithm
based on students’ facial expressions, head poses, and eye gazes. Their results showed that ML algorithms performed well at
predicting student engagement in class. Atherton et al. [44] found a correlation between the use of course materials and
student scores; students who accessed course content more often achieved better results on their exams and assessments.
Bosch [45] studied the automatic detection of student cognitive engagement using a face-based approach.
Some previous studies have also investigated student engagement using log data [46, 47]. In recent years, researchers have
investigated the effects of academic self-efficacy, teaching presence, and perceived ease of use on student engagement in
MOOCs using statistical techniques [7]. Ding et al. [48] studied the effect of gamification on student engagement in online
discussion forums. Wells et al. [49] studied student engagement and performance data using the LMS platform and concluded
that student engagement increased as the exam approached. Additionally, they found a positive correlation between student
performance and student engagement. Pardo et al. [50] revealed that student interactions with online learning activities have a
significant impact on student exam scores. Other studies have found that student engagement is only weakly correlated with
student performance in online knowledge surveys [51]. Hamid et al. [52] measured student engagement using an ML approach
and concluded that the support vector machine (SVM) and the K-nearest neighbor (K-NN) classifiers are appropriate for
predicting student engagement. Bote-Lorenzo and Gomez-Sanchez [53] predicted decreases in student engagement in an MOOC
using an ML approach. Holmes [9] found that continuous assessment increased student engagement.
Several studies have shown that course outcomes are positively correlated with student engagement [54, 55]. For example,
Atherton et al. [44] showed that students who access web-based system study materials daily and undergo regular assessments
achieve high exam scores. Other research results show that high-engagement students tend to earn good grades on course
quizzes and assessments [4]. Rodgers [56] found that student interactions with an e-learning system were significantly
correlated with course outcomes.
However, most of the previous work on engagement has focused on traditional education in universities and schools but has
neglected student engagement in web-based systems. Additionally, the previous work related to student engagement has been
based on statistical analysis, survey, and qualitative methods; however, these statistical approaches cannot reveal the hidden
knowledge in student data. Moreover, statistics-based and qualitative methods are not easily generalized, nor are they scalable.
Surveys are not a good option for measuring student engagement; for example, younger children cannot understand the
questions, and completing the surveys requires a large amount of time. Another downside of these studies is that they are
based on student behaviors and emotions, as well as the course design; however, student engagement can also depend on
student participation in learning activities.
Figure 1
The DT is trained with a training set containing tuples. Finally, the DT is used to classify a dataset with unknown class labels [57].
DTs are primarily used to process information for decision-making [58].
The tree is constructed from the dataset by determining which attributes best split the input features at the child nodes. In this
case, we used the concept of information gain which is dependent on information theory. When a node has minimum entropy
(highest information gain), that node is used as a split node [59]. A DT is important when a study seeks to determine which
features are important in a student prediction model [60]. The rules for DTs are easy to understand and interpret, and we know
exactly which classifier leads to a decision.
J48
A J48 decision tree belongs to the DT family; it both produces rules and creates the tree from a dataset. The J48 algorithm is an
improved version of the C4.5 algorithm [61]. It is a sample-predictive ML model that predicts the target values of an unseen
database based on the different values of input features in the current dataset. The rules of this approach are easily
interpreted. Moreover, this method is an implementation of the ID3 (interactive dichotomize) algorithm and is a supervised ML
algorithm used primarily for classification problems. The internal nodes of a J48 decision tree represent the input features
(attributes), and the branches of the tree represent the possible values of the input features in the new dataset. Finally, the
terminal nodes (leaves) display the final values of target variables [62]. The attribute-selection process is based on the
information gain method (gain ratio) [63]. The J48 decision tree works for both numeric and categorical variables; moreover, it
determines the variables that are best at splitting the dataset [30]. The attribute with the highest gain ratio reflects the best
split point.
Data Description
The present study examined data from a module (lesson) of a social science course attended by OU students working via the VLE
system that addressed a particular topic in a given course block [68]. The VLE system provides different course topics in
different course blocks. This VLE delivers various courses to students, and students can enroll in the courses from different
locations [6].
The number of students enrolled in the social science course for the July 2013 session was 384. We used only the July 2013
student records (384 students) that applied to the period through the first assessment from the social science course data.
Based on the first assessment scores, the instructor can determine the low-engagement students at an early point in the course.
We extracted three types of data: demographic (highest education level), performance (final results and score on the
assessment) and learning behavior (number of clicks on VLE activities) data. The behavioral data included the number of clicks
on activity types such as data plus, forming, glossary, collaborate, content, resources, subpage, homepage, and URL as well as
the number of times each student accessed VLE activities through the first course assessment.
One problem is that the selected attributes are stored in different tables (student info, student assessment, assessments,
student VLE, courses, and VLE) in the OU data, as shown in Figure 2. The student info table contains the students’ demographic
information and the results of each course [68]. The course table contains information about the courses in which students are
enrolled [68]. The registration table contains student record timestamps and course enrollment dates [68]. Assessment
information is recorded in the assessment table [68]. The student-assessment table contains the assessment results for
different students [68]. The interaction information of different students regarding different materials and activities is stored in
the student-VLE and VLE tables [68]. The VLE interaction data consist of the numbers of clicks students made while studying the
course material in the VLE. Each course activity is identified with a label, for example, data plus, forming, content, etc.
Figure 2
We transferred the data into an ML-compatible layout in which each row index was a student ID, and each column index was a
student’s feature. Thus, each attribute was related to the first assessment in the social science course.
Before developing the predictive models, we established a label or definition of engagement; thus, the total number of times a
student accessed the VLE activities (total number of clicks on VLE activities) was assumed to be an indicator of engagement. The
detection of student engagement merely by observing student behavior is challenging [69] because students can sometimes
appear to be busy but fail to complete any learning tasks. Moreover, the prediction of student engagement in VLE courses by
simply counting clicks during VLE activities is difficult because students sometimes click on other, unimportant activities, such as
Facebook and Twitter. Additionally, in some cases, students spend little time in the VLE but achieve a high score on the course
assessment. Therefore, the total number of clicks on the VLE is insufficient for measuring student engagement in the VLE course
[9]. Instead, the criterion for measuring engagement in the current study was jointly based on four variables: the first
assessment score (score on the assessment), the student education degree before registering for the course (highest education
level),the final exam result after completing the course (final results), and the total number of clicks on VLE activities. To discern
the impacts of these four variables on student engagement, we conducted statistical analyses (Spearman’s correlation) using
SPSS to assess the relationship among the total number of clicks on VLE activities and the highest education level, score on the
assessment, and final results in the course for each student, based on a significance level of 0.05. The results are shown in Figure
Figure 3
The correlation coefficient between the dependent variable (total number of clicks on VLE activities) and independent variables
(score on the assessment, final results, and education level) of the students. Note: score on the assessment (score); highest
education level (highest education)
Figure 3 illustrates that two variables (final results and score on the assessment) were significantly correlated with the
dependent variable (total number of clicks on VLE activities). Therefore, we define engagement through three variables (final
results, score on the assessment, and total number of clicks on VLE activities). The details of the above variables are given later
in the paper (see the section “Predictors that affect student engagement in the Web-based system”).
Figure 3 shows that student engagement is related to the final results, score on the assessment and total number of clicks on
VLE activities for each student, based on the large R-values of these variables. The most highly engaged students achieved
higher scores and better results on the exam. The highest education level variable was omitted because it had a low r value that
was statistically nonsignificant. Therefore, we define engagement as follows: where represents the student’s engagement level
(high or low) on the assessment and .The “OR” operator is denoted by and “AND” operator is denoted by .
Additionally, denotes students who achieved excellent scores on the first assessment (score on the assessment ≥ 90%), denotes
those students who are qualified (final results = Pass) and represents those students who are active during the course (total
number of clicks on VLE activities ≥ average clicks of students). After establishing the engagement label, all the training data
were labeled using the engagement rules presented above.
Student engagement can be measured using different methods, such as questionnaires, external observers, and physiological
measures; however, in these methods, students can disturb, and these methods are not scalable [43]. Furthermore, according
to prior research, measuring student engagement using total clicks during course does not guarantee that students are highly
Feature Selection
In this study, we predicted low-engagement students using the number of clicks on the VLE activities; therefore, we consider
only the activities-related features (i.e., number of logins to data plus, forming, glossary, collaborate, content, resources,
subpage, homepage, and URL). Learning is a process that occurs when students interact with course materials and receive
instruction [70]. These features describe how students participate while taking the VLE course.
Missing Values
Some values were missing from the dataset. We substituted zero values for these missing data and interpreted the zeros as
indicating that the student did not login to those activities through the first assessment.
Predictors that Affect Student Engagement in Web-Based Systems
In web-based systems, the most important predictors are the student activities and the materials used before the course
assessment. Further details on the predictors used in the current study are given below.
Activity Type
Students engaged in a range of VLE activities, namely, forming, data plus, glossary, collaborate, content, resources, subpage,
homepage, and URL, while completing the course assessment. These activities provide important information for predictive
analysis. The number of times each student clicks on each of the activities is recorded daily in a time-stamped log file that
indicates the time the student spent on each activity. The forming variable references the discussion forum, where students can
discuss problems with each other. The forum is also a space where students can submit questions to better understand the
subject [60]. Resources consist of lecture notes, books, lecture slides, and other course materials in HTML and PDF formats [60].
The content variable contains study materials in HTML format related to the specific course studied. The subpage variable
reveals the student’s navigation path through the VLE structure [60]. The homepage variable reflects the first screen of every
course; these screens are visited by a student before accessing other course material. The glossary includes details about the OU
and higher education acronyms. The data plus variable references a module developed by the OU that allows students to see
their own records that have been stored in the database. Additionally, with the help of this module, producing a SQLite3
database is both customizable and portable for web-based systems. Furthermore, students can easily export the OU database.
Student ID
This variable is a unique identification number for a student in the OU records.
Final Results
This variable represents the student’s final exam results after completing the course, and the possible values are pass or fail.
This important variable reflects student effort in the course.
All the aforementioned variables are associated with student engagement at the OU. However, learning is a complex process
that is also affected by other factors, such as teacher participation in discussion forums, course design, class size, teacher’s
experience, teacher’s conception of learning, teaching styles, and other factors [18, 72].
We constructed an Excel file from the training data and uploaded it to Rapid Miner. Rapid Miner includes the entire visualization
module and predictive module of the decision tree. Therefore, we could easily construct the decision tree algorithms and the
NBC from these data.
In the training phase, we supplied the inputs and the corresponding data classes to the ML classifier to allow the classifiers to
discover the patterns between the input and output [73]. Finally, the trained models used these patterns to classify unseen data
[73].
We used a 10-fold cross-validation method to train and test the current student models. Cross-validation is primarily utilized to
assess model performance [74]. In k-fold cross-validation, the data are divided into k different subsets, the model is trained
using k-1 subsets, and the remaining subset is used for testing. The average performance obtained from this method provides a
good estimation of model performance. [74].
Performance Metrics
After training the classifiers in the current study, we assessed the performance of the learning models using previously unseen
data. We obtained the prediction results for the models with the test data and counted the number of true positives, true
negatives, false positives, and false negatives that were used to evaluate performance. Through this process, we obtained the
numbers of true positives (low engagement) and true negatives (high engagement), as well as the number of false positives and
false negatives.
Our main goal in this study was to minimize the false-negatives rate (i.e., the number of low-engagement students incorrectly
identified as high-engagement students). Therefore, we selected the model with the highest recall [75]. We used the following
performance metrics to measure the quality of the ML model predictions.
Accuracy
The first metric was accuracy, which is the number of low-engagement students correctly predicted as having low-engagement
during the course [75, 76].
Recall
Next, we calculated the recall, which indicates the fraction of all the students in the dataset who have low-engagement and who
the classifier correctly recognized as having low engagement [75, 76]. An ML model with a high recall is considered to have
satisfactory performance.
Kappa
When a model has a kappa value of zero, its performance is poor; in contrast, a value near 1 indicates that the model
has achieves good performance [30]. Experiments and Results
In this part of the study, we predicted the numbers of low-engagement students from the different activities of a VLE course
using features related to student activity. To answer the research questions of the current study, we performed several
experiments. We used the ML algorithms and the Rapid Miner tool to build the learning models, as described below.
We visualized the input variables (student clicks on VLE activities) of the OU course to illustrate how important the input
variables are in predicting low-engagement students at the first assessment point of a social science course. The results were
used to better understand the student data [26].
We visualized the number of clicks per activity, for example, forming, content, homepage, etc. Figure 4 presents the number of
clicks per activity, which indicates how much time the students spent on each activity.
In the second step, we determined how much the input features of the current study were correlated with the output.
Therefore, before applying the ML algorithms, we conducted a statistical analysis (Spearman correlation) to determine the
significance between the dependent variable of the study (level of engagement) and the independent variables (score on the
assessment and the number of clicks on VLE activities, namely, data plus, forming, glossary, collaborate, content, resource,
subpage, homepage, and URL). A spearman correlation is appropriate for both continuous and discrete features [78].
After conducting the Spearman correlation analysis, each independent variable in the current study received a correlation
coefficient (r) that reflected the strength and direction of the linear relationship between the tested pair of variables [79]. The
results are shown in Table 1.
Correlation analysis and descriptive statistics for the activities and the students’ level of engagement.
The statistical results show that student clicks on forming, content, subpage, and URL were moderately correlated with the level
of engagement in VLE activities, whereas the number of student clicks on resources and collaborate were weakly correlated.
Moreover, the number of clicks on the homepage was highly correlated with the level of engagement. Table 1 shows that the
number of clicks on glossary and data plus were unrelated to the student level of engagement in VLE activities [80].
Although some of the selected predictor variables were not significant, we included all the predictor variables in our
experiment, following the advice of Luan and Zho [81]. According to Luan and Zho [81], nonsignificant variables can be
important in some records [82].
Table 1 indicates that the seven independent variables, namely, the number of clicks on forming, content, subpage, collaborate,
resources, homepage, and URL were significant ( values < 0.05) with respect to the dependent variables. These independent
variables are meaningful and were used in subsequent experiments. However, this analytical statistic does not reveal the
hidden information in the data [76].
Because most of the input variables in the current study were significant predictors of student engagement, there was room for
further application of the ML algorithms.
Question 1. Can we model the student engagement in different course activities by utilizing ML algorithms, and if so, which ML
classifier offers optimal performance in predicting student engagement in the VLE course?
To explore this question, we determined the best ML algorithms to predict low-engagement students and performed the first
experiment. In this experiment, the input features were the students’ clicks on VLE activities (data plus, forming, glossary,
content, resource, subpage, homepage, and URL) in a VLE course, and the target variable was the students’ level of
engagement.
To estimate how the classifier could generalize the unseen data, we divided the data using a 10-fold cross-validation method.
We followed this procedure to determine student engagement as predicted by each ML model.
In the first experiment, we used the DT, J48, JRIP, GBT, CART, and NBC algorithms to predict student engagement using the
student interaction data. We used the Rapid Miner tool to build the ML models and determined the accuracy of each algorithm
using 10-fold cross-validation.
The DT is a supervised ML algorithm that is simple and easy to understand. We used the following optimum parameters to train
the DT: criterion = gain ratio, maximal depth = 20, confidence = 0.25, minimal gain = 0.1, and minimal leaf size = 2. Finally, we
obtained an accuracy of 85.91% after applying 10-fold cross-validation.
Table 2 shows that the DT predictive model correctly classified 283 of 303 low-engagement students; therefore, its true-positive
rate (sensitivity or recall) was 0.9340, with a false-positive rate of 0.425.
Table 3 shows that the J48 classifier correctly classified 289 of 305 low-engagement students. The true-positive rate (sensitivity
or recall) of this model was 0.947, and its false-positive rate was 0.358. Finally, the accuracy of the classifier was 88.52%. The
results are shown in Table 3.
Confusion matrix of the J48 model when predicting two classes of students’ engagement.
In the third phase of the first experiment, we built a JRIP decision tree model using Rapid Miner. The default parameters for the
JRIP models during the training stage were set as follows: F (number of folds per REP) = 3.0, N (minimal weights of instances
with a split) = 2; O (number of optimizations runs) = 2.0, and S (seed value used for data randomization) = 1.0.
Table 4 shows that the JRIP decision tree correctly classified 227 of 243 low-engagement students. The true-positive rate of the
JRIP model was 0.934, the false-positive rate was 0.342 and the accuracy was 83.27%.
Confusion matrix of the JRIP model when predicting two classes of students’ engagement.
In the fourth phase of the first experiment, we built a GBT model to predict low-engagement students in the VLE course. The
default parameters of the GBT were as follows: number of trees = 20; maximal depth = 5; min rows = 10.0; and number of
bins = 20. The GBT correctly classified 294 of 323 low-engagement students. The true-positive rate (recall or sensitivity) of the
GBT was 0.910, and the false-positive rate was 0.383. Moreover, this model achieved high accuracy (86.43%) based on the
default parameters. The results are shown in Table 5.
Confusion matrix of the gradient-boosted tree (GBT) model when predicting two classes of students’ engagement.
In the fifth phase of the first experiment, we developed a CART model using the OU student data. Table 6 shows that the CART
model correctly classified 235 of 263 low-engagement students. The default parameters used in this model were as follows: S
(random number of seeds) = 1; M (the minimum number of instances at the terminal nodes) = 2.0; and N (number of folds used
in the minimal cost-complexity pruning) = 5.0. The true-positive rate (recall or sensitivity) of the CART model was 0.893, the
false-positive rate was 0.333, and the accuracy was 82.25%.
Confusion matrix of the classification and regression tree (CART) model when predicting two classes of students’ engagement.
We applied the NB (kernel) classifier to our data to calculate the probabilities associated with highly engaged and lowly engaged
students. We implemented the NBC using kernel density estimation and obtained good performance with the following
parameters: estimation mode = greedy; minimum bandwidth = 0.1; and number of kernels = 10. The true-positive rate (recall
and sensitivity) of the NBC was 0.900, the false-positive rate was 0.50, and the accuracy was 82.93%. The results are listed in
Confusion matrix of the Naive Bayes (Kernel) classifier when predicting two classes of students’ engagement. A
comparison of the results of the six models (Table 8 and Figure 5) showed that J48, GBT, DT, and JRIP predicted
lowengagement students with high accuracy (88.52%, 86.43%, 85.91%, and 83.27%, respectively) based on student
clicks on different activities.
Accuracy, kappa, and recall of the ML models used in the current study.
Visualization of the accuracy, kappa, and recall of the ML models used in the current study.
When data are unbalanced (i.e., when the number of records of one class is less than that of another), accuracy alone does not
always indicate that a classifier has achieved good performance in predicting low-engagement students (the unbalanced
problem) [43]; therefore, we checked the recall (sensitivity), ROC, and kappa values of the classifiers.
To identify low-engagement students in the OU course, recall is paramount, but if we want to identify high-engagement
students (those with a larger number of clicks on activities), then accuracy is important. In this study, our goal was to identify
lowengagement students; therefore, we focused on recall.
In our model, the recall results reflect how many of the low-engagement students were correctly identified as low-engagement
out of the total number of low-engagement students in the dataset. Given such identification, teachers can give feedback and
sent warning messages to those students who may need to work harder.
In the first experiment, we also compared the performances of the learning algorithms based on the ROC curves. We computed
the AUC value of each classifier, and they ranged from 0.8 to 0.5. The ROC curves of our models represent the probability that
low-engagement students in the sample are correctly identified as low. A high AUC means that the classifier has good
performance, and the AUC value is close to 1. A low AUC indicates that the classifier performs poorly. Figure 6 shows that the
J48, JRIP, GBT, and DT classifiers achieved better AUC values than did the other algorithms and thus performed better. The ROC
curves of the other models indicate that these models achieved inferior performance for the studied dataset.
The experimental results show that J48, JRIP, GBT, and DT are appropriate algorithms for identifying low-engagement students
in an OU course. When low-engagement students can be identified, teachers can utilize these models to alert low-engagement
students in advance and learn more about the low-engagement students. The NBC and CART classifiers were less accurate than
the GBT, J48, JRIP, and DT classifiers and did not perform as well in predicting low-engagement students during a VLE course.
Figure 5 and Table 8 also show that low-engagement students can be predicted with reasonable accuracy and recall based on
the number of clicks on VLE activities prior to the first assessment.
Question 2. Is it possible to identify the activities and conditions that are most important in web-based learning for predicting
student engagement?
To understand which VLE activities are important for student engagement prediction, we explored the second question by
building a decision tree using the DT classifier because the explanations generated through decision trees are easily generalized
and understandable [83].
We applied the DT to discover more details about the students. Figure 7 shows the DTs constructed from the VLE dataset.
Several interesting observations can be drawn from building the DT classifier.
Second, from Figure 7, we conclude that forming and content are the most important types of activities for predicting low
engagement in the social science course. The number of messages posted and replied to in the discussion forum (forming) may
also be related to the engagement of students in the course. Additionally, instructor involvement in the discussion forum could
further improve student engagement [18].
Third, Figure 7 further demonstrates that highly engaged students accessed more OU course-related content (content) during
the course and had a higher level of participation in the discussion forum; therefore, they tended to interact more with other
students. In contrast, low-engagement students clicked less in discussion forums and on content.
Fourth, Figure 7 shows that the forming, content, and subpage activities have a deleterious effect when a student has low-level
engagement. Figure 7 shows that when a student’s participation in activities such as forming, subpage, and content is low, the
student’s engagement level will also be low.
The features selected by our models provide good information that instructors can use to provide confident interventions for
their students before the end of a course. According to prior research, student retention can also be improved by improving the
level of student interaction in the course [30, 84, 85]. The DT generated by the DT algorithm shows how the activities of an OU
course can be used to predict student engagement, and these findings also reflect the confidence level of each student.
These rules show that student clicks on homepage, content, and forming are significantly related to student engagement.
Moreover, the rules can be interpreted as follows.
When a student spends more time in the discussion forum (forming), engagement in course activities is high. Additionally,
highassessment-score students spent more time creating new content and pages via which their assessments were submitted
and used subject material more frequently to clarify concepts than did low-score students.
When the number of clicks of a student on a discussion forum (forming) is ≥145 OR homepage (homepage) ≥68 OR
content ≥ 147, then the student’s engagement level in the OU course is high.
Finally, experiments 1 and 2 showed that JRIP, J48, DT, and GBT are appropriate ML models for predicting low-engagement
students in the VLE course. Additionally, the results showed that four features strongly affect student engagement and course
outcomes, namely, the number of clicks on forming, content, homepage, and subpage. The rule sets selected through this
analysis can help instructors better design their course content and identify low-engagement students at an early course stage.
The performances of the GBT, J48, DT, and JRIP models, which were sufficient for practical applications, were good for the
following reasons.
(1) GBT, J48, DT, and JRIP can be effectively applied for both numerical and categorical attributes [86]. Additionally, these
models are white box models, and their results are easily interpreted [87].
(2) The performance of JRIP is good in the presence of noise [66], and it produces rules that can be easily understood by
teachers.
(3) DTs can handle more predictor variables than can logistic regression and other models [82], and they can find complex
and nonlinear relations between dependent and independent variables [82].
(4) J48 can handle missing values and unknown levels, and it provides good stability and easily understandable results
[61]. J48 can also handle both discrete and continuous values [88].
(5) DTs can produce rules that are easily handled and understood by humans. Additionally, DTs perform rapid
classification of unknown data, without requiring complex computations [61].
Question 3: How is a student’s engagement in different VLE activities associated with that student’s final score on an
assessment?
To address this question, we determine how student engagement is related to the first assessment scores by conducting a third
experiment in which we analyze student data related to student engagement in VLE activities and the first assessment score in
the VLE course. According to previous work, student engagement has a positive effect on grades, test scores, retention, and
graduation [89].
The Spearman correlation (r) between student assessment scores and student engagement was 0.351 (Table 1), which indicates
that engagement is positively correlated with assessment scores.
Figure 8 shows the relationship between student assessment scores and engagement level for the training data. Figure 8 further
indicates that students with high assessment scores in the training data show high engagement. Therefore, we applied the J48
decision tree to study these relationships in the test data. We split the dataset into two portions, training and testing, with
allocations of 75% and 25%, respectively. We trained the J48 classifier on training data and tested it on unseen data (testing
data). In other words, we compared the student’s first assessment score with the student’s engagement in that same course.
The course activities include content, forming, data plus, URL, homepage,oucollaborate, resource, subpage, and glossary.
Student assessment scores compared to student engagement levels in the training data.
As shown in Figure 9, we plotted the score on the first assessment and the predicted student engagement for the test data. The
graph shows that high-engagement students usually earn better assessment scores in OU courses; this result may be due to
their increased access of course content and increased participation in discussion forums. In contrast, low-engagement students
tend to have lower scores on the first assessment; this result may be due to poor exam preparation, poor time management,
limited access of course content, and low participation in discussion forums.
Source Data
When students interact with the VLE system to complete a course assessment, their activities are recorded in the log file, and
the student performance data are recorded in the student database.
Preprocessing
The preprocessing module extracts input-related features and engagement labels from the student log data and transforms
those data into a format acceptable for input into ML algorithms.
. ML Model Selection
Based on the ML performances for the student log data, this module selects the best ML model for making student engagement
predictions.
Instructor Dashboard
Because the decision tree rules are in the form of if/then rules, it is difficult for the instructor to understand these rules.
Consequently, the instructor dashboard is a computer program that interprets these rules and displays them in the form of a
graph to provide valuable information about student engagement in VLE activities to the instructor. Moreover, the instructor
can predict individual student engagement and then send pertinent intervention advice to low-engagement students.
After the model is developed, it can be applied to real student data; subsequently, at any moment in time, the model shows a
student’s engagement level for different assessments, materials, and activities. In other words, the model shows the number of
times a student’s uses VLE activities, such as forming or content.
The dashboard allows course instructors without IT skills to acquire up-to-date predictions about student engagement for each
assessment and each activity and to make accurate decisions about students to reduce the student dropout rate. Additionally,
the instructor can determine the reason for a student’s low-engagement level.
Moreover, the predictive portion of the dashboard provides some statistical information about students and their course
assignments. For example, first, the predictive portion identifies the top five most popular activities during the first assessment
of the course; second, it finds the percentage of low- and high-level engagement students for the current course assessment.
Furthermore, a graphical representation of student interactions with course activities allows instructors to evaluate a student’s
behavior in a few seconds and to give feedback in real time.
This visualization allows the instructor to assess the effect of redesigning the course or material. Furthermore, the visualization
enables the teaching staff to receive feedback on their teaching practices.
(2) When a student engages less in the discussion forum or OU content or never accesses the OU content, the instructor
can e-mail (disengagement trigger) the student to determine what materials the student is having difficulty with and
to determine why the student is contributing less to the course. The instructor can also offer advice about the course.
This advice may help to increase the student’s awareness of their productive or unproductive behavior, which might
increase their level of engagement.
(3) The instructor can predict individual student difficulties for each assessment and recommend relevant materials and
activities to students for the next assessment.
(4) Instructor contributions to the discussion forum can also increase student engagement.
(5) This model can help the instructor find the most important activities for students, increase student engagement, and
help students achieve high scores on course assessments.
(6) The teacher can use the model to increase student engagement. When the student engagement level in a course is
low, the instructor can redesign the course, to improve the student interactions within the course.
This tool can help instructors to design materials such that students will remain engaged during the course assessment.
The results of the first experiment showed that DT, J48, JRIP, and GBT are the most appropriate algorithms for predicting
lowengagement students during an OU assessment. Table 8 and Figure 5 reveal that the true-positive rate (recall) of the J48
model is slightly higher than the alternatives and that the J48 model successfully identifies the students who truly exhibit
lowengagement during assessment activities.
The results of the second experiment indicate that the most important variables for predicting low engagement in an OU
assessment are clicks on content, forming, subpage, and homepage.
Because the current study used only student activity data from the OU system, the analysis focused on whether these data
could be used to predict low-engagement students based on the assessment results. Student engagement is a complex problem
that also depends on factors such as teaching experience, course design, teaching style, and course concepts. These factors must
be further investigated in the context of student engagement.
In future work, we plan to evaluate the total number of students’ clicks for each assessment, course design, teaching
experience, and teaching style in an OU course and then use collaborative filtering to recommend materials and lectures for
lowengagement students. This approach will help students achieve higher grades on the final exam.
Problem statement:
You work for a consumer finance company which specializes in lending loans to urban customers.
When the company receives a loan application, the company has to make a decision for loan
approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:
• If the applicant is likely to repay the loan, then not approving the loan results in a
loss ofbusiness to the company
• If the applicant is not likely to repay the loan, i.e. he/she is likely to default,
thenapproving the loan may lead to a financial loss for the company
The data given below contains the information about past loan applicants and whether they
‘defaulted’4 or not. The aim is to identify patterns which indicate if a person is likely to default,
which may be used for taking actions such as denying the loan, reducing the amount of loan, lending
(too risky applicants) at a higher interest rate, etc.
In this case study, you will use EDA to understand how consumer attributes and loan attributes
influence the tendency of default.
When a person applies for a loan, there are two types of decisions that could be taken by the
company:
1. Loan accepted: If the company approves the loan, there are 3 possible scenarios
describedbelow:
o Fully paid: Applicant has fully paid the loan (the principal and the interest rate) o
Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is
not yet completed. These candidates are not labelled as 'defaulted'.
o Charged-off: Applicant has not paid the instalments in due time for a long period of
time,i.e. he/she has defaulted on the loan
2. Loan rejected: The company had rejected the loan (because the candidate does not meet
theirrequirements etc.). Since the loan was rejected, there is no transactional history of those
applicants with the company and so this data is not available with the company (and thus in this
dataset)
This company is the largest online loan marketplace, facilitating personal loans, business loans, and
financing of medical procedures. Borrowers can easily access lower interest rate loans through an
online interface.
Like most other lending companies, lending loans to ‘risky’ applicants is the largest source of
financial loss (called credit loss). The credit loss is the amount of money lost by the lender when the
borrower refuses to pay or runs away with the money owed. In other words, borrowers who default
cause the largest amount of loss to the lenders. In this case, the customers labelled as 'charged-off'
are the 'defaulters'.
© 2013 – 2021 360DigiTMG. All Rights
If one is able to identify these risky loan applicants, then such loans can be reduced thereby cutting
down the amount of credit loss.
In other words, the company wants to understand the driving factors (or driver variables) behind
loan default, i.e. the variables which are strong indicators of default. The company can utilize this
knowledge for its portfolio and risk assessment.
Perform Multinomial regression on the dataset in which loan status is the output (Y) variable and it
has three levels in it.
Background:-
Implementation of Logistic Regression to Predict Default Loan by Kevin Tongam A.
Predictive Modeling on Credit Default Using GLM Model in R
I am going to demonstrate the practical application of logistic regression model to predict if an individual would default their
loan or paid the loan. We will use real data from Lending Club and import it to my R studio. The data consists of 9,578
observations and total 14 variables
purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”,
“small_business”, and “all_other”).
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by
LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments ($) owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower. dti: The debt-to-income ratio of the
borrower (amount of debt divided by annual income). fico: The FICO credit score of the borrower. days.with.cr.line: The
number of days the borrower has had a credit line. revol.bal: The borrower’s revolving balance (amount unpaid at the end
of the credit card billing cycle). revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used
relative to total credit available). inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years. pub.rec:
The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
The logistic regression is based off of logistic distribution function. Suppose a model of Pi=β1+β2Xi, thus the logistic function
is expresses by Pi=11+e−(β1+β2Xi) or we can simply write it
Pi=exp(β1+β2Xi)1+exp(β1+β2Xi)
This represents the probability of a person default. Otherwise, we can write the probability of loan being paid (Y = 0) as :
1−Pi=11+exp(β1+β2Xi)
Therefore, we can write
Pi1−Pi=exp(β1+β2Xi)(exp(β1+β2Xi)+1)exp(β1+β2Xi)+1=exp(β1+β2Xi)
This is exactly the odds ratio in favor of a person default their loan, i.e if Pi = 0.8 means that the odds are 4 to 1 in favor a person
will default the loan. Taking the natural log of the equation above, we have
Ln(Pi1−Pi)=β1+β2Xi
This is the framework of logistic regression model. The method to estimate the coefficient parameter in logistic regression is
using Maximum Likelihood.
Here is the visual representation of default and paid in loan as shown in the graph below. The default rate with respect to paid is
approximately 22.5%.
library(caTools)
sample <- sample.split(loan$not.fully.paid, SplitRatio =
0.8) train_data <- subset(loan, sample == TRUE) test_data
<- subset(loan, sample == FALSE) Building logistic
regression model in R
Building logistic regression in R is quite simple, we use glm() function or “Generalized Linear Model” to perform logistic
regression, here we put the whole independent variables to the model.
#Logistic regression
log.reg <- glm(not.fully.paid ~ ., data = train_data, family = "binomial")
summary(log.reg)
#TUNED GLM
log.reg.rev <- glm(not.fully.paid ~ purpose + installment + log.annual.inc +
fico + revol.bal + inq.last.6mths + pub.rec, data = train_data, family =
"binomial") summary(log.reg.rev) ## ## Call:
## glm(formula = not.fully.paid ~ purpose + installment + log.annual.inc +
## fico + revol.bal + inq.last.6mths + pub.rec, family = "binomial",
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0930 -0.6271 -0.5056 -0.3628 2.7274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.734e+00 7.062e-01 9.536 < 2e-16 ***
## purposecredit_card -5.016e-01 1.219e-01 -4.116 3.86e-05 ***
u <- c(20000, 30000, 40000, 50000, 70000, 100000, 150000, 200000, 250000, 300000, 450000, 500000, 550000, 600000,
800000, 1000000, 1200000, 150000, 200000, 250000, 300000, 350000, 400000, 500000,
600000) u <- as.data.frame(u) K <- u %>%
mutate(odds = (func3(u)))
K
## u odds
## 1 20000 0.93069841
## 2 30000 0.89786999
## 3 40000 0.86619952
## 4 50000 0.83564617
## 5 70000 0.77773456
## 6 100000 0.69830452
## 7 150000 0.58353549
## 8 200000 0.48762920
## 9 250000 0.40748547
## 10 300000 0.34051367
## 11 450000 0.19870181
## 12 500000 0.16604441
## 13 550000 0.13875437
## 14 600000 0.11594956
## 15 800000 0.05654039
## 16 1000000 0.02757075
## 17 1200000 0.01344430
Graph above is the visualization of the effects of increasing the income to the odds of default. It shows that as income
increasing, the odds of default is decreasing exponentially. This is an intuitive result as we expect people with higher income will
likely to pay their loan, vice versa. Prediction now we can test the model with our test dataset. No, we do not make another
model with our test dataset, but instead using out trained model to make predictions off of our test dataset. The syntax to
predict is simple in R, using predict() functions in R we can generate predictions with our model.
#MAKING PREDICTIONS
predict_loan <- predict(object = log.reg.rev,
newdata = test_data[-14], type = "response")
head(predict_loan, n = 30)
## 11 15 16 17 18 19 29
## 0.25525891 0.09156507 0.11983393 0.17076668 0.03477595 0.14392812 0.03525650
## 30 31 34 37 39 40 45
## 0.02977701 0.05549851 0.06002100 0.02675648 0.11617745 0.11455656 0.09277177
## 56 59 60 61 74 75 81
## 0.21075875 0.18184797 0.11779812 0.15398311 0.22018983 0.08012999 0.06090537
## 83 85 93 94 109 120 123
## 0.06953879 0.09387963 0.05329096 0.09921972 0.28696441 0.16196419 0.04741813
## 137 142
## 0.07439255 0.17271198
Here we can sample out the predicition result of 30 samples. But we find that the result is not 0 or 1 as we want to use since the
dependent variable (not.fully.paid) takes values either 0 or 1. In this case we can simply turn it to 0 or 1 values by assigning any
values below 0.5 as 0, and above 0.5 as 1.
head(binary_predict, n = 30)
## 11 15 16 17 18 19 29 30 31 34 37 39 40 45 56 59 60 61 74 75
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Model Evaluation
We have built our model, making predictions, now time to see how accurate our model make predictions off our test dataset.
The method to evaluate the accuracy of a model’s predictions is using confusion matrix. With confusion matrix, we can see if our
model generate right predictions with respect to the actual value from the test dataset, or otherwise make a false prediction.
library(ROCR)
## Loading required package: ggplot
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess pred_ROCR <- prediction(predict_loan,
test_data$not.fully.paid) auc_ROCR <-
performance(pred_ROCR, measure = 'auc')
plot(performance(pred_ROCR, measure = 'tpr', x.measure = 'fpr'), colorize = TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))