Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
69 views

Student Academic Performance Prediction Under Various Machine Learning Classification Algorithms

Data Mining in Educational System has increased tremendously in the past and still increasing in present era. This study focusses on the academic stand point and the performance of the student is evaluated by various parameters such as Scholastic Features, Demographic Features and Emotional Features are carried out.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Student Academic Performance Prediction Under Various Machine Learning Classification Algorithms

Data Mining in Educational System has increased tremendously in the past and still increasing in present era. This study focusses on the academic stand point and the performance of the student is evaluated by various parameters such as Scholastic Features, Demographic Features and Emotional Features are carried out.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

9 XI November 2021

https://doi.org/10.22214/ijraset.2021.38786
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

Student Academic Performance Prediction under


Various Machine Learning Classification Algorithms
M. Nirmala1, T. Seeni Selvi2, V. Saravanan3
1
Department of Computer Applications,
2
Department of Computer Science,
3
Department of Information Technology
1
Hindusthan College of Engineering and Technology,
2, 3
Hindusthan College of Arts and Science
Abstract: Data Mining in Educational System has increased tremendously in the past and still increasing in present era. This
study focusses on the academic stand point and the performance of the student is evaluated by various parameters such as
Scholastic Features, Demographic Features and Emotional Features are carried out. Various Machine learning methodologies
are adopted to extract the masked knowledge from the educational data set provided, which helps in identifying the features
giving more impact to the student academic performance and there by knowing the impacting features, helps us to predict
deeper insights about student performance in academics. Various Machine learning workflow starting from problem definition
to Model Prediction has been carried out in this study. The supervised learning methodology has been adopted and various
Feature engineering methods has been adopted to make the ML model appropriate for training and evaluation. It is a prediction
problem and various Classification algorithms such as Logistic Regression, Random Forest, SVM, KNN, XGBOOST, Decision
Tree modelling has been done to fit the student data appropriately.
Keywords: Scholastic, Demographic, Emotional, Logistic Regression, Random Forest, SVM, KNN, XGBOOST, Decision Tree.

I. INTRODUCTION
Machine Learning [1] commonly deals with big data where the size of the data is massive and the data can be both in structured and
unstructured format. It endows the computers with the ability to learn from ‘DATA’ and make sensible decisions. The main focus of
this research it to perform a step by step process of the Machine Learning approach from Problem definition to Prediction.
Educational sector is a domain where outsized amount of data is being bred every day. The generated existing data and the about to
receive data if analysed in the right format can bring tremendous changes in the Scholastic field. The Machine Learning technique is
able to perfectly analyze the data and can bring lot of changes in improving the scholastic performance of the students. The other
features which included demographic, behavioural can also create an impact in the academic performance of the students.

II. LITERATURE SURVEY / RELATED WORK


Numerous data mining tasks [2] were used to create qualitative predictive models to predict the students’ grades from a collected
training dataset. During the survey, university students were aimed and collected multiple personal, social, and academic data of
them. Pre-processing of the collected were done to make it suitable for data mining tasks. Third, the classification models were
tested on the pre-processed data. On the whole this study motivated the universities to do data mining tasks on their students’ data
regularly to get interesting results and patterns which in turn can be more effective and helpful for university as well as the students
in many ways. A similar research on Educational Data Mining; Student’s performance was predicted based on academic records and
their forum participation in [3] . Two undergraduate course data were collected. To predict student’s performance three
classification models like Naive Bayes, Neural Networks and Decision Trees were used. The results show that Naive Bayes model
gave better result comparing to other two models.
Another comparative study was done by [4]. They compared six algorithms like J48, Random Forest, Naive Bayes, Naive Bayes
Multinomial, K-Star and IBK. The data set contains 480 records and Weka Tool were used for implementation. The Survey
conducted based on seven attributes and found Random Forest algorithm provides more accuracy compared to other algorithms.
A survey was conducted over 200 college students. In this research [5] classification algorithms were adopted on student dataset to
foretell the learning behavior of student’s. Slow learners were identified, and actions were taken to reduce the failure count and
correct actions could be adopted to make the weaker students suitable for learning. In this study the J48, Naive Bayes and Random
forest algorithms were compared. Finally the researcher got accuracy using Random forest algorithm when the data set is in massive
size.

©IJRASET: All Rights are Reserved 221


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

The study about students’ educational behavior done by [6] proposed framework having a category of a feature called “Behavioral
feature” is introduced where they focus on student’s behavioral features and their relationship with student’s academic success.
They used the same framework to examine student’s progress by using ensemble techniques which enhance the overall accuracy of
results. Classification task on student database to predict the academic performance of student was carried by [7]. Bayesian Network
Classifiers is used in this study. Information like Previous semester marks, Internal Assessment Marks, Performance during
Seminars, Assignment, Attendance, Co-Curricular Activities were collected to predict the performance of the end semester marks.
This study will help the students improve their performance. The students who require special responsiveness will be effectively
identified and the failure rate of students would be decreased considerably.
A Student performance through a study was done by [8]. The sample contains 300 students out of which 225 are males and 75 are
females. The performance of the students in the class are affected by various parameters such as student attendance, hours spent in
class, family income, students mother’s age and her education.
Educational Data Mining to be a upcoming research area which deals with computational methods to explore educational data was
explained by [9]. It also explains the types of Educational Environments, Educational data and different group of people in
education field. It helps us to explore educational phenomenon better and to get enhanced insights into it. This also says about the
current affairs in the EDM field.

III. RESEARCH METHODOLOGY


The various methods adopted during the research process have been portrayed. This is a Descriptive Research problem where the
study of student data set is explored. It performs the prediction of Academic performance of students of an educational body by
applying various methodologies with respect to Machine Learning.

A. Research Data
The data collected from secondary data sources are tabulated in the Table 1.
Table 1 : Data Source Details
Data sources xAPI-Edu-Data.csv
Dataset characteristics Multivariate
Number of Instances 480
Number of Attributes 17
Attribute Type Categorical and Numerical
Dataset Owner Ibrahim Alijarah
Professor (Assistant) at The University of Jordan
Fargo, North Dakota, United States
Link https://www.kaggle.com/aljarah/xAPI-Edu-Data/metadata

B. Proposed System Method Of Analysis


The proposed system states the prediction of the Academic performance of the student using various Features depicted in Table 2
are classified as Demographic, Scholastic and Emotional.
Table 2 : Students Features
Demographic Features
Scholastic Features Emotional Features
(Related to Population)
gender Educational Stages Raised Hands
Nationality Grade Levels Visited Resources
Place of Birth Section ID Viewing Announcements
Semester Discussion Groups
Topic Parents Answering Survey
Parents responsible for student Student Absence Days
Class (L,M,H) based on the total
Parents School Satisfaction
grade marks classified into 3
classes

©IJRASET: All Rights are Reserved 222


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

Machine Learning workflow has various steps to be followed starting from Problem definition to Model Prediction. Various steps
required to be followed before fitting the model are shown in the Figure 1.

Figure 1 : Machine Learning Process Pipeline

C. Machine Learning Pipeline


Machine learning methodology is adopted for problems when traditional programming cannot be done, and when the system itself
needs to solve the problem rather than a programmer, and if the size of the data is very large.
Steps to be followed for Machine Learning Process

Be clear with what the model is expected to do.


Define
Ensure that all the inputs are available during prediction.
Problem
In this system the academic performance of students need to be predicted based upon various features.
The data is collected from xAPI-Edu-Data.csv data repository. It contains 480 rows and 17 Columns. It
contains both categorical and Numerical data.
The data collected is in the format shown in Figure 2.

Collect
Data

Figure 2 : Data Format for Supervised Learning

Table 3 : Students Features and its Descriptions


Feature Datatype Description
gender Categorical Male or Female
NationalITy Categorical Student Nationality
PlaceofBirth Categorical Place of Birth of the Student
StageID Categorical Stage refers to Primary, Middle or High School
GradeID Categorical Grade Category varies from G-01 to G-12
SectionID Categorical Classroom Section, either A or B or C
Topic Categorical Refers to Course Topic such as Math, Quran etc.
Semester Categorical Either First semester or Second Semester
Relation Categorical Either Father or Mum, who is responsible for Student
raisedhands Numerical Count of students Interacted during the class room by raising hands.
VisiTedResources Numerical Count of the students who visited the course content.
AnnouncementsView Numerical Count of the students who checks the new Announcements
Discussion Numerical Count of the students who participated on discussion groups.
ParentsAnsweringSurvey Categorical Whether Parent Answered Survey provided from school or not.
ParentsschoolSatisfaction Categorical Degree of Parent satisfaction from School
StudentAbsenceDays Categorical Either Nominal above 7 or under 7
Based on the total grade / marks it is classified as Low-level, Middle Level,
Class Categorical
High Level.
Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques (both graphical and
quantitative) to better understand data. This system contains 4 Numerical Columns and 13 Categorical Columns and the description
about each and every feature, its datatype, its category and its description are explained in the table 3.

©IJRASET: All Rights are Reserved 223


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

D. Exploratory Data Analysis


1) Univariate Analysis – Individual Features / Variables

Identify the Null Values present in each column and after analysing it shows that the
given data set contains No Null values.
Data visualization is the graphical representation of data in the form of charts,
Analyze diagrams etc. Visualization helps to understand the data much quicker than
Data quantitative methods and as a part of visualization various methods are performed to
Analyze the data in a better format.
UNIVARIATE ANALYSIS – Individual Features / Variables
BIVARIATE ANALYSIS – Relationship of a feature with Target Variable

The Univariate analysis does a single variable analysis. It does not infers its relationship with any other variables. In general count
plot could be used for this analysis. It helps to portray the data and it’s respective patterns for the user to get a better insight about
the single variable and the graphical representation helps us to view maximum, minimum, mean values etc. The Univariate Analysis
and its visualization inferences are described using below mentioned charts.

Figure 3 : Univariate Analysis - gender Figure 4 : Univariate Analysis – Stage ID

Figure 5 : Univariate Analysis – PlaceofBirth Figure 6 : Univariate Analysis – Nationality

©IJRASET: All Rights are Reserved 224


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

Figure 7 : Univariate Analysis – Class


Figure 8 : Univariate Analysis – Grade ID

Figure 9 : Univariate Analysis – Section ID Figure 10 : Univariate Analysis – Topic

Figure 11 : Univariate Analysis – Semester Figure 12 : Univariate Analysis – Relation

Figure 13 : Univariate Analysis – Figure 14 : Univariate Analysis –


ParentAnsweringSurvey ParentschoolSatisfaction

©IJRASET: All Rights are Reserved 225


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

Figure 15 : Univariate Analysis – StudentAbsenceDays

2) Univariate Analysis –Report

Male is 63.5% and Female is 36.4% . The gender feature infers that the maximum
Gender
count of students from the data set is Male.
Under Nationality feature KW has 37.3% and Jordan has 35.8% and Venezuela
Nationality
has the least % of 0.2%
The % ratio of Nationality and Place of Birth is almost same and as per the
PlaceofBirth
analysis any one column could be dropped.
Out of the total 51.7 % students are studying in MiddleSchool, 41.5% are in
StageID
Lowerlevel and only 6.9% are in High School.
Out of the total G-02 is 30.6%,G-08 is 24.2% ,G-07 is 21%, G-04 is 10%, G-06
GradeID is 6.7%, G-11 is 2.7%, G-12 is 2.3%, G-09 is 1.04%, G-10 is 0.83% and G-05 is
0.63%.
Out of the total 59% are studying in A section. 34.8% are studying in B section
SectionID
and 6.25% are studying in C Section.
Out of the total students 19.8% area of interest topic is IT, 13.5% is French,
12.3 % is Arabic, 10.6% is Science, 9.8% is English, 6.25% is Biology, 5.2% is
Topic
Spanish, 5% for both Geology and Chemistry , 4.58% for Quran, 4.37% is
Mathematics and 3.95% for History.
Semester 51% of students are in First Semester and 48.95% are in Second Semester.
Parent Responsible for student can be either Father or Mum. Out of the total %
Relation
58.9% is for Father and 41.04% is for Mother.
ParentAnsweringSurvey towards the school improvement is an important factor
ParentAnsweringSurvey
and 56.25% gave an Answer of ‘YES’ and 43.75% gave an answer of ‘NO’
ParentschoolSatisfaction is also an important factor and this helps to identify
whether the student will continue in the same school or not. Out of the Total
ParentschoolSatisfaction
percentage 61% opinion towards the School was Good and remaining of 39%
opinion towards school was Bad.
StudentAbsenceDays Out of the total 60% students are regular and 40% has taken more than 7 days
leave. Female has more attendance than Male.
StudentAbsenceDays StudentAbsenceDays/ Gender Male Female
with respect to gender Under 7 160 129
Above 7 145 46
Out of the Total Low Level score is acquired by 26.5%, Medium Level Score is
Class
acquired by 44% and High Level score is acquired by 30%of students.

©IJRASET: All Rights are Reserved 226


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

3) Bivariate Analysis – Relationship Of a Feature With Target Variable


Bivariate Analysis is performed to find the associativity between every variable in the data set with the Target Variable (Class in
this system). It also checks for association and the strength of this association or whether there are differences between two variables
and the significance of these differences.

Figure 16 : Bivariate Analysis –Gender & Class Figure 17 : Bivariate Analysis – Stage ID & Class

Figure 18 : Bivariate Analysis – Section ID & Class Figure 19 : Bivariate Analysis – Semester & Class

Figure 21 : Bivariate Analysis – ParentAnsweringSurvey &


Figure 20 : Bivariate Analysis – Relation & Class Class

©IJRASET: All Rights are Reserved 227


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

Figure 22 : Bivariate Analysis – ParentSchoolSatisfaction Figure 23 : Bivariate Analysis – StudentAbsenceDays &


& Class Class

Figure 24 : Bivariate Analysis – raisedhands & Class Figure 25 : Bivariate Analysis – Visited Resources & Class

Figure 27 : Bivariate Analysis – Announcements View &


Figure 26 : Bivariate Analysis – Discussion & Class Class

Figure 28 : Bivariate Analysis – Nationality & Class

©IJRASET: All Rights are Reserved 228


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

Figure 29 : Bivariate Analysis – Place of Birth & Class

Figure 30 : Bivariate Analysis – Grade ID & Class

Figure 31 : Bivariate Analysis – Topic & Class

4) Bivariate Analysis –Report – Target Variable = Class

With respect to gender compared with class, female


has the highest score with respect to High level and Male
Gender
has Highest score with respect to Low Level. Female
Table 4 : Gender & Class Academic performance is more compared to Male.
Score

Nationality
Table 5 : Nationality & Class Score
With respect to Nationality compared with class, Jordan and Egypt has got highest percentage
compared to other countries

©IJRASET: All Rights are Reserved 229


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

PlaceofBirth

Table 6 : PlaceofBirth & Class Score


With respect to PlaceofBirth compared with class, Jordan and Egypt has got highest count
value compared to other countries.

With respect to StageID Middle


StageID School and Lower Level has got high
level of scores with respect to Class.

Table 7 : Stage ID & Class Score

GradeID

Table 8 : Grade ID & Class Score


G-02, G-08, G-09 has the highest scores compared to other grades

With respect to SectionID compared


SectionID with class, Section A is ranking high in
all 3 class categories.

Table 9 : Section ID & Class Score

Topic

Table 10 : Topic & Class Score

In case of second semester, it is less


Semester in the Low Level and in other cases it is
more.

Table 11 : Semester & Class Score

©IJRASET: All Rights are Reserved 230


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

With respect to Relation compared


Relation with class, the highlevel learning
students are greatly supported and
motivated by mothers.
Table 12 : Relation & Class Score

With respect to
ParentAnswerin ParentAnsweringSurvey compared with
gSurvey class, there was more yes for H and M
and less for L.
Table 13 : ParentAnsweringSurvey & Class
Score

With respect to
ParentSchoolsatisfaction compared
ParentschoolSat with class, large majority of parents are
isfaction satisfied with the education they
received. In case of least satisfied
Table 14 : ParentSchoolSatisfaction & Class parent the count is comparatively less.
Score
The biggest visual trend can be seen
is how frequently the student was
absent. Over 90% of the students who
StudentAbsence
did poorly were absent more than seven
Days
times, while almost none of the
students who did well were absent
Table 15 : StudentAbsenceDays & Class Score more than seven times.

Raisedhands

Announce Female student have participated


mentsView more in viewing announcements.

visitedReso Female student have visited the


urces resources more in number.

Female Students have more


Discussion
participated in Discussion.

©IJRASET: All Rights are Reserved 231


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

5) Correlation: Coorelation [10] is a bivariate analysis that measures the strength of association between 2 variables and the
direction of the relationship. The correlation value will be between +1 and -1.
Types of Coorelation are :
Numeric Vs Numeric Categorical (Binary Ordinal With Categorical vs
Feature) Vs Numerical Ordinal categorical
Pearson Pointbiserialr Spearman Rho Cross Tab

Different types of correlation has been implemented depending upon the type of variable. For the given data set, the following
coorelation methods have been adopted which is depicted in the
Table 16
Table 16 : Correlation Methods Applied for the Dataset

©IJRASET: All Rights are Reserved 232


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

The following inferences has been drawn from the


Table 17. It shows that correlation between various features among other feature using crosstab function, Spearman RHO, Pearson,
point biserialr shows that the following features are coo related and could be included for modelling. Nationality, Place of Birth,
Stage ID, Grade ID, Section ID, Topic, Semester, Relation, Class, parent Answering Survey, Parent School Satisfaction, Student
Absence Days to be included for model along with numerical features. Other features if required using the Feature importance could
be later included for modelling.
Table 17 : Correlation Methods Tabulated Values

©IJRASET: All Rights are Reserved 233


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

E. Feature Engineering Concepts [11]


It is the process of converting data into features to act as inputs to machine learning models. Variable transformation type is applied
in this study, where in the given data set most of the columns are categorical and need to be converted to numerical. The conversion
process is done through Label encoding method [12] and the output of the Label Encoding is shown in the Figure 34 and the
formula applied for the label encoding is shown in the Figure 32

Figure 33 : Label Encoding Code

Figure 34 : Label Encoder: Categorical to Numeric Converted Values


Various proposed Classification Algorithms [13] used in this paper are :
1) Logistic Regression Decision Tree
2) Random Forest XG Boost
3) K Nearest Neighbors Algorithm Support Vector Machine

©IJRASET: All Rights are Reserved 234


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

IV. EXPERIMENTAL RESULTS


The transformed data set is partitioned into training data set and the test data set where the training data is 70% of the whole data set
and the remaining unused 30% is used as Test data set. The random state is set as 0. The parameters applied for various algorithms
are depicted in Table 18. The experimented results before feature engineering is depicted in
Table 19. Sample code for Logistic Regression and its classification Report has been shown in Table 20 & Figure 35.
Table 18 : Parameters For Model Fitting
Model Type Parameters for Fitting the Model
Logistic Regression solver='lbfgs',multi_class='auto', max_iter=2000
RandomForestClassifier(n_jobs=-1, random_state=123, criterion='gini',
Random Forest
max_depth=3,)
KNN KNeighborsClassifier(n_neighbors=7
SVM svm.SVC(kernel='rbf',gamma='auto') # Linear Kernel
xgb.XGBClassifier(max_depth=10, learning_rate=0.1, n_estimators=100,
XGBOOST
seed=10)
DecisionTreeClassifier(criterion = "gini", random_state = 100,
DECISION TREE – Gini
max_depth=7, min_samples_leaf=5)
DECISION TREE - Entropy DecisionTreeClassifier(criterion = "entropy", random_state = 100,
max_depth=7, min_samples_leaf=5)

Table 19 : Experimented Results – Before Feature Engineering


Model Type Training Score Testing Score
Logistic Regression 79.16 75.0
Random Forest 82.44 75.69
KNN 75.0 61.1
SVM 99.70 50.0
XGBOOST 100.0 74.30
DECISION TREE – Gini 86.90 70.83
DECISION TREE - Entropy 85.11 67.36

Table 20 : Training & Testing Code – Logistic Regression Algorithm


Training Score Code Testing Score Code
from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score
Logit_Model=LogisticRegression(solver='lbfgs', from sklearn.metrics import classification_report
multi_class='auto', max_iter=2000) prediction=Logit_Model.predict(X_test)
Logit_Model.fit(X_train,Y_train) score = accuracy_score(Y_test,prediction)
Logit_Model.score(X_train,Y_train) report=classification_report(Y_test,prediction)

Figure 35 : Logistic Regression – Classification Report

©IJRASET: All Rights are Reserved 235


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

A. Feature Importance
1) Random Forest Feature Importance [14]: Random forests are among the most popular machine learning methods thanks to
their relatively good accuracy, robustness and ease of use. They also provide two straightforward methods for feature selection:
mean decrease impurity and mean decrease accuracy.
2) Experimented Results after Feature Engineering: The Feature Engineering process applied data set is divided into training data
set and the test data set where the training data is 70% of the whole data set and the remaining unused 30% is used as Test data
set. The random state is set as 50 here, whereas in the previous phase it was set as 0.

Table 21 : Experimented Results –After Feature Engineering


Model Type Training Testing Remarks
Score Score
Logistic Regression 87.20 86.81 Good
Random Forest 94.05 90.97 Fair
KNN 81.54 82.63 Good
Needs more Testing
SVM 97.91 83.33
Effort
Needs more Testing
XGBOOST 97.02 90.27
Effort
Needs more Testing
DECISION TREE – Gini 81.25 76.38
Effort
DECISION TREE - Entropy 80.65 81.25 Good

V. CONCLUSION
The Machine learning methodology is rapidly increasing and the impact of the machine able to predict the result of a system by
itself and also it is able to train a data over a period of time and also test the trained model with a different set of data to prove that
the model is working efficiently and effectively. In this research study it has been apparently proved that Logistic Regression has
got a training score of 87.20 and a testing score of 86.81 has proved that the model is working effectively without any bias or
variance concept. KNN and Decision Tree Entropy also works good and other implemented algorithms in this research study needs
some more feature engineering concepts and data analysis in a stronger term. The model deployment has been done for all
algorithms and the sample input has been given for evaluation, which classified perfectly in all algorithms.
VI. FUTURE SCOPE
The present study predicting the Academic performance of students with respect various features have considerably proved positive
results. This research work increases the performance prediction process of student in an effective way. When considering the future
this work can be further extended by using other feature(s) as Target Variable.
A. Other Features such as Financial Impacting feature, Physical Health Impacting feature and practicing food habits feature can
also be included in the upcoming research study.
B. As the above factors also can create an impact on the academic performance of the student directly or indirectly.
C. Since the present study focused on predicting the academic performance [5] of the student other factors included can also be
experimented to predict the performance of the student not only in academic point of view but also in a behavior perspective.

REFERENCES
[1] Smola, Alex, and S.V.N. Vishwanathan. Introduction to Machine Learning. Cambridge University Press, 2008. N.p., 2008. Web.
[2] Amjad Abu Saa. (2016) “Educational Data Mining & Students’ Performance Prediction” International Journal of Advanced Computer Science and
Applications, Vol. 7, No. 5, 2016.
[3] Ahmed Mueen, Bassam Zafar and Umar Manzoor. (2016) “Modeling and Predicting Students’ Academic Performance Using Data Mining Techniques” I.J.
Modern Education and Computer Science, 2016, 11, 36-42.
[4] Bhrigu Kapur, Nakin Ahluwalia and Sathyaraj R, “Comparative Study on Marks Prediction using Data Mining and Classification Algorithms”, International
Journal of Advanced Research in Computer Science, 8 (3), March-April 2017,632-636
[5] Prasada Rao, K. , M. V.P. Chandra Sekhara, and B. Ramesh. "Predicting Learning Behavior of Students using Classification Techniques." International
Journal of Computer Applications (0975 – 8887) Volume 139 – No.7, April 2016.

©IJRASET: All Rights are Reserved 236


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com

[6] Amrieh, E. A., Hamtini, T. & Aljarah, I. (2016). Mining educational data to predict Student’s academic performance using ensemble methods. International
Journal of Database Theory and Application, 9(8), pp. 119–136. doi: 2016.9.8.13.
[7] Sundar PVP. A Comparative Study For Predicting Students Academic Performance using Bayesian Network Classifiers. IOSR Journal of Engineering. 2013
Feb; 3(2):37–42.
[8] S. T. Hijazi, and R. S. M. M. Naqvi, “Factors affecting student’s performance: A Case of Private Colleges”, Bangladesh e-Journal of Sociology, Vol. 3, No. 1,
2006
[9] C. Romero, “Educational Data Mining: A Review of the State of the Art”, IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and
Reviews, Vol. 40, 2010.
[10] https://www.statisticssolutions.com/correlation-pearson-kendall-spearman/
[11] https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
[12] https://towardsdatascience.com/encoding-categorical-features-21a2651a065c
[13] https://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf
[14] https://blog.datadive.net/selecting-good-features-part-iii-random-forests/

©IJRASET: All Rights are Reserved 237

You might also like