Student Academic Performance Prediction Under Various Machine Learning Classification Algorithms
Student Academic Performance Prediction Under Various Machine Learning Classification Algorithms
https://doi.org/10.22214/ijraset.2021.38786
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue XI Nov 2021- Available at www.ijraset.com
I. INTRODUCTION
Machine Learning [1] commonly deals with big data where the size of the data is massive and the data can be both in structured and
unstructured format. It endows the computers with the ability to learn from ‘DATA’ and make sensible decisions. The main focus of
this research it to perform a step by step process of the Machine Learning approach from Problem definition to Prediction.
Educational sector is a domain where outsized amount of data is being bred every day. The generated existing data and the about to
receive data if analysed in the right format can bring tremendous changes in the Scholastic field. The Machine Learning technique is
able to perfectly analyze the data and can bring lot of changes in improving the scholastic performance of the students. The other
features which included demographic, behavioural can also create an impact in the academic performance of the students.
The study about students’ educational behavior done by [6] proposed framework having a category of a feature called “Behavioral
feature” is introduced where they focus on student’s behavioral features and their relationship with student’s academic success.
They used the same framework to examine student’s progress by using ensemble techniques which enhance the overall accuracy of
results. Classification task on student database to predict the academic performance of student was carried by [7]. Bayesian Network
Classifiers is used in this study. Information like Previous semester marks, Internal Assessment Marks, Performance during
Seminars, Assignment, Attendance, Co-Curricular Activities were collected to predict the performance of the end semester marks.
This study will help the students improve their performance. The students who require special responsiveness will be effectively
identified and the failure rate of students would be decreased considerably.
A Student performance through a study was done by [8]. The sample contains 300 students out of which 225 are males and 75 are
females. The performance of the students in the class are affected by various parameters such as student attendance, hours spent in
class, family income, students mother’s age and her education.
Educational Data Mining to be a upcoming research area which deals with computational methods to explore educational data was
explained by [9]. It also explains the types of Educational Environments, Educational data and different group of people in
education field. It helps us to explore educational phenomenon better and to get enhanced insights into it. This also says about the
current affairs in the EDM field.
A. Research Data
The data collected from secondary data sources are tabulated in the Table 1.
Table 1 : Data Source Details
Data sources xAPI-Edu-Data.csv
Dataset characteristics Multivariate
Number of Instances 480
Number of Attributes 17
Attribute Type Categorical and Numerical
Dataset Owner Ibrahim Alijarah
Professor (Assistant) at The University of Jordan
Fargo, North Dakota, United States
Link https://www.kaggle.com/aljarah/xAPI-Edu-Data/metadata
Machine Learning workflow has various steps to be followed starting from Problem definition to Model Prediction. Various steps
required to be followed before fitting the model are shown in the Figure 1.
Collect
Data
Identify the Null Values present in each column and after analysing it shows that the
given data set contains No Null values.
Data visualization is the graphical representation of data in the form of charts,
Analyze diagrams etc. Visualization helps to understand the data much quicker than
Data quantitative methods and as a part of visualization various methods are performed to
Analyze the data in a better format.
UNIVARIATE ANALYSIS – Individual Features / Variables
BIVARIATE ANALYSIS – Relationship of a feature with Target Variable
The Univariate analysis does a single variable analysis. It does not infers its relationship with any other variables. In general count
plot could be used for this analysis. It helps to portray the data and it’s respective patterns for the user to get a better insight about
the single variable and the graphical representation helps us to view maximum, minimum, mean values etc. The Univariate Analysis
and its visualization inferences are described using below mentioned charts.
Male is 63.5% and Female is 36.4% . The gender feature infers that the maximum
Gender
count of students from the data set is Male.
Under Nationality feature KW has 37.3% and Jordan has 35.8% and Venezuela
Nationality
has the least % of 0.2%
The % ratio of Nationality and Place of Birth is almost same and as per the
PlaceofBirth
analysis any one column could be dropped.
Out of the total 51.7 % students are studying in MiddleSchool, 41.5% are in
StageID
Lowerlevel and only 6.9% are in High School.
Out of the total G-02 is 30.6%,G-08 is 24.2% ,G-07 is 21%, G-04 is 10%, G-06
GradeID is 6.7%, G-11 is 2.7%, G-12 is 2.3%, G-09 is 1.04%, G-10 is 0.83% and G-05 is
0.63%.
Out of the total 59% are studying in A section. 34.8% are studying in B section
SectionID
and 6.25% are studying in C Section.
Out of the total students 19.8% area of interest topic is IT, 13.5% is French,
12.3 % is Arabic, 10.6% is Science, 9.8% is English, 6.25% is Biology, 5.2% is
Topic
Spanish, 5% for both Geology and Chemistry , 4.58% for Quran, 4.37% is
Mathematics and 3.95% for History.
Semester 51% of students are in First Semester and 48.95% are in Second Semester.
Parent Responsible for student can be either Father or Mum. Out of the total %
Relation
58.9% is for Father and 41.04% is for Mother.
ParentAnsweringSurvey towards the school improvement is an important factor
ParentAnsweringSurvey
and 56.25% gave an Answer of ‘YES’ and 43.75% gave an answer of ‘NO’
ParentschoolSatisfaction is also an important factor and this helps to identify
whether the student will continue in the same school or not. Out of the Total
ParentschoolSatisfaction
percentage 61% opinion towards the School was Good and remaining of 39%
opinion towards school was Bad.
StudentAbsenceDays Out of the total 60% students are regular and 40% has taken more than 7 days
leave. Female has more attendance than Male.
StudentAbsenceDays StudentAbsenceDays/ Gender Male Female
with respect to gender Under 7 160 129
Above 7 145 46
Out of the Total Low Level score is acquired by 26.5%, Medium Level Score is
Class
acquired by 44% and High Level score is acquired by 30%of students.
Figure 16 : Bivariate Analysis –Gender & Class Figure 17 : Bivariate Analysis – Stage ID & Class
Figure 18 : Bivariate Analysis – Section ID & Class Figure 19 : Bivariate Analysis – Semester & Class
Figure 24 : Bivariate Analysis – raisedhands & Class Figure 25 : Bivariate Analysis – Visited Resources & Class
Nationality
Table 5 : Nationality & Class Score
With respect to Nationality compared with class, Jordan and Egypt has got highest percentage
compared to other countries
PlaceofBirth
GradeID
Topic
With respect to
ParentAnswerin ParentAnsweringSurvey compared with
gSurvey class, there was more yes for H and M
and less for L.
Table 13 : ParentAnsweringSurvey & Class
Score
With respect to
ParentSchoolsatisfaction compared
ParentschoolSat with class, large majority of parents are
isfaction satisfied with the education they
received. In case of least satisfied
Table 14 : ParentSchoolSatisfaction & Class parent the count is comparatively less.
Score
The biggest visual trend can be seen
is how frequently the student was
absent. Over 90% of the students who
StudentAbsence
did poorly were absent more than seven
Days
times, while almost none of the
students who did well were absent
Table 15 : StudentAbsenceDays & Class Score more than seven times.
Raisedhands
5) Correlation: Coorelation [10] is a bivariate analysis that measures the strength of association between 2 variables and the
direction of the relationship. The correlation value will be between +1 and -1.
Types of Coorelation are :
Numeric Vs Numeric Categorical (Binary Ordinal With Categorical vs
Feature) Vs Numerical Ordinal categorical
Pearson Pointbiserialr Spearman Rho Cross Tab
Different types of correlation has been implemented depending upon the type of variable. For the given data set, the following
coorelation methods have been adopted which is depicted in the
Table 16
Table 16 : Correlation Methods Applied for the Dataset
A. Feature Importance
1) Random Forest Feature Importance [14]: Random forests are among the most popular machine learning methods thanks to
their relatively good accuracy, robustness and ease of use. They also provide two straightforward methods for feature selection:
mean decrease impurity and mean decrease accuracy.
2) Experimented Results after Feature Engineering: The Feature Engineering process applied data set is divided into training data
set and the test data set where the training data is 70% of the whole data set and the remaining unused 30% is used as Test data
set. The random state is set as 50 here, whereas in the previous phase it was set as 0.
V. CONCLUSION
The Machine learning methodology is rapidly increasing and the impact of the machine able to predict the result of a system by
itself and also it is able to train a data over a period of time and also test the trained model with a different set of data to prove that
the model is working efficiently and effectively. In this research study it has been apparently proved that Logistic Regression has
got a training score of 87.20 and a testing score of 86.81 has proved that the model is working effectively without any bias or
variance concept. KNN and Decision Tree Entropy also works good and other implemented algorithms in this research study needs
some more feature engineering concepts and data analysis in a stronger term. The model deployment has been done for all
algorithms and the sample input has been given for evaluation, which classified perfectly in all algorithms.
VI. FUTURE SCOPE
The present study predicting the Academic performance of students with respect various features have considerably proved positive
results. This research work increases the performance prediction process of student in an effective way. When considering the future
this work can be further extended by using other feature(s) as Target Variable.
A. Other Features such as Financial Impacting feature, Physical Health Impacting feature and practicing food habits feature can
also be included in the upcoming research study.
B. As the above factors also can create an impact on the academic performance of the student directly or indirectly.
C. Since the present study focused on predicting the academic performance [5] of the student other factors included can also be
experimented to predict the performance of the student not only in academic point of view but also in a behavior perspective.
REFERENCES
[1] Smola, Alex, and S.V.N. Vishwanathan. Introduction to Machine Learning. Cambridge University Press, 2008. N.p., 2008. Web.
[2] Amjad Abu Saa. (2016) “Educational Data Mining & Students’ Performance Prediction” International Journal of Advanced Computer Science and
Applications, Vol. 7, No. 5, 2016.
[3] Ahmed Mueen, Bassam Zafar and Umar Manzoor. (2016) “Modeling and Predicting Students’ Academic Performance Using Data Mining Techniques” I.J.
Modern Education and Computer Science, 2016, 11, 36-42.
[4] Bhrigu Kapur, Nakin Ahluwalia and Sathyaraj R, “Comparative Study on Marks Prediction using Data Mining and Classification Algorithms”, International
Journal of Advanced Research in Computer Science, 8 (3), March-April 2017,632-636
[5] Prasada Rao, K. , M. V.P. Chandra Sekhara, and B. Ramesh. "Predicting Learning Behavior of Students using Classification Techniques." International
Journal of Computer Applications (0975 – 8887) Volume 139 – No.7, April 2016.
[6] Amrieh, E. A., Hamtini, T. & Aljarah, I. (2016). Mining educational data to predict Student’s academic performance using ensemble methods. International
Journal of Database Theory and Application, 9(8), pp. 119–136. doi: 2016.9.8.13.
[7] Sundar PVP. A Comparative Study For Predicting Students Academic Performance using Bayesian Network Classifiers. IOSR Journal of Engineering. 2013
Feb; 3(2):37–42.
[8] S. T. Hijazi, and R. S. M. M. Naqvi, “Factors affecting student’s performance: A Case of Private Colleges”, Bangladesh e-Journal of Sociology, Vol. 3, No. 1,
2006
[9] C. Romero, “Educational Data Mining: A Review of the State of the Art”, IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and
Reviews, Vol. 40, 2010.
[10] https://www.statisticssolutions.com/correlation-pearson-kendall-spearman/
[11] https://www.kdnuggets.com/2018/12/feature-engineering-explained.html
[12] https://towardsdatascience.com/encoding-categorical-features-21a2651a065c
[13] https://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf
[14] https://blog.datadive.net/selecting-good-features-part-iii-random-forests/