Abstract
Predicting students’ academic performance has been a research area of interest in recent years, with many institutions focusing on improving the students’ performance and the education quality. The analysis and prediction of students’ performance can be achieved using various data mining techniques. Moreover, such techniques allow instructors to determine possible factors that may affect the students’ final marks. To that end, this work analyzes two different undergraduate datasets at two different universities. Furthermore, this work aims to predict the students’ performance at two stages of course delivery (20% and 50% respectively). This analysis allows for properly choosing the appropriate machine learning algorithms to use as well as optimize the algorithms’ parameters. Furthermore, this work adopts a systematic multi-split approach based on Gini index and p-value. This is done by optimizing a suitable bagging ensemble learner that is built from any combination of six potential base machine learning algorithms. It is shown through experimental results that the posited bagging ensemble models achieve high accuracy for the target group for both datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abdul Aziz A, Ismail NH, Ahmad F (2013) Mining students’ academic performance. Journal of Theoretical and Applied Information Technology 53(3):485–485
Ahmed ABED, Elaraby IS (2014) Data mining: a prediction for student’s performance using classification method. World Journal of Computer Application and Technology 2(2):43–47
Aly M (2005) Survey on multiclass classification methods. Neural Network 19:1–9
Asogbon MG, Samuel OW, Omisore MO, Ojokoh BA (2016) A multi-class support vector machine approach for students academic performance prediction. Int J Multidisciplinary and Current Research 4
Athani SS, Kodli SA, Banavasi MN, Hiremath PS (2017) Student performance predictor using multiclass support vector classification algorithm. In: 2017 international conference on signal processing and communication (ICSPC). IEEE, pp 341–346
Baradwaj BK, Pal S (2012) Mining educational data to analyze students’ performance. arXiv:12013417
Bhardwaj BK, Pal S (2012) Data mining: a prediction for performance improvement using classification. arXiv:12013418
Buffardi K, Edwards SH (2014) Introducing codeworkout: an adaptive and social learning environment. In: Proceedings of the 45th ACM technical symposium on computer science education, ACM, SIGCSE ’14. https://doi.org/10.1145/2538862.2544317, pp 724–724
Bühlmann P (2012) Bagging, boosting and ensemble methods. In: Handbook of computational statistics. Springer, Berlin, pp 985–1022
Bühlmann P, Yu B, et al. (2002) Analyzing bagging. The Annals of Statistics 30(4):927–961
Chang YC, Kao WY, Chu CP, Chiu CH (2009) A learning style classification mechanism for e-learning. Computers & Education 53(2):273–285
Chen X, Vorvoreanu M, Madhavan K (2014) Mining social media data for understanding students’ learning experiences. IEEE Transactions on Learning Technologies 7(3):246–259. https://doi.org/10.1109/TLT.2013.2296520
Daniel J, Vázquez Cano E, Gisbert Cervera M (2015) The future of moocs: adaptive learning or business model? International Journal of Educational Technology in Higher Education 12(1):64–73. https://doi.org/10.7238/rusc.v12i1.2475
Daradoumis T, Bassi R, Xhafa F, Caballe S (2013) A review on massive e-learning (mooc) design, delivery and assessment. In: 2013 eighth international conference on p2p, parallel, grid, cloud and internet computing, pp 208–213
Dhar V, Tickoo A, Koul R, Dubey B (2010) Comparative performance of some popular artificial neural network algorithms on benchmark and function approximation problems. Pramana 74(2):307–324
Essalmi F, Ayed LJB, Jemni M, Graf S, Kinshuk (2015) Generalized metrics for the analysis of e-learning personalization strategies. Computers in Human Behavior 48:310–322. https://doi.org/10.1016/j.chb.2014.12.050
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI magazine 17(3):37–37
Feldman L (2006) Designing homework assignments: from theory to design. Age 4:1
Fiszelew A, Britos P, Ochoa A, Merlino H, Fernández E, García-Marínez R (2007) Finding optimal neural network architecture using genetic algorithms. Advances in Computer Science and Engineering Research in Computing Science 27:15–24
Fluss R, Faraggi D, Reiser B (2005) Estimation of the youden index and its associated cutoff point. Biometrical Journal: Journal of Mathematical Methods in Biosciences 47(4):458–472
Fok WW, He Y, Yeung HA, Law K, Cheung K, Ai Y, Ho P (2018) Prediction model for students’ future development by deep learning and tensorflow artificial intelligence engine. In: 2018 4th international conference on information management (ICIM). IEEE, pp 103–106
Fujita H, et al. (2019) Neural-fuzzy with representative sets for prediction of student performance. Appl Intell 49(1):172–187
Gevrey M, Dimopoulos I, Lek S (2003) Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecological Modelling 160(3):249–264
Guyon I, Lemaire V, Boullé M, Dror G, Vogel D (2010) Design and analysis of the kdd cup 2009: fast scoring on a large orange customer database. ACM SIGKDD Explorations Newsletter 11(2):68–76
Hand DJ, Till RJ (2001) A simple generalisation of the area under the roc curve for multiple class classification problems. Machine Learning 45(2):171–186
Hijazi ST, Naqvi S (2006) Factors affecting students’performance. Bangladesh E-Journal of Sociology 3(1)
Hosseinzadeh A, Izadi M, Verma A, Precup D, Buckeridge D (2013) Assessing the predictability of hospital readmission using machine learning. In: Twenty-fifth IAAI conference
Injadat M, Salo F, Nassif AB (2016) Data mining techniques in social media: a survey. Neurocomputing 214:654–670
Injadat M, Salo F, Nassif AB, Essex A, Shami A (2018) Bayesian optimization with machine learning algorithms towards anomaly detection. In: 2018 IEEE global communications conference (GLOBECOM). https://doi.org/10.1109/GLOCOM.2018.8647714, pp 1–6
Injadat M, Moubayed A, Nassif AB, Shami A (2020) Systematic ensemble model selection approach for educational data mining. Knowledge-Based Systems 200:105992. https://doi.org/10.1016/j.knosys.2020.105992. http://www.sciencedirect.com/science/article/pii/S0950705120302999
Jain A, Solanki S (2019) An efficient approach for multiclass student performance prediction based upon machine learning. In: 2019 International conference on communication and electronics systems (ICCES). IEEE, pp 1457–1462
Kaggle Inc (2019) Kaggle. https://www.kaggle.com/
Karaci A (2019) Intelligent tutoring system model based on fuzzy logic and constraint-based student model. Neural Computing and Applications 31(8):3619–3628. https://doi.org/10.1007/s00521-017-3311-2
Kaur G, Singh W (2016) Prediction of student performance using weka tool. An International Journal of Engineering Sciences 17:8–16
Kehrwald B (2008) Understanding social presence in text-based online learning environments. Distance Education 29(1):89–106. https://doi.org/10.1080/01587910802004860
Khan B, Khiyal MSH, Khattak MD (2015) Final grade prediction of secondary school student using decision tree. Int J Comput Appli 115(21)
Khribim MK, Jemni M, Nasraoui O (2008) Automatic recommendations for e-learning personalization based on web usage mining techniques and information retrieval. In: 2008 eighth IEEE international conference on advanced learning technologies. https://doi.org/10.1109/ICALT.2008.198, pp 241–245
Klamma R, Chatti MA, Duval E, Hummel H, Hvannberg ET, Kravcik M, Law E, Naeve A, Scott P (2007) Social software for life-long learning. Journal of Educational Technology & Society 10 (3):72–83
Koch P, Wujek B, Golovidov O, Gardner S (2017) Automated hyperparameter tuning for effective machine learning. In: Proceedings of the SAS global forum 2017 conference, pp 1–23
Kotsiantis S, Patriarcheas K, Xenos M (2010) A combinational incremental ensemble of classifiers as a technique for predicting students’ performance in distance education. Knowl-Based Syst 23(6):529–535
Kuhn M, et al. (2008) Building predictive models in r using the caret package. Journal of statistical software 28(5):1–26
Lerman RI, Yitzhaki S (1984) A note on the calculation and interpretation of the gini index. Economics Letters 15(3-4):363–368
Lorenz MO (1905) Methods of measuring the concentration of wealth. Publications of the American statistical association 9(70):209–219
Luan J (2002) Data mining and its applications in higher education. New Directions for Institutional Research 2002(113):17–36. https://doi.org/10.1002/ir.35
Lv C, Xing Y, Zhang J, Na X, Li Y, Liu T, Cao D, Wang FY (2017) Levenberg–marquardt backpropagation training of multilayer neural networks for state estimation of a safety-critical cyber-physical system. IEEE Transactions on Industrial Informatics 14(8):3436–3446
Ma Y, Liu B, Wong CK, Yu PS, Lee SM (2000) Targeting the right students using data mining. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 457–464
Marquardt DW (1963) An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics 11(2):431–441
Márquez-Vera C, Cano A, Romero C, Ventura S (2013) Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied Intelligence 38(3):315–330
Moubayed A, Injadat M, Nassif AB, Lutfiyya H, Shami A (2018) E-learning: challenges and research opportunities using machine learning data analytics. IEEE Access 6:39117–39138. https://doi.org/10.1109/ACCESS.2018.2851790
Moubayed A, Injadat M, Shami A, Lutfiyya H (2018) DNS typo-squatting domain detection: a data analytics & machine learning based approach. In: 2018 IEEE global communications conference (GLOBECOM). IEEE, pp 1–7
Moubayed A, Injadat M, Shami A, Lutfiyya H (2018) Relationship between student engagement and performance in e-learning environment using association rules. In: 2018 IEEE world engineering education conference (EDUNINE). https://doi.org/10.1109/EDUNINE.2018.8451005, pp 1–6
Moubayed A, Aqeeli E, Shami A (2020) Ensemble-based feature selection and classification model for DNS typo-squatting detection. In: 33rd Canadian conference on electrical and computer engineering (CCECE’20). IEEE, pp 1–6
Moubayed A, Injadat M, Shami A, Lutfiyya H (2020) Student engagement level in e-learning environment. Clustering using k-means. American Journal of Distance Education. https://doi.org/10.1080/08923647.2020.1696140
Netflix Inc (2009) Netflix competition. https://www.netflixprize.com/
Nguyen D, Widrow B (1990) Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. In: 1990 IJCNN international joint conference on neural networks. IEEE, pp 21–26
Pal S (2012) Mining educational data to reduce dropout rates of engineering students. Int J Inform Eng Electron Business 4(2):1
Prasad GNR, Babu AV (2013) Mining previous marks data to predict students performance in their final year examinations. Int J Eng Res Technol 2(2):1–4
Ramaswami M (2014) Validating predictive performance of classifier models for multiclass problem in educational data mining. International Journal of Computer Science Issues (IJCSI) 11(5):86
Rana S, Garg R (2016) Evaluation of students’ performance of an institute using clustering algorithms. Int J Appl Eng Res 11(5):3605–3609
Romero C, Ventura S (2007) Educational data mining: a survey from 1995 to 2005. Expert systems with applications 33(1):135–146
Rosenberg MJ, Foshay R (2002) E-learning: strategies for delivering knowledge in the digital age. Performance Improvement 41(5):50–51. https://doi.org/10.1002/pfi.4140410512. https://onlinelibrary.wiley.com/doi/abs/10.1002/pfi.4140410512, https://onlinelibrary.wiley.com/doi/pdf/10.1002/pfi.4140410512
Saxena R (2015) Educational data mining: performance evaluation of decision tree and clustering techniques using weka platform. Int J Comput Sci Business Inform 15(2):26–37
Vahdat M, Oneto L, Anguita D, Funk M, Rauterberg M (2015) A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: Design for teaching and learning in a networked world. Springer International Publishing, Cham, pp 352–366
Vujicic T, Matijevic T, Ljucovic J, Balota A, Sevarac Z (2016) Comparative analysis of methods for determining number of hidden neurons in artificial neural network. In: Central European conference on information and intelligent systems, faculty of organization and informatics Varazdin, p 219
Wang X, Zhang Y, Yu S, Liu X, Yuan Y, Wang F (2017) E-learning recommendation framework based on deep learning. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC). https://doi.org/10.1109/SMC.2017.8122647, pp 455–460
Yang L, Moubayed A, Hamieh I, Shami A (2019) Tree-based intelligent intrusion detection system in internet of vehicles. In: 2019 IEEE global communications conference (GLOBECOM)
Acknowledgments
This study was funded by Ontario Graduate Scholarship (OGS) Program.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Informed Consent
This study does not involve any experiments on animals.
Rights and permissions
About this article
Cite this article
Injadat, M., Moubayed, A., Nassif, A.B. et al. Multi-split optimized bagging ensemble model selection for multi-class educational data mining. Appl Intell 50, 4506–4528 (2020). https://doi.org/10.1007/s10489-020-01776-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01776-3