Water Quality Classification Using Machine Learning
Water Quality Classification Using Machine Learning
Abstract— Water quality is crucial as it directly affects the access safe and affordable drinking water by the year 2030.
ecosystem and human health. However, current water quality This goal was designated as a target by the JMP.
classification methods are inefficient because they do not
compare prediction accuracy between machine learning In the field, hydrologists collect water samples from
methods. In this regard, the objective of this study is to classify various water sources such as taps, tube wells, distribution
water quality based on the proposed machine learning tools. To networks, hand pump/dug wells, streams, springs and dams,
fulfill that, a preliminary study was conducted by collecting rivers, and lakes [4]. The water samples collected are kept in
related information in the research domain through articles, a plastic or glass container to be transported to laboratories to
electronic books, and online databases. The data collection for analyze their parameters. To determine water quality, water
the prototype’s dataset was obtained from an electronic book scientists analyze the water samples based on biological,
published by the Pakistan Council of Research in Water physical, and chemical parameters and decide whether or not
Resources 2021. Subsequently, the data pre-processing phase they follow the regulations and standards. Still, unfortunately,
was conducted by using WEKA software which includes the not all countries use the standards set by the WHO. Some
crucial steps to transform the data into a cleaner format and countries have their regulations and standards set for water
make the model more accurate. The model for each technique quality. Thus, having Guidelines for Drinking Water Quality
was developed using Python in Jupyter Notebook. The results of (GDWQ) is not mandatory [5]. The guideline values are not
the accuracy score for each model were also conducted in this
meant to be used entirely as regulations and standards for the
phase. The findings of this research show that the Decision Tree
national or subnational. It is considered impossible to make a
model performs excellently with an accuracy of 97.37%
compared to the Support Vector Machine and K-Nearest
range of conditions that will influence the water parameters
Neighbour models, with an accuracy of 95.69% and 74.72%, for a particular country as each includes different biological,
respectively. Consequently, implementing a multi-class physical, and chemical parameters that are considered
classification system can help future researchers classify more essential for the country’s GDWQ. Some of the factors that
accurately and reduce the misclassification of water quality. affect the difference in the water parameters are the
environmental conditions, agricultural and industrial
Keywords— machine learning, classification, water quality, activities, and the sources that are readily accessible [5].
decision tree
Regulations and standards are the most critical aspects of
I. INTRODUCTION deciding the quality of the water. Water scientists have a tough
and time-consuming task ahead of them when they must
Water is an essential resource for life since it is required classify the quality of the water based on the permissible limits
for the existence of almost all living organisms, including of biological, physical, and chemical parameters when dealing
humans [1]. It is essential to drink clean water since it assists with a large number of water samples. As a result, it is
in the body's elimination of toxins, contributes to the necessary to conduct monitoring and analysis to ascertain
preservation of health, and protects against dangerous whether or not the water source is consumable without risk in
illnesses. Diseases such as cholera, diarrhea, dysentery, one's day-to-day life. Several strategies, including machine
hepatitis A, typhoid, and polio are connected to polluted water learning, the Internet of Things (IoT), and cloud computing,
and inadequate sanitation, as stated by WHO [2][3]. As part have been put into practice to monitor the water's quality
of the Sustainable Development Goal (SDG) number 6, the [6][7][8][9]. However, current water quality classification
Joint Monitoring Programme for Water Supply and Sanitation methods are inefficient because they do not compare
(JMP) set an objective that all people should have the right to prediction accuracy between machine learning methods.
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on December 01,2024 at 09:16:45 UTC from IEEE Xplore. Restrictions apply.
Notebook. Subsequently, the cleaned data are utilized in the
development of a classification model using machine learning
as further explained in the following subsection.
Figure 3 below shows the bar chart of water distribution
by its quality. The x-axis represents the category of its
safeness; whether it is safe or unsafe. The y-axis represents the
number of data that are present in the dataset.
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on December 01,2024 at 09:16:45 UTC from IEEE Xplore. Restrictions apply.
variable would also increase. However, the correlation
coefficient of 0 indicates that there are no correlations between
the two water parameters. From observation of the matrix,
Total Dissolved Solids correlate with Hardness and Chloride
with a value of 0.98 and 0.97, respectively. Hardness and
Escherichia Coli show no correlation involved, as the value
achieved is 0.00. Finally, the water parameters that indicate
the highest negative correlations are pH values with Hardness
and Arsenic with a value of -0.17.
B. Model Development
For DT, KNN, and SVM techniques, the setting of default
parameters are presented in Table 1 as follows.
TABLE 1. DEFAULT PARAMETERS FOR DT, KNN, AND
SVM
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on December 01,2024 at 09:16:45 UTC from IEEE Xplore. Restrictions apply.
Figure 8 shows the accuracy comparison of the DT 95.69%. While for the lowest is the default parameter 70:30
technique in using default and after the hyperparameter split ratio with an accuracy of 60.31%.
tuning.
After observing the accuracy comparison between all
techniques, the DT is proven to be the best model as it
maintains the highest in classifying water quality with an
accuracy of 97.37%. The result proves that the DT performs
well on datasets that consist of two class labels. Therefore, the
technique is used for the primary model development of this
project. Figure 11 shows the performance evaluation using the
Confusion Matrix for the DT model.
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on December 01,2024 at 09:16:45 UTC from IEEE Xplore. Restrictions apply.
[5] WHO. 2018b. "Developing drinking water regulations and standards. [24] Najwa Mohd Rizal, N., Hayder, G., Mnzool, M., Elnaim, B. M.,
General guidance with a special focus on countries with limited Mohammed, A. O. Y., & Khayyat, M. M. (2022). Comparison between
resources." In Routledge Handbook of Water and Health, 1–68. regression models, support vector machine (SVM), and artificial neural
http://apps.who.int/iris/bitstream/handle/10665/272969/97892415139 network (ANN) in river water quality prediction. Processes, 10(8),
44-eng.pdf. 1652.
[6] Ragi, Nikhil M., Ravishankar Holla, and G. Manju. "Predicting water [25] Ilić, M., Srdjević, Z., & Srdjević, B. (2022). Water quality prediction
quality parameters using machine learning." In 2019 4th International based on Naïve Bayes algorithm. Water Science and Technology,
Conference on Recent Trends on Electronics, Information, 85(4), 1027-1039.
Communication & Technology (RTEICT), pp. 1109-1112. IEEE, [26] Malek, N. H. A., Wan Yaacob, W. F., Md Nasir, S. A., & Shaadan, N.
2019. (2022). Prediction of water quality classification of the Kelantan River
[7] Uddin, M. G., Nash, S., Rahman, A., & Olbert, A. I. (2023). Basin, Malaysia, using machine learning techniques. Water, 14(7),
Performance analysis of the water quality index model for predicting 1067.
water state using machine learning techniques. Process Safety and [27] Priyadarshini, I., Alkhayyat, A., Obaid, A. J., & Sharma, R. (2022).
Environmental Protection, 169, 808-828. Water pollution reduction for sustainable urban development using
[8] Sarker, Iqbal H., Asif Irshad Khan, Yoosef B. Abushark, and Fawaz machine learning techniques. Cities, 130, 103970.
Alsolami. "Internet of Things (IoT) security intelligence: a [28] Cengiz, A. V. C. I., Budak, M., Yağmur, N., & Balçik, F. (2023).
comprehensive overview, machine learning solutions and research Comparison between random forest and support vector machine
directions." Mobile Networks and Applications (2022): 1-17. algorithms for LULC classification. International Journal of
[9] Azrour, M., Mabrouki, J., Fattah, G., Guezzaz, A., & Aziz, F. (2022). Engineering and Geosciences, 8(1), 1-10.
Machine learning algorithms for efficient water quality prediction. [29] Gakii, C., & Jepkoech, J. (2019). A classification model for water
Modeling Earth Systems and Environment, 8(2), 2793-2801. quality analysis using decision tree.
[10] WHO. 2018a. "A global overview of national regulations and standards [30] Park, J., Lee, W. H., Kim, K. T., Park, C. Y., Lee, S., & Heo, T. Y.
for drinking-water quality." Verordnung Über Die Qualitä t von (2022). Interpretation of ensemble learning to predict water quality
Wasser Für Den Menschlichen Gebrauch (Trinkwasserverordnung - using explainable artificial intelligence. Science of the Total
TrinkwV Environment, 832, 155070.
[11] Partyka, M. L., & Bond, R. F. (2022). Wastewater reuse for irrigation [31] Ahmarofi, A.A., Kassa, F.M., Ishak, M. K. (2021). "Predicting the
of produce: a review of research, regulations, and risks. Science of the Cycle Time at a Production Line Through the Development of the 3-3-
Total Environment, 828, 154385. 1 Multilayer Perceptron Artificial Neural Networks with Formulated
[12] Han, X., Liu, X., Gao, D., Ma, B., Gao, X., & Cheng, M. (2022). Costs Momentum Rate." In Intelligent Manufacturing and Mechatronics:
and benefits of the development methods of drinking water quality Proceedings of SympoSIMM 2020, pp. 165-173. Singapore: Springer
index: A systematic review. Ecological Indicators, 144, 109501. Singapore, 2021.
[13] Sarker, Iqbal H. "Machine learning: Algorithms, real-world [32] Ahmarofi, A. A., Ramli, R., Abidin, N. Z., Jamil, J. M., & Shaharanee,
applications, and research directions." SN computer science 2, no. 3 I. N. (2020). Variations on the number of hidden nodes through
(2021): 160. Tiada ref 7 in text multilayer perceptron networks to predict the cycle time. Journal of
[14] Gordan, Meisam, Saeed-Reza Sabbagh-Yazdi, Zubaidah Ismail, Information and Communication Technology, 19(1), 1-19.
Khaled Ghaedi, Páraic Carroll, Daniel McCrum, and Bijan Samali. https://doi.org/10.32890/jict2020.19.1.1
"State-of-the-art review on advancements of data mining in structural [33] Patil, D., Kar, S., & Gupta, R. (2023). Classification and Prediction of
health monitoring." Measurement 193 (2022): 110939. Developed Water Quality Indexes Using Soft Computing Tools. Water
[15] Sharifani, K., & Amini, M. (2023). Machine Learning and Deep Conservation Science and Engineering, 8(1), 16.
Learning: A Review of Methods and Applications. World Information [34] Khoi, D. N., Quan, N. T., Linh, D. Q., Nhi, P. T. T., & Thuy, N. T. D.
Technology and Engineering Journal, 10(07), 3897-3904. (2022). Using machine learning models for predicting the water quality
[16] Ghobadi, F., & Kang, D. (2023). Application of Machine Learning in index in the La Buong River, Vietnam. Water, 14(10), 1552.
Water Resources Management: A Systematic Literature Review. [35] Mamat, N., & Razali, S. F. M. (2023). Comparisons of Various
Water, 15(4), 620. Imputation Methods for Incomplete Water Quality Data: A Case Study
[17] Nasir, N., Kansal, A., Alshaltone, O., Barneih, F., Sameer, M., of The Langat River, Malaysia. Jurnal Kejuruteraan, 35(1), 191-201.
Shanableh, A., & Al-Shamma'a, A. (2022). Water quality classification [36] Pakistan Council of Research in Water Recourses. (2021). PCRWR
using machine learning algorithms. Journal of Water Process Annual Report 2020-2021.
Engineering, 48, 102920. [37] Rasool, U., Yin, X., Xu, Z., Rasool, M. A., Senapathi, V., Hussain, M.,
[18] Gorgan-Mohammadi, F., Rajaee, T., & Zounemat-Kermani, M. (2023). ... & Trabucco, J. C. (2022). Mapping of groundwater productivity
Decision tree models in predicting water quality parameters of potential with machine learning algorithms: A case study in the
dissolved oxygen and phosphorus in lake water. Sustainable Water provincial capital of Baluchistan, Pakistan. Chemosphere, 303,
Resources Management, 9(1), 1. 135265.
[19] Nababan, A. A., Khairi, M., & Harahap, B. S. (2022). Implementation [38] Ahmed, M., Mumtaz, R., & Hassan Zaidi, S. M. (2021). Analysis of
of K-Nearest Neighbors (KNN) algorithm in classification of data water quality indices and machine learning techniques for rating water
water quality. Jurnal Mantik, 6(1), 30-35. pollution: A case study of Rawal Dam, Pakistan. Water Supply, 21(6),
[20] Juna, A., Umer, M., Sadiq, S., Karamti, H., Eshmawi, A. A., Mohamed, 3225-3250.
A., & Ashraf, I. (2022). Water quality prediction using KNN imputer [39] Khan, M. T., Shoaib, M., Hammad, M., Salahudin, H., Ahmad, F., &
and multilayer perceptron. Water, 14(17), 2592. Ahmad, S. (2021). Application of machine learning techniques in
[21] Derdour, A., Jodar-Abellan, A., Pardo, M. Á., Ghoneim, S. S., & rainfall–runoff modelling of the soan river basin, Pakistan. Water,
Hussein, E. E. (2022). Designing Efficient and Sustainable Predictions 13(24), 3528.
of Water Quality Indexes at the Regional Scale Using Machine [40] Farooq, M. U., Zafar, A. M., Raheem, W., Jalees, M. I., & Aly Hassan,
Learning Algorithms. Water, 14(18), 2801. A. (2022). Assessment of algorithm performance on predicting total
[22] Shamsuddin, I. I. S., Othman, Z., & Sani, N. S. (2022). Water quality dissolved solids using artificial neural network and multiple linear
index classification based on machine learning: A case from the Langat regression for the groundwater data. Water, 14(13), 2002.
River Basin model. Water, 14(19), 2939. [41] Adnan, R. M., Mostafa, R. R., Elbeltagi, A., Yaseen, Z. M., Shahid, S.,
[23] Oğuz, A., & Ertuğrul, Ö. F. (2023). A survey on applications of & Kisi, O. (2022). Development of new machine learning model for
machine learning algorithms in water quality assessment and water streamflow prediction: Case studies in Pakistan. Stochastic
supply and management. Water Supply, 23(2), 895-922. Environmental Research and Risk Assessment, 1-35.
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on December 01,2024 at 09:16:45 UTC from IEEE Xplore. Restrictions apply.