Flood Prediction Using Ensemble Machine Learning Model
Flood Prediction Using Ensemble Machine Learning Model
Learning Model
Tanvir Rahman Miah Mohammad Asif Syeed Maisha Farzana
Department of Computer and Information Sciences Department of Computer Science and Department of Computer Science and
2023 5th International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) | 979-8-3503-3752-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/HORA58378.2023.10156673
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 04,2024 at 15:36:17 UTC from IEEE Xplore. Restrictions apply.
dataset represents the amount of precipitation received in a
particular area during a specific time frame, relative to the
long-term average. This dataset also includes categorical Feature Engineering:
values such as 'True' and 'False' to indicate whether a flood To ensure that the dataset is unbiased and suitable for use
occurred in a given year. A range of +/- 19% of the average of with the models, the Standard Scaler technique was applied.
2039.6mm (during the period of June to September) is This technique involves scaling the data by centering them
considered a normal monsoon for the state. Any annual rainfall around their mean and scaling them to have unit variance.
that exceeds this threshold is considered as a potential To ensure unbiased training data, the dataset used in this
indicator of flood occurrence. The dataset preprocessing phase study was divided into training and testing sets in an 80:20
comprises three main stages, which are as follows: ratio, and the features were standardized using the Standard
Normalization: Scaler technique. This technique guarantees that all features
To prepare the dataset for analysis, normalization also are on the same scale and that the models are trained on an
known as feature encoding techniques are utilized. In this unbiased dataset.
study, two attributes of the dataset contain string-type data:
'sub-division' and 'flood'. These attributes are encoded V. DESCRIPTION OF MACHINE LEARNING MODELS USED IN
accordingly. For 'flood', which has only two unique values THIS STUDY
Figure 4, shows the dataset after feature selection: Fig 5. Diagram of Stacked Generalization [15]
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 04,2024 at 15:36:17 UTC from IEEE Xplore. Restrictions apply.
regression to identify a decision function for a given sample values of the independent variables. Unlike linear
x, which is given by, regression, a logistic regression uses a logit function to
establish the relationship between the independent and
∑ 𝑦𝑖α𝑖𝐾(𝑥𝑖, 𝑥) + 𝑏 dependent variables. The logit function is the natural
𝑖ϵ𝑆𝑉
logarithm of the odds ratio, which is calculated as the
probability of success divided by the probability of failure.
The dual coefficients in the SVC algorithm are denoted as The logistic regression equation can be expressed in a
α𝑖 and they are bounded by a value C. The kernel 𝐾(𝑥𝑖, 𝑥) general form.
represents the similarity measure between the input vector x
and a training example 𝑥𝑖 and the independent term b is logit(p) = β0 + β1x1 + β2x2 + ... + βnxn
estimated during the training process. [8] Here, p is the probability of the dependent variable being 1,
x1, x2, ..., xn are the independent variables, and β0, β1, β2,
K-Nearest Neighbor: ..., βn are the coefficients of the model. A comparison
The KNN method is a non-parametric algorithm that between linear regression and logistic regression is
predicts values of new data points for regression predictive graphically illustrated below.
problems by relying on feature similarity. According to a
research paper [16], this algorithm assumes that the new
data point is similar to the existing data and assigns it to the
category that is most similar to the existing categories. It can
quickly classify new data into a well-defined category. By
relying on feature similarity, the KNN method can quickly
classify new data points into a well-defined category by
calculating the distance between the new data point and all
previous data points in the training set using various
distance functions such as Euclidean distance.
𝑚
2
Euclidean distance: d(x,y) = ∑ (𝑥𝑖 − 𝑦𝑖)
𝑖=1 Fig 6. Distinguishing Between Linear and Logistic Regression [11]
Decision Tree Classifier: The base equation for the logistic regression model is
By using decision trees, one can generate models that derived as follows:
predict target variables by learning simple decision rules The value of probability "p" is limited to a range between 0
from the features of the dataset. These trees provide a and 1. To determine the odds of "p", it is divided by the
graphical representation of the decisions made by predictive difference between 1 and "p", which is denoted as (1-"p").
models, where internal nodes correspond to tests on the Then, we take the logarithm of this ratio to obtain the log
𝑃
features, branches indicate the outcomes, and leaf nodes [9] odds or logit, which is denoted as 𝑙𝑜𝑔( 1 − 𝑃 ). The logistic
represent the final decision obtained from the feature function is then applied to the logit to obtain the final
computations. The key aspect of decision trees is to create a equation:
series of splits that divide the data into two groups that are
the most homogeneous. To determine group homogeneity, 1
decision trees calculate the entropies of these groups. The P= 𝑃
−𝑙𝑜𝑔( 1 − 𝑃 )
entropy of a decision tree with C classes can be defined as: 1+𝑒
𝐶
Entropy = ∑− 𝑝𝑖𝑙𝑜𝑔2𝑝𝑖 The equation above is used to determine the probability of a
𝑖 binary outcome (either 0 or 1) by analyzing a set of
Information gain is a key statistical property used in predictor variables. Logistic regression models help estimate
decision trees to measure the reduction in entropy. The the relationship between the predictor variables and the
information gain is determined by subtracting the entropy of probability of the binary outcome.
the dataset before splitting based on a specified feature value In the given dataset, the amount of annual rainfall serves as
from the entropy of the dataset after splitting based on that the independent variable, while the dependent variable
feature value. Entropy values range between 0 and 1, where indicates the occurrence of a flood based on the amount of
1 represents maximum group impurity and 0 indicates fully rainfall.
pure groups. The probability of selecting an element of class A meta-model is used to generate an accurate final output by
i at random is represented by 𝑝𝑖. combining the predictions of multiple base-models. The
meta-model is trained on the predictions made by the
base-models on the unused training data. After training, the
Binary Logistic Regression:
meta-model uses the predictions made by the base-models
According to research published in [11], a logistic
on new data to produce a final prediction by combining
regression is a type of generalized linear model used when
them with its own outputs.
the dependent variable is binary. This model estimates the
likelihood of the dependent variable being 1 based on the
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 04,2024 at 15:36:17 UTC from IEEE Xplore. Restrictions apply.
To improve the accuracy of the final model, the predictions
Decision Tree Classifier 78.3% 0.217
made by multiple base models are combined using a
(DTC)
meta-model in a process known as stacking. First, the base
models are validated on a portion of the training data. Then,
Binary Logistic 83.6% 0.164
the predictions made by the base models on the remaining
Regression
training data are used to train the meta-model, which learns
to combine the predictions in a way that generates accurate
Stacked Generalization 84.8% 0.159
predictions on new data. Stacking can identify the
predictions that have low correlation between the base
models and combine their strengths to improve the overall
accuracy of the final model. 2. The subsequent table shows the results of applying the
models to rainfall data from all months of the year, using the
VI. PERFORMANCE EVALUATION same set of models in order to improve accuracy.
The rainfall dataset that has been analyzed in this project are
all collected into a CSV file. The rainfall has been Predictive Models Accuracy Standard
monitored from 1901 to 2017 in Kerala, India. Deviation
Predictive Model Classification:
The prediction was made using four different types of K-Nearest Neighbors 74.6% 0.172
classifiers: K-Nearest Neighbors (KNN), Support Vector (KNN)
Classifier (SVC), Decision Tree Classifier (DTC), Binary
Logistic Regression as well as a stacked model. The base Support Vector Classifier 90.6% 0.111
models used in the stacked model were the aforementioned (SVC)
four classifiers, with the meta-model being Binary Logistic
Regression. Decision Tree Classifier 77.2% 0.772
(DTC)
1. The models were initially applied to the rainfall data of
the monsoon period in Kerala, which lasts for 5 months - Binary Logistic 93.0% 0.103
June, July, August, September, and October [12]. Therefore, Regression
only the rainfall data from these months were used for the
analysis. The table below presents the outcomes of the Stacked Generalization 93.3% 0.098
models that were applied.
Fig 7. Whisker box plot for Standalone Model Accuracies on the Monsoon
Rainfall Data
Fig 8. Whisker box plot for Standalone Model Accuracies on the 12 months
Rainfall Data
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 04,2024 at 15:36:17 UTC from IEEE Xplore. Restrictions apply.
VII. CONCLUDING REMARKS AND FUTURE PROSPECTS [10] Model. TechLeer. https://www.tJha, V. (2018, June 17). Decision
Tree Algorithm for a
Predictiveechleer.com/articles/120-decision-tree-algorithm-for-a-predi
This study focused on predicting floods using ctive-model/
meteorological data through the development of an [11] Fernandes, A. A. T., Figueiredo Filho, D. B., Rocha, E. C. D., &
Nascimento, W. D. S. (2020). Read this paper if you want to learn
ensemble machine learning model. By comparing the logistic regression. Revista de Sociologia e Política, 28(74), 2–3.
performance of various models including KNN, SVC, https://doi.org/10.1590/1678-987320287406en
decision trees, and logistic regression, we found that the [12] Chaturvedi, A. (2021, May 30). Monsoon likely to arrive in Kerala on
May 31, says IMD; heavy rainfall predicted in Karnataka from June
ensemble model exhibited superior accuracy and precision 1. Hindustan Times. Retrieved September 30, 2021 from
compared to individual models. Additionally, the ensemble https://www.hindustantimes.com/india-news/monsoon-likely-to-arrive
model outperformed previous studies in flood prediction -in-kerala-on-may-31-says-imd-101622366991176.html
[13] Saini, A. (2021, August 3). Conceptual Understanding of Logistic
using machine learning models.
Regression for Data Science Beginners. Analytics Vidhya. Retrieved
September 25, 2021 from
There are several avenues for future research to enhance the
https://www.analyticsvidhya.com/blog/2021/08/conceptual-understan
accuracy of the proposed flood prediction model. Firstly, ding-of-logistic-regression-for-data-science-beginners/ Khatun, F.
additional data sources such as soil moisture and land use (2021, May 31). Living with floods and reducing vulnerability in
will be incorporated. Secondly, the model's performance will Bangladesh. The Daily Star.
https://www.thedailystar.net/opinion/macro-mirror/news/living-floods
be evaluated under different temporal and spatial scales.
-and-reducing-vulnerability-bangladesh-1950277
Thirdly, the use of other ensemble techniques, such as [14] Sankaranarayanan, S., Prabhakar, M., Satish, S., Jain, P., Ramprasad,
bagging and boosting, will be explored to further improve A., & Krishnan, A. (2019). Flood prediction based on weather
parameters using deep learning. Journal of Water and Climate
the model's accuracy. Finally, the development of an online Change, 11(4), 1766–1783. https://doi.org/10.2166/wcc.2019.321
flood prediction system based on the proposed model is [15] "Ensemble Stacking for Machine Learning and Deep Learning,"
Analytics Vidhya, Aug. 2021. [Online]. Available:
planned to provide real-time flood warnings to local https://www.analyticsvidhya.com/blog/2021/08/ensemble-stacking-for
communities and authorities. -machine-learning-and-deep-learning/.
[16] Zhang, Z., "Introduction to machine learning: k-nearest neighbors,"
Annals of Translational Medicine, vol. 4, no. 11, pp. 218–218, Jun.
REFERENCES 2016.
https://atm.amegroups.com/article/view/10170/11310
[1] Floods: Occurrence and Distribution. (n.d.). Department of Geology,
Aligarh Muslim University. Retrieved May 29, 2021, from
http://www.geol-amu.org/notes/be1a-3-1.htm#:~:text=A%20flood%2
0occurs%20when%20water,oceans%20that%20submerges%20nearby
%20land.&text=The%20
most%20common%20cause%20of,during%20an%20unusually%20he
avy%20rainfall.
[2] Flood. (n.d.). National Geographic. Retrieved June 2, 2021, from
https://www.nationalgeographic.org/encyclopedia/flood/?fbclid=IwA
R3rkSgka-9VdX7VzrKf-2nmulw8m0u_IY7IzaCTHluz-ZxOfAtfoPfa
kXU
[3] Jongman, B., Ward, P. J., & Aerts, J. C. J. H. (2012). Global exposure
to river and coastal flooding: Long term trends and changes. Global
Environmental Change, 22(4), 823-835.
https://doi.org/10.1016/j.gloenvcha.2012.07.004
[4] Flooding will affect double the number of people worldwide by 2030.
(2019). The Guardian. Retrieved May 29, 2021, from
https://www.theguardian.com/environment/2020/apr/23/flooding-dou
ble-number-people-worldwide-2030
[5] Bangladesh – Floods Affect Over 1 Million People in 13 Districts.
(2020, June 8). FloodList.Retrieved June 2, 2021, from
http://floodlist.com/asia/bangladesh-floods-update-july-2020
[6] Rainfall Index. (n.d.). USDA. Retrieved September 25, 2021, from
https://www.rma.usda.gov/en/Policy-and-Procedure/Insurance-Plans/
Rainfall-Index
[7] Between Normalization vs. Standardization. Analytics Vidhya.
Retrieved September 30, 2021 from
https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machi
ne-learning-normalization-standardization/
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 04,2024 at 15:36:17 UTC from IEEE Xplore. Restrictions apply.