Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
30 views

Project-1 (Data Preprocessing)

Uploaded by

Arijeet ros
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Project-1 (Data Preprocessing)

Uploaded by

Arijeet ros
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

GIET UNIVERSITY, GUNUPUR

SCHOOL OF ENGINEERING AND TECHNOLOGY


DEPARTMENT OF CSE (AIML)

Step 1: Import Python Libraries:-


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

Step 2: Read the Dataset:-


df= pd.read_csv('kerala.csv')
df.head(5)

Step 3: Explore the Dataset:-


1)df.info()

2)df.shape

3)df.describe()

4)df.corr()

Replace:- In order to train this Python model, we need the values of our target
output to be 0 & 1. So, we'll replace values in the Floods column (YES, NO)
with (1, 0) respectively
df['FLOODS'].replace(['YES', 'NO'], [1,0], inplace=True)
df.head(5)

null values:- To find the null values In the dataset

df.isnull().mean().sort_values(ascending=False) * 100

corr:- To identifying the correlation between the data points using heat map

NAME OF THE STUDENT: Arijeet Mishra ROLL NO – 21CSEAIML008


PAGE NO: 01
GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)

corr df.corr()

sns.heatmap(corr, xticklabels corr.columns, yticklabels


corr.columns)

Step 3: Feature Selection:-

Start by importing the Select Best library:

from sklearn.feature_selection import SelectKBest


from sklearn.feature_selection import chi2

After, define X & Y:-

X= df.iloc[:,1:14] //for all features


Y= df.iloc[:,-1] //for target output (floods)

Select the top 3 features:-

best_features= SelectKBest(score_func=chi2, k=3)


fit= best_features.fit(X,Y)

Now we create data frames for the features and the score of each
feature:

df_scores= pd.DataFrame(fit.scores_)
df_columns= pd.DataFrame(X.columns)

Finally, we’ll combine all the features and their corresponding scores in
one data frame:

features_scores= pd.concat([df_columns, df_scores], axis=1)


features_scores.columns= ['Features', 'Score']
features_scores.sort_values(by = 'Score')

Step 4: Build the Model:-


X= df[['SEP', 'JUN', 'JUL']] the top 3 features
Y= df[['FLOODS']] the target output

Splitting the dataset into train and test:-


X_train,X_test,y_train,y_test=train_test_split(X,Y,test_siz
e=0.4,random_state=100)

NAME OF THE STUDENT: Arijeet Mishra ROLL NO – 21CSEAIML008


PAGE NO: 02
GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)

Create a logistic regression body:-

logreg= LogisticRegression()
logreg.fit(X_train,y_train)

we predict the likelihood of a flood using the logistic regression body we


created:-
y_pred=logreg.predict(X_test)
print (X_test) #test dataset
print (y_pred) #predicted values

Step 5: Evaluate the Model’s Performance:-

• 5.1:- Mean Absolute Error(MAE):- MAE is a straightforward metric that calculates


the absolute difference between actual and predicted values. The degree of errors for
predictions and observations is measured using the average absolute errors for the
entire group.

from sklearn.metrics import mean absolute_error

print("MAE", mean_absolute_error(y_test,y_pred)

• 5.2:- Mean Squared Error(MSE)

MSE is a popular and straightforward statistic with a bit of variation in mean


absolute error. The squared difference between the actual and anticipated values
is calculated using mean squared error.

from sklearn.metrics import mean_squared_error

print("MSE", mean_squared_error(y_test,y_pred)

• 5.3:-Root Mean Squared Error(RMSE)

As the term, RMSE implies that it is a straightforward square root of mean


squared error.

NAME OF THE STUDENT: Arijeet Mishra ROLL NO – 21cseaiml008

PAGE NO: 03
GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)

• R Squared (R2)

The R2 score, also called the coefficient of determination, is one of the


performance evaluation measures for the regression-based machine learning
model. Simply put, it measures how close the target data points are to the fitted
line. As we have shown, MAE and MSE are context-dependent, but the R2 score
is context neutral. So, with the help of R squared, we have a baseline model to
compare to a model that none of the other metrics give

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)

print(r2)

Classification Report:-
A classification report is a performance evaluation report that is used
to evaluate the performance of machine learning models by the
following 5 criteria:

• Accuracy is a score used to evaluate the model’s performance. The


higher it is, the better.
• Recall measures the model’s ability to correctly predict the true
positive values.
• Precision is the ratio of true positives to the sum of both true and
false positives.
• F-score combines precision and recall into one metric. Ideally, its
value should be closest to 1, the better.
• Support is the number of actual occurrences of each class in the
dataset.

NAME OF THE STUDENT: Arijeet Mishra Roll no – 21cseaiml008


PAGE NO: 04
GIET UNIVERSITY, GUNUPUR
SCHOOL OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF CSE (AIML)

from sklearn import metrics


from sklearn.metrics import classification_report
print(‘Accuracy: ‘,metrics.accuracy_score(y_test, y_pred))
print(‘Recall: ‘,metrics.recall_score(y_test, y_pred,
zero_division=1))
print(“Precision:”,metrics.precision_score(y_test, y_pred,
zero_division=1))
print(“CL Report:”,metrics.classification_report(y_test,
y_pred, zero_division=1))

ROC Curve:-

The receiver operating characteristic (ROC) curve is used to display the


sensitivity and specificity of the logistic regression model by calculating the true
positive and false positive rates.

From the ROC curve, we can calculate the area under the curve (AUC) whose
value ranges from 0 to 1. You’ll remember that the closer to 1, the better it is for
our predictive modeling.

• To determine the ROC curve, first define the metrics:-


y_pred_proba= logreg.predict_proba(X_test) [::,1]

• Then, calculate the true positive and false positive rates:-


false_positive_rate, true_positive_rate, _ =
metrics.roc_curve(y_test, y_pred_proba)

• Next, calculate the AUC to see the model's performance:-

auc= metrics.roc_auc_score(y_test, y_pred_proba)

• Finally, plot the ROC curve:-


plt.plot(false_positive_rate,
true_positive_rate,label="AUC="+str(auc))
plt.title('ROC Curve')
plt.ylabel('True Positive Rate')
plt.xlabel('false Positive Rate')
plt.legend(loc=4)

NAME OF THE STUDENT: Arijeet Mishra roll no – 21AIML008


PAGE NO: 05

You might also like