Data Science Program With SONAR Data
Data Science Program With SONAR Data
SWAPNIL SAURAV
Email: ssaurav2@gitam.in
ROLL NO. HR21CSEN0114029
PhD Part Time – CSE Department (GITAM, Hyderabad)
2023
DATA SCIENCES Assignment 1
The Sonar dataset is a collection of 208 labeled samples of sonar signals collected by a
naval mine detection system. Each sample is composed of 60 input variables that
represent the strength of the sonar signal in various frequency bands (continuous numeric
data). The output variable is a categorical string that indicates whether the sample is
either a ROCK (R) or a MINE (M).
In this assignment, we have been asked to perform logistic regression to create a logistic
regression model. This dataset can be used to train a logistic regression model to classify
objects as either "R" or "M" based on the attributes.
Figure 1 plot is a correlation matrix between all the 60 input variables to understand how they
are correlated. Ideally, these attributes shouldn’t have any or weak correlation values but the
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
figure shows some dark color (moderate positive correlation) as well whitish color (moderate
negative correlation). We should remove highly correlated attribute and just keep one of them
for the analysis.
Data shows out of 208 observations, Mine is the outcome for 111 observations and the rest 97
has Rock as output. Dataset is balanced and we don’t have to perform any other steps to
balance the dataset.
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
The height of each bar represents the frequency or number of data points that fall within that
range. Histogram plots are useful for seeing the overall pattern or shape of the data and are
often used to compare different data sets. As we can see from the Figure 3, that bulk of the
attribute is positively skewed. There are some negatively skewed attributes also like attributes
with index 30 to 35.
We see that:
1. Only two attributes out of 60 can explain more than 54% of the data
2. Eleven out of 60 can explain more than 90% of the data
3. 18 out of 60 attributes can explain more than 95% of the data
4. Logistic Regression.
We need to run Binary Logistic Regression. But before that we need to perform ANOVA test to
see if the attributes impact the output variable or not.
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
Since there are more than 2 attributes we need to perform n-Way Anova or MONOVA test.
MONOVA Test:
The null hypothesis (H0) of ANOVA is that there is no difference among group means. The
alternative hypothesis (Ha) is that at least one group differs significantly from the overall
mean of the dependent variable. If we accept the null hypothesis then that means there is no
difference among group means and Logistics Regression will not give us significant results
hence we can’t trust the regression.
Rejecting the null hypothesis will help in accepting the alternate hypothesis which will make
us confident to use any machine learning techniques (including Logistic Regression) in this
case.
Because the p-value of the independent variable (60 attributes), is statistically significant (p <
0.05), it is likely that the attributes do have a significant effect on identifying Mine or Rock.
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
Our model has given us the coefficient for all the 60 attributes and the constant value
(intercept). This will help us the make the prediction. Accuracy on the training data is 84.6%
and on the testing data is about 75%
5. Learning Curves
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
We can see from Figure 6 that as we increase the number of dataset the Training accuracy is
decreasing but the test accuracy is increasing. Less dataset gives us underfitting situation but
as we increase the model is fine tuned and we reach best possible scenario.
Error analysis is done using finding Confusion matrix and then calculating various performance
matrices like precision, recall, and f1-score
Confusion Matrix: A confusion matrix is a table that is used to evaluate the performance of a
classification model. It is a table of the predicted classes compared to the actual classes in the
test data set. It allows you to see where the model is making correct predictions, and where it
is making mistakes. It also helps you to identify potential areas for improvement in the model.
The rows of the matrix represent the actual classes and the columns represent the predicted
classes. The matrix is filled with counts of the number of times an instance was predicted to
be in a certain class. The diagonal elements represent the number of correct predictions,
while the off-diagonal elements indicate the number of incorrect predictions.
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
The below table is the confusion matrix of test data. IT shows that 14 records were classified
as Mine and it was actually Mine and 10 were correctly classified as Rock. Remaining 8 (5+3)
were incorrectly classified as
Classification Report: Precision, recall, F1 score and support are all metrics used to evaluate
the performance of a machine learning model.
• Precision is the fraction of true positives from the total number of predicted positives.
It tells us how accurate our model is when predicting positive outcomes.
• Recall is the fraction of true positives from the total number of actual positives. It tells
us how many positive outcomes our model is able to predict.
• F1 score is a weighted average of precision and recall. It takes both false positives and
false negatives into account when calculating the score.
• Support is the number of occurrences of each class in our dataset. It is used to
calculate the accuracy, precision and recall of our model.
***
CODE
import numpy as np # linear algebra
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
loc = "D:\\datasets\\sonar_csv.csv"
##
dataset=dataset.iloc[1: , :]
print(dataset.shape)
print(dataset.dtypes)
print(dataset[60].value_counts())
X = dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values
print(X)
print(Y)
#EDA
print(dataset.iloc[: , :-1].corr())
import seaborn as sb
# plotting correlation heatmap
dataplot = sb.heatmap(dataset.iloc[: , :-1].corr(), cmap="YlGnBu", annot=False)
# displaying heatmap
plt.show()
dataset.iloc[: , :].groupby(60)[60].count().plot.bar();
plt.show()
# histograms
dataset.iloc[: , :].hist(sharex=False, sharey=False, xlabelsize=1,
ylabelsize=1, figsize=(12,12),fill=True, color='#F5AC03')
plt.title("Histogram of data for each attribute")
plt.show()
#MODAL
print("=======> \n\n",Y)
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.15,
random_state=1)
print(X.shape,X_train.shape,X_test.shape)
#PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=60)
principalComponents = pca.fit_transform(X_train)
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
principalDf = pca.transform(X_test)
var = pca.explained_variance_ratio_
print("variance ratio = ",var)
#Reading again
df = pd.read_csv(loc, index_col=0)
print("Columns: ",df.columns)
df.head()
print(maov.mv_test())
#####################
model =LogisticRegression()
model.fit(X,Y)
y_pred = model.predict(X_test)
print("LR Coefficient= ",model.coef_)
print("Intercept = ",model.intercept_)
#TEST
model =LogisticRegression()
model.fit(X_train,Y_train)
x_train_prediction =model.predict(X_train)
traning_data_accuracy= accuracy_score(x_train_prediction,Y_train)
print('Accuracy on traning data :',traning_data_accuracy)
x_test_prediction =model.predict(X_test)
testing_data_accuracy= accuracy_score(x_test_prediction,Y_test)
print('Accuracy on testing data :',testing_data_accuracy)
# Learning Curves
# overfitting
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1
X=X,
y=Y,
cv=cv,
scoring="neg_root_mean_squared_error",
train_sizes =[1,15,35,66,99,125,150,166], verbose=0
)
plt.show()
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import LearningCurveDisplay, ShuffleSplit
common_params = {
"X": X,
"y": Y,
"train_sizes": np.linspace(0.1, 1.0, 5),
"cv": ShuffleSplit(n_splits=50, test_size=0.2, random_state=0),
"score_type": "both",
"n_jobs": 4,
"line_kw": {"marker": "o"},
"std_display_style": "fill_between",
"score_name": "Accuracy",
}
#RESULT ANALYSIS
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(Y_test, y_pred))
print(classification_report(Y_test,y_pred))
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)