Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Heart FailureDataset ML Algorithms

Uploaded by

kalsaitharshad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Heart FailureDataset ML Algorithms

Uploaded by

kalsaitharshad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Heart Failure

Algorithms Analysis Using Classification


Here is the Dataset from UCI Machine Learning repository, which is an open source website, you can
reach many other datasets, which is specifically classified according to task( Regression,
Classification), attribute types(categorical, numerical ) and more.
https꞉//archive.ics.uci.edu/ml/datasets.php
If you want to find out where to find free Resources To Download Datasets ;
https꞉//medium.com/datadriveninvestor/7‑free‑resources‑to‑download‑datasets‑4689a419ccf9
This dataset contains the medical records of 299 patients who had heart failure, collected during their
follow‑up period, where each patient profile has 13 clinical features.
Thirteen (13) clinical features꞉
age꞉ age of the patient (years)
anaemia꞉ decrease of red blood cells or hemoglobin (boolean)
high blood pressure꞉ if the patient has hypertension (boolean)
creatinine phosphokinase (CPK)꞉ level of the CPK enzyme in the blood (mcg/L)
diabetes꞉ if the patient has diabetes (boolean)
ejection fraction꞉ percentage of blood leaving the heart at each contraction (percentage)
platelets꞉ platelets in the blood (kiloplatelets/mL)
sex꞉ woman or man (binary)
serum creatinine꞉ level of serum creatinine in the blood (mg/dL)
serum sodium꞉ level of serum sodium in the blood (mEq/L)
smoking꞉ if the patient smokes or not (boolean)
time꞉ follow‑up period (days)
[target] death event꞉ if the patient deceased during the follow‑up period (boolean)
In [3]: import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.ensemble import RandomForestRegressor
from matplotlib import rcParams

2022‑11‑20 09:25:59.664474: I tensorflow/core/platform/cpu_feature_guard.cc:193] This Te


nsorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the
following CPU instructions in performance‑critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler fla
gs.

In [4]: df = pd.read_csv("/Users/randyasfandy/Downloads/heart_failure_clinical_records_dataset.c
In [255… df.head(5)

Out[255]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets


0 75.0 0 582 0 20 1 265000.00
1 55.0 0 7861 0 38 0 263358.03
2 65.0 0 146 0 20 0 162000.00
3 50.0 1 111 0 20 0 210000.00
4 65.0 1 160 1 20 0 327000.00

In [256… df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
# Column Non‑Null Count Dtype
‑‑‑ ‑‑‑‑‑‑ ‑‑‑‑‑‑‑‑‑‑‑‑‑‑ ‑‑‑‑‑
0 age 299 non‑null float64
1 anaemia 299 non‑null int64
2 creatinine_phosphokinase 299 non‑null int64
3 diabetes 299 non‑null int64
4 ejection_fraction 299 non‑null int64
5 high_blood_pressure 299 non‑null int64
6 platelets 299 non‑null float64
7 serum_creatinine 299 non‑null float64
8 serum_sodium 299 non‑null int64
9 sex 299 non‑null int64
10 smoking 299 non‑null int64
11 time 299 non‑null int64
12 DEATH_EVENT 299 non‑null int64
dtypes: float64(3), int64(10)
memory usage: 30.5 KB

In [578… df.sample(5)

Out[578]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platele


83 79.0 1 55 0 50 1 172000.
199 60.0 0 1211 1 35 0 263358.
45 50.0 0 582 1 38 0 310000.
174 65.0 0 198 1 35 1 281000.
208 60.0 1 2281 1 40 0 283000.

In [579… df["high_blood_pressure"].value_counts()
0 194
Out[579]:
1 105
Name: high_blood_pressure, dtype: int64

In [580… df["smoking"].value_counts()
0 203
Out[580]:
1 96
Name: smoking, dtype: int64

In [61]: age = df.iloc[:, 0]


creat = df.iloc[:, 2]
eject = df.iloc[:, 4]
plat = df.iloc[:, 6]
serum = df.iloc[:, 7]
sod = df.iloc[:, 8]
time = df.iloc[:, 11]

In [84]: import matplotlib.pyplot as plt


import seaborn as sns

fig, axs = plt.subplots(ncols=7)


sns.histplot(age,ax=axs[0])
sns.histplot(creat,ax=axs[1])
sns.histplot(eject,ax=axs[2])
sns.histplot(plat,ax=axs[3])
sns.histplot(serum,ax=axs[4])
sns.histplot(sod,ax=axs[5])
sns.histplot(time,ax=axs[6])
fig.set_figheight(12)
fig.set_figwidth(48)

In [94]: sns.histplot(creat)
<AxesSubplot:xlabel='creatinine_phosphokinase', ylabel='Count'>
Out[94]:

Train ‑ Test Split


In [5]: x = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']

x_train, x_test, y_train, y_test = train_test_split(


x, y,
test_size=0.2, random_state=42
)

In [258… from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, con


from sklearn.ensemble import RandomForestClassifier

Classification Algorithms
Random Forest Classifier
Logistic Regression
KNN
Decision Tree
SVM
Naive Bayes

Random Forest Classifier


In [259… dfrst = RandomForestClassifier(n_estimators=3, max_depth=4, min_samples_split=6, class_w
ranfor = dfrst.fit(x_train, y_train)
y_pred = ranfor.predict(x_test)

In [260… print('accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))


print('precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))

accuracy: 0.73
precision: 0.70
recall: 0.64
f1_score: 0.67

In [261… rf_accuracy = accuracy_score(y_test, y_pred)


rf_precision = precision_score(y_test, y_pred)
rf_recall = recall_score(y_test, y_pred)
rf_f1 = f1_score(y_test, y_pred)

Logistic Regression
In [262… from sklearn.linear_model import LogisticRegression

In [263… logisticRegr = LogisticRegression()


logisticRegr.fit(x_train, y_train)
y_pred2 = logisticRegr.predict(x_test)

In [264… print('accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred2)))


print('precision: {:.2f}'.format(precision_score(y_test, y_pred2)))
print('recall: {:.2f}'.format(recall_score(y_test, y_pred2)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred2)))
accuracy: 0.80
precision: 0.88
recall: 0.60
f1_score: 0.71

In [265… lr_accuracy = accuracy_score(y_test, y_pred2)


lr_precision = precision_score(y_test, y_pred2)
lr_recall = recall_score(y_test, y_pred2)
lr_f1 = f1_score(y_test, y_pred2)

KNN
In [208… from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

In [266… Ks = 10
mean_acc = np.zeros((Ks‑1))
std_acc = np.zeros((Ks‑1))

for n in range(1,Ks):

#Train Model and Predict


neigh = KNeighborsClassifier(n_neighbors = n).fit(x_train,y_train)
yhat=neigh.predict(x_test)
mean_acc[n‑1] = metrics.accuracy_score(y_test, yhat)

std_acc[n‑1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

array([0.53333333, 0.56666667, 0.53333333, 0.55 , 0.53333333,


Out[266]:
0.53333333, 0.53333333, 0.55 , 0.55 ])

In [267… print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)

The best accuracy was with 0.5666666666666667 with k= 2

In [268… classifier = KNeighborsClassifier(n_neighbors=2)


classifier.fit(x_train, y_train)
y_pred3 = classifier.predict(x_test)

In [269… print('accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred3)))


print('precision: {:.2f}'.format(precision_score(y_test, y_pred3)))
print('recall: {:.2f}'.format(recall_score(y_test, y_pred3)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred3)))

accuracy: 0.57
precision: 0.40
recall: 0.08
f1_score: 0.13

In [270… knn_accuracy = accuracy_score(y_test, y_pred3)


knn_precision = precision_score(y_test, y_pred3)
knn_recall = recall_score(y_test, y_pred3)
knn_f1 = f1_score(y_test, y_pred3)

Decision Tree
In [271… from sklearn.tree import DecisionTreeClassifier
In [272… mean_acc = np.zeros((9))

for d in range (1,10):


drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = d).fit(x_train,y_
drugTree_yhat = drugTree.predict(x_test)
print("For depth = {} accuracy score is {} ".format(d, accuracy_score(y_test, drugT
mean_acc[d‑1] = accuracy_score(y_test, drugTree_yhat)

For depth = 1 accuracy score is 0.75


For depth = 2 accuracy score is 0.75
For depth = 3 accuracy score is 0.7333333333333333
For depth = 4 accuracy score is 0.7166666666666667
For depth = 5 accuracy score is 0.75
For depth = 6 accuracy score is 0.6833333333333333
For depth = 7 accuracy score is 0.7
For depth = 8 accuracy score is 0.7
For depth = 9 accuracy score is 0.65

In [273… print( "The best accuracy was with", mean_acc.max(), "with depth=", mean_acc.argmax()+1)
The best accuracy was with 0.75 with depth= 1

In [274… drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 1).fit(x_train,y_trai

In [275… drugTree_yhat = drugTree.predict(x_test)

In [277… print('accuracy: {:.2f}'.format(accuracy_score(y_test, drugTree_yhat)))


print('precision: {:.2f}'.format(precision_score(y_test, drugTree_yhat)))
print('recall: {:.2f}'.format(recall_score(y_test, drugTree_yhat)))
print('f1_score: {:.2f}'.format(f1_score(y_test, drugTree_yhat)))

accuracy: 0.75
precision: 0.81
recall: 0.52
f1_score: 0.63

In [278… dt_accuracy = accuracy_score(y_test, drugTree_yhat)


dt_precision = precision_score(y_test, drugTree_yhat)
dt_recall = recall_score(y_test, drugTree_yhat)
dt_f1 = f1_score(y_test, drugTree_yhat)

SVM
In [279… from sklearn import svm
from sklearn.metrics import f1_score

In [280… for k in ('linear', 'poly', 'rbf','sigmoid'):


svm_model = svm.SVC( kernel = k).fit(x_train,y_train)
svm_yhat = svm_model.predict(x_test)
print("For kernel: {}, the f1 score is: {}".format(k,f1_score(y_test,svm_yhat, avera

For kernel: linear, the f1 score is: 0.7368014819388701


For kernel: poly, the f1 score is: 0.4298245614035087
For kernel: rbf, the f1 score is: 0.4298245614035087
For kernel: sigmoid, the f1 score is: 0.4298245614035087

In [281… clf = svm.SVC(kernel='linear')

In [282… clf.fit(x_train, y_train)

SVC(kernel='linear')
Out[282]:
In [283… yhat = clf.predict(x_test)

In [284… print('accuracy: {:.2f}'.format(accuracy_score(y_test, yhat)))


print('precision: {:.2f}'.format(precision_score(y_test, yhat)))
print('recall: {:.2f}'.format(recall_score(y_test, yhat)))
print('f1_score: {:.2f}'.format(f1_score(y_test, yhat)))

accuracy: 0.75
precision: 0.81
recall: 0.52
f1_score: 0.63

In [285… svm_accuracy = accuracy_score(y_test, yhat)


svm_precision = precision_score(y_test, yhat)
svm_recall = recall_score(y_test, yhat)
svm_f1 = f1_score(y_test, yhat)

Naive Bayes
In [286… from sklearn.naive_bayes import GaussianNB

In [287… gnb = GaussianNB()

In [288… gnb.fit(x_train, y_train)


GaussianNB()
Out[288]:

In [289… y_pred4 = gnb.predict(x_test)

In [290… print('accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred4)))


print('precision: {:.2f}'.format(precision_score(y_test, y_pred4)))
print('recall: {:.2f}'.format(recall_score(y_test, y_pred4)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred4)))

accuracy: 0.73
precision: 0.91
recall: 0.40
f1_score: 0.56

In [292… nb_accuracy = accuracy_score(y_test, y_pred4)


nb_precision = precision_score(y_test, y_pred4)
nb_recall = recall_score(y_test, y_pred4)
nb_f1 = f1_score(y_test, y_pred4)

Prediction Dictionary
In [295… accuracy_list = [rf_accuracy, lr_accuracy, knn_accuracy, dt_accuracy, svm_accuracy, nb_a
precision_list = [rf_precision, lr_precision, knn_precision, dt_precision, svm_precision
recall_list = [rf_recall, lr_recall, knn_recall, dt_recall, svm_recall, nb_recall]
f1_list = [rf_f1, lr_f1, knn_f1, dt_f1,svm_f1, nb_f1]

columns = ['Random Forest', 'Logistic Regression', 'KNN', 'Decision Tree', 'Support Vect
index = ['Accuracy', 'Precision', 'Recall', 'F1 Score']

evaluation_df = pd.DataFrame([accuracy_list, precision_list, recall_list, f1_list], inde


evaluation_df = evaluation_df.transpose()
accuracy_df1.columns.name = 'Algorithms'
Most Accurate Model
In [306… accuracy_df1.sort_values(by = "Accuracy", ascending = False)

Out[306]: Algorithm Accuracy Precision Recall F1 Score


Logistic Regression 0.800000 0.882353 0.60 0.714286
Decision Tree 0.750000 0.812500 0.52 0.634146
Support Vector Machine 0.750000 0.812500 0.52 0.634146
Random Forest 0.733333 0.695652 0.64 0.666667
Naive Bayes 0.733333 0.909091 0.40 0.555556
KNN 0.566667 0.400000 0.08 0.133333

Highest Precision Score Model


In [307… accuracy_df1.sort_values(by = "Precision", ascending = False)

Out[307]: Algorithm Accuracy Precision Recall F1 Score


Naive Bayes 0.733333 0.909091 0.40 0.555556
Logistic Regression 0.800000 0.882353 0.60 0.714286
Decision Tree 0.750000 0.812500 0.52 0.634146
Support Vector Machine 0.750000 0.812500 0.52 0.634146
Random Forest 0.733333 0.695652 0.64 0.666667
KNN 0.566667 0.400000 0.08 0.133333

Highest Recall Score Model


In [308… accuracy_df1.sort_values(by = "Recall", ascending = False)

Out[308]: Algorithm Accuracy Precision Recall F1 Score


Random Forest 0.733333 0.695652 0.64 0.666667
Logistic Regression 0.800000 0.882353 0.60 0.714286
Decision Tree 0.750000 0.812500 0.52 0.634146
Support Vector Machine 0.750000 0.812500 0.52 0.634146
Naive Bayes 0.733333 0.909091 0.40 0.555556
KNN 0.566667 0.400000 0.08 0.133333

Highest F1 Score Model


In [309… accuracy_df1.sort_values(by = "F1 Score", ascending = False)
Out[309]: Algorithm Accuracy Precision Recall F1 Score
Logistic Regression 0.800000 0.882353 0.60 0.714286
Random Forest 0.733333 0.695652 0.64 0.666667
Decision Tree 0.750000 0.812500 0.52 0.634146
Support Vector Machine 0.750000 0.812500 0.52 0.634146
Naive Bayes 0.733333 0.909091 0.40 0.555556
KNN 0.566667 0.400000 0.08 0.133333

In [6]: # Generate a mask for the upper triangle


corr = x_train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure


f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap


cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})

<AxesSubplot:>
Out[6]:
FEATURE SELECTION
In [605… from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# selected features are selected in multicollinearity check part


X = df.drop('DEATH_EVENT', axis=1)
y = df['DEATH_EVENT']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=414)

In [606… feature_names = [f"feature {i}" for i in range((X.shape[1]))]


forest = RandomForestRegressor(random_state=42)
forest.fit(X_train, y_train)
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(X.columns, forest.feature_importances_):
feats[feature] = importance #add the name/value pair

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini‑imp


importances.sort_values(by='Gini‑importance').plot(kind='bar', rot=90, figsize=(15,12))
plt.show()

You might also like