0% found this document useful (0 votes)

17 views

Python Solution

This document outlines the steps taken to perform exploratory data analysis on a heart disease dataset. It includes: 1. Loading and understanding the data, which has 303 samples and 14 features including categorical and continuous variables. 2. Performing exploratory analysis including univariate analysis with count plots of categorical features and boxen plots of continuous features, as well as bivariate analysis. 3. Preprocessing the data for modeling, which includes feature engineering and making features model-ready. The goal is to predict the risk of heart attack by applying various machine learning models like linear classifiers and tree models.

Uploaded by

Mile Mile

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Python Solution

Uploaded by

Mile Mile

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

1.

Introduction
– 1.1 Data Dictionary
– 1.2 Task
2. Preparation
– 2.1 Packages
– 2.2 Data
– 2.3 Understanding Data
3. Exploratory Data Analysis
– 3.1 Univariate Analysis
– 3.2 Bivariate Analysis
4. Data Preprocessing
– 4.1 Conclusions from the EDA
– 4.2 Packages
– 4.3 Making features model ready
5. Modeling
– 5.1 Linear Classifiers
– 5.2 Tree Models

1. Introduction
back to top

1.1 Data Dictionary

age - Age of the patient

sex - Sex of the patient

cp - Chest pain type ~ 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 =

Asymptomatic

trtbps - Resting blood pressure (in mm Hg)

chol - Cholestoral in mg/dl fetched via BMI sensor

fbs - (fasting blood sugar > 120 mg/dl) ~ 1 = True, 0 = False

restecg - Resting electrocardiographic results ~ 0 = Normal, 1 = ST-T wave normality, 2 = Left

ventricular hypertrophy

thalachh - Maximum heart rate achieved

oldpeak - Previous peak

slp - Slope

caa - Number of major vessels

thall - Thalium Stress Test result ~ (0,3)

exng - Exercise induced angina ~ 1 = Yes, 0 = No

output - Target variable

1.2 Task
To perform EDA and predict if a person is prone to a heart attack or not.

2. Preparation
back to top

2.1 Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

2.2 Data
df = pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-
dataset/heart.csv")

2.3 Understanding Data

2.3.1 The shape of the data
print("The shape of the dataset is : ", df.shape)

The shape of the dataset is : (303, 14)

2.3.2 Preview of the first 5 rows of the data

df.head()

age sex cp trtbps chol fbs restecg thalachh exng oldpeak

slp \
0 63 1 3 145 233 1 0 150 0 2.3
0
1 37 1 2 130 250 0 1 187 0 3.5
0
2 41 0 1 130 204 0 0 172 0 1.4
2
3 56 1 1 120 236 0 1 178 0 0.8
2
4 57 0 0 120 354 0 1 163 1 0.6
2

caa thall output

0 0 1 1
1 0 2 1
2 0 2 1
3 0 2 1
4 0 2 1

2.3.3 Checking the number of unique values in each column

dict = {}
for i in list(df.columns):
dict[i] = df[i].value_counts().shape[0]

pd.DataFrame(dict,index=["unique count"]).transpose()

unique count
age 41
sex 2
cp 4
trtbps 49
chol 152
fbs 2
restecg 3
thalachh 91
exng 2
oldpeak 40
slp 3
caa 5
thall 4
output 2

2.3.4 Separating the columns in categorical and continuous

cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]
target_col = ["output"]
print("The categorial cols are : ", cat_cols)
print("The continuous cols are : ", con_cols)
print("The target variable is : ", target_col)

The categorial cols are : ['sex', 'exng', 'caa', 'cp', 'fbs',

'restecg', 'slp', 'thall']
The continuous cols are : ['age', 'trtbps', 'chol', 'thalachh',
'oldpeak']
The target variable is : ['output']

2.3.5 Summary statistics

df[con_cols].describe().transpose()

count mean std min 25% 50% 75%

max
age 303.0 54.366337 9.082101 29.0 47.5 55.0 61.0
77.0
trtbps 303.0 131.623762 17.538143 94.0 120.0 130.0 140.0
200.0
chol 303.0 246.264026 51.830751 126.0 211.0 240.0 274.5
564.0
thalachh 303.0 149.646865 22.905161 71.0 133.5 153.0 166.0
202.0
oldpeak 303.0 1.039604 1.161075 0.0 0.0 0.8 1.6
6.2

2.3.6 Missing values

df.isnull().sum()

age 0
sex 0
cp 0
trtbps 0
chol 0
fbs 0
restecg 0
thalachh 0
exng 0
oldpeak 0
slp 0
caa 0
thall 0
output 0
dtype: int64

3. Exploratory Data Analysis

3.1 Univariate Analysis

3.1.1 Count plot of categorical features
fig = plt.figure(figsize=(18,15))
gs = fig.add_gridspec(3,3)
gs.update(wspace=0.5, hspace=0.25)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[0,2])
ax3 = fig.add_subplot(gs[1,0])
ax4 = fig.add_subplot(gs[1,1])
ax5 = fig.add_subplot(gs[1,2])
ax6 = fig.add_subplot(gs[2,0])
ax7 = fig.add_subplot(gs[2,1])
ax8 = fig.add_subplot(gs[2,2])
background_color = "#ffe6e6"
color_palette = ["#800000","#8000ff","#6aac90","#5833ff","#da8829"]
fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax2.set_facecolor(background_color)
ax3.set_facecolor(background_color)
ax4.set_facecolor(background_color)
ax5.set_facecolor(background_color)
ax6.set_facecolor(background_color)
ax7.set_facecolor(background_color)
ax8.set_facecolor(background_color)

# Title of the plot

ax0.spines["bottom"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.tick_params(left=False, bottom=False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.text(0.5,0.5,
'Count plot for various\n categorical features\
n_________________',
horizontalalignment='center',
verticalalignment='center',
fontsize=18, fontweight='bold',
fontfamily='serif',
color="#000000")

# Sex count
ax1.text(0.3, 220, 'Sex', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax1,data=df,x='sex',palette=color_palette)
ax1.set_xlabel("")
ax1.set_ylabel("")

# Exng count
ax2.text(0.3, 220, 'Exng', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax2.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax2,data=df,x='exng',palette=color_palette)
ax2.set_xlabel("")
ax2.set_ylabel("")

# Caa count
ax3.text(1.5, 200, 'Caa', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax3.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax3,data=df,x='caa',palette=color_palette)
ax3.set_xlabel("")
ax3.set_ylabel("")

# Cp count
ax4.text(1.5, 162, 'Cp', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax4.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax4,data=df,x='cp',palette=color_palette)
ax4.set_xlabel("")
ax4.set_ylabel("")

# Fbs count
ax5.text(0.5, 290, 'Fbs', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax5.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax5,data=df,x='fbs',palette=color_palette)
ax5.set_xlabel("")
ax5.set_ylabel("")

# Restecg count
ax6.text(0.75, 165, 'Restecg', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax6.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax6,data=df,x='restecg',palette=color_palette)
ax6.set_xlabel("")
ax6.set_ylabel("")

# Slp count
ax7.text(0.85, 155, 'Slp', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax7.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax7,data=df,x='slp',palette=color_palette)
ax7.set_xlabel("")
ax7.set_ylabel("")

# Thall count
ax8.text(1.2, 180, 'Thall', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax8.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax8,data=df,x='thall',palette=color_palette)
ax8.set_xlabel("")
ax8.set_ylabel("")

for s in ["top","right","left"]:
ax1.spines[s].set_visible(False)
ax2.spines[s].set_visible(False)
ax3.spines[s].set_visible(False)
ax4.spines[s].set_visible(False)
ax5.spines[s].set_visible(False)
ax6.spines[s].set_visible(False)
ax7.spines[s].set_visible(False)
ax8.spines[s].set_visible(False)

3.1.2 Boxen plot of continuous features

fig = plt.figure(figsize=(18,16))
gs = fig.add_gridspec(2,3)
gs.update(wspace=0.3, hspace=0.15)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[0,2])
ax3 = fig.add_subplot(gs[1,0])
ax4 = fig.add_subplot(gs[1,1])
ax5 = fig.add_subplot(gs[1,2])

# Title of the plot

ax0.spines["bottom"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.tick_params(left=False, bottom=False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.text(0.5,0.5,
'Boxen plot for various\n continuous features\
n_________________',
horizontalalignment='center',
verticalalignment='center',
fontsize=18, fontweight='bold',
fontfamily='serif',
color="#000000")

# Age
ax1.text(-0.05, 81, 'Age', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax1,y=df['age'],palette=["#800000"],width=0.6)
ax1.set_xlabel("")
ax1.set_ylabel("")

# Trtbps
ax2.text(-0.05, 208, 'Trtbps', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax2.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax2,y=df['trtbps'],palette=["#8000ff"],width=0.6)
ax2.set_xlabel("")
ax2.set_ylabel("")
# Chol
ax3.text(-0.05, 600, 'Chol', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax3.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax3,y=df['chol'],palette=["#6aac90"],width=0.6)
ax3.set_xlabel("")
ax3.set_ylabel("")

# Thalachh
ax4.text(-0.09, 210, 'Thalachh', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax4.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax4,y=df['thalachh'],palette=["#5833ff"],width=0.6)
ax4.set_xlabel("")
ax4.set_ylabel("")

# oldpeak
ax5.text(-0.1, 6.6, 'Oldpeak', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax5.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax5,y=df['oldpeak'],palette=["#da8829"],width=0.6)
ax5.set_xlabel("")
ax5.set_ylabel("")

for s in ["top","right","left"]:
ax1.spines[s].set_visible(False)
ax2.spines[s].set_visible(False)
ax3.spines[s].set_visible(False)
ax4.spines[s].set_visible(False)
ax5.spines[s].set_visible(False)
3.1.2 Count plot of target
fig = plt.figure(figsize=(18,7))
gs = fig.add_gridspec(1,2)
gs.update(wspace=0.3, hspace=0.15)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])

# Title of the plot

ax0.text(0.5,0.5,"Count of the target\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')

ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)

# Target Count
ax1.text(0.35,177,"Output",fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax1, data=df, x = 'output',palette = color_palette)
ax1.set_xlabel("")
ax1.set_ylabel("")
ax1.set_xticklabels(["Low chances of attack(0)","High chances of
attack(1)"])

ax0.spines["top"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax0.spines["bottom"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.spines["right"].set_visible(False)

3.2 Bivariate Analysis

3.2.1 Correlation matrix of continuous features
df_corr = df[con_cols].corr().transpose()
df_corr
age trtbps chol thalachh oldpeak
age 1.000000 0.279351 0.213678 -0.398522 0.210013
trtbps 0.279351 1.000000 0.123174 -0.046698 0.193216
chol 0.213678 0.123174 1.000000 -0.009940 0.053952
thalachh -0.398522 -0.046698 -0.009940 1.000000 -0.344187
oldpeak 0.210013 0.193216 0.053952 -0.344187 1.000000

fig = plt.figure(figsize=(10,10))
gs = fig.add_gridspec(1,1)
gs.update(wspace=0.3, hspace=0.15)
ax0 = fig.add_subplot(gs[0,0])

color_palette = ["#5833ff","#da8829"]
mask = np.triu(np.ones_like(df_corr))
ax0.text(1.5,-0.1,"Correlation Matrix",fontsize=22, fontweight='bold',
fontfamily='serif', color="#000000")
df_corr = df[con_cols].corr().transpose()
sns.heatmap(df_corr,mask=mask,fmt=".1f",annot=True,cmap='YlGnBu')
plt.show()
3.2.2 Scatterplot heatmap of dataframe
fig = plt.figure(figsize=(12,12))
corr_mat = df.corr().stack().reset_index(name="correlation")
g = sns.relplot(
data=corr_mat,
x="level_0", y="level_1", hue="correlation", size="correlation",
palette="YlGnBu", hue_norm=(-1, 1), edgecolor=".7",
height=10, sizes=(50, 250), size_norm=(-.2, .8),
)
g.set(xlabel="features on X", ylabel="featurs on Y", aspect="equal")
g.fig.suptitle('Scatterplot heatmap',fontsize=22, fontweight='bold',
fontfamily='serif', color="#000000")
g.despine(left=True, bottom=True)
g.ax.margins(.02)
for label in g.ax.get_xticklabels():
label.set_rotation(90)
for artist in g.legend.legendHandles:
artist.set_edgecolor(".7")
plt.show()

<Figure size 864x864 with 0 Axes>

3.2.3 Distribution of continuous features according to target variable
fig = plt.figure(figsize=(18,18))
gs = fig.add_gridspec(5,2)
gs.update(wspace=0.5, hspace=0.5)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])

background_color = "#ffe6e6"
color_palette = ["#800000","#8000ff","#6aac90","#5833ff","#da8829"]
fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax2.set_facecolor(background_color)
ax3.set_facecolor(background_color)
ax4.set_facecolor(background_color)
ax5.set_facecolor(background_color)
ax6.set_facecolor(background_color)
ax7.set_facecolor(background_color)
ax8.set_facecolor(background_color)
ax9.set_facecolor(background_color)

# Age title
ax0.text(0.5,0.5,"Distribution of age\naccording to\n target variable\
n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax0.spines["bottom"].set_visible(False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)

# Age
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax1, data=df, x='age',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax1.set_xlabel("")
ax1.set_ylabel("")
# TrTbps title
ax2.text(0.5,0.5,"Distribution of trtbps\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax2.spines["bottom"].set_visible(False)
ax2.set_xticklabels([])
ax2.set_yticklabels([])
ax2.tick_params(left=False, bottom=False)

# TrTbps
ax3.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax3, data=df, x='trtbps',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax3.set_xlabel("")
ax3.set_ylabel("")

# Chol title
ax4.text(0.5,0.5,"Distribution of chol\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax4.spines["bottom"].set_visible(False)
ax4.set_xticklabels([])
ax4.set_yticklabels([])
ax4.tick_params(left=False, bottom=False)

# Chol
ax5.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax5, data=df, x='chol',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax5.set_xlabel("")
ax5.set_ylabel("")

# Thalachh title
ax6.text(0.5,0.5,"Distribution of thalachh\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax6.spines["bottom"].set_visible(False)
ax6.set_xticklabels([])
ax6.set_yticklabels([])
ax6.tick_params(left=False, bottom=False)

# Thalachh
ax7.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax7, data=df, x='thalachh',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax7.set_xlabel("")
ax7.set_ylabel("")

# Oldpeak title
ax8.text(0.5,0.5,"Distribution of oldpeak\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax8.spines["bottom"].set_visible(False)
ax8.set_xticklabels([])
ax8.set_yticklabels([])
ax8.tick_params(left=False, bottom=False)

# Oldpeak
ax9.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax9, data=df, x='oldpeak',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax9.set_xlabel("")
ax9.set_ylabel("")

for i in ["top","left","right"]:
ax0.spines[i].set_visible(False)
ax1.spines[i].set_visible(False)
ax2.spines[i].set_visible(False)
ax3.spines[i].set_visible(False)
ax4.spines[i].set_visible(False)
ax5.spines[i].set_visible(False)
ax6.spines[i].set_visible(False)
ax7.spines[i].set_visible(False)
ax8.spines[i].set_visible(False)
ax9.spines[i].set_visible(False)
3.2.4 Some other relations that seemed intuitive
fig = plt.figure(figsize=(18,20))
gs = fig.add_gridspec(6,2)
gs.update(wspace=0.5, hspace=0.5)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])
ax10 = fig.add_subplot(gs[5,0])
ax11 = fig.add_subplot(gs[5,1])

# Cp title
# 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 =
Asymptomatic
ax0.text(0.5,0.5,"Chest pain\ndistribution\n__________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax0.spines["bottom"].set_visible(False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)
ax0.text(1,.5,"0 - Typical Angina\n1 - Atypical Angina\n2 - Non-
anginal Pain\n3 - Asymptomatic",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)

# Cp
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax1, data=df, x='cp',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax1.set_xlabel("")
ax1.set_ylabel("")
# Caa title
ax2.text(0.5,0.5,"Number of\nmajor vessels\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax2.text(1,.5,"0 vessels\n1 vessel\n2 vessels\n3 vessels\n4vessels",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)

ax2.spines["bottom"].set_visible(False)
ax2.set_xticklabels([])
ax2.set_yticklabels([])
ax2.tick_params(left=False, bottom=False)

# Caa
ax3.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax3, data=df, x='caa',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax3.set_xlabel("")
ax3.set_ylabel("")

# Sex title
ax4.text(0.5,0.5,"Heart Attack\naccording to\nsex\n______",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax4.text(1,.5,"0 - Female\n1 - Male",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)
ax4.spines["bottom"].set_visible(False)
ax4.set_xticklabels([])
ax4.set_yticklabels([])
ax4.tick_params(left=False, bottom=False)

# Sex
ax5.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax5,data=df,x='sex',palette=["#8000ff","#da8829"],
hue='output')
ax5.set_xlabel("")
ax5.set_ylabel("")

# Thall title
ax6.text(0.5,0.5,"Distribution of thall\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax6.text(1,.5,"Thalium Stress\nTest Result\n0, 1, 2, 3",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)
ax6.spines["bottom"].set_visible(False)
ax6.set_xticklabels([])
ax6.set_yticklabels([])
ax6.tick_params(left=False, bottom=False)

# Thall
ax7.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax7, data=df, x='thall',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax7.set_xlabel("")
ax7.set_ylabel("")

# Thalachh title
ax8.text(0.5,0.5,"Boxen plot of\nthalachh wrt\noutcome\n_______",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax8.text(1,.5,"Maximum heart\nrate achieved",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)

ax8.spines["bottom"].set_visible(False)
ax8.set_xticklabels([])
ax8.set_yticklabels([])
ax8.tick_params(left=False, bottom=False)

# Thalachh
ax9.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax9,
data=df,x='output',y='thalachh',palette=["#8000ff","#da8829"])
ax9.set_xlabel("")
ax9.set_ylabel("")

# Exng title
ax10.text(0.5,0.5,"Strip Plot of\nexng vs age\n______",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax10.text(1,.5,"Exercise induced\nangina\n0 - No\n1 - Yes",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)
ax10.spines["bottom"].set_visible(False)
ax10.set_xticklabels([])
ax10.set_yticklabels([])
ax10.tick_params(left=False, bottom=False)

# Exng
ax11.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.stripplot(ax=ax11,
data=df,x='exng',y='age',hue='output',palette=["#8000ff","#da8829"])
ax9.set_xlabel("")
ax9.set_ylabel("")

4.1 Conclusions from the EDA

1. There are no NaN values in the data.
2. There are certain outliers in all the continuous features.
3. The data consists of more than twice the number of people with sex = 1 than sex = 0.
4. There is no apparent linear correlation between continuous variable according to the
heatmap.
5. The scatterplot heatmap matrix suggests that there might be some correlation between
output and cp, thalachh and slp.
6. It is intuitive that elder people might have higher chances of heart attack but according to
the distribution plot of age wrt output, it is evident that this isn't the case.
7. According to the distribution plot of thalachh wrt output, people with higher
maximum heart rate achieved have higher chances of heart attack.
8. According to the distribution plot of oldpeak wrt output, people with lower pevious
peak achieved have higher chances of heart attack.
9. The plot 3.2.4 tells about the following -
– People with Non-Anginal chest pain, that is with cp = 2 have higher chances of
heart attack.
– People with 0 major vessels, that is with caa = 0 have high chance of heart
attack.
– People with sex = 1 have higher chance of heart attack.
– People with thall = 2 have much higher chance of heart attack.
– People with no exercise induced angina, that is with exng = 0 have higher chance
of heart attack.

4.2 Packages
# Scaling
from sklearn.preprocessing import RobustScaler

# Train Test Split

from sklearn.model_selection import train_test_split

# Models
import torch
import torch.nn as nn
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Metrics
from sklearn.metrics import accuracy_score, classification_report,
roc_curve

# Cross Validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

print('Packages imported...')

Packages imported...
4.3 Making features model ready
4.3.1 Scaling and Encoding features
# creating a copy of df
df1 = df

# define the columns to be encoded and scaled

cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]

# encoding the categorical columns

df1 = pd.get_dummies(df1, columns = cat_cols, drop_first = True)

# defining the features and target

X = df1.drop(['output'],axis=1)
y = df1[['output']]

# instantiating the scaler

scaler = RobustScaler()

# scaling the continuous featuree

X[con_cols] = scaler.fit_transform(X[con_cols])
print("The first 5 rows of X are")
X.head()

The first 5 rows of X are

age trtbps chol thalachh oldpeak sex_1 exng_1 caa_1

caa_2 \
0 0.592593 0.75 -0.110236 -0.092308 0.9375 1 0 0
0
1 -1.333333 0.00 0.157480 1.046154 1.6875 1 0 0
0
2 -1.037037 0.00 -0.566929 0.584615 0.3750 0 0 0
0
3 0.074074 -0.50 -0.062992 0.769231 0.0000 1 0 0
0
4 0.148148 -0.50 1.795276 0.307692 -0.1250 0 1 0
0

caa_3 ... cp_2 cp_3 fbs_1 restecg_1 restecg_2 slp_1 slp_2

thall_1 \
0 0 ... 0 1 1 0 0 0 0
1
1 0 ... 1 0 0 1 0 0 0
0
2 0 ... 0 0 0 0 0 0 1
0
3 0 ... 0 0 0 1 0 0 1
0
4 0 ... 0 0 0 1 0 0 1
0

thall_2 thall_3
0 0 0
1 1 0
2 1 0
3 1 0
4 1 0

[5 rows x 22 columns]

4.3.2 Train and test split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =
0.2, random_state = 42)
print("The shape of X_train is ", X_train.shape)
print("The shape of X_test is ",X_test.shape)
print("The shape of y_train is ",y_train.shape)
print("The shape of y_test is ",y_test.shape)

The shape of X_train is (242, 22)

The shape of X_test is (61, 22)
The shape of y_train is (242, 1)
The shape of y_test is (61, 1)

5. Modeling
back to top

5.1 Linear Classifiers

5.1.1 Support Vector Machines
# instantiating the object and fitting
clf = SVC(kernel='linear', C=1, random_state=42).fit(X_train,y_train)

# predicting the values

y_pred = clf.predict(X_test)

# printing the test accuracy

print("The test accuracy score of SVM is ", accuracy_score(y_test,
y_pred))

The test accuracy score of SVM is 0.8688524590163934

5.1.2 Hyperparameter tuning of SVC

# instantiating the object
svm = SVC()

# setting a grid - not so extensive

parameters = {"C":np.arange(1,10,1),'gamma':[0.00001,0.00005,
0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5]}

# instantiating the GridSearchCV object

searcher = GridSearchCV(svm, parameters)

# fitting the object

searcher.fit(X_train, y_train)

# the scores
print("The best params are :", searcher.best_params_)
print("The best score is :", searcher.best_score_)

# predicting the values

y_pred = searcher.predict(X_test)

# printing the test accuracy

print("The test accuracy score of SVM after hyper-parameter tuning is
", accuracy_score(y_test, y_pred))

The best params are : {'C': 3, 'gamma': 0.1}

The best score is : 0.8384353741496599
The test accuracy score of SVM after hyper-parameter tuning is
0.9016393442622951

5.1.3 Logistic Regression

# instantiating the object
logreg = LogisticRegression()

# fitting the object

logreg.fit(X_train, y_train)

# calculating the probabilities

y_pred_proba = logreg.predict_proba(X_test)

# finding the predicted valued

y_pred = np.argmax(y_pred_proba,axis=1)

# printing the test accuracy

print("The test accuracy score of Logistric Regression is ",
accuracy_score(y_test, y_pred))

The test accuracy score of Logistric Regression is 0.9016393442622951

5.1.4 ROC Curve

# calculating the probabilities
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# instantiating the roc_cruve

fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)

# plotting the curve

plt.plot([0,1],[0,1],"k--",'r+')
plt.plot(fpr,tpr,label='Logistic Regression')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Logistric Regression ROC Curve")
plt.show()

5.2 Tree Models

5.2.1 Decision Tree
# instantiating the object
dt = DecisionTreeClassifier(random_state = 42)

# fitting the model

dt.fit(X_train, y_train)

# calculating the predictions

y_pred = dt.predict(X_test)

# printing the test accuracy

print("The test accuracy score of Decision Tree is ",
accuracy_score(y_test, y_pred))

The test accuracy score of Decision Tree is 0.7868852459016393

5.2.2 Random Forest
# instantiating the object
rf = RandomForestClassifier()

# fitting the model

rf.fit(X_train, y_train)

# calculating the predictions

y_pred = dt.predict(X_test)

# printing the test accuracy

print("The test accuracy score of Random Forest is ",
accuracy_score(y_test, y_pred))

The test accuracy score of Random Forest is 0.7868852459016393

5.2.3 Gradient Boosting Classifier - without tuning

# instantiate the classifier
gbt = GradientBoostingClassifier(n_estimators =
300,max_depth=1,subsample=0.8,max_features=0.2,random_state=42)

# fitting the model

gbt.fit(X_train,y_train)

# predicting values
y_pred = gbt.predict(X_test)
print("The test accuracy score of Gradient Boosting Classifier is ",
accuracy_score(y_test, y_pred))

The test accuracy score of Gradient Boosting Classifier is

0.8688524590163934

Heart Disease Prediction! ❤️?
No ratings yet
Heart Disease Prediction! ❤️?
52 pages
Heart Attacks Analysis
No ratings yet
Heart Attacks Analysis
10 pages
1728086737277
No ratings yet
1728086737277
26 pages
Heart Failure Prediction
100% (1)
Heart Failure Prediction
41 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
Heart Disease Prediction (1) (1) - 1
No ratings yet
Heart Disease Prediction (1) (1) - 1
1 page
Correlation: Import As Import As Import As Import As From Import From Import Import Matplotlib Import
No ratings yet
Correlation: Import As Import As Import As Import As From Import From Import Import Matplotlib Import
1 page
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
DSA_1
No ratings yet
DSA_1
8 pages
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
Data_Analyzer
No ratings yet
Data_Analyzer
10 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Data science and analtics Laboratory
No ratings yet
Data science and analtics Laboratory
21 pages
AD3411 (2)
No ratings yet
AD3411 (2)
28 pages
Data Sci
No ratings yet
Data Sci
29 pages
DA Manual - Part B
No ratings yet
DA Manual - Part B
13 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
Time Series Analysis Group 9
No ratings yet
Time Series Analysis Group 9
16 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
ML Assignment No 5
No ratings yet
ML Assignment No 5
11 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
Heart Diesese
No ratings yet
Heart Diesese
9 pages
B 4 Heart
No ratings yet
B 4 Heart
9 pages
Pandas
No ratings yet
Pandas
4 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
CardioGoodFitness - Jupyter Notebook
No ratings yet
CardioGoodFitness - Jupyter Notebook
12 pages
batch1 ds
No ratings yet
batch1 ds
15 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
Heart Dataset Analysis
No ratings yet
Heart Dataset Analysis
24 pages
EDA Lab Manual
100% (2)
EDA Lab Manual
93 pages
609008987-EDA-Lab-Manual
No ratings yet
609008987-EDA-Lab-Manual
93 pages
Datascience 2 PDF
No ratings yet
Datascience 2 PDF
24 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
Dovdush_KN-305_lab3
No ratings yet
Dovdush_KN-305_lab3
2 pages
Rimjhim
No ratings yet
Rimjhim
21 pages
Hypothesis Testing PDF
No ratings yet
Hypothesis Testing PDF
9 pages
HEART DISEASE CLASSIFICATION USING ANN HANDS-ON
No ratings yet
HEART DISEASE CLASSIFICATION USING ANN HANDS-ON
7 pages
Python Codes Test 2
No ratings yet
Python Codes Test 2
12 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Print Print Print Print: Import As
No ratings yet
Print Print Print Print: Import As
6 pages
sowmi DS
No ratings yet
sowmi DS
27 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
Python for Machine Learning Visualization 1735231185
No ratings yet
Python for Machine Learning Visualization 1735231185
69 pages
Fds SLOT 2
No ratings yet
Fds SLOT 2
12 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
No ratings yet
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
16 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
Data Science
No ratings yet
Data Science
18 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Diabetes - Prediction - Project - Ipynb - Colab
No ratings yet
Diabetes - Prediction - Project - Ipynb - Colab
11 pages
ml_labmanual (3)
No ratings yet
ml_labmanual (3)
33 pages
AP Calculus Flashcards, Fourth Edition: Up-to-Date Review and Practice
From Everand
AP Calculus Flashcards, Fourth Edition: Up-to-Date Review and Practice
Barron's Educational Series
No ratings yet
Degradation Processes in Reliability
From Everand
Degradation Processes in Reliability
Waltraud Kahle
No ratings yet
Global Stock Market Prediction Based On Stock Chart Images Using Deep Q-Network
No ratings yet
Global Stock Market Prediction Based On Stock Chart Images Using Deep Q-Network
12 pages
Automatic Detection of Cyberbullying in Social Media Text
No ratings yet
Automatic Detection of Cyberbullying in Social Media Text
21 pages
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
Machine Learning Strategy For Predicting Flutter Performance of Streamlined
No ratings yet
Machine Learning Strategy For Predicting Flutter Performance of Streamlined
15 pages
Model Selection Evaluation Algorithm Selection 1684595082
No ratings yet
Model Selection Evaluation Algorithm Selection 1684595082
51 pages
Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
No ratings yet
Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
49 pages
TNP Portal Using Web Development and Machine Learning
No ratings yet
TNP Portal Using Web Development and Machine Learning
9 pages
Aiml Report
No ratings yet
Aiml Report
70 pages
Exam DP 100 Data Science Solution On Azure Skills Measured
No ratings yet
Exam DP 100 Data Science Solution On Azure Skills Measured
9 pages
Deep Learning With Python
100% (4)
Deep Learning With Python
396 pages
AWS Certified Machine Learning Specialty Exam Guide
No ratings yet
AWS Certified Machine Learning Specialty Exam Guide
7 pages
CV - Vrunda Shah - Data Scientist - 2.5 Years Experience
No ratings yet
CV - Vrunda Shah - Data Scientist - 2.5 Years Experience
2 pages
A Novel IoT Network Intrusion Detection Approach Based On Adaptive Particle Swarm Optimization Convolutional Neural Network
100% (1)
A Novel IoT Network Intrusion Detection Approach Based On Adaptive Particle Swarm Optimization Convolutional Neural Network
16 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
An Identification Model Used For Arabic Libyan Dialects Based On Machine Learning Approach
No ratings yet
An Identification Model Used For Arabic Libyan Dialects Based On Machine Learning Approach
14 pages
Decision Trees and Random Forests
No ratings yet
Decision Trees and Random Forests
25 pages
Zameer Usman - AI Resume
No ratings yet
Zameer Usman - AI Resume
4 pages
2020 BilevelOptimization
No ratings yet
2020 BilevelOptimization
679 pages
AI - ML in Healthcare - Notes
No ratings yet
AI - ML in Healthcare - Notes
34 pages
The immuneML Ecosystem For Machine Learning Analysis of
No ratings yet
The immuneML Ecosystem For Machine Learning Analysis of
22 pages
Challenges in Deploying Machine Learning
No ratings yet
Challenges in Deploying Machine Learning
29 pages
The Importance of Hyperparameters in Machine Learning
No ratings yet
The Importance of Hyperparameters in Machine Learning
8 pages
Narrowing The Search: Which Hyperparameters Really Matter?
No ratings yet
Narrowing The Search: Which Hyperparameters Really Matter?
9 pages
Artificial Intelligence Ai Courses Training Udacity 2
No ratings yet
Artificial Intelligence Ai Courses Training Udacity 2
1 page
Heart Disease Risk Prediction Using Deep Learning Techniques With Feature Augmentation
No ratings yet
Heart Disease Risk Prediction Using Deep Learning Techniques With Feature Augmentation
15 pages
Iris Classification
No ratings yet
Iris Classification
6 pages
Democratizing AI, and Surviving Titanic With Automated Machine Learning - Adnan Masood
No ratings yet
Democratizing AI, and Surviving Titanic With Automated Machine Learning - Adnan Masood
21 pages
Machine Learning-Based Maternal Health Risk Predic
No ratings yet
Machine Learning-Based Maternal Health Risk Predic
15 pages
Hyperband
No ratings yet
Hyperband
52 pages
6 - Steps of The Classification Algorithm in Supervised Learning
No ratings yet
6 - Steps of The Classification Algorithm in Supervised Learning
15 pages