Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

Python Solution

This document outlines the steps taken to perform exploratory data analysis on a heart disease dataset. It includes: 1. Loading and understanding the data, which has 303 samples and 14 features including categorical and continuous variables. 2. Performing exploratory analysis including univariate analysis with count plots of categorical features and boxen plots of continuous features, as well as bivariate analysis. 3. Preprocessing the data for modeling, which includes feature engineering and making features model-ready. The goal is to predict the risk of heart attack by applying various machine learning models like linear classifiers and tree models.

Uploaded by

Mile Mile
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Python Solution

This document outlines the steps taken to perform exploratory data analysis on a heart disease dataset. It includes: 1. Loading and understanding the data, which has 303 samples and 14 features including categorical and continuous variables. 2. Performing exploratory analysis including univariate analysis with count plots of categorical features and boxen plots of continuous features, as well as bivariate analysis. 3. Preprocessing the data for modeling, which includes feature engineering and making features model-ready. The goal is to predict the risk of heart attack by applying various machine learning models like linear classifiers and tree models.

Uploaded by

Mile Mile
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1.

Introduction
– 1.1 Data Dictionary
– 1.2 Task
2. Preparation
– 2.1 Packages
– 2.2 Data
– 2.3 Understanding Data
3. Exploratory Data Analysis
– 3.1 Univariate Analysis
– 3.2 Bivariate Analysis
4. Data Preprocessing
– 4.1 Conclusions from the EDA
– 4.2 Packages
– 4.3 Making features model ready
5. Modeling
– 5.1 Linear Classifiers
– 5.2 Tree Models

1. Introduction
back to top

1.1 Data Dictionary


age - Age of the patient

sex - Sex of the patient

cp - Chest pain type ~ 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 =


Asymptomatic

trtbps - Resting blood pressure (in mm Hg)

chol - Cholestoral in mg/dl fetched via BMI sensor

fbs - (fasting blood sugar > 120 mg/dl) ~ 1 = True, 0 = False

restecg - Resting electrocardiographic results ~ 0 = Normal, 1 = ST-T wave normality, 2 = Left


ventricular hypertrophy

thalachh - Maximum heart rate achieved

oldpeak - Previous peak

slp - Slope

caa - Number of major vessels

thall - Thalium Stress Test result ~ (0,3)

exng - Exercise induced angina ~ 1 = Yes, 0 = No


output - Target variable

1.2 Task
To perform EDA and predict if a person is prone to a heart attack or not.

2. Preparation
back to top

2.1 Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

2.2 Data
df = pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-
dataset/heart.csv")

2.3 Understanding Data


2.3.1 The shape of the data
print("The shape of the dataset is : ", df.shape)

The shape of the dataset is : (303, 14)

2.3.2 Preview of the first 5 rows of the data


df.head()

age sex cp trtbps chol fbs restecg thalachh exng oldpeak


slp \
0 63 1 3 145 233 1 0 150 0 2.3
0
1 37 1 2 130 250 0 1 187 0 3.5
0
2 41 0 1 130 204 0 0 172 0 1.4
2
3 56 1 1 120 236 0 1 178 0 0.8
2
4 57 0 0 120 354 0 1 163 1 0.6
2

caa thall output


0 0 1 1
1 0 2 1
2 0 2 1
3 0 2 1
4 0 2 1

2.3.3 Checking the number of unique values in each column


dict = {}
for i in list(df.columns):
dict[i] = df[i].value_counts().shape[0]

pd.DataFrame(dict,index=["unique count"]).transpose()

unique count
age 41
sex 2
cp 4
trtbps 49
chol 152
fbs 2
restecg 3
thalachh 91
exng 2
oldpeak 40
slp 3
caa 5
thall 4
output 2

2.3.4 Separating the columns in categorical and continuous


cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]
target_col = ["output"]
print("The categorial cols are : ", cat_cols)
print("The continuous cols are : ", con_cols)
print("The target variable is : ", target_col)

The categorial cols are : ['sex', 'exng', 'caa', 'cp', 'fbs',


'restecg', 'slp', 'thall']
The continuous cols are : ['age', 'trtbps', 'chol', 'thalachh',
'oldpeak']
The target variable is : ['output']

2.3.5 Summary statistics


df[con_cols].describe().transpose()

count mean std min 25% 50% 75%


max
age 303.0 54.366337 9.082101 29.0 47.5 55.0 61.0
77.0
trtbps 303.0 131.623762 17.538143 94.0 120.0 130.0 140.0
200.0
chol 303.0 246.264026 51.830751 126.0 211.0 240.0 274.5
564.0
thalachh 303.0 149.646865 22.905161 71.0 133.5 153.0 166.0
202.0
oldpeak 303.0 1.039604 1.161075 0.0 0.0 0.8 1.6
6.2

2.3.6 Missing values


df.isnull().sum()

age 0
sex 0
cp 0
trtbps 0
chol 0
fbs 0
restecg 0
thalachh 0
exng 0
oldpeak 0
slp 0
caa 0
thall 0
output 0
dtype: int64

3. Exploratory Data Analysis


back to top

3.1 Univariate Analysis


3.1.1 Count plot of categorical features
fig = plt.figure(figsize=(18,15))
gs = fig.add_gridspec(3,3)
gs.update(wspace=0.5, hspace=0.25)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[0,2])
ax3 = fig.add_subplot(gs[1,0])
ax4 = fig.add_subplot(gs[1,1])
ax5 = fig.add_subplot(gs[1,2])
ax6 = fig.add_subplot(gs[2,0])
ax7 = fig.add_subplot(gs[2,1])
ax8 = fig.add_subplot(gs[2,2])
background_color = "#ffe6e6"
color_palette = ["#800000","#8000ff","#6aac90","#5833ff","#da8829"]
fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax2.set_facecolor(background_color)
ax3.set_facecolor(background_color)
ax4.set_facecolor(background_color)
ax5.set_facecolor(background_color)
ax6.set_facecolor(background_color)
ax7.set_facecolor(background_color)
ax8.set_facecolor(background_color)

# Title of the plot


ax0.spines["bottom"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.tick_params(left=False, bottom=False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.text(0.5,0.5,
'Count plot for various\n categorical features\
n_________________',
horizontalalignment='center',
verticalalignment='center',
fontsize=18, fontweight='bold',
fontfamily='serif',
color="#000000")

# Sex count
ax1.text(0.3, 220, 'Sex', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax1,data=df,x='sex',palette=color_palette)
ax1.set_xlabel("")
ax1.set_ylabel("")

# Exng count
ax2.text(0.3, 220, 'Exng', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax2.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax2,data=df,x='exng',palette=color_palette)
ax2.set_xlabel("")
ax2.set_ylabel("")

# Caa count
ax3.text(1.5, 200, 'Caa', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax3.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax3,data=df,x='caa',palette=color_palette)
ax3.set_xlabel("")
ax3.set_ylabel("")

# Cp count
ax4.text(1.5, 162, 'Cp', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax4.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax4,data=df,x='cp',palette=color_palette)
ax4.set_xlabel("")
ax4.set_ylabel("")

# Fbs count
ax5.text(0.5, 290, 'Fbs', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax5.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax5,data=df,x='fbs',palette=color_palette)
ax5.set_xlabel("")
ax5.set_ylabel("")

# Restecg count
ax6.text(0.75, 165, 'Restecg', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax6.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax6,data=df,x='restecg',palette=color_palette)
ax6.set_xlabel("")
ax6.set_ylabel("")

# Slp count
ax7.text(0.85, 155, 'Slp', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax7.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax7,data=df,x='slp',palette=color_palette)
ax7.set_xlabel("")
ax7.set_ylabel("")

# Thall count
ax8.text(1.2, 180, 'Thall', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax8.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax8,data=df,x='thall',palette=color_palette)
ax8.set_xlabel("")
ax8.set_ylabel("")

for s in ["top","right","left"]:
ax1.spines[s].set_visible(False)
ax2.spines[s].set_visible(False)
ax3.spines[s].set_visible(False)
ax4.spines[s].set_visible(False)
ax5.spines[s].set_visible(False)
ax6.spines[s].set_visible(False)
ax7.spines[s].set_visible(False)
ax8.spines[s].set_visible(False)

3.1.2 Boxen plot of continuous features


fig = plt.figure(figsize=(18,16))
gs = fig.add_gridspec(2,3)
gs.update(wspace=0.3, hspace=0.15)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[0,2])
ax3 = fig.add_subplot(gs[1,0])
ax4 = fig.add_subplot(gs[1,1])
ax5 = fig.add_subplot(gs[1,2])

background_color = "#ffe6e6"
color_palette = ["#800000","#8000ff","#6aac90","#5833ff","#da8829"]
fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax2.set_facecolor(background_color)
ax3.set_facecolor(background_color)
ax4.set_facecolor(background_color)
ax5.set_facecolor(background_color)

# Title of the plot


ax0.spines["bottom"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.tick_params(left=False, bottom=False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.text(0.5,0.5,
'Boxen plot for various\n continuous features\
n_________________',
horizontalalignment='center',
verticalalignment='center',
fontsize=18, fontweight='bold',
fontfamily='serif',
color="#000000")

# Age
ax1.text(-0.05, 81, 'Age', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax1,y=df['age'],palette=["#800000"],width=0.6)
ax1.set_xlabel("")
ax1.set_ylabel("")

# Trtbps
ax2.text(-0.05, 208, 'Trtbps', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax2.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax2,y=df['trtbps'],palette=["#8000ff"],width=0.6)
ax2.set_xlabel("")
ax2.set_ylabel("")
# Chol
ax3.text(-0.05, 600, 'Chol', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax3.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax3,y=df['chol'],palette=["#6aac90"],width=0.6)
ax3.set_xlabel("")
ax3.set_ylabel("")

# Thalachh
ax4.text(-0.09, 210, 'Thalachh', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax4.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax4,y=df['thalachh'],palette=["#5833ff"],width=0.6)
ax4.set_xlabel("")
ax4.set_ylabel("")

# oldpeak
ax5.text(-0.1, 6.6, 'Oldpeak', fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax5.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax5,y=df['oldpeak'],palette=["#da8829"],width=0.6)
ax5.set_xlabel("")
ax5.set_ylabel("")

for s in ["top","right","left"]:
ax1.spines[s].set_visible(False)
ax2.spines[s].set_visible(False)
ax3.spines[s].set_visible(False)
ax4.spines[s].set_visible(False)
ax5.spines[s].set_visible(False)
3.1.2 Count plot of target
fig = plt.figure(figsize=(18,7))
gs = fig.add_gridspec(1,2)
gs.update(wspace=0.3, hspace=0.15)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])

background_color = "#ffe6e6"
color_palette = ["#800000","#8000ff","#6aac90","#5833ff","#da8829"]
fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)

# Title of the plot


ax0.text(0.5,0.5,"Count of the target\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')

ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)

# Target Count
ax1.text(0.35,177,"Output",fontsize=14, fontweight='bold',
fontfamily='serif', color="#000000")
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax1, data=df, x = 'output',palette = color_palette)
ax1.set_xlabel("")
ax1.set_ylabel("")
ax1.set_xticklabels(["Low chances of attack(0)","High chances of
attack(1)"])

ax0.spines["top"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax0.spines["bottom"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.spines["right"].set_visible(False)

3.2 Bivariate Analysis


3.2.1 Correlation matrix of continuous features
df_corr = df[con_cols].corr().transpose()
df_corr
age trtbps chol thalachh oldpeak
age 1.000000 0.279351 0.213678 -0.398522 0.210013
trtbps 0.279351 1.000000 0.123174 -0.046698 0.193216
chol 0.213678 0.123174 1.000000 -0.009940 0.053952
thalachh -0.398522 -0.046698 -0.009940 1.000000 -0.344187
oldpeak 0.210013 0.193216 0.053952 -0.344187 1.000000

fig = plt.figure(figsize=(10,10))
gs = fig.add_gridspec(1,1)
gs.update(wspace=0.3, hspace=0.15)
ax0 = fig.add_subplot(gs[0,0])

color_palette = ["#5833ff","#da8829"]
mask = np.triu(np.ones_like(df_corr))
ax0.text(1.5,-0.1,"Correlation Matrix",fontsize=22, fontweight='bold',
fontfamily='serif', color="#000000")
df_corr = df[con_cols].corr().transpose()
sns.heatmap(df_corr,mask=mask,fmt=".1f",annot=True,cmap='YlGnBu')
plt.show()
3.2.2 Scatterplot heatmap of dataframe
fig = plt.figure(figsize=(12,12))
corr_mat = df.corr().stack().reset_index(name="correlation")
g = sns.relplot(
data=corr_mat,
x="level_0", y="level_1", hue="correlation", size="correlation",
palette="YlGnBu", hue_norm=(-1, 1), edgecolor=".7",
height=10, sizes=(50, 250), size_norm=(-.2, .8),
)
g.set(xlabel="features on X", ylabel="featurs on Y", aspect="equal")
g.fig.suptitle('Scatterplot heatmap',fontsize=22, fontweight='bold',
fontfamily='serif', color="#000000")
g.despine(left=True, bottom=True)
g.ax.margins(.02)
for label in g.ax.get_xticklabels():
label.set_rotation(90)
for artist in g.legend.legendHandles:
artist.set_edgecolor(".7")
plt.show()

<Figure size 864x864 with 0 Axes>


3.2.3 Distribution of continuous features according to target variable
fig = plt.figure(figsize=(18,18))
gs = fig.add_gridspec(5,2)
gs.update(wspace=0.5, hspace=0.5)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])

background_color = "#ffe6e6"
color_palette = ["#800000","#8000ff","#6aac90","#5833ff","#da8829"]
fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax2.set_facecolor(background_color)
ax3.set_facecolor(background_color)
ax4.set_facecolor(background_color)
ax5.set_facecolor(background_color)
ax6.set_facecolor(background_color)
ax7.set_facecolor(background_color)
ax8.set_facecolor(background_color)
ax9.set_facecolor(background_color)

# Age title
ax0.text(0.5,0.5,"Distribution of age\naccording to\n target variable\
n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax0.spines["bottom"].set_visible(False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)

# Age
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax1, data=df, x='age',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax1.set_xlabel("")
ax1.set_ylabel("")
# TrTbps title
ax2.text(0.5,0.5,"Distribution of trtbps\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax2.spines["bottom"].set_visible(False)
ax2.set_xticklabels([])
ax2.set_yticklabels([])
ax2.tick_params(left=False, bottom=False)

# TrTbps
ax3.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax3, data=df, x='trtbps',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax3.set_xlabel("")
ax3.set_ylabel("")

# Chol title
ax4.text(0.5,0.5,"Distribution of chol\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax4.spines["bottom"].set_visible(False)
ax4.set_xticklabels([])
ax4.set_yticklabels([])
ax4.tick_params(left=False, bottom=False)

# Chol
ax5.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax5, data=df, x='chol',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax5.set_xlabel("")
ax5.set_ylabel("")

# Thalachh title
ax6.text(0.5,0.5,"Distribution of thalachh\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax6.spines["bottom"].set_visible(False)
ax6.set_xticklabels([])
ax6.set_yticklabels([])
ax6.tick_params(left=False, bottom=False)

# Thalachh
ax7.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax7, data=df, x='thalachh',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax7.set_xlabel("")
ax7.set_ylabel("")

# Oldpeak title
ax8.text(0.5,0.5,"Distribution of oldpeak\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax8.spines["bottom"].set_visible(False)
ax8.set_xticklabels([])
ax8.set_yticklabels([])
ax8.tick_params(left=False, bottom=False)

# Oldpeak
ax9.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax9, data=df, x='oldpeak',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax9.set_xlabel("")
ax9.set_ylabel("")

for i in ["top","left","right"]:
ax0.spines[i].set_visible(False)
ax1.spines[i].set_visible(False)
ax2.spines[i].set_visible(False)
ax3.spines[i].set_visible(False)
ax4.spines[i].set_visible(False)
ax5.spines[i].set_visible(False)
ax6.spines[i].set_visible(False)
ax7.spines[i].set_visible(False)
ax8.spines[i].set_visible(False)
ax9.spines[i].set_visible(False)
3.2.4 Some other relations that seemed intuitive
fig = plt.figure(figsize=(18,20))
gs = fig.add_gridspec(6,2)
gs.update(wspace=0.5, hspace=0.5)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])
ax10 = fig.add_subplot(gs[5,0])
ax11 = fig.add_subplot(gs[5,1])

background_color = "#ffe6e6"
color_palette = ["#800000","#8000ff","#6aac90","#5833ff","#da8829"]
fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax2.set_facecolor(background_color)
ax3.set_facecolor(background_color)
ax4.set_facecolor(background_color)
ax5.set_facecolor(background_color)
ax6.set_facecolor(background_color)
ax7.set_facecolor(background_color)
ax8.set_facecolor(background_color)
ax9.set_facecolor(background_color)
ax10.set_facecolor(background_color)
ax11.set_facecolor(background_color)

# Cp title
# 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 =
Asymptomatic
ax0.text(0.5,0.5,"Chest pain\ndistribution\n__________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax0.spines["bottom"].set_visible(False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)
ax0.text(1,.5,"0 - Typical Angina\n1 - Atypical Angina\n2 - Non-
anginal Pain\n3 - Asymptomatic",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)

# Cp
ax1.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax1, data=df, x='cp',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax1.set_xlabel("")
ax1.set_ylabel("")
# Caa title
ax2.text(0.5,0.5,"Number of\nmajor vessels\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax2.text(1,.5,"0 vessels\n1 vessel\n2 vessels\n3 vessels\n4vessels",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)

ax2.spines["bottom"].set_visible(False)
ax2.set_xticklabels([])
ax2.set_yticklabels([])
ax2.tick_params(left=False, bottom=False)

# Caa
ax3.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax3, data=df, x='caa',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax3.set_xlabel("")
ax3.set_ylabel("")

# Sex title
ax4.text(0.5,0.5,"Heart Attack\naccording to\nsex\n______",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax4.text(1,.5,"0 - Female\n1 - Male",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)
ax4.spines["bottom"].set_visible(False)
ax4.set_xticklabels([])
ax4.set_yticklabels([])
ax4.tick_params(left=False, bottom=False)

# Sex
ax5.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.countplot(ax=ax5,data=df,x='sex',palette=["#8000ff","#da8829"],
hue='output')
ax5.set_xlabel("")
ax5.set_ylabel("")

# Thall title
ax6.text(0.5,0.5,"Distribution of thall\naccording to\n target
variable\n___________",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax6.text(1,.5,"Thalium Stress\nTest Result\n0, 1, 2, 3",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)
ax6.spines["bottom"].set_visible(False)
ax6.set_xticklabels([])
ax6.set_yticklabels([])
ax6.tick_params(left=False, bottom=False)

# Thall
ax7.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.kdeplot(ax=ax7, data=df, x='thall',hue="output",
fill=True,palette=["#8000ff","#da8829"], alpha=.5, linewidth=0)
ax7.set_xlabel("")
ax7.set_ylabel("")

# Thalachh title
ax8.text(0.5,0.5,"Boxen plot of\nthalachh wrt\noutcome\n_______",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax8.text(1,.5,"Maximum heart\nrate achieved",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)

ax8.spines["bottom"].set_visible(False)
ax8.set_xticklabels([])
ax8.set_yticklabels([])
ax8.tick_params(left=False, bottom=False)

# Thalachh
ax9.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.boxenplot(ax=ax9,
data=df,x='output',y='thalachh',palette=["#8000ff","#da8829"])
ax9.set_xlabel("")
ax9.set_ylabel("")

# Exng title
ax10.text(0.5,0.5,"Strip Plot of\nexng vs age\n______",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 18,
fontweight='bold',
fontfamily='serif',
color='#000000')
ax10.text(1,.5,"Exercise induced\nangina\n0 - No\n1 - Yes",
horizontalalignment = 'center',
verticalalignment = 'center',
fontsize = 14
)
ax10.spines["bottom"].set_visible(False)
ax10.set_xticklabels([])
ax10.set_yticklabels([])
ax10.tick_params(left=False, bottom=False)

# Exng
ax11.grid(color='#000000', linestyle=':', axis='y', zorder=0,
dashes=(1,5))
sns.stripplot(ax=ax11,
data=df,x='exng',y='age',hue='output',palette=["#8000ff","#da8829"])
ax9.set_xlabel("")
ax9.set_ylabel("")

for i in ["top","left","right"]:
ax0.spines[i].set_visible(False)
ax1.spines[i].set_visible(False)
ax2.spines[i].set_visible(False)
ax3.spines[i].set_visible(False)
ax4.spines[i].set_visible(False)
ax5.spines[i].set_visible(False)
ax6.spines[i].set_visible(False)
ax7.spines[i].set_visible(False)
ax8.spines[i].set_visible(False)
ax9.spines[i].set_visible(False)
ax10.spines[i].set_visible(False)
ax11.spines[i].set_visible(False)
3.2.5 Pairplot according to target variable - one plot to rule them all
sns.pairplot(df,hue='output',palette = ["#8000ff","#da8829"])
plt.show()
4. Data Preprocessing
back to top

4.1 Conclusions from the EDA


1. There are no NaN values in the data.
2. There are certain outliers in all the continuous features.
3. The data consists of more than twice the number of people with sex = 1 than sex = 0.
4. There is no apparent linear correlation between continuous variable according to the
heatmap.
5. The scatterplot heatmap matrix suggests that there might be some correlation between
output and cp, thalachh and slp.
6. It is intuitive that elder people might have higher chances of heart attack but according to
the distribution plot of age wrt output, it is evident that this isn't the case.
7. According to the distribution plot of thalachh wrt output, people with higher
maximum heart rate achieved have higher chances of heart attack.
8. According to the distribution plot of oldpeak wrt output, people with lower pevious
peak achieved have higher chances of heart attack.
9. The plot 3.2.4 tells about the following -
– People with Non-Anginal chest pain, that is with cp = 2 have higher chances of
heart attack.
– People with 0 major vessels, that is with caa = 0 have high chance of heart
attack.
– People with sex = 1 have higher chance of heart attack.
– People with thall = 2 have much higher chance of heart attack.
– People with no exercise induced angina, that is with exng = 0 have higher chance
of heart attack.

4.2 Packages
# Scaling
from sklearn.preprocessing import RobustScaler

# Train Test Split


from sklearn.model_selection import train_test_split

# Models
import torch
import torch.nn as nn
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Metrics
from sklearn.metrics import accuracy_score, classification_report,
roc_curve

# Cross Validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

print('Packages imported...')

Packages imported...
4.3 Making features model ready
4.3.1 Scaling and Encoding features
# creating a copy of df
df1 = df

# define the columns to be encoded and scaled


cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]

# encoding the categorical columns


df1 = pd.get_dummies(df1, columns = cat_cols, drop_first = True)

# defining the features and target


X = df1.drop(['output'],axis=1)
y = df1[['output']]

# instantiating the scaler


scaler = RobustScaler()

# scaling the continuous featuree


X[con_cols] = scaler.fit_transform(X[con_cols])
print("The first 5 rows of X are")
X.head()

The first 5 rows of X are

age trtbps chol thalachh oldpeak sex_1 exng_1 caa_1


caa_2 \
0 0.592593 0.75 -0.110236 -0.092308 0.9375 1 0 0
0
1 -1.333333 0.00 0.157480 1.046154 1.6875 1 0 0
0
2 -1.037037 0.00 -0.566929 0.584615 0.3750 0 0 0
0
3 0.074074 -0.50 -0.062992 0.769231 0.0000 1 0 0
0
4 0.148148 -0.50 1.795276 0.307692 -0.1250 0 1 0
0

caa_3 ... cp_2 cp_3 fbs_1 restecg_1 restecg_2 slp_1 slp_2


thall_1 \
0 0 ... 0 1 1 0 0 0 0
1
1 0 ... 1 0 0 1 0 0 0
0
2 0 ... 0 0 0 0 0 0 1
0
3 0 ... 0 0 0 1 0 0 1
0
4 0 ... 0 0 0 1 0 0 1
0

thall_2 thall_3
0 0 0
1 1 0
2 1 0
3 1 0
4 1 0

[5 rows x 22 columns]

4.3.2 Train and test split


X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =
0.2, random_state = 42)
print("The shape of X_train is ", X_train.shape)
print("The shape of X_test is ",X_test.shape)
print("The shape of y_train is ",y_train.shape)
print("The shape of y_test is ",y_test.shape)

The shape of X_train is (242, 22)


The shape of X_test is (61, 22)
The shape of y_train is (242, 1)
The shape of y_test is (61, 1)

5. Modeling
back to top

5.1 Linear Classifiers


5.1.1 Support Vector Machines
# instantiating the object and fitting
clf = SVC(kernel='linear', C=1, random_state=42).fit(X_train,y_train)

# predicting the values


y_pred = clf.predict(X_test)

# printing the test accuracy


print("The test accuracy score of SVM is ", accuracy_score(y_test,
y_pred))

The test accuracy score of SVM is 0.8688524590163934

5.1.2 Hyperparameter tuning of SVC


# instantiating the object
svm = SVC()

# setting a grid - not so extensive


parameters = {"C":np.arange(1,10,1),'gamma':[0.00001,0.00005,
0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5]}

# instantiating the GridSearchCV object


searcher = GridSearchCV(svm, parameters)

# fitting the object


searcher.fit(X_train, y_train)

# the scores
print("The best params are :", searcher.best_params_)
print("The best score is :", searcher.best_score_)

# predicting the values


y_pred = searcher.predict(X_test)

# printing the test accuracy


print("The test accuracy score of SVM after hyper-parameter tuning is
", accuracy_score(y_test, y_pred))

The best params are : {'C': 3, 'gamma': 0.1}


The best score is : 0.8384353741496599
The test accuracy score of SVM after hyper-parameter tuning is
0.9016393442622951

5.1.3 Logistic Regression


# instantiating the object
logreg = LogisticRegression()

# fitting the object


logreg.fit(X_train, y_train)

# calculating the probabilities


y_pred_proba = logreg.predict_proba(X_test)

# finding the predicted valued


y_pred = np.argmax(y_pred_proba,axis=1)

# printing the test accuracy


print("The test accuracy score of Logistric Regression is ",
accuracy_score(y_test, y_pred))

The test accuracy score of Logistric Regression is 0.9016393442622951

5.1.4 ROC Curve


# calculating the probabilities
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# instantiating the roc_cruve


fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)

# plotting the curve


plt.plot([0,1],[0,1],"k--",'r+')
plt.plot(fpr,tpr,label='Logistic Regression')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Logistric Regression ROC Curve")
plt.show()

5.2 Tree Models


5.2.1 Decision Tree
# instantiating the object
dt = DecisionTreeClassifier(random_state = 42)

# fitting the model


dt.fit(X_train, y_train)

# calculating the predictions


y_pred = dt.predict(X_test)

# printing the test accuracy


print("The test accuracy score of Decision Tree is ",
accuracy_score(y_test, y_pred))

The test accuracy score of Decision Tree is 0.7868852459016393


5.2.2 Random Forest
# instantiating the object
rf = RandomForestClassifier()

# fitting the model


rf.fit(X_train, y_train)

# calculating the predictions


y_pred = dt.predict(X_test)

# printing the test accuracy


print("The test accuracy score of Random Forest is ",
accuracy_score(y_test, y_pred))

The test accuracy score of Random Forest is 0.7868852459016393

5.2.3 Gradient Boosting Classifier - without tuning


# instantiate the classifier
gbt = GradientBoostingClassifier(n_estimators =
300,max_depth=1,subsample=0.8,max_features=0.2,random_state=42)

# fitting the model


gbt.fit(X_train,y_train)

# predicting values
y_pred = gbt.predict(X_test)
print("The test accuracy score of Gradient Boosting Classifier is ",
accuracy_score(y_test, y_pred))

The test accuracy score of Gradient Boosting Classifier is


0.8688524590163934

back to top

You might also like