Siahaan V. Data Science Crash Course... With Python GUI 2ed 2023
Siahaan V. Data Science Crash Course... With Python GUI 2ed 2023
THYROID DISEASE
CLASSIFICATION AND PREDICTION
USING MACHINE LEARNING AND DEEP LEARNING
WITH PYTHON GUI
DATA SCIENCE CRASH COURSE:
THYROID DISEASE
CLASSIFICATION AND PREDICTION
USING MACHINE LEARNING AND DEEP LEARNING
WITH PYTHON GUI
Second Edition
VIVIAN SIAHAAN
RISMON HASIHOLAN SIANIPAR
Description
Exploring Dataset
1 #thyroid.py
2 import numpy as np
3 import pandas as pd
4 import matplotlib
5 import matplotlib.pyplot as plt
6 import seaborn as sns
7 sns.set_style('darkgrid')
8 from sklearn.preprocessing import
9 LabelEncoder
10 import warnings
11 warnings.filterwarnings('ignore')
12 import os
13 import plotly.graph_objs as go
14 import joblib
15 import itertools
16 from sklearn.metrics import
roc_auc_score,roc_curve
17
from sklearn.model_selection import
18
train_test_split,
19 RandomizedSearchCV,
20 GridSearchCV,StratifiedKFold
21 from sklearn.preprocessing import
22 StandardScaler, MinMaxScaler
23 from sklearn.linear_model import
LogisticRegression
24
from sklearn.naive_bayes import
25
GaussianNB
26 from sklearn.tree import
27 DecisionTreeClassifier
28 from sklearn.svm import SVC
29 from sklearn.ensemble import
RandomForestClassifier,
30
ExtraTreesClassifier
31
from sklearn.neighbors import
32 KNeighborsClassifier
33 from sklearn.ensemble import
34 AdaBoostClassifier,
35 GradientBoostingClassifier
36 from xgboost import XGBClassifier
37 from sklearn.neural_network import
MLPClassifier
38
from sklearn.linear_model import
39
SGDClassifier
40
from sklearn.preprocessing import
41 StandardScaler, \
LabelEncoder, OneHotEncoder
from sklearn.metrics import
confusion_matrix, accuracy_score,
recall_score, precision_score
from sklearn.metrics import
classification_report, f1_score,
plot_confusion_matrix
from catboost import
CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import
SMOTE
from sklearn.model_selection import
learning_curve
from mlxtend.plotting import
plot_decision_regions
1 #Reads dataset
2 curr_path = os.getcwd()
3 df =
4 pd.read_csv(curr_path+"/hypothyroid.csv")
5 print(df.iloc[:,0:7].head().to_string())
6 print(df.iloc[:,7:12].head().to_string())
7 print(df.iloc[:,12:21].head().to_string())
print(df.iloc[:,21:31].head().to_string())
Output:
age sex on thyroxine query on thyroxine on antithyroid
medication sick pregnant
0 41 F f f
f f f
1 23 F f f
f f f
2 46 M f f
f f f
3 70 F t f
f f f
4 70 F f f
f f f
TT4 T4U measured T4U FTI measured FTI TBG measured TBG
referral source binaryClass
0 125 t 1.14 t 109 f ?
SVHC P
1 102 f ? f ? f ?
other P
2 109 t 0.91 t 120 f ?
other P
3 175 f ? f ? f ?
other P
4 61 t 0.87 t 70 f ?
SVI P
1 #Checks shape
2 print(df.shape)
Output:
(3772, 30)
By printing df.shape, you will see the output as a tuple containing the
number of rows and columns in the DataFrame, giving you an idea of
the size of the dataset.
The output (3772, 30) indicates that the DataFrame df has 3772 rows
and 30 columns.
This information helps you understand the size and structure of the
dataset. There are 3772 instances or observations in the dataset, and
each instance has 30 different features or variables associated with it.
1 #Reads columns
2 print("Data Columns -->
",df.columns)
Output:
Data Columns --> Index(['age', 'sex', 'on thyroxine',
'query on thyroxine', 'on antithyroid medication', 'sick',
'pregnant', 'thyroid surgery', 'I131 treatment', 'query
hypothyroid', 'query hyperthyroid', 'lithium', 'goitre',
'tumor', 'hypopituitary', 'psych', 'TSH measured', 'TSH',
'T3 measured', 'T3', 'TT4 measured', 'TT4', 'T4U measured',
'T4U',
'FTI measured', 'FTI', 'TBG measured', 'TBG', 'referral
source',
'binaryClass'], dtype='object')
The code print("Data Columns --> ",df.columns) is used to print the
column names of the DataFrame df.
By printing df.columns, you will see the output as the column names of
the DataFrame, which provides a list of the feature names or variables
present in the dataset.
This output lists all the column names of the DataFrame df. Each
column name represents a specific feature or variable present in the
dataset.
Here is an explanation of each column in the "hypothyroid" dataset:
'age': Represents the age of the patient.
'sex': Indicates the sex of the patient (e.g.,
male or female).
'on thyroxine': Indicates whether the
patient is on thyroxine medication or not.
'query on thyroxine': Indicates whether
there is a query or doubt about the patient
being on thyroxine medication.
'on antithyroid medication': Indicates
whether the patient is on antithyroid
medication or not.
'sick': Indicates whether the patient is
currently sick or not.
'pregnant': Indicates whether the patient is
pregnant or not.
'thyroid surgery': Indicates whether the
patient has undergone thyroid surgery or
not.
'I131 treatment': Indicates whether the
patient has received I131 treatment or not.
'query hypothyroid': Indicates whether
there is a query or doubt about the patient
having hypothyroidism.
'query hyperthyroid': Indicates whether
there is a query or doubt about the patient
having hyperthyroidism.
'lithium': Indicates whether the patient has
taken lithium medication or not.
'goitre': Indicates whether the patient has
goitre (enlargement of the thyroid gland) or
not.
'tumor': Indicates whether the patient has a
thyroid tumor or not.
'hypopituitary': Indicates whether the
patient has hypopituitarism (reduced
hormone production by the pituitary gland)
or not.
'psych': Indicates whether the patient has a
psychological disorder or not.
'TSH measured': Indicates whether the
Thyroid Stimulating Hormone (TSH) level
has been measured or not.
'TSH': Represents the Thyroid Stimulating
Hormone (TSH) level.
'T3 measured': Indicates whether the
triiodothyronine (T3) level has been
measured or not.
'T3': Represents the triiodothyronine (T3)
level.
'TT4 measured': Indicates whether the total
thyroxine (TT4) level has been measured or
not.
'TT4': Represents the total thyroxine (TT4)
level.
'T4U measured': Indicates whether the
thyroxine uptake (T4U) has been measured
or not.
'T4U': Represents the thyroxine uptake
(T4U) level.
'FTI measured': Indicates whether the Free
Thyroxine Index (FTI) has been measured
or not.
'FTI': Represents the Free Thyroxine Index
(FTI) value.
'TBG measured': Indicates whether the
Thyroxine-Binding Globulin (TBG) level has
been measured or not.
'TBG': Represents the Thyroxine-Binding
Globulin (TBG) level.
'referral source': Indicates the source or
method of referral for the patient.
'binaryClass': Represents the binary
classification label indicating whether the
patient has hypothyroidism (1) or not (0).
These columns contain various features and attributes related to
patients and thyroid-related measurements, which can be used to
analyze and predict hypothyroidism.
Information of Dataset
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Data columns (total 30 columns):
# Column Non-Null
Count Dtype
--- ------ --------
------ -----
0 age 3772
non-null object
1 sex 3772
non-null object
2 on thyroxine 3772
non-null object
3 query on thyroxine 3772
non-null object
4 on antithyroid medication 3772
non-null object
5 sick 3772
non-null object
6 pregnant 3772
non-null object
7 thyroid surgery 3772
non-null object
8 I131 treatment 3772
non-null object
9 query hypothyroid 3772
non-null object
10 query hyperthyroid 3772
non-null object
11 lithium 3772
non-null object
12 goitre 3772
non-null object
13 tumor 3772
non-null object
14 hypopituitary 3772
non-null object
15 psych 3772
non-null object
16 TSH measured 3772
non-null object
17 TSH 3772
non-null object
18 T3 measured 3772
non-null object
19 T3 3772
non-null object
20 TT4 measured 3772
non-null object
21 TT4 3772
non-null object
22 T4U measured 3772
non-null object
23 T4U 3772
non-null object
24 FTI measured 3772
non-null object
25 FTI 3772
non-null object
26 TBG measured 3772
non-null object
27 TBG 3772
non-null object
28 referral source 3772
non-null object
29 binaryClass 3772
non-null object
dtypes: object(30)
memory usage: 884.2+ KB
None
Output:
count unique
top freq
age 3772 94
59 95
sex 3772 3
F 2480
on thyroxine 3772 2
f 3308
query on thyroxine 3772 2
f 3722
on antithyroid medication 3772 2
f 3729
sick 3772 2
f 3625
pregnant 3772 2
f 3719
thyroid surgery 3772 2
f 3719
I131 treatment 3772 2
f 3713
query hypothyroid 3772 2
f 3538
query hyperthyroid 3772 2
f 3535
lithium 3772 2
f 3754
goitre 3772 2
f 3738
tumor 3772 2
f 3676
hypopituitary 3772 2
f 3771
psych 3772 2
f 3588
TSH measured 3772 2
t 3403
TSH 3772 288
? 369
T3 measured 3772 2
t 3003
T3 3772 70
? 769
TT4 measured 3772 2
t 3541
TT4 3772 242
? 231
T4U measured 3772 2
t 3385
T4U 3772 147
? 387
FTI measured 3772 2
t 3387
FTI 3772 235
? 385
TBG measured 3772 1
f 3772
TBG 3772 1
? 3772
referral source 3772 5
other 2201
binaryClass 3772 2
P 3481
Output:
age 1
sex 150
on thyroxine 0
query on thyroxine 0
on antithyroid medication 0
sick 0
pregnant 0
thyroid surgery 0
I131 treatment 0
query hypothyroid 0
query hyperthyroid 0
lithium 0
goitre 0
tumor 0
hypopituitary 0
psych 0
TSH measured 0
TSH 369
T3 measured 0
T3 769
TT4 measured 0
TT4 231
T4U measured 0
T4U 387
FTI measured 0
FTI 385
TBG measured 0
TBG 3772
referral source 0
binaryClass 0
dtype: int64
Total number of null values: 6064
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Data columns (total 30 columns):
# Column Non-Null Count
Dtype
--- ------ --------------
-----
0 age 3771 non-null
float64
1 sex 3622 non-null
object
2 on thyroxine 3772 non-null
object
3 query on thyroxine 3772 non-null
object
4 on antithyroid medication 3772 non-null
object
5 sick 3772 non-null
object
6 pregnant 3772 non-null
object
7 thyroid surgery 3772 non-null
object
8 I131 treatment 3772 non-null
object
9 query hypothyroid 3772 non-null
object
10 query hyperthyroid 3772 non-null
object
11 lithium 3772 non-null
object
12 goitre 3772 non-null
object
13 tumor 3772 non-null
object
14 hypopituitary 3772 non-null
object
15 psych 3772 non-null
object
16 TSH measured 3772 non-null
object
17 TSH 3403 non-null
float64
18 T3 measured 3772 non-null
object
19 T3 3003 non-null
float64
20 TT4 measured 3772 non-null
object
21 TT4 3541 non-null
float64
22 T4U measured 3772 non-null
object
23 T4U 3385 non-null
float64
24 FTI measured 3772 non-null
object
25 FTI 3387 non-null
float64
26 TBG measured 3772 non-null
object
27 TBG 0 non-null
float64
28 referral source 3772 non-null
object
29 binaryClass 3772 non-null
object
dtypes: float64(7), object(23)
memory usage: 884.2+ KB
None
Output:
age float64
sex object
on thyroxine object
query on thyroxine object
on antithyroid medication object
sick object
pregnant object
thyroid surgery object
I131 treatment object
query hypothyroid object
query hyperthyroid object
lithium object
goitre object
tumor object
hypopituitary object
psych object
TSH measured object
TSH float64
T3 measured object
T3 float64
TT4 measured object
TT4 float64
T4U measured object
T4U float64
FTI measured object
FTI float64
TBG measured object
binaryClass object
dtype: object
age 0
sex 0
on thyroxine 0
query on thyroxine 0
on antithyroid medication 0
sick 0
pregnant 0
thyroid surgery 0
I131 treatment 0
query hypothyroid 0
query hyperthyroid 0
lithium 0
goitre 0
tumor 0
hypopituitary 0
psych 0
TSH measured 0
TSH 0
T3 measured 0
T3 0
TT4 measured 0
TT4 0
T4U measured 0
T4U 0
FTI measured 0
FTI 0
TBG measured 0
binaryClass 0
dtype: int64
Total number of null values: 0
Output:
age TSH
T3 TT4 T4U
FTI
count 3772.000000 3772.000000
3772.000000 3772.000000 3772.000000
3772.000000
mean 51.735879 5.086766
2.013500 108.319345 0.995000
110.469649
std 20.082295 23.290853
0.738262 34.496511 0.185156
31.355087
min 1.000000 0.005000
0.050000 2.000000 0.250000
2.000000
25% 36.000000 0.600000
1.700000 89.000000 0.890000
94.000000
50% 54.000000 1.600000
2.013500 106.000000 0.995000
110.000000
75% 67.000000 3.800000
2.200000 123.000000 1.070000
121.250000
max 455.000000 530.000000
10.600000 430.000000 2.320000
395.000000
Limiting Age
Distribution of Samples
Output:
Total Number of samples : 3772
Total No.of Negative Thyroid: 291
Total No.of Positive Thyroid : 3481
plot_piechart(df,'binaryClass')
Output:
P 3481
N 291
Name: binaryClass, dtype: int64
The code defines a function named plot_piechart() that creates a pie chart and a
horizontal bar plot as subplots. Here are the steps involved in the code:
1. def plot_piechart(df, var, title=''):: This line defines
the function plot_piechart, which takes three
parameters: df (the DataFrame), var (the
variable/column to plot), and an optional title
parameter for the title of the plot.
2. plt.figure(figsize=(25, 10)): This line creates a new
figure with a specified size for the plot.
3. plt.subplot(121): This line creates the first subplot
in a 1x2 grid, selecting the first position.
4. label_list = list(df[var].value_counts().index): This
line retrieves the unique values of the var column
in the DataFrame and converts them to a list. It will
be used as labels for the pie chart.
5. df[var].value_counts().plot.pie(autopct="%1.1f%%",
colors=sns.color_palette("prism", 7),
startangle=60, labels=label_list, wedgeprops=
{"linewidth": 2, "edgecolor": "k"}, shadow=True,
textprops={'fontsize': 20}): This line plots the pie
chart using the plot.pie() function. It uses the
value_counts() method to count the occurrences of
each unique value in the var column. Other
parameters specify the autopct format for
percentage display, color palette, starting angle,
labels, wedge properties, shadow, and text font
size.
6. plt.title("Distribution of " + var + " variable " +
title, fontsize=25): This line sets the title of the pie
chart by concatenating the var and title
parameters.
7. value_counts = df[var].value_counts(): This line
calculates the count of each unique value in the var
column using value_counts() and assigns it to the
value_counts variable.
8. percentages = value_counts / len(df) * 100: This
line calculates the percentage values of each
unique value in the var column by dividing the
value_counts by the length of the DataFrame and
multiplying by 100.
9. print("Percentage values:") print(percentages):
This code prints the percentage values calculated
in the previous step.
10. plt.subplot(122): This line creates the second
subplot in the 1x2 grid, selecting the second
position.
11. ax = df[var].value_counts().plot(kind="barh"): This
line plots a horizontal bar plot using the
value_counts() method on the var column of the
DataFrame.
12. for i, j in enumerate(df[var].value_counts().values):
ax.text(.7, i, j, weight="bold", fontsize=20): This
code adds text annotations to the horizontal bar
plot, displaying the count values above each bar.
13. plt.title("Count of " + var + " cases " + title,
fontsize=25): This line sets the title of the
horizontal bar plot by concatenating the var and
title parameters.
14. print("Count values:") print(value_counts): This
code prints the count values calculated in step 7.
15. plt.show(): This line displays the plot with both
subplots.
Overall, the function plot_piechart() generates a pie chart to visualize the
distribution of a variable and a horizontal bar plot to display the count of each
category. It also prints the percentage values and count values for each category.
The function allows for customized titles and can be used to analyze the
distribution and count of different variables in the DataFrame.
ax.annotate(format(p.get_height(),
'.0f'), (p.get_x() + p.get_width()
/ 2., p.get_height()), ha='center',
va='center',
xytext=(0, 10), weight="bold",
fontsize=17, textcoords='offset
points')
plt.title(i, fontsize=30) #
Adjust title font size
plt.show()
plt.show()
Output:
Negative Thyroid:
age y
----- ---
5.35 7
14.05 8
22.75 20
31.45 30
40.15 38
48.85 38
57.55 48
66.25 54
74.95 37
83.65 11
Positive Thyroid:
age y
----- ---
5.95 15
15.85 168
25.75 427
35.65 489
45.55 453
55.45 644
65.35 634
75.25 503
85.15 139
95.05 9
Output:
Negative Thyroid:
TSH y
-------- ---
26.5143 243
79.5127 23
132.511 9
185.51 8
238.508 2
291.507 0
344.505 0
397.504 1
450.502 3
503.501 2
Positive Thyroid:
TSH y
--------- ----
7.25475 3453
21.7543 19
36.2538 3
50.7533 1
65.2528 1
79.7523 2
94.2518 0
108.751 1
123.251 0
137.75 1
Output:
Negative Thyroid:
T3 y
----- ---
0.395 39
0.785 29
1.175 34
1.565 51
1.955 91
2.345 36
2.735 6
3.125 1
3.515 2
3.905 2
Positive Thyroid:
T3 y
------- ----
0.5775 229
1.6325 2163
2.6875 924
3.7425 111
4.7975 35
5.8525 10
6.9075 6
7.9625 1
9.0175 1
10.0725 1
Output:
Negative Thyroid:
TT4 y
----- ---
9.3 29
23.9 15
38.5 21
53.1 21
67.7 52
82.3 62
96.9 35
111.5 30
126.1 15
140.7 11
Positive Thyroid:
TT4 y
------ ----
39.55 68
80.65 1355
121.75 1616
162.85 329
203.95 80
245.05 26
286.15 4
327.25 0
368.35 1
409.45 2
Output:
Negative Thyroid:
T4U y
------ ---
0.6145 7
0.7235 15
0.8325 43
0.9415 81
1.0505 64
1.1595 48
1.2685 20
1.3775 8
1.4865 2
1.5955 3
Positive Thyroid:
T4U y
------ ----
0.3535 5
0.5605 59
0.7675 715
0.9745 1905
1.1815 602
1.3885 108
1.5955 54
1.8025 26
2.0095 5
2.2165 2
Upon analyzing the distribution of the "T4U"
feature by the "binaryClass" feature, the
following observations can be made:
Output:
Negative Thyroid:
FTI y
------ ---
9.55 30
24.65 9
39.75 18
54.85 28
69.95 48
85.05 43
100.15 51
115.25 53
130.35 9
145.45 2
Positive Thyroid:
FTI y
----- ----
35.9 17
73.7 639
111.5 2186
149.3 481
187.1 109
224.9 32
262.7 10
300.5 3
338.3 1
376.1 3
Negative Thyroid:
The "FTI" values for the
negative thyroid group
range from 9.55 to 145.45.
The majority of individuals
have "FTI" values between
54.85 and 115.25.
The highest count of
individuals (53) falls within
the "FTI" range of 115.25.
There are relatively fewer
individuals with extreme
"FTI" values below 24.65
and above 130.35 in the
negative thyroid group.
Positive Thyroid:
The "FTI" values for the
positive thyroid group
range from 35.9 to 376.1.
The majority of individuals
have "FTI" values between
73.7 and 149.3.
The highest count of
individuals (2186) falls
within the "FTI" range of
111.5.
There are relatively fewer
individuals with extreme
"FTI" values below 73.7
and above 262.7 in the
positive thyroid group.
Based on this analysis, it appears that higher
"FTI" values are more common in the positive
thyroid group compared to the negative thyroid
group. This suggests that the "FTI" feature may
be a useful indicator for distinguishing between
individuals with positive and negative thyroid
conditions. Further analysis and modeling can
be performed to assess the predictive power of
the "FTI" feature and its contribution to thyroid
disease diagnosis.
1 def put_label_stacked_bar(ax,fontsize):
2 #patches is everything inside of the chart
3 for rect in ax.patches:
4 # Find where everything is located
5 height = rect.get_height()
6 width = rect.get_width()
7 x = rect.get_x()
8 y = rect.get_y()
9
10 # The height of the bar is the data value and
11 can be used as the label
12 label_text = f'{height:.0f}' #
f'{height:.2f}' to format decimal values
13
14
# ax.text(x, y, text)
15
label_x = x + width / 2
16
label_y = y + height / 2
17
18
# plots only when height is greater than
19
specified value
20
if height > 0:
21
ax.text(label_x, label_y,
22 label_text, ha='center', va='center', weight =
23 "bold",fontsize=fontsize)
24
25 def dist_count_plot(df, cat):
26 fig = plt.figure(figsize=(25, 15))
27 ax1 = fig.add_subplot(111)
28
29 group_by_stat = df.groupby([cat,
30 'binaryClass']).size()
31 stacked_plot =
group_by_stat.unstack().plot(kind='bar',
32
stacked=True, ax=ax1, grid=True)
33
ax1.set_title('Stacked Bar Plot of ' + cat
34 + ' (number of cases)', fontsize=14)
35 ax1.set_ylabel('Number of Cases')
36 put_label_stacked_bar(ax1, 17)
37
38 # Collect the values of each stacked bar
39 data = []
40 headers = [''] +
41 df['binaryClass'].unique().tolist()
42 for container in stacked_plot.containers:
43 bar_values = []
44 for bar in container:
45 bar_value = bar.get_height()
46 bar_values.append(bar_value)
47 data.append(bar_values)
48
49 # Transpose the data for tabulate
50
51 data_transposed = list(map(list,
52 zip(*data)))
53
54 # Insert the values of `cat` as the first
column in the data
55
data_with_cat = [[value] + row for value,
56
row in
zip(group_by_stat.index.get_level_values(cat),
data_transposed)]
plt.show()
To use this function, you can pass your DataFrame (df) and
the categorical feature you want to analyze (cat). The
stacked bar plot will be displayed, and the counts for each
category will be printed in tabular form.
Here are the step-by-step explanations of
each function:
Output:
Percentage values:
F 69.724284
M 30.275716
Name: sex, dtype: float64
Count values:
F 2630
M 1142
Name: sex, dtype: int64
P N
-- --- ----
F 226 2404
F 65 1077
Pie Chart:
The pie chart shows that
the majority of the samples
(69.7%) are labeled as "F"
(female), while the
remaining 30.3% are
labeled as "M" (male).
This indicates that the
dataset is imbalanced in
terms of gender
representation, with more
female samples compared
to male samples.
Bar Plot:
The bar plot provides a
visual representation of the
count of each category of
the "sex" feature.
It shows that the category
"F" (female) has a higher
count (2630) compared to
the category "M" (male)
with a count of 1142.
Stacked Bar Plot:
The stacked bar plot
compares the distribution
of the "sex" feature within
each class of the
"binaryClass" variable.
In the "Negative Thyroid"
class, the majority of
samples are labeled as "F"
(female) with a count of
2404, while there are fewer
samples labeled as "M"
(male) with a count o
1077.
In the "Positive Thyroid"
class, there are fewer
samples overall, with 226
labeled as "F" (female) and
65 labeled as "M" (male).
These plots provide insights into the
distribution of the "sex" feature and it
relationship with the "binaryClass" variable
However, it's important to note that the datase
is imbalanced, particularly in terms of gende
representation. Further analysis and modeling
should take this into consideration to avoid
potential biases and ensure reliable predictions
Pie Chart:
The pie chart represents
the distribution of the "on
thyroxine" feature.
It visualizes the proportion
of samples that are labeled
as "t" (on thyroxine) and "f"
(not on thyroxine).
The slices of the pie chart
show the percentage of
each category out of the
total number of samples.
Bar Plot:
The bar plot displays the
count of each category of
the "on thyroxine" feature.
It presents a bar for each
category, where the height
of each bar corresponds to
the count of samples in that
category.
The x-axis represents the
categories "t" and "f", and
the y-axis represents the
count of samples.
Stacked Bar Plot:
The stacked bar plot
compares the distribution
of the "on thyroxine"
feature within each class of
the "binaryClass" variable.
It visualizes the count of
samples for each
combination of the "on
thyroxine" feature and the
"binaryClass" variable.
The x-axis represents the
categories "t" and "f" of the
"on thyroxine" feature, and
the y-axis represents the
count of samples.
Each stacked bar
represents a specific
category of the "on
thyroxine" feature within
the "Negative Thyroid" or
"Positive Thyroid" class.
These plots provide visual representations of
the distribution and relationship between the
"on thyroxine" feature and the "binaryClass"
variable in the dataset. They help to understand
the proportions, counts, and distribution
patterns of the different categories within each
variable.
Output:
Percentage values:
f 87.698834
t 12.301166
Name: on thyroxine, dtype: float64
Count values:
f 3308
t 464
Name: on thyroxine, dtype: int64
P N
-- --- ----
f 282 3026
f 9 455
Output:
Percentage values:
f 97.454931
t 2.545069
Name: tumor, dtype: float64
Count values:
f 3676
t 96
Name: tumor, dtype: int64
P N
-- --- ----
f 283 3393
f 8 88
Output:
Percentage values:
t 90.217391
f 9.782609
Name: TSH measured, dtype: float64
Count values:
t 3403
f 369
Name: TSH measured, dtype: int64
P N
-- --- ----
f 0 369
t 291 3112
Output:
Percentage values:
t 93.875928
f 6.124072
Name: TT4 measured, dtype: float64
Count values:
t 3541
f 231
Name: TT4 measured, dtype: int64
P N
-- --- ----
f 5 226
f 286 3255
print(cat_cols)
print(num_cols)
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Data columns (total 28 columns):
# Column Non-Null
Count Dtype
--- ------ --------
------ -----
0 age 3772
non-null float64
1 sex 3772
non-null object
2 on thyroxine 3772
non-null object
3 query on thyroxine 3772
non-null object
4 on antithyroid medication 3772
non-null object
5 sick 3772
non-null object
6 pregnant 3772
non-null object
7 thyroid surgery 3772
non-null object
8 I131 treatment 3772
non-null object
9 query hypothyroid 3772
non-null object
10 query hyperthyroid 3772
non-null object
11 lithium 3772
non-null object
12 goitre 3772
non-null object
13 tumor 3772
non-null object
14 hypopituitary 3772
non-null object
15 psych 3772
non-null object
16 TSH measured 3772
non-null object
17 TSH 3772
non-null float64
18 T3 measured 3772
non-null object
19 T3 3772
non-null float64
20 TT4 measured 3772
non-null object
21 TT4 3772
non-null float64
22 T4U measured 3772
non-null object
23 T4U 3772
non-null float64
24 FTI measured 3772
non-null object
25 FTI 3772
non-null float64
26 TBG measured 3772
non-null object
27 binaryClass 3772
non-null object
dtypes: float64(6), object(22)
memory usage: 825.2+ KB
None
fig.suptitle('The density of
numerical features', fontsize=50)
plt.tight_layout()
plt.show()
plotnumber += 1
fig.suptitle('The distribution of
categorical features', fontsize=50)
plt.tight_layout()
plt.show()
1 def plot_one_versus_one_cat(feat):
2 categorical_features = ["FTI measured",
3 "sex", "on thyroxine", "sick", "pregnant",
"goitre", "tumor", "hypopituitary", "psych"]
4
num_plots = len(categorical_features)
5
6
fig, axes = plt.subplots(3, 3, figsize=
7
(40, 30), facecolor='#fbe7dd')
8
axes = axes.flatten()
9
color_palette =
10 sns.color_palette("Spectral_r",
11 n_colors=num_plots)
12
13 for i in range(num_plots):
14 ax = axes[i]
15 ax.tick_params(axis='x',
16 labelsize=30)
17 ax.tick_params(axis='y',
labelsize=30)
18
g =
19
sns.countplot(x=df[categorical_features[i]],
20 hue=df[feat], palette=color_palette[i:i+2],
21 ax=ax)
22
23 for p in g.patches:
24
25 g.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2.,
26
p.get_height()),
27
ha='center',
28 va='center', xytext=(0, 10), weight="bold",
29 fontsize=20, textcoords='offset points')
30
31
32 g.set_xlabel(categorical_features[i],
fontsize=30)
g.set_ylabel("Number of Cases",
fontsize=30)
g.legend(fontsize=25)
plt.tight_layout()
plt.show()
plot_one_versus_one_cat("binaryClass")
1 plot_one_versus_one_cat("on
antithyroid medication")
FTI measured:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"FTI measured".
The x-axis represents the
"FTI measured" categories:
"t" (FTI measured) and "f"
(FTI not measured).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
Sex:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"sex".
The x-axis represents the
"sex" categories: "F"
(female) and "M" (male).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
On thyroxine:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"on thyroxine".
The x-axis represents the
"on thyroxine" categories:
"f" (not on thyroxine) and
"t" (on thyroxine).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
Sick:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"sick".
The x-axis represents the
"sick" categories: "f" (not
sick) and "t" (sick).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
Pregnant:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"pregnant".
The x-axis represents the
"pregnant" categories: "f"
(not pregnant) and "t"
(pregnant).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
Goitre:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"goitre".
The x-axis represents the
"goitre" categories: "f" (no
goitre) and "t" (goitre
present).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
Tumor:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"tumor".
The x-axis represents the
"tumor" categories: "f" (no
tumor) and "t" (tumor
present).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
Hypopituitary:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"hypopituitary".
The x-axis represents the
"hypopituitary" categories:
"f" (not hypopituitary) and
"t" (hypopituitary present).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
Psych:
The plot shows the
distribution of the "on
antithyroid medication"
feature for each category of
"psych".
The x-axis represents the
"psych" categories: "f" (not
psych) and "t" (psych
issue).
The y-axis represents the
number of cases.
Each bar represents a
category of "on antithyroid
medication" and is color-
coded accordingly.
The count of cases is
displayed on top of each
bar.
By examining these plots, you can analyze the
distribution of the "on antithyroid medication"
feature across different categories of each
corresponding feature. It allows you to observe
any patterns, trends, or relationships between
the presence of antithyroid medication and
other categorical features in the dataset.
Distribution of Nine Categorical Features versus
Thyroid Surgery
1 plot_one_versus_one_cat("thyroid
surgery")
1 plot_one_versus_one_cat("lithium")
1 plot_one_versus_one_cat("TSH
measured")
1 plot_one_versus_one_cat("T3
measured")
1 plot_one_versus_one_cat("TT4
measured")
1 plot_one_versus_one_cat("T4U
measured")
1 plot_one_versus_one_cat("FTI
measured")
1 plot_one_versus_one_cat("I131
treatment")
1 plot_one_versus_one_cat("query
hypothyroid")
1 def
2 plot_feat1_feat2_vs_target_pie(df,
target, col1, col2):
3
gs0 = df[df[target] == 'P']
4
[col1].value_counts()
5
gs1 = df[df[target] == 'N']
6 [col1].value_counts()
7 ss0 = df[df[target] == 'P']
8 [col2].value_counts()
9 ss1 = df[df[target] == 'N']
10 [col2].value_counts()
11
12 col1_labels = df[col1].unique()
13 col2_labels = df[col2].unique()
14
15 _, ax = plt.subplots(2, 2,
16 figsize=(20, 20),
facecolor='#f7f7f7')
17
18
cmap = plt.get_cmap('Pastel1')
19
# Define color map
20
21
ax[0][0].pie(gs0,
22 labels=col1_labels[:len(gs0)],
23 shadow=True, autopct='%1.1f%%',
24 explode=[0.03] * len(gs0),
25
colors=cmap(np.arange(len(gs0))),
26
textprops={'fontsize': 20})
27
ax[0][1].pie(gs1,
28 labels=col1_labels[:len(gs1)],
29 shadow=True, autopct='%1.1f%%',
30 explode=[0.03] * len(gs1),
31
colors=cmap(np.arange(len(gs1))),
32
textprops={'fontsize': 20})
33
ax[1][0].pie(ss0,
34 labels=col2_labels[:len(ss0)],
35 shadow=True, autopct='%1.1f%%',
36 explode=[0.04] *
37 len(ss0),
colors=cmap(np.arange(len(ss0))),
38
textprops={'fontsize': 20})
39
ax[1][1].pie(ss1,
40 labels=col2_labels[:len(ss1)],
41 shadow=True, autopct='%1.1f%%',
42 explode=[0.04] *
43 len(ss1),
colors=cmap(np.arange(len(ss1))),
44
textprops={'fontsize': 20})
45
46
ax[0][0].set_title(f"
47 {target.capitalize()} = 0",
48 fontsize=30)
49 ax[0][1].set_title(f"
50 {target.capitalize()} = 1",
fontsize=30)
51
plt.show()
52
53
# Print each pie chart as a table
54
in tabular form for each stroke
55
print(f"\n{col1.capitalize()}:")
56
57 print(f"{target.capitalize()} =
58 0:")
59 gs0_table =
pd.DataFrame({'Categories':
60
col1_labels[:len(gs0)],
61 'Percentage': gs0 / gs0.sum() *
62 100, 'Count': gs0})
63 print(gs0_table)
64
65 print(f"\n{target.capitalize()} =
66 1:")
gs1_table =
pd.DataFrame({'Categories':
col1_labels[:len(gs1)],
'Percentage': gs1 / gs1.sum() *
100, 'Count': gs1})
print(gs1_table)
print(f"\n{col2.capitalize()}:")
print(f"{target.capitalize()} =
0:")
ss0_table =
pd.DataFrame({'Categories':
col2_labels[:len(ss0)],
'Percentage': ss0 / ss0.sum() *
100, 'Count': ss0})
print(ss0_table)
print(f"\n{target.capitalize()} =
1:")
ss1_table =
pd.DataFrame({'Categories':
col2_labels[:len(ss1)],
'Percentage': ss1 / ss1.sum() *
100, 'Count': ss1})
print(ss1_table)
plot_feat1_feat2_vs_target_pie(df,
"binaryClass", "on thyroxine", "on
antithyroid medication")
Output:
On thyroxine:
Binaryclass = 0:
Categories Percentage Count
f f 86.929043 3026
t t 13.070957 455
Binaryclass = 1:
Categories Percentage Count
f f 96.907216 282
t t 3.092784 9
On antithyroid medication:
Binaryclass = 0:
Categories Percentage Count
f f 98.79345 3439
t t 1.20655 42
Binaryclass = 1:
Categories Percentage Count
f f 99.656357 290
t t 0.343643 1
1 plot_feat1_feat2_vs_target_pie(df,
2 "binaryClass", "sick", "pregnant")
Output:
Sick:
Binaryclass = 0:
Categories Percentage Count
f f 96.093077 3345
t t 3.906923 136
Binaryclass = 1:
Categories Percentage Count
f f 96.219931 280
t t 3.780069 11
Pregnant:
Binaryclass = 0:
Categories Percentage Count
f f 98.477449 3428
t t 1.522551 53
Binaryclass = 1:
Categories Percentage Count
f f 100.0 291
1 plot_feat1_feat2_vs_target_pie(df,
2 "binaryClass", "lithium", "tumor")
Top-Left Chart:
Title: "BinaryClass = 0"
Feature: "lithium"
This pie chart shows the
distribution of categories in
the "lithium" feature for the
cases where the binary
target variable is 0.
Each category is
represented by a slice in
the pie chart, and the size
of each slice corresponds to
the proportion of cases
belonging to that category.
The labels inside the pie
chart indicate the category
names, and the percentage
values represent the
proportion of cases for
each category.
The colors of the slices in
the pie chart are assigned
using the "Pastel1"
colormap.
Top-Right Chart:
Title: "BinaryClass = 1"
Feature: "lithium"
This pie chart shows the
distribution of categories in
the "lithium" feature for the
cases where the binary
target variable is 1.
Similar to the top-left
chart, the slices in the pie
chart represent the
categories, and the sizes of
the slices correspond to the
proportion of cases for
each category.
The labels inside the pie
chart display the category
names and the percentage
values indicate the
proportion of cases for
each category.
The colors of the slices are
assigned using the
"Pastel1" colormap.
Bottom-Left Chart:
Title: "BinaryClass = 0"
Feature: "tumor"
This pie chart represents
the distribution of
categories in the "tumor"
feature for the cases where
the binary target variable is
0.
Each slice in the pie chart
represents a category, and
the size of the slice reflects
the proportion of cases for
that category.
The labels inside the pie
chart indicate the category
names, and the percentage
values represent the
proportion of cases for
each category.
The colors of the slices are
assigned using the
"Pastel1" colormap.
Bottom-Right Chart:
Title: "BinaryClass = 1"
Feature: "tumor"
This pie chart displays the
distribution of categories in
the "tumor" feature for the
cases where the binary
target variable is 1.
Similar to the other pie
charts, each slice
represents a category, and
the size of the slice
corresponds to the
proportion of cases for that
category.
The labels inside the pie
chart show the category
names, and the percentage
values indicate the
proportion of cases for
each category.
The colors of the slices are
assigned using the
"Pastel1" colormap.
Overall, the plot provides an intuitive
visualization of how the categories in the
"lithium" and "tumor" features are distributed
across the two target values ("binaryClass = 0"
and "binaryClass = 1"). It allows for a quick
comparison of category proportions within each
feature based on the binary target variable.
Figure 37 The percentage distribution of
lithium and tumor versus binaryClass in pie
chart
Output:
Lithium:
Binaryclass = 0:
Categories Percentage Count
f f 99.511635 3464
t t 0.488365 17
Binaryclass = 1:
Categories Percentage Count
f f 99.656357 290
t t 0.343643 1
Tumor:
Binaryclass = 0:
Categories Percentage Count
f f 97.471991 3393
t t 2.528009 88
Binaryclass = 1:
Categories Percentage Count
f f 97.250859 283
t t 2.749141 8
Lithium:
For cases where binaryClass = 0:
The category "f" (not on
lithium) accounts for 99.5%
of the cases, with a count
of 3464.
The category "t" (on
lithium) accounts for 0.5%
of the cases, with a count
of 17.
For cases where binaryClass = 1:
The category "f" (not on
lithium) accounts for 99.7%
of the cases, with a count
of 290.
The category "t" (on
lithium) accounts for 0.3%
of the cases, with a count
of 1.
Tumor:
For cases where binaryClass = 0:
The category "f" (no tumor)
accounts for 97.5% of the
cases, with a count of 3393.
The category "t" (has
tumor) accounts for 2.5%
of the cases, with a count
of 88.
For cases where binaryClass = 1:
The category "f" (no tumor)
accounts for 97.3% of the
cases, with a count of 283.
The category "t" (has
tumor) accounts for 2.7%
of the cases, with a count
of 8.
From these observations, we can conclude:
The majority of cases in
both binaryClass = 0 and
binaryClass = 1 are not on
lithium.
The majority of cases in
both binaryClass = 0 and
binaryClass = 1 do not
have a tumor.
The proportion of cases
with lithium or a tumor is
generally higher in
binaryClass = 1 compared
to binaryClass = 0,
indicating a potential
association between these
features and the binary
target variable.
These insights can help in understanding the
relationships between the "lithium" and "tumor"
features and the binary target variable,
providing valuable information for further
analysis or decision-making.
Step Plot the distribution of nine categorical features versus age feature:
1
1 def feat_versus_other(feat,another,legend,ax0,label):
2 for s in ["right", "top"]:
3 ax0.spines[s].set_visible(False)
4
5 ax0_sns = sns.histplot(data=df,
6 x=feat,ax=ax0,zorder=2,kde=False,hue=another,multiple="stack",
shrink=.8
7
,linewidth=0.3,alpha=1)
8
9 put_label_stacked_bar(ax0_sns,15)
10 ax0_sns.set_xlabel('',fontsize=30, weight='bold')
11 ax0_sns.set_ylabel('',fontsize=30, weight='bold')
12 ax0_sns.grid(which='major', axis='x', zorder=0,
13 color='#EEEEEE', linewidth=0.4)
14 ax0_sns.grid(which='major', axis='y', zorder=0,
color='#EEEEEE', linewidth=0.4)
15
16
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
17
ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8',
18
fontsize=10, bbox_to_anchor=(1, 0.989), loc='upper right')
19
ax0_sns.set_xlabel(label)
20
plt.tight_layout()
21
22
label_bin = list(df["binaryClass"].value_counts().index)
23
label_sex = list(df["sex"].value_counts().index)
24
label_thyroxine = list(df["on
25 thyroxine"].value_counts().index)
27 label_pregnant = list(df["pregnant"].value_counts().index)
28 label_lithium = list(df["lithium"].value_counts().index)
29 label_goitre = list(df["goitre"].value_counts().index)
30 label_tumor = list(df["tumor"].value_counts().index)
31 label_tsh = list(df["TSH measured"].value_counts().index)
32 label_tt4 = list(df["TT4 measured"].value_counts().index)
33
34 label_dict = {
35 "binaryClass": label_bin,
36 "sex": label_sex,
37 "on thyroxine": label_thyroxine,
38 "pregnant": label_pregnant,
39 "lithium": label_lithium,
40 "goitre": label_goitre,
41 "tumor": label_tumor,
42 "TSH measured": label_tsh,
43 "TT4 measured": label_tt4
44 }
45
46 def hist_feat_versus_nine_cat(feat, label):
47 ax_list = []
48
49 fig, axes = plt.subplots(3, 3, figsize=(30, 20))
50 axes = axes.flatten()
51
52
53 for i, (cat_feature, label_var) in
54 enumerate(label_dict.items()):
55 ax = axes[i]
56 feat_versus_other(feat, df[cat_feature], label_var,
ax, f"{cat_feature} versus {label}")
57
ax_list.append(ax)
58
59
plt.tight_layout()
60
plt.show()
61
62
hist_feat_versus_nine_cat(df["age"],"age")
63
The resulting plots show the distribution of the "age" feature in relation to
each of the nine categorical features in the dataset. Each subplot represents a
different categorical feature, and the stacked histograms illustrate the
distribution of age within each category.
The purpose of this code is to provide a visual representation of how the "age"
feature varies across different categories of the categorical features. It allows
for easy comparison and analysis of the age distribution within each category,
providing insights into potential relationships or patterns between age and the
categorical features.
By examining the resulting plots, we can gain insights into how the age
distribution is influenced by each categorical feature. We can observe the
distribution patterns and differences between categories, as well as any
notable trends or relationships that may emerge. This analysis can be useful
for understanding the impact of categorical variables on the distribution of
the "age" feature and exploring potential associations in the dataset.
Step Plot the density of nine categorical features versus age feature:
1
1 def prob_feat_versus_other(feat,another,legend,ax0,label):
2 for s in ["right", "top"]:
3 ax0.spines[s].set_visible(False)
4
5 ax0_sns =
6 sns.kdeplot(x=feat,ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='
7
8 ax0_sns.set_xlabel('',fontsize=20, weight='bold')
9 ax0_sns.set_ylabel('',fontsize=20, weight='bold')
10
11 ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE',
12 ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE',
13
14 ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
15 ax0_sns.legend(legend, ncol=2, facecolor='#D8D8D8', fontsize=25,
loc='upper right')
16
ax0_sns.set_xlabel(label)
17
plt.tight_layout()
18
19
def prob_feat_versus_nine_cat(feat, label):
20
fig, axes = plt.subplots(3, 3, figsize=(30, 20))
21
axes = axes.flatten()
22
23
for i, (cat_feature, label_var) in enumerate(label_dict.items()):
24
ax = axes[i]
25
prob_feat_versus_other(feat, df[cat_feature], label_var, ax,
27
28
plt.tight_layout()
29
plt.show()
30
31
prob_feat_versus_nine_cat(df["age"],"age")
32
33
34
35
36
37
1 hist_feat_versus_nine_cat(df_dummy["TSH"],"TSH")
2 prob_feat_versus_nine_cat(df_dummy["TSH"],"TSH")
hist_feat_versus_nine_cat():
The purpose of this plot is to
compare the distribution of the
numerical feature "T3" across
different categories of each
categorical feature. By visualizing
histograms for each category, we
can gain insights into how the
distribution of "T3" varies within
each category. This plot helps us
understand the range, spread, and
frequency of "T3" values within
different groups defined by the
categorical features. It allows us to
identify any notable differences or
similarities in the distribution of
"T3" across categories, which can
be useful for identifying patterns
or potential relationships between
"T3" and the categorical variables.
prob_feat_versus_nine_cat():
The purpose of this plot is to
compare the probability density
estimation (KDE plot) of the
numerical feature "T3" across
different categories of each
categorical feature. KDE plots
provide a smooth estimate of the
underlying probability distribution,
allowing us to visualize the shape
and density of "T3" values within
each category. This plot helps us
analyze the relative likelihood of
different "T3" values occurring
within each category and observe
any differences or similarities in
the distribution patterns. It is
particularly useful for identifying
modes, peaks, or variations in the
density of "T3" values across
categories, providing insights into
the relationship between "T3" and
the categorical variables.
1 hist_feat_versus_nine_cat(df_dummy["TT4"],"TT4")
2 prob_feat_versus_nine_cat(df_dummy["TT4"],"TT4")
hist_feat_versus_nine_cat():
The purpose of this plot is to
compare the distribution of the
numerical feature "TT4" across
different categories of each
categorical feature. By visualizing
histograms for each category, we
can gain insights into how the
distribution of "TT4" varies within
each category. This plot helps us
understand the range, spread, and
frequency of "TT4" values within
different groups defined by the
categorical features. It allows us to
identify any notable differences or
similarities in the distribution of
"TT4" across categories, which can
be useful for identifying patterns or
potential relationships between
"TT4" and the categorical variables.
prob_feat_versus_nine_cat():
The purpose of this plot is to
compare the probability density
estimation (KDE plot) of the
numerical feature "TT4" across
different categories of each
categorical feature. KDE plots
provide a smooth estimate of the
underlying probability distribution,
allowing us to visualize the shape
and density of "TT4" values within
each category. This plot helps us
analyze the relative likelihood of
different "TT4" values occurring
within each category and observe
any differences or similarities in the
distribution patterns. It is
particularly useful for identifying
modes, peaks, or variations in the
density of "TT4" values across
categories, providing insights into
the relationship between "TT4" and
the categorical variables.
1 hist_feat_versus_nine_cat(df_dummy["T4U"],"T4U")
2 prob_feat_versus_nine_cat(df_dummy["T4U"],"T4U")
hist_feat_versus_nine_cat():
The purpose of this plot is to
compare the distribution of the
numerical feature "T4U" across
different categories of each
categorical feature. By visualizing
histograms for each category, we
can examine how the distribution of
"T4U" varies within each category.
This plot helps us understand the
range, spread, and frequency of
"T4U" values within different groups
defined by the categorical features.
It allows us to identify any notable
differences or similarities in the
distribution of "T4U" across
categories, which can provide
insights into potential relationships
between "T4U" and the categorical
variables.
prob_feat_versus_nine_cat():
The purpose of this plot is to
compare the probability density
estimation (KDE plot) of the
numerical feature "T4U" across
different categories of each
categorical feature. KDE plots
provide a smooth estimate of the
underlying probability distribution,
allowing us to visualize the shape
and density of "T4U" values within
each category. This plot helps us
analyze the relative likelihood of
different "T4U" values occurring
within each category and observe
any differences or similarities in the
distribution patterns. It is
particularly useful for identifying
modes, peaks, or variations in the
density of "T4U" values across
categories, providing insights into
the relationship between "T4U" and
the categorical variables.
In summary, both plots aim to provide a visual
representation of how the numerical feature "T4U" is
distributed across different categories of the categorical
features. They enable us to explore the variations, patterns,
and relationships between "T4U" and the categorical
variables, facilitating data analysis and interpretation.
1 hist_feat_versus_nine_cat(df_dummy["FTI"],"FTI")
2 prob_feat_versus_nine_cat(df_dummy["FTI"],"FTI")
hist_feat_versus_nine_cat(df_dummy["FTI"],"FTI"):
This plot displays the distribution of
FTI (Free Thyroxine Index) values
across different categories of the
nine selected features.
Each subplot represents a different
categorical feature, and the
histogram bars show the frequency
or count of FTI values within each
category.
The plot allows us to observe how
the FTI values are distributed
among different categories and
identify any patterns or variations
across the selected features.
prob_feat_versus_nine_cat(df_dummy["FTI"],"FTI"):
This plot shows the estimated
probability density of FTI values
across different categories of the
nine selected features.
Each subplot corresponds to a
categorical feature, and the KDE
(Kernel Density Estimation) curves
represent the probability
distribution of FTI values within
each category.
The plot helps us assess the
likelihood of observing specific FTI
values within each category and
compare the density patterns across
the selected features.
These plots provide visual representations of the
relationship between the FTI values and the selected
categorical features. They allow us to explore how the FTI
values are distributed across different categories and gain
insights into any variations or patterns that may exist. The
histogram plot emphasizes the frequency or count of FTI
values, while the KDE plot focuses on the probability
density. Together, they provide a comprehensive
understanding of the relationship between the FTI values
and the categorical variables.
df=df.replace({"t":1,"f":0})
1. map_binaryClass()
function:
This function takes a
value n as input and
maps it to either 0 or 1
based on the condition.
If the input n is equal to
"N", it returns 0.
Otherwise, it returns 1.
This function is used to
convert the values in
the "binaryClass"
column to numerical
values (0 or 1)
representing the two
classes.
2. Applying
map_binaryClass() function
to "binaryClass" column:
The apply method is
used on the
"binaryClass" column of
the DataFrame (df).
It applies the
map_binaryClass()
function to each value
in the "binaryClass"
column, effectively
converting the values to
0 or 1 based on the
mapping.
3. map_sex() function:
This function takes a
value n as input and
maps it to either 0 or 1
based on the condition.
If the input n is equal to
"F", it returns 0.
Otherwise, it returns 1.
This function is used to
convert the values in
the "sex" column to
numerical values (0 or
1) representing the two
genders.
4. Applying map_sex()
function to "sex" column:
The apply method is
used on the "sex"
column of the
DataFrame (df).
It applies the map_sex()
function to each value
in the "sex" column,
effectively converting
the values to 0 or 1
based on the mapping.
5. Replacing values in the
DataFrame:
The replace method is
used on the DataFrame
(df) to replace specific
values.
In this case, the values
"t" and "f" are replaced
with 1 and 0,
respectively, throughout
the entire DataFrame.
This step is likely
performed to convert
other categorical
variables represented as
"t" and "f" to numerical
values (1 and 0) for
further analysis or
modeling purposes.
Overall, these steps involve mapping and
replacing values in the DataFrame to convert
categorical variables into numerical
representations, enabling easier data analysis
or modeling tasks.
Output:
Feature Importance:
Features Values
17 TSH 0.511161
25 FTI 0.121740
21 TT4 0.104991
19 T3 0.054661
0 age 0.054394
23 T4U 0.046756
2 on thyroxine 0.016755
16 TSH measured 0.011206
9 query hypothyroid 0.010994
1 sex 0.010496
7 thyroid surgery 0.008807
18 T3 measured 0.007312
5 sick 0.007273
10 query hyperthyroid 0.005463
13 tumor 0.005275
15 psych 0.004032
22 T4U measured 0.003265
24 FTI measured 0.003217
8 I131 treatment 0.003115
3 query on thyroxine 0.002658
20 TT4 measured 0.002111
4 on antithyroid medication 0.001334
11 lithium 0.001125
6 pregnant 0.001060
12 goitre 0.000706
14 hypopituitary 0.000091
26 TBG measured 0.000000
print("Feature Ranking:")
print(result_lg)
Output:
Feature Ranking:
Features Ranking
26 TBG measured 15
0 age 14
21 TT4 13
25 FTI 12
15 psych 11
14 hypopituitary 10
11 lithium 9
22 T4U measured 8
10 query hyperthyroid 7
5 sick 6
8 I131 treatment 5
24 FTI measured 4
23 T4U 3
18 T3 measured 2
17 TSH 1
20 TT4 measured 1
19 T3 1
2 on thyroxine 1
16 TSH measured 1
3 query on thyroxine 1
1 sex 1
12 goitre 1
9 query hypothyroid 1
7 thyroid surgery 1
6 pregnant 1
4 on antithyroid medication 1
13 tumor 1
Step Split dataset into train and test data with three
1 feature scaling: raw, normalization, and
standardization:
1 sm = SMOTE(random_state=42)
2 X,y = sm.fit_resample(X, y.ravel())
3
4 #Splits the data into training and
5 testing
6 X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size =
7
0.2, random_state = 2021,
8 stratify=y)
9 X_train_raw = X_train.copy()
10 X_test_raw = X_test.copy()
11 y_train_raw = y_train.copy()
12 y_test_raw = y_test.copy()
13
14 X_train_norm = X_train.copy()
15 X_test_norm = X_test.copy()
16 y_train_norm = y_train.copy()
17 y_test_norm = y_test.copy()
18 norm = MinMaxScaler()
19 X_train_norm =
20 norm.fit_transform(X_train_norm)
21 X_test_norm =
22 norm.transform(X_test_norm)
23
24 X_train_stand = X_train.copy()
25 X_test_stand = X_test.copy()
26 y_train_stand = y_train.copy()
y_test_stand = y_test.copy()
scaler = StandardScaler()
X_train_stand =
scaler.fit_transform(X_train_stand)
X_test_stand =
scaler.transform(X_test_stand)
Learning Curve
1 def plot_learning_curve(estimator,
2 title, X, y, axes=None, ylim=None,
cv=None, n_jobs=None,
3
train_sizes=np.linspace(.1, 1.0,
4 5)):
5 if axes is None:
6 _, axes = plt.subplots(1, 3,
7 figsize=(35, 10))
8
9 axes[0].set_title(title)
10 if ylim is not None:
11 axes[0].set_ylim(*ylim)
12 axes[0].set_xlabel("Training
13 examples")
14 axes[0].set_ylabel("Score")
15
16 train_sizes, train_scores,
test_scores, fit_times, _ = \
17
learning_curve(estimator, X,
18
y, cv=cv, n_jobs=n_jobs,
19
20 train_sizes=train_sizes,
21
22 return_times=True)
23 train_scores_mean =
24 np.mean(train_scores, axis=1)
25 train_scores_std =
26 np.std(train_scores, axis=1)
27 test_scores_mean =
np.mean(test_scores, axis=1)
28
test_scores_std =
29
np.std(test_scores, axis=1)
30
fit_times_mean =
31 np.mean(fit_times, axis=1)
32 fit_times_std =
33 np.std(fit_times, axis=1)
34
35 # Plot learning curve
36 axes[0].grid()
37
38 axes[0].fill_between(train_sizes, \
39 train_scores_mean -
train_scores_std,\
40
train_scores_mean +
41
train_scores_std, \
42
alpha=0.1, color="r")
43
44 axes[0].fill_between(train_sizes, \
45 test_scores_mean -
46 test_scores_std,\
47 test_scores_mean +
48 test_scores_std, \
49 alpha=0.1, color="g")
50 axes[0].plot(train_sizes,
train_scores_mean, 'o-', \
51
color="r", label="Training score")
52
axes[0].plot(train_sizes,
53
test_scores_mean, 'o-', \
54
color="g", label="Cross-validation
55 score")
56 axes[0].legend(loc="best")
57
58
59 # Plot n_samples vs fit_times
60 axes[1].grid()
axes[1].plot(train_sizes,
fit_times_mean, 'o-')
axes[1].fill_between(train_sizes, \
fit_times_mean - fit_times_std,\
fit_times_mean +
fit_times_std, alpha=0.1)
axes[1].set_xlabel("Training
examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability
of the model")
axes[2].fill_between(fit_times_mean,
\
test_scores_mean -
test_scores_std,\
test_scores_mean +
test_scores_std, alpha=0.1)
axes[2].set_xlabel("fit_times")
axes[2].set_ylabel("Score")
axes[2].set_title("Performance
of the model")
return plt
Learning Curve:
Look at the trend of the
training score and cross-
validation score as the
training set size increases.
If the training score and
cross-validation score are
both low, it suggests that
the model is underfitting
the data.
If the training score is high
and the cross-validation
score is significantly lower,
it indicates overfitting.
If both scores are high and
close to each other, it
suggests a good balance
between bias and variance.
Check the variance of the
scores by observing the
shaded areas around the
lines. A wider shaded area
indicates higher variance.
Compare the training and
cross-validation scores to
assess the model's
generalization ability.
Here are the steps for each function and their purposes:
1 #Plots ROC
2 def plot_roc(model,X_test, y_test,
3 title):
4 Y_pred_prob =
model.predict_proba(X_test)
5
Y_pred_prob = Y_pred_prob[:, 1]
6
7
fpr, tpr, thresholds =
8
roc_curve(y_test, Y_pred_prob)
9
plt.figure(figsize=(25,15))
10
plt.plot([0,1],[0,1],
11 color='navy', lw=5, linestyle='--')
12 plt.plot(fpr,tpr, label='ANN')
13 plt.xlabel('False Positive
14 Rate')
15 plt.ylabel('True Positive
16 Rate')
17 plt.title('ROC Curve of ' +
title)
18
plt.grid(True)
19 plt.show()
20
21 def
22 plot_decision_boundary(model,xtest,
ytest, name):
23
plt.figure(figsize=(25, 15))
24
#Trains model with two features
25
model.fit(xtest, ytest)
26
27
28
plot_decision_regions(xtest.values,
ytest.ravel(), clf=model, legend=2)
plt.title("Decision boundary
for " + name + " (Test)",
fontsize=30)
plt.xlabel("TSH", fontsize=25)
plt.ylabel("FTI", fontsize=25)
plt.legend(fontsize=25)
plt.show()
1 feat_boundary = ['TSH','FTI']
2 X_feature = X[feat_boundary]
3 X_train_feat, X_test_feat,
4 y_train_feat, y_test_feat = \
5 train_test_split(X_feature, y,
test_size = 0.2, \
random_state = 2021, stratify=y)
plot_decision_boundary(model,X_test_feat,
y_test_feat, name)
plot_learning_curve(model, name,
X_train, y_train, cv=3);
plt.show()
list_scores.append({'Model Name':
name, \
'Feature Scaling':fc, 'Accuracy':
accuracy, \
'Recall': recall, 'Precision':
precision, 'F1':f1})
1 feature_scaling = {
2 'Raw':(X_train_raw, X_test_raw,
3 y_train_raw, y_test_raw),
4 'Normalization':(X_train_norm,
X_test_norm, \
5
y_train_norm, y_test_norm),
6
'Standardization':(X_train_stand,
7
X_test_stand, \
8
y_train_stand, y_test_stand),
9
}
10
11
model_svc =
12 SVC(random_state=2021,probability=True)
13 for fc_name, value in
feature_scaling.items():
X_train, X_test, y_train, y_test =
value
run_model('SVC', model_svc,
X_train, X_test, \
y_train, y_test, fc_name)
1 logreg =
2 LogisticRegression(solver='lbfgs',
max_iter=5000, random_state=2021)
3
for fc_name, value in
4
feature_scaling.items():
5
X_train, X_test, y_train, y_test
6 = value
run_model('Logistic Regression',
logreg, X_train, \
X_test, y_train, y_test, fc_name,
proba=True)
The purpose of this code is to train and evaluate a
logistic regression model using different feature scaling
techniques. Here's a step-by-step explanation:
1. logreg =
LogisticRegression(solver='lbfgs',
max_iter=5000,
random_state=2021): This line
initializes a logistic regression
model with specific solver,
maximum iterations, and random
state settings. The solver is set to
'lbfgs', which is an optimization
algorithm for logistic regression.
The maximum number of
iterations is set to 5000 to ensure
convergence, and the random
state is set to 2021 for
reproducibility.
2. for fc_name, value in
feature_scaling.items():: This line
starts a loop over the
feature_scaling dictionary, which
contains different feature scaling
techniques as keys and
corresponding training and
testing data as values.
3. X_train, X_test, y_train, y_test =
value: This line unpacks the
training and testing data for the
current feature scaling technique
from the value variable into
separate variables: X_train
(training features), X_test
(testing features), y_train
(training target), and y_test
(testing target).
4. run_model('Logistic Regression',
logreg, X_train, X_test, y_train,
y_test, fc_name, proba=True):
This line calls the run_model
function to train and evaluate the
logistic regression model. It
passes the model (logreg),
training and testing data, feature
scaling technique name
(fc_name), and the parameter
proba=True to indicate that the
model should provide probability
estimates instead of class
predictions.
5. Inside the run_model() function,
the logistic regression model is
trained using the training data
(X_train, y_train), and then
predictions are made on the
testing data (X_test). The
function computes various
evaluation metrics such as
accuracy, recall, precision, and
F1-score. It also generates and
displays plots such as the
confusion matrix, predicted
values versus true values, ROC
curve, and decision boundary.
Finally, the learning curve is
plotted to visualize the model's
performance with different
training example sizes.
By running this code, you can compare the performance
of the logistic regression model using different feature
scaling techniques and assess their impact on the
model's predictions and generalization ability.
Raw Scaling:
Accuracy: 0.9885
Recall: 0.9885
Precision: 0.9885
F1-score: 0.9885
Normalization Scaling:
Accuracy: 0.8342
Recall: 0.8342
Precision: 0.8348
F1-score: 0.8341
Standardization Scaling:
Accuracy: 0.9641
Recall: 0.9641
Precision: 0.9644
F1-score: 0.9641
From these results, we can make the following
observations:
Raw Scaling: The model achieved
the highest accuracy, precision,
recall, and F1-score among the
three scalings. This indicates that
using the raw feature values
without any scaling had the best
performance.
Normalization Scaling: The
model's performance decreased
compared to the raw scaling. The
accuracy, precision, recall, and
F1-score were lower, suggesting
that normalizing the features to a
specific range negatively
impacted the model's predictive
ability.
Standardization Scaling: The
model performed well with
standardization scaling, although
slightly lower than the raw
scaling. It achieved high
accuracy, precision, recall, and
F1-score. Standardization helped
by standardizing the feature
values to have zero mean and
unit variance, which might have
improved the model's stability
and performance.
In conclusion, for this particular logistic regression
model and dataset, using the raw feature values without
any scaling resulted in the best performance. However,
it's important to note that the effectiveness of feature
scaling can vary depending on the dataset and the
model used. It is recommended to experiment with
different scalings and evaluate their impact on model
performance to find the most suitable approach.
run_model(f'KNeighbors Classifier
n_neighbors = {max_index}',\
knn, X_train, X_test, y_train,
y_test, \
fc_name, proba=True)
run_model('DecisionTree
Classifier', searcher, X_train, \
X_test, y_train, y_test,
fc_name,proba=True)
The code performs the following steps:
1. Iterates over the
feature_scaling dictionary
items, which contain the
different feature scaling
methods and their
corresponding data splits.
2. Assigns the training and
testing data from the
current feature scaling
method to X_train, X_test,
y_train, and y_test
variables.
3. Initializes a
DecisionTreeClassifier as
dt, which is the classifier
used for training and
prediction.
4. Defines the
hyperparameter grid for
the DecisionTreeClassifier.
In this case, it specifies the
range of max_depth values
to consider and sets the
random_state to 2021.
These hyperparameters will
be tuned using grid search.
5. Creates a GridSearchCV
object called searcher with
the DecisionTreeClassifier
and the defined parameter
grid. This object will
perform a grid search over
the specified
hyperparameters.
6. Calls the run_model()
function to train and
evaluate the
DecisionTreeClassifier with
the grid search on the
current feature scaling
method. It passes the
classifier (searcher),
training and testing data,
feature scaling method
(fc_name), and enables
probability predictions
(proba=True).
The run_model() function will fit the
DecisionTreeClassifier with grid search on the
training data, make predictions on the testing
data, and evaluate the performance of the
model. It will also generate various plots and
metrics to analyze the results.
Raw Scaling:
Accuracy: 99.8%
Recall: 99.8%
Precision: 99.8%
F1-Score: 99.8%
Normalized Scaling:
Accuracy: 99.8%
Recall: 99.8%
Precision: 99.8%
F1-Score: 99.8%
Standardized Scaling:
Accuracy: 99.8%
Recall: 99.8%
Precision: 99.8%
F1-Score: 99.8%
The results indicate that the Decision Tree
Classifier model performed consistently well
across all feature scaling methods, achieving
near-perfect scores in accuracy, recall,
precision, and F1-score. This suggests that the
decision tree model was able to effectively
capture the patterns and relationships in the
dataset, regardless of the scaling method used.
1 rf =
2 RandomForestClassifier(n_estimators=200,
3 \
4 max_depth=20, random_state=2021)
5
6 for fc_name, value in
7 feature_scaling.items():
X_train, X_test, y_train, y_test =
value
run_model('RandomForest Classifier',
rf, X_train, \
X_test, y_train, y_test, fc_name,
proba=True)
1 gbt =
2 GradientBoostingClassifier(n_estimators
= 200, \
3
max_depth=21, subsample=0.8,
4
max_features=0.2, \
5
random_state=2021)
6
7
for fc_name, value in
8 feature_scaling.items():
X_train, X_test, y_train, y_test =
value
run_model('GradientBoosting
Classifier', gbt, X_train, \
X_test, y_train, y_test, fc_name,
proba=True)
1 mlp =
2 MLPClassifier(random_state=2021)
3 for fc_name, value in
feature_scaling.items():
4
X_train, X_test, y_train, y_test
5
= value
run_model('MLP Classifier', mlp,
X_train, X_test, \
y_train, y_test, fc_name,
proba=True)
#thyroid.py
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
import os
import plotly.graph_objs as go
import joblib
import itertools
from sklearn.metrics import
roc_auc_score,roc_curve
from sklearn.model_selection import
train_test_split, RandomizedSearchCV,
GridSearchCV,StratifiedKFold
from sklearn.preprocessing import StandardScaler,
MinMaxScaler
from sklearn.linear_model import
LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import
RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier,
GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler,
\
LabelEncoder, OneHotEncoder
from sklearn.metrics import confusion_matrix,
accuracy_score, recall_score, precision_score
from sklearn.metrics import classification_report,
f1_score, plot_confusion_matrix
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import learning_curve
from mlxtend.plotting import plot_decision_regions
#Reads dataset
curr_path = os.getcwd()
df = pd.read_csv(curr_path+"/hypothyroid.csv")
print(df.iloc[:,0:7].head().to_string())
print(df.iloc[:,7:12].head().to_string())
print(df.iloc[:,12:21].head().to_string())
print(df.iloc[:,21:31].head().to_string())
#Checks shape
print(df.shape)
#Reads columns
print("Data Columns --> ",df.columns)
df['age'].fillna(df['age'].mean(), inplace=True)
df['age'].describe()
df[var].value_counts().plot.pie(autopct="%1.1f%%",
\
colors=sns.color_palette("prism", 7), \
startangle=60,
labels=label_list, \
wedgeprops=
{"linewidth": 2, "edgecolor": "k"}, \
shadow=True,
textprops={'fontsize': 20}
)
plt.title("Distribution of " + var + "
variable " + title, fontsize=25)
value_counts = df[var].value_counts()
# Print percentage values
percentages = value_counts / len(df) * 100
print("Percentage values:")
print(percentages)
plt.subplot(122)
ax = df[var].value_counts().plot(kind="barh")
for i, j in
enumerate(df[var].value_counts().values):
ax.text(.7, i, j, weight="bold",
fontsize=20)
plot_piechart(df,'binaryClass')
# Looks at distribution of all features in the
whole original dataset
columns = list(df.columns)
columns.remove('binaryClass')
plt.subplots(figsize=(35, 50))
length = len(columns)
color_palette = sns.color_palette("Set3",
n_colors=length) # Define color palette
for i, j in itertools.zip_longest(columns,
range(length)):
plt.subplot((length // 2), 5, j + 1)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
ax = df[i].hist(bins=10, edgecolor='black',
color=color_palette[j]) # Set color for each
histogram
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center',
va='center', xytext=(0, 10),
weight="bold", fontsize=17, textcoords='offset
points')
df[df['binaryClass'] == "N"]
[feat].plot(ax=ax1, kind='hist', bins=10,
edgecolor='black', color=colors[0])
ax1.set_title('Negative Thyroid', fontsize=25)
ax1.set_xlabel(feat, fontsize=20)
ax1.set_ylabel('Count', fontsize=20)
data1 = []
for p in ax1.patches:
x = p.get_x() + p.get_width() / 2.
y = p.get_height()
ax1.annotate(format(y, '.0f'), (x, y),
ha='center',
va='center', xytext=(0, 10),
weight="bold", fontsize=20,
textcoords='offset points')
data1.append([x, y])
df[df['binaryClass'] == "P"]
[feat].plot(ax=ax2, kind='hist', bins=10,
edgecolor='black', color=colors[1])
ax2.set_title('Positive Thyroid', fontsize=25)
ax2.set_xlabel(feat, fontsize=20)
ax2.set_ylabel('Count', fontsize=20)
data2 = []
for p in ax2.patches:
x = p.get_x() + p.get_width() / 2.
y = p.get_height()
ax2.annotate(format(y, '.0f'), (x, y),
ha='center',
va='center', xytext=(0, 10),
weight="bold", fontsize=20,
textcoords='offset points')
data2.append([x, y])
plt.show()
def put_label_stacked_bar(ax,fontsize):
#patches is everything inside of the chart
for rect in ax.patches:
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
# ax.text(x, y, text)
label_x = x + width / 2
label_y = y + height / 2
group_by_stat = df.groupby([cat,
'binaryClass']).size()
stacked_plot =
group_by_stat.unstack().plot(kind='bar',
stacked=True, ax=ax1, grid=True)
ax1.set_title('Stacked Bar Plot of ' + cat + '
(number of cases)', fontsize=14)
ax1.set_ylabel('Number of Cases')
put_label_stacked_bar(ax1, 17)
plt.show()
print(cat_cols)
print(num_cols)
plotnumber += 1
def plot_one_versus_one_cat(feat):
categorical_features = ["FTI measured", "sex", "on thyroxine", "sick", "p
"hypopituitary", "psych"]
num_plots = len(categorical_features)
for i in range(num_plots):
ax = axes[i]
ax.tick_params(axis='x', labelsize=30)
ax.tick_params(axis='y', labelsize=30)
g = sns.countplot(x=df[categorical_features[i]], hue=df[feat], palett
ax=ax)
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_widt
ha='center', va='center', xytext=(0, 10), weight="bold
textcoords='offset points')
g.set_xlabel(categorical_features[i], fontsize=30)
g.set_ylabel("Number of Cases", fontsize=30)
g.legend(fontsize=25)
plt.tight_layout()
plt.show()
plot_one_versus_one_cat("binaryClass")
plot_one_versus_one_cat("thyroid surgery")
plot_one_versus_one_cat("lithium")
plot_one_versus_one_cat("TSH measured")
plot_one_versus_one_cat("T3 measured")
plot_one_versus_one_cat("TT4 measured")
plot_one_versus_one_cat("T4U measured")
plot_one_versus_one_cat("FTI measured")
plot_one_versus_one_cat("I131 treatment")
plot_one_versus_one_cat("query hypothyroid")
col1_labels = df[col1].unique()
col2_labels = df[col2].unique()
# Print each pie chart as a table in tabular form for each stroke
print(f"\n{col1.capitalize()}:")
print(f"{target.capitalize()} = 0:")
gs0_table = pd.DataFrame({'Categories': col1_labels[:len(gs0)], 'Percenta
'Count': gs0})
print(gs0_table)
print(f"\n{target.capitalize()} = 1:")
gs1_table = pd.DataFrame({'Categories': col1_labels[:len(gs1)], 'Percenta
'Count': gs1})
print(gs1_table)
print(f"\n{col2.capitalize()}:")
print(f"{target.capitalize()} = 0:")
ss0_table = pd.DataFrame({'Categories': col2_labels[:len(ss0)], 'Percenta
'Count': ss0})
print(ss0_table)
print(f"\n{target.capitalize()} = 1:")
ss1_table = pd.DataFrame({'Categories': col2_labels[:len(ss1)], 'Percenta
'Count': ss1})
print(ss1_table)
def feat_versus_other(feat,another,legend,ax0,label):
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
put_label_stacked_bar(ax0_sns,15)
ax0_sns.set_xlabel('',fontsize=30, weight='bold')
ax0_sns.set_ylabel('',fontsize=30, weight='bold')
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidt
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidt
label_bin = list(df["binaryClass"].value_counts().index)
label_sex = list(df["sex"].value_counts().index)
label_thyroxine = list(df["on thyroxine"].value_counts().index)
label_pregnant = list(df["pregnant"].value_counts().index)
label_lithium = list(df["lithium"].value_counts().index)
label_goitre = list(df["goitre"].value_counts().index)
label_tumor = list(df["tumor"].value_counts().index)
label_tsh = list(df["TSH measured"].value_counts().index)
label_tt4 = list(df["TT4 measured"].value_counts().index)
label_dict = {
"binaryClass": label_bin,
"sex": label_sex,
"on thyroxine": label_thyroxine,
"pregnant": label_pregnant,
"lithium": label_lithium,
"goitre": label_goitre,
"tumor": label_tumor,
"TSH measured": label_tsh,
"TT4 measured": label_tt4
}
plt.tight_layout()
plt.show()
hist_feat_versus_nine_cat(df["age"],"age")
def prob_feat_versus_other(feat,another,legend,ax0,label):
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
ax0_sns =
sns.kdeplot(x=feat,ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='g',zorder
ax0_sns.set_xlabel('',fontsize=20, weight='bold')
ax0_sns.set_ylabel('',fontsize=20, weight='bold')
plt.tight_layout()
plt.show()
prob_feat_versus_nine_cat(df["age"],"age")
hist_feat_versus_nine_cat(df["TSH"],"TSH")
prob_feat_versus_nine_cat(df["TSH"],"TSH")
hist_feat_versus_nine_cat(df["T3"],"T3")
prob_feat_versus_nine_cat(df["T3"],"T3")
hist_feat_versus_nine_cat(df["TT4"],"TT4")
prob_feat_versus_nine_cat(df["TT4"],"TT4")
hist_feat_versus_nine_cat(df["T4U"],"T4U")
prob_feat_versus_nine_cat(df["T4U"],"T4U")
hist_feat_versus_nine_cat(df["FTI"],"FTI")
prob_feat_versus_nine_cat(df["FTI"],"FTI")
else:
return 1
df['binaryClass'] = df['binaryClass'].apply(lambda x: map_binaryClass(x))
else:
return 1
df['sex'] = df['sex'].apply(lambda x: map_sex(x))
df=df.replace({"t":1,"f":0})
result_rf = pd.DataFrame()
result_rf['Features'] = X.columns
result_rf ['Values'] = rf.feature_importances_
result_rf.sort_values('Values', inplace = True, ascending = False)
plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Values',y = 'Features', data=result_rf, color="Blue")
plt.xlabel('Feature Importance', fontsize=30)
plt.ylabel('Feature Labels', fontsize=30)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()
result_et = pd.DataFrame()
result_et['Features'] = X.columns
result_et ['Values'] = model.feature_importances_
result_et.sort_values('Values', inplace=True, ascending =False)
plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Values',y = 'Features', data=result_et, color="red")
plt.xlabel('Feature Importance', fontsize=30)
plt.ylabel('Feature Labels', fontsize=30)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()
result_lg = pd.DataFrame()
result_lg['Features'] = X.columns
result_lg ['Ranking'] = rfe.ranking_
result_lg.sort_values('Ranking', inplace=True , ascending = False)
plt.figure(figsize=(25,25))
sns.set_color_codes("pastel")
sns.barplot(x = 'Ranking',y = 'Features', data=result_lg, color="orange")
plt.ylabel('Feature Labels', fontsize=30)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.show()
print("Feature Ranking:")
print(result_lg)
X_train_norm = X_train.copy()
X_test_norm = X_test.copy()
y_train_norm = y_train.copy()
y_test_norm = y_test.copy()
norm = MinMaxScaler()
X_train_norm = norm.fit_transform(X_train_norm)
X_test_norm = norm.transform(X_test_norm)
X_train_stand = X_train.copy()
X_test_stand = X_test.copy()
y_train_stand = y_train.copy()
y_test_stand = y_test.copy()
scaler = StandardScaler()
X_train_stand = scaler.fit_transform(X_train_stand)
X_test_stand = scaler.transform(X_test_stand)
axes[0].set_title(title)
if ylim is not None:
axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")
return plt
#Plots ROC
def plot_roc(model,X_test, y_test, title):
Y_pred_prob = model.predict_proba(X_test)
Y_pred_prob = Y_pred_prob[:, 1]
return y_pred
list_scores = []
feature_scaling = {
#'Raw':(X_train_raw, X_test_raw, y_train_raw, y_test_raw),
#'Normalization':(X_train_norm, X_test_norm, y_train_norm, y_test_norm),
'Standardization':(X_train_stand, X_test_stand, y_train_stand, y_test_stand)
}
model_svc = SVC(random_state=2021,probability=True)
for fc_name, value in feature_scaling.items():
X_train, X_test, y_train, y_test = value
run_model('SVC', model_svc, X_train, X_test, y_train, y_test, fc_name)
for i in range(2,50):
knn = KNeighborsClassifier(n_neighbors = i)
knn.fit(X_train, y_train)
scores_1.append(accuracy_score(y_test, knn.predict(X_test)))
max_val = max(scores_1)
max_index = np.argmax(scores_1) + 2
dt = DecisionTreeClassifier()
parameters = { 'max_depth':np.arange(1,21,1),'random_state':[2021]}
searcher = GridSearchCV(dt, parameters)
PREDICTING
THYROID
USING DEEP LEARNING
PREDICTING
THYROID
USING DEEP LEARNING
1 # thyroid_ann.py
2 import numpy as np # linear algebra
3 import pandas as pd # data
4 processing, CSV file I/O
5 import os
6 import cv2
7 import pandas as pd
8 import seaborn as sns
9 sns.set_style('darkgrid')
10 from matplotlib import pyplot as
11 plt
12 from sklearn.preprocessing import
13 LabelEncoder
14 from sklearn.model_selection import
train_test_split
15
from sklearn.preprocessing import
16 StandardScaler
17 import tensorflow as tf
from sklearn.metrics import
confusion_matrix,
classification_report,
accuracy_score
from imblearn.over_sampling import
SMOTE
from sklearn.impute import
SimpleImputer
1 #Reads dataset
2 curr_path = os.getcwd()
3 df =
4 pd.read_csv(curr_path+"/hypothyroid.csv")
5
6 #Replaces ? with NAN
7 df=df.replace({"?":np.NAN})
8
9 #Converts six columns into numerical
10 num_cols = ['age',
11 'FTI','TSH','T3','TT4','T4U']
12 df[num_cols] =
df[num_cols].apply(pd.to_numeric,
13
errors='coerce')
14
15
#Deletes irrelevant columns
16
df.drop(['TBG','referral
17 source'],axis=1,inplace=True)
18
19 #Handles missing values
20 #Uses mode imputation for all other
21 categorical features
22 def mode_imputation(feature):
23 mode=df[feature].mode()[0]
24 df[feature]=df[feature].fillna(mode)
25
26 for col in ['sex', 'T4U measured']:
27 mode_imputation(col)
28
29 df['age'].fillna(df['age'].mean(),
30 inplace=True)
31
32 imputer = SimpleImputer(strategy='mean')
33 df['TSH'] =
imputer.fit_transform(df[['TSH']])
34
df['T3'] =
35
imputer.fit_transform(df[['T3']])
36
df['TT4'] =
37 imputer.fit_transform(df[['TT4']])
38 df['T4U'] =
39 imputer.fit_transform(df[['T4U']])
40 df['FTI'] =
41 imputer.fit_transform(df[['FTI']])
42
43 #Cleans age column
44 for i in range(df.shape[0]):
45 if df.age.iloc[i] > 100.0:
46 df.age.iloc[i] = 100.0
47
48 df['age'].describe()
49
50 #Converts binaryClass feature to {0,1}
51 def map_binaryClass(n):
52 if n == "N":
53 return 0
54
55 else:
56 return 1
57 df['binaryClass'] =
58 df['binaryClass'].apply(lambda x:
map_binaryClass(x))
59
60
#Converts sex feature to {0,1}
61
def map_sex(n):
62
if n == "F":
63
return 0
64
else:
return 1
df['sex'] = df['sex'].apply(lambda x:
map_sex(x))
1 #Resamples data
2 sm = SMOTE(random_state=42)
3 X,y = sm.fit_resample(X, y.ravel())
4
5 #Splits the data into training and
6 testing
7 X_train, X_test, y_train, y_test =
train_test_split(X, y,\
8
test_size = 0.2, random_state =
9
2021, stratify=y)
10
11
#Standar scaler
12
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Step Build, compile, and train model and save it and its history into files:
1
The output represents the summary of the model structure and the number
of parameters in each layer:
Input Layer (dense): It has an output shape of
(None, 100), indicating that it outputs a tensor
of shape (batch_size, 100). The layer has 2,800
parameters.
Dropout Layer (dropout): It has an output
shape of (None, 100), same as the previous
layer. Dropout layers do not have any
parameters.
Hidden Layer 1 (dense_1): It has an output
shape of (None, 20), indicating that it outputs a
tensor of shape (batch_size, 20). The layer has
2,020 parameters.
Dropout Layer 1 (dropout_1): It has an output
shape of (None, 20), same as the previous
layer. Dropout layers do not have any
parameters.
Output Layer (dense_2): It has an output shape
of (None, 1), indicating that it outputs a tensor
of shape (batch_size, 1). The layer has 21
parameters.
The total number of trainable parameters in the
model is 4,841. These parameters are learned
during the training process to make predictions
on the given task.
This summary provides a concise overview of the model architecture, the
output shapes of each layer, and the total number of parameters.
Based on this analysis, we can conclude that the model is well-suited for the
task at hand, which is predicting whether an individual has hypothyroidism
or not. The achieved high accuracy on both the training and validation data
suggests that the model can make accurate predictions on unseen data as
well.
Output:
Accuracy Score : 0.9928212491026561
Classification Report :
precision recall f1-
score support
0 0.99 0.99
0.99 697
1 0.99 0.99
0.99 696
accuracy
0.99 1393
macro avg 0.99 0.99
0.99 1393
weighted avg 0.99 0.99
0.99 1393
Accuracy Score:
The accuracy_score
function is used to
calculate the accuracy of
the model's predictions. It
compares the predicted
classes (y_pred) with the
actual classes (y_test) and
returns the accuracy score.
The normalize=True
parameter normalizes the
score to a value between 0
and 1. The accuracy score
represents the proportion
of correct predictions out
of the total number of
samples.
Classification Report:
The classification_report
function generates a
comprehensive report that
includes several metrics
such as precision, recall,
F1-score, and support for
each class. It compares the
predicted classes (y_pred)
with the actual classes
(y_test) and provides a
detailed analysis of the
model's performance for
each class. The report is
printed to the console.
The output of these statements will be the
accuracy score and the classification report.
The accuracy score will indicate how well the
model performed overall, while the
classification report will provide insights into
the performance of the model for each class,
including metrics like precision (the proportion
of true positive predictions out of the total
predicted positives), recall (the proportion of
true positive predictions out of the total actual
positives), F1-score (a balanced measure of
precision and recall), and support (the number
of samples in each class).
Confusion Matrix
1 #Confusion matrix:
2 conf_mat =
3 confusion_matrix(y_true=y_test,
y_pred = y_pred)
4
class_list = ['NORMAL', 'STROKE']
5
fig, ax = plt.subplots(figsize=(25,
6
15))
7
sns.heatmap(conf_mat, annot=True, ax
8 = ax, cmap='YlOrBr', fmt='g',
9 annot_kws={"size": 30})
10
11 ax.set_xlabel('Predicted labels',
12 fontsize=30)
ax.set_ylabel('True labels',
fontsize=30)
ax.set_title('Confusion Matrix',
fontsize=30)
ax.xaxis.set_ticklabels(class_list),
ax.yaxis.set_ticklabels(class_list)
ax.tick_params(labelsize=30)
The result is shown in Figure 142. The plot_real_pred_val() function is defined to cre
scatter plot comparing the predicted values (ypred) and the true values (Y_test)
given model (name). Here's a step-by-step explanation of the code:
1. Create a figure for the plot: The plt.figure functio
used to create a figure object with a specified
(figsize=(25,15)).
2. Calculate accuracy: The accuracy between the predic
values and the true values is calculated using
accuracy_score function and stored in the variable acc
3. Plot the predicted values: The predicted values (ypr
are plotted as scatter points using the plt.sca
function. The color="blue" parameter sets the colo
the points to blue. The lw=5 parameter sets the linew
of the points to 5. The label="Predicted" parameter
the label for the legend.
4. Plot the true values: The true values (Y_test) are plo
as scatter points using the plt.scatter function.
color="red" parameter sets the color of the points to
The label="Actual" parameter sets the label for
legend.
5. Set the title: The plt.title function is used to set the
for the plot. The title includes the model name (na
and is formatted with the fontsize=10 parameter.
6. Set the x-axis label: The plt.xlabel function is used to
the x-axis label. The label includes the accuracy va
(acc) formatted as a percentage us
str(round((acc*100),3)) + "%". It represents the accur
of the model in predicting the values.
7. Set the legend: The plt.legend function is used to disp
the legend on the plot. The fontsize=25 parameter
the font size of the legend.
8. Display the grid: The plt.grid function is used to displ
grid on the plot. The True parameter enables the g
while alpha=0.75 sets the transparency level, lw=1
the linewidth of the grid lines, and ls='-.' sets
linestyle.
9. Show the plot: The plt.show function is called to disp
the plot.
The resulting plot shows the predicted values as blue points and the true values a
points. The accuracy of the model is displayed as the x-axis label, and the le
differentiates between the predicted and actual values. The grid lines provide addi
visual guidance.
#thyroid_ann.py
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import os
import cv2
import pandas as pd
import seaborn as sns
sns.set_style('darkgrid')
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score
from imblearn.over_sampling import SMOTE
from sklearn.impute import SimpleImputer
#Reads dataset
curr_path = os.getcwd()
df = pd.read_csv(curr_path+"/hypothyroid.csv")
df['age'].fillna(df['age'].mean(), inplace=True)
imputer = SimpleImputer(strategy='mean')
df['TSH'] = imputer.fit_transform(df[['TSH']])
df['T3'] = imputer.fit_transform(df[['T3']])
df['TT4'] = imputer.fit_transform(df[['TT4']])
df['T4U'] = imputer.fit_transform(df[['T4U']])
df['FTI'] = imputer.fit_transform(df[['FTI']])
else:
return 1
df['binaryClass'] = df['binaryClass'].apply(lambda x:
map_binaryClass(x))
else:
return 1
df['sex'] = df['sex'].apply(lambda x: map_sex(x))
#Resamples data
sm = SMOTE(random_state=42)
X,y = sm.fit_resample(X, y.ravel())
#Standar scaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Imports Tensorflow and create a Sequential Model to add layer for the
ANN
ann = tf.keras.models.Sequential()
#Input layer
ann.add(tf.keras.layers.Dense(units=100,
input_dim=27,
kernel_initializer='uniform',
activation='relu'))
ann.add(tf.keras.layers.Dropout(0.5))
#Hidden layer 1
ann.add(tf.keras.layers.Dense(units=20,
kernel_initializer='uniform',
activation='relu'))
ann.add(tf.keras.layers.Dropout(0.5))
#Output layer
ann.add(tf.keras.layers.Dense(units=1,
kernel_initializer='uniform',
activation='sigmoid'))
#Saves model
ann.save('thyroid_model.h5')
print (history.history.keys())
#accuracy
fig, ax = plt.subplots(figsize=(25, 15))
plt.plot(epochs, acc, 'r', label='Training accuracy', lw=10)
plt.plot(epochs, val_acc, 'b--', label='Validation accuracy', lw=10)
plt.title('Training and validation accuracy', fontsize=35)
plt.legend(fontsize=25)
ax.set_xlabel("Epoch", fontsize=30)
ax.tick_params(labelsize=30)
plt.show()
#loss
fig, ax = plt.subplots(figsize=(25, 15))
plt.plot(epochs, loss, 'r', label='Training loss', lw=10)
plt.plot(epochs, val_loss, 'b--', label='Validation loss', lw=10)
plt.title('Training and validation loss', fontsize=35)
plt.legend(fontsize=25)
ax.set_xlabel("Epoch", fontsize=30)
ax.tick_params(labelsize=30)
plt.show()
#Sets the threshold for the predictions. In this case, the threshold is
0.5 (this value can be modified).
#prediction on test set
y_pred = ann.predict(X_test)
y_pred = [int(p>=0.5) for p in y_pred]
print(y_pred)
#Confusion matrix:
conf_mat = confusion_matrix(y_true=y_test, y_pred = y_pred)
class_list = ['NORMAL', 'STROKE']
fig, ax = plt.subplots(figsize=(25, 15))
sns.heatmap(conf_mat, annot=True, ax = ax, cmap='YlOrBr', fmt='g',
annot_kws={"size": 30})
ax.set_xlabel('Predicted labels', fontsize=30)
ax.set_ylabel('True labels', fontsize=30)
ax.set_title('Confusion Matrix', fontsize=30)
ax.xaxis.set_ticklabels(class_list), ax.yaxis.set_ticklabels(class_list)
ax.tick_params(labelsize=30)
plt.scatter(range(len(ypred)),ypred,color="blue",lw=5,label="Predicted")
plt.scatter(range(len(Y_test)), Y_test,color="red",label="Actual")
plt.title("Predicted Values vs True Values of " + name, fontsize=10)
plt.xlabel("Accuracy: " + str(round((acc*100),3)) + "%")
plt.legend(fontsize=25)
plt.grid(True, alpha=0.75, lw=1, ls='-.')
plt.show()
IMPLEMENTING
GUI
WITH PYQT
IMPLEMENTING
GUI
WITH PYQT
Designing GUI
Step Put three Push Button widgets onto form. Set their text
2 property as LOAD DATA, TRAIN ML MODEL, and
TRAIN DL MODEL. Set their objectName property as
pbLoad, pbTrainML, dan pbTrainDL.
Step Put two Table Widget onto form. Set their objectName
3 properties as twData1 and twData2.
Step Add two Label widgets onto form. Set their text
4 properties as Label 1 and Label 2 and set their
objectName properties as label1 and label2.
Step Put three Widget from Containers panel onto form and
5 set their ObjectName property as widgetPlot1,
widgetPlot2, and widgetPlot3.
1 #plot_class.py
2 from PyQt5.QtWidgets import*
3 from matplotlib.backends.backend_qt5agg
4 import FigureCanvas
5 from matplotlib.figure import Figure
6
7 class plot_class(QWidget):
8 def __init__(self, parent = None):
9 QWidget.__init__(self, parent)
10 self.canvas = FigureCanvas(Figure())
11
12 vertical_layout = QVBoxLayout()
13
vertical_layout.addWidget(self.canvas)
14
15
self.canvas.axis1 =
16
self.canvas.figure.add_subplot(111)
17
self.canvas.axis1 =
18 self.canvas.figure.subplots_adjust(
19
20 top=0.936,
21
22 bottom=0.104,
23
left=0.047,
24
25
right=0.981,
hspace=0.2,
wspace=0.2
)
self.canvas.figure.set_facecolor("xkcd:ligh
t mauve")
self.setLayout(vertical_layout)
The code defines a class called plot_class that inherits
from the QWidget class in the PyQt5 library. This class is
designed to display a plot using Matplotlib within a Qt
application. Here's a step-by-step explanation of the
code:
1. Import the required modules: The
code starts by importing the
necessary modules from the
PyQt5 and Matplotlib libraries.
2. Define the plot_class class: The
plot_class class is defined, and it
inherits from the QWidget class.
3. Initialize the class: The __init__
method is defined to initialize the
class. It takes parent as an
optional parameter.
4. Create a FigureCanvas: Inside the
__init__ method, a FigureCanvas
object is created with an empty
Figure. This canvas will be used
to display the plot.
5. Create a vertical layout: A
QVBoxLayout object is created to
arrange the canvas vertically.
6. Add the canvas to the layout: The
canvas is added to the vertical
layout using the addWidget
method.
7. Add a subplot to the canvas: The
canvas is assigned a subplot using
the add_subplot method. The
argument (111) specifies a single
subplot occupying the entire
canvas.
8. Adjust subplot parameters: The
subplots_adjust method is called
to adjust the spacing and
positioning of the subplot within
the canvas. The provided
parameters control the top,
bottom, left, right, height, and
width ratios.
9. Set the facecolor of the figure:
The set_facecolor method of the
Figure object is called to set the
background color of the figure. In
this case, the color is set to
"xkcd:light mauve".
10. Set the layout for the widget: The
setLayout method is called to set
the vertical layout as the layout
for the widget.
The plot_class class provides a convenient way to display
a plot within a Qt application by embedding a Matplotlib
canvas in a QWidget.
Step Add another Combo Box widget and set its objectName
9 property as cbClassifier. Populate this widget with
fourteen items as shown in Figure 144.
Step Add three radio buttons and set their text properties as
10 Raw, Norm, and Stand. Then, set their objectName as
rbRaw, rbNorm, and rbStand.
Step Add the last Combo Box widget and set its objectName
11 property as cbPredictionDL. Populate this widget with
two items as shown in Figure 145.
1 #gui_thyroid.py
2 from PyQt5.QtWidgets import *
3 from PyQt5.uic import loadUi
4 from
5 matplotlib.backends.backend_qt5agg
import (NavigationToolbar2QT as
6
NavigationToolbar)
7
from matplotlib.colors import
8 ListedColormap
9
10 class DemoGUI_Thyroid(QMainWindow):
11 def __init__(self):
12 QMainWindow.__init__(self)
13
14 loadUi("gui_thyroid.ui",self)
15 self.setWindowTitle(\
16 "GUI Demo of Classifying and
17 Predicting Thyroid Disease")
18 self.addToolBar(NavigationToolbar(
19 \
20 self.widgetPlot1.canvas, self))
21
22 if __name__ == '__main__':
import sys
app = QApplication(sys.argv)
ex = DemoGUI_Thyroid()
ex.show()
sys.exit(app.exec_())
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 sns.set_style('darkgrid')
6 import warnings
7 import mglearn
8 warnings.filterwarnings('ignore')
9 import os
10 import joblib
11 from numpy import save
12 from numpy import load
13 from os import path
14 from sklearn.metrics import
15 roc_auc_score,roc_curve
16 from sklearn.model_selection import
train_test_split,
17
RandomizedSearchCV,
18 GridSearchCV,StratifiedKFold
19 from sklearn.preprocessing import
20 StandardScaler, MinMaxScaler
21 from sklearn.linear_model import
22 LogisticRegression
23 from sklearn.naive_bayes import
GaussianNB
24
from sklearn.tree import
25
DecisionTreeClassifier
26
from sklearn.svm import SVC
27
from sklearn.ensemble import
28 RandomForestClassifier,
29 ExtraTreesClassifier
30 from sklearn.neighbors import
31 KNeighborsClassifier
32 from sklearn.ensemble import
AdaBoostClassifier,
33
GradientBoostingClassifier
34
from xgboost import XGBClassifier
35
from sklearn.neural_network import
36 MLPClassifier
37 from sklearn.linear_model import
38 SGDClassifier
39 from sklearn.preprocessing import
StandardScaler, LabelEncoder,
40
OneHotEncoder
41
from sklearn.metrics import
42 confusion_matrix, accuracy_score,
43 recall_score, precision_score
44 from sklearn.metrics import
45 classification_report, f1_score,
plot_confusion_matrix
from catboost import
CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import
SMOTE
from sklearn.model_selection import
learning_curve
from mlxtend.plotting import
plot_decision_regions
import tensorflow as tf
from sklearn.base import clone
from sklearn.decomposition import
PCA
from tensorflow.keras.models import
Sequential, Model, load_model
from sklearn.impute import
SimpleImputer
The code includes two functions for populating a QTableWidget with data
from a pandas DataFrame. Here's a step-by-step explanation of the code:
1. write_df_to_qtable() method: This static
method takes a DataFrame (df) and a
QTableWidget (table) as input parameters. It
is responsible for writing the DataFrame data
to the QTableWidget. Here are the steps
performed in this method:
a. Get the headers of the DataFrame using
list(df) and assign them to the headers
variable.
b. Set the number of rows and columns in the
QTableWidget using setRowCount and
setColumnCount, respectively, based on
the shape of the DataFrame.
c. Set the horizontal header labels of the
QTableWidget using
setHorizontalHeaderLabels and pass the
headers list.
d. Convert the DataFrame to a NumPy array
using df.values and assign it to the df_array
variable. This step is performed to improve
computational efficiency when retrieving
data from the DataFrame.
e. Iterate over each row and column in the
DataFrame using nested loops. Set the
QTableWidgetItem in the corresponding
position of the QTableWidget using
setItem. Convert the DataFrame values to
strings using str before assigning them to
the QTableWidgetItem.
2. populate_table() method: This method takes
two parameters - data (a DataFrame) and table
(a QTableWidget). It is responsible for
populating the QTableWidget with the data
from the DataFrame. Here are the steps
performed in this method:
a. Call the write_df_to_qtable() method and
pass the data and table as arguments to
populate the table with the DataFrame
data.
b. Set the alternating row colors of the
QTableWidget using
setAlternatingRowColors and pass True as
the argument.
c. Apply custom stylesheet to the
QTableWidget using setStyleSheet to set
the alternating background colors.
Step Define initial_state() method to disable some widgets when form initially
3 runs:
df[feature]=df[feature].fillna(mode)
1 def populate_cbData(self):
2 self.cbData.addItems(self.df.iloc[:,1:]
3 )
4 self.cbData.addItems(["Features
Importance"])
5
self.cbData.addItems(["Correlation
Matrix", \
"Pairwise Relationship", "Features
Correlation"])
1 def import_dataset(self):
2 curr_path = os.getcwd()
3 dataset_dir = curr_path +
4 "/hypothyroid.csv"
5
6 #Loads csv file
7 self.df = self.read_dataset(dataset_dir)
8
9 #Populates tables with data
10 self.populate_table(self.df, self.twData1)
11 self.label1.setText('Thyroid Disease Data')
12
13 self.populate_table(self.df.describe(),
self.twData2)
14
self.twData2.setVerticalHeaderLabels(['Count'
15
, \
16
'Mean', 'Std', 'Min', '25%', '50%', '75%',
17 'Max'])
18 self.label2.setText('Data Desciption')
19
20 #Turns on pbTrainML widget
21 self.pbTrainML.setEnabled(True)
22 self.pbTrainDL.setEnabled(True)
23
24 #Turns off pbLoad
25 self.pbLoad.setEnabled(False)
#Populates cbData
self.populate_cbData()
Step Connect clicked() event of pbLoad widget with import_dataset() and put
7 it inside __init__() method as shown in line 8 and invoke initial_state()
methode in line 9:
1 def __init__(self):
2 QMainWindow.__init__(self)
3 loadUi("gui_thyroid.ui",self)
4 self.setWindowTitle(\
5 "GUI Demo of Classifying and Predicting Thyroid
6 Disease")
7 self.addToolBar(NavigationToolbar(\
8 self.widgetPlot1.canvas, self))
9 self.pbLoad.clicked.connect(self.import_dataset
)
self.initial_state(False)
Figure 147 The initial state of form
Figure 148 When LOAD DATA button is clicked, the two tables will be
populated
Step Run gui_thyroid.py and you will see the other widgets are initially
8 disabled as shown in Figure 147. Then click LOAD DATA button. The two
tables will be populated as shown in Figure 148.
return X, y
Step Define train_test() to split dataset into train and test data with
2 raw, normalized, and standardized feature scaling:
1 def train_test(self):
2 X, y = self.fit_dataset(self.df)
3
4 #Splits the data into training and
5 testing
6 X_train, X_test, y_train, y_test =
train_test_split(X, y,\
7
test_size = 0.2, random_state = 2021,
8
stratify=y)
9
self.X_train_raw = X_train.copy()
10
self.X_test_raw = X_test.copy()
11
self.y_train_raw = y_train.copy()
12
self.y_test_raw = y_test.copy()
13
14
#Saves into npy files
15
save('X_train_raw.npy',
16 self.X_train_raw)
17 save('y_train_raw.npy',
18 self.y_train_raw)
19 save('X_test_raw.npy',
20 self.X_test_raw)
21 save('y_test_raw.npy',
self.y_test_raw)
22
23
self.X_train_norm = X_train.copy()
24
self.X_test_norm = X_test.copy()
25
26 self.y_train_norm = y_train.copy()
27 self.y_test_norm = y_test.copy()
28 norm = MinMaxScaler()
29 self.X_train_norm[inf_cols] = \
30 norm.fit_transform(self.X_train_norm)
31 self.X_test_norm[inf_cols] = \
32 norm.transform(self.X_test_norm)
33
34 #Saves into npy files
35 save('X_train_norm.npy',
36 self.X_train_norm)
37 save('y_train_norm.npy',
self.y_train_norm)
38
save('X_test_norm.npy',
39
self.X_test_norm)
40
save('y_test_norm.npy',
41 self.y_test_norm)
42
43 self.X_train_stand = X_train.copy()
44 self.X_test_stand = X_test.copy()
45 self.y_train_stand = y_train.copy()
46 self.y_test_stand = y_test.copy()
47 scaler = StandardScaler()
48 self.X_train_stand[inf_cols] = \
scaler.fit_transform(self.X_train_stand
)
self.X_test_stand[inf_cols] = \
scaler.transform(self.X_test_stand)
1 def split_data_ML(self):
2 if path.isfile('X_train_raw.npy'):
3 #Loads npy files
4 self.X_train_raw = \
5 np.load('X_train_raw.npy',allow_pickle=True)
6 self.y_train_raw = \
7 np.load('y_train_raw.npy',allow_pickle=True)
8 self.X_test_raw = \
9 np.load('X_test_raw.npy',allow_pickle=True)
10 self.y_test_raw = \
11 np.load('y_test_raw.npy',allow_pickle=True)
12
13 self.X_train_norm = \
14 np.load('X_train_norm.npy',allow_pickle=True)
15 self.y_train_norm = \
16 np.load('y_train_norm.npy',allow_pickle=True)
17 self.X_test_norm = \
18 np.load('X_test_norm.npy',allow_pickle=True)
19 self.y_test_norm = \
20 np.load('y_test_norm.npy',allow_pickle=True)
21
22 self.X_train_stand = \
23 np.load('X_train_stand.npy',allow_pickle=True
24 )
25 self.y_train_stand = \
26
27 np.load('y_train_stand.npy',allow_pickle=True
28 )
29 self.X_test_stand = \
30 np.load('X_test_stand.npy',allow_pickle=True)
31 self.y_test_stand = \
32 np.load('y_test_stand.npy',allow_pickle=True)
33
34 else:
35 self.train_test()
36
37 #Prints each shape
38 print('X train raw shape: ',
self.X_train_raw.shape)
39
print('Y train raw shape: ',
40
self.y_train_raw.shape)
41
print('X test raw shape: ',
42 self.X_test_raw.shape)
43 print('Y test raw shape: ',
44 self.y_test_raw.shape)
45
46 #Prints each shape
47 print('X train norm shape: ',
48 self.X_train_norm.shape)
49 print('Y train norm shape: ',
self.y_train_norm.shape)
50
print('X test norm shape: ',
self.X_test_norm.shape)
print('Y test norm shape: ',
self.y_test_norm.shape)
1 def train_model_ML(self):
2 self.split_data_ML()
3
4 #Turns on three widgets
5 self.cbData.setEnabled(True)
6 self.cbClassifier.setEnabled(True)
7
8 self.cbPredictionML.setEnabled(True
)
9
10
#Turns off pbTrainML
self.pbTrainML.setEnabled(False)
1 def __init__(self):
2 QMainWindow.__init__(self)
3 loadUi("gui_voice.ui",self)
4 self.setWindowTitle(\
5 "GUI Demo of Classifying and Predicting Voice
6 Based Gender")
7 self.addToolBar(NavigationToolbar(\
8 self.widgetPlot1.canvas, self))
9 self.pbLoad.clicked.connect(self.import_dataset)
10 self.initial_state(False)
self.pbTrainML.clicked.connect(self.train_model_ML
)
Step Run gui_thyroid.py and you will see the other widgets are
6 initially disabled. Click LOAD DATA button. The two tables are
populated, LOAD DATA button is disabled, and TRAIN ML
MODEL and TRAIN DL MODEL buttons are enabled. Then,
click on TRAIN ML MODEL button. You will see that cbData,
cbClassifier, and cbPredictionML are enabled and pbTrainML
is disabled as shown in Figure 149. You also will find four
training dan test npy files for machine learning in your working
directory.
widget.canvas.axis1.set_title("Count
of "+ var +" cases")
widget.canvas.figure.tight_layout()
widget.canvas.draw()
The bar_cat() method plots a horizontal bar chart to visualize the count
of categories in a categorical variable in the given DataFrame df. It
takes the following parameters:
df: The DataFrame containing the variable to
be plotted.
var: The name of the categorical variable.
widget: The widget that contains the
matplotlib figure for displaying the chart.
The method uses the value_counts() function to count the occurrences
of each category in the variable and plots a horizontal bar chart using
the plot(kind='barh') function. The count values are displayed on the
bars using the text() function. The title of the chart is set using the
set_title() method of the matplotlib axis object.
Both methods update the matplotlib figure in the specified widget and
redraw the canvas to display the chart.
# ax.text(x, y, text)
label_x = x + width / 2
label_y = y + height / 2
1 def choose_plot(self):
2 strCB = self.cbData.currentText()
3
4 if strCB == "binaryClass":
5 #Plots distribution of binaryClass variable in pie
6 chart
7 self.widgetPlot1.canvas.figure.clf()
8 self.widgetPlot1.canvas.axis1 = \
9 self.widgetPlot1.canvas.figure.add_subplot(121,\
10 facecolor = '#fbe7dd')
11 label_class = \
12 list(self.df_dummy["binaryClass"].value_counts().index
)
13
self.pie_cat(self.df_dummy["binaryClass"],\
14
'binaryClass', label_class, self.widgetPlot1)
15
self.widgetPlot1.canvas.figure.tight_layout()
16
self.widgetPlot1.canvas.draw()
17
18
self.widgetPlot1.canvas.axis1 = \
19
self.widgetPlot1.canvas.figure.add_subplot(122,\
20
facecolor = '#fbe7dd')
21
self.bar_cat(self.df_dummy,"binaryClass",
22
self.widgetPlot1)
23
self.widgetPlot1.canvas.figure.tight_layout()
24
self.widgetPlot1.canvas.draw()
25
26
self.widgetPlot2.canvas.figure.clf()
27
self.widgetPlot2.canvas.axis1 = \
28
self.widgetPlot2.canvas.figure.add_subplot(121,\
29
facecolor = '#fbe7dd')
30
self.widgetPlot2.canvas.axis2 = \
31
self.widgetPlot2.canvas.figure.add_subplot(122,\
32
facecolor = '#fbe7dd')
33
34 self.dist_percent_plot(self.df_dummy,"sex",\
35 self.widgetPlot2.canvas.axis1,\
36 self.widgetPlot2.canvas.axis2)
37 self.widgetPlot2.canvas.figure.tight_layout()
38 self.widgetPlot2.canvas.draw()
39
40 self.widgetPlot3.canvas.figure.clf()
41 self.widgetPlot3.canvas.axis1 = \
42 self.widgetPlot3.canvas.figure.add_subplot(331,\
43 facecolor = '#fbe7dd')
44 g=sns.countplot(self.df_dummy["on thyroxine"],\
45 hue = self.df_dummy["binaryClass"], \
46 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
47 self.put_label_stacked_bar(g,17)
48 self.widgetPlot3.canvas.axis1.set_title(\
49 "on thyroxine versus binaryClass",\
50 fontweight ="bold",fontsize=14)
51
52 self.widgetPlot3.canvas.axis1 = \
53 self.widgetPlot3.canvas.figure.add_subplot(332,\
54 facecolor = '#fbe7dd')
55 g=sns.countplot(self.df_dummy["TSH measured"],\
56 hue = self.df_dummy["binaryClass"], \
57 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
58 self.put_label_stacked_bar(g,17)
59 self.widgetPlot3.canvas.axis1.set_title(\
60 "TSH measured versus binaryClass",\
61 fontweight ="bold",fontsize=14)
62
63 self.widgetPlot3.canvas.axis1 = \
64 self.widgetPlot3.canvas.figure.add_subplot(333,\
65 facecolor = '#fbe7dd')
66 g=sns.countplot(self.df_dummy["TT4 measured"],\
67 hue = self.df_dummy["binaryClass"], \
68 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
69 self.put_label_stacked_bar(g,17)
70 self.widgetPlot3.canvas.axis1.set_title(\
71 "TT4 measured versus binaryClass",\
72 fontweight ="bold",fontsize=14)
73
74 self.widgetPlot3.canvas.axis1 = \
75 self.widgetPlot3.canvas.figure.add_subplot(334,\
76 facecolor = '#fbe7dd')
77 g=sns.countplot(self.df_dummy["T4U measured"],\
78 hue = self.df_dummy["binaryClass"], \
79 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
80 self.put_label_stacked_bar(g,17)
81 self.widgetPlot3.canvas.axis1.set_title(\
82 "T4U measured versus binaryClass",\
83 fontweight ="bold",fontsize=14)
84
85 self.widgetPlot3.canvas.axis1 = \
86 self.widgetPlot3.canvas.figure.add_subplot(335,\
87 facecolor = '#fbe7dd')
88 g=sns.countplot(self.df_dummy["T3 measured"],\
89 hue = self.df_dummy["binaryClass"], \
90 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
91 self.put_label_stacked_bar(g,17)
92 self.widgetPlot3.canvas.axis1.set_title(\
93 "T3 measured versus binaryClass",\
94 fontweight ="bold",fontsize=14)
95 self.widgetPlot3.canvas.axis1 = \
96 self.widgetPlot3.canvas.figure.add_subplot(336,\
97 facecolor = '#fbe7dd')
98 g=sns.countplot(self.df_dummy["query on
99 thyroxine"],\
100 hue = self.df_dummy["binaryClass"], \
101 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
102 self.put_label_stacked_bar(g,17)
103 self.widgetPlot3.canvas.axis1.set_title(\
104 "query on thyroxine versus binaryClass",\
105 fontweight ="bold",fontsize=14)
106 self.widgetPlot3.canvas.axis1 = \
107 self.widgetPlot3.canvas.figure.add_subplot(337,\
108 facecolor = '#fbe7dd')
109 g=sns.countplot(self.df_dummy["sick"],\
110 hue = self.df_dummy["binaryClass"], \
111 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
112 self.put_label_stacked_bar(g,17)
113 self.widgetPlot3.canvas.axis1.set_title(\
114 "sick versus binaryClass",\
115 fontweight ="bold",fontsize=14)
116 self.widgetPlot3.canvas.axis1 = \
117 self.widgetPlot3.canvas.figure.add_subplot(338,\
118 facecolor = '#fbe7dd')
119 g=sns.countplot(self.df_dummy["tumor"],\
120 hue = self.df_dummy["binaryClass"], \
121 palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
122 self.put_label_stacked_bar(g,17)
123 self.widgetPlot3.canvas.axis1.set_title(\
124 "tumor versus binaryClass",\
125 fontweight ="bold",fontsize=14)
126
self.widgetPlot3.canvas.axis1 = \
127
self.widgetPlot3.canvas.figure.add_subplot(339,\
128
facecolor = '#fbe7dd')
129
g=sns.countplot(self.df_dummy["psych"],\
130
hue = self.df_dummy["binaryClass"], \
131
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
132
self.put_label_stacked_bar(g,17)
133
self.widgetPlot3.canvas.axis1.set_title(\
134
"psych versus binaryClass",\
135
fontweight ="bold",fontsize=14)
136
137
self.widgetPlot3.canvas.figure.tight_layout()
138
self.widgetPlot3.canvas.draw()
1 def __init__(self):
2 QMainWindow.__init__(self)
3 loadUi("gui_thyroid.ui",self)
4 self.setWindowTitle(\
5 "GUI Demo of Classifying and Predicting Thyroid
6 Disease")
7 self.addToolBar(NavigationToolbar(\
8 self.widgetPlot1.canvas, self))
9 self.pbLoad.clicked.connect(self.import_dataset)
10 self.initial_state(False)
11 self.pbTrainML.clicked.connect(self.train_model_ML)
self.cbData.currentIndexChanged.connect(self.choose_plot
)
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
5 buttons. Then, choose binaryClass item from cbData widget. You will
see the result as shown in Figure 150.
Figure 150 The distribution of binaryClass variable versus other
categorical features
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
The purpose of this code block is to generate and update plots based
on the selected item "TSH measured" in the combobox. It specifically
focuses on visualizing the distribution and relationships of variables
related to "TSH measured".
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
2 buttons. Then, choose TSH measured item from cbData widget. You
will see the result as shown in Figure 151.
Distribution of T3 Measured
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
widgetPlot1:
Subplot 1 (Pie Chart): Represents the
distribution of the "T3 measured"
variable in a pie chart. It shows the
percentage of cases labeled as "T3
measured" and "Not T3 measured".
Subplot 2 (Bar Chart): Represents the
count of "T3 measured" cases in a bar
chart. It displays the number of cases
labeled as "T3 measured" and "Not T3
measured".
widgetPlot2:
Subplot 1: Represents the stacked bar
plot of the "T3 measured" variable. It
shows the percentage distribution of
"T3 measured" cases across different
categories.
Subplot 2: Represents the percentage
distribution of "T3 measured" across
different categories.
widgetPlot3:
Subplots 1-9: Each subplot represents
the count of a specific variable (e.g.,
"on thyroxine", "goitre", "TT4
measured") grouped by the "T3
measured" variable. The plots display
the number of cases for each variable
category, differentiated by the "T3
measured" label.
These plots aim to provide visual insights into the distribution
and relationships of variables related to "T3 measured" in the
dataset. They help in understanding the patterns and potential
correlations between the "T3 measured" variable and other
variables.
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
widgetPlot1:
Subplot 1 (Pie Chart): This plot shows the
distribution of the "TT4 measured"
variable. It helps visualize the proportion
of cases labeled as "TT4 measured" and
"Not TT4 measured" in the dataset.
Subplot 2 (Bar Chart): This plot provides a
count of cases for each category of the
"TT4 measured" variable. It gives an
overview of the number of cases in each
category.
widgetPlot2:
Subplot 1: This plot displays a stacked bar
plot of the "TT4 measured" variable against
another variable. It shows the percentage
distribution of "TT4 measured" cases
within each category of the selected
variable. This plot helps identify the
relationship between "TT4 measured" and
another variable.
Subplot 2: This plot presents the count of
cases for each category of the selected
variable. It provides a visual representation
of the distribution of cases within each
category.
widgetPlot3 (Subplots 1-9):
Each subplot represents the count of a
specific variable (e.g., "on thyroxine",
"goitre", "T3 measured") grouped by the
"TT4 measured" variable. These plots help
analyze the distribution of cases for each
variable category based on the "TT4
measured" label. By comparing the counts
of "TT4 measured" and "Not TT4
measured" cases within each category,
these plots reveal potential relationships
and patterns between the variables.
Overall, these plots aim to provide a comprehensive understanding of
the "TT4 measured" variable and its associations with other variables
in the dataset. They assist in identifying any significant trends,
dependencies, or patterns that may exist between "TT4 measured"
and other variables of interest.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
2 buttons. Then, choose TT4 measured item from cbData widget. You
will see the result as shown in Figure 153.
1 def feat_versus_other(self,
2 feat,another,legend,ax0,label='',title=''):
3 background_color = "#fbe7dd"
4 sns.set_palette(['#ff355d','#66b3ff'])
5 for s in ["right", "top"]:
6 ax0.spines[s].set_visible(False)
7
8 ax0.set_facecolor(background_color)
9 ax0_sns = sns.histplot(data=self.df, \
10 x=self.df[feat],ax=ax0,zorder=2,kde=False,
\
11
hue=another,multiple="stack", shrink=.8,\
12
linewidth=0.3,alpha=1)
13
14
self.put_label_stacked_bar(ax0_sns,12)
15
16
ax0_sns.set_xlabel('',fontsize=10,
17
weight='bold')
18
ax0_sns.set_ylabel('',fontsize=10,
19 weight='bold')
20
21
22 ax0_sns.grid(which='major', axis='x',
23 zorder=0, \
24 color='#EEEEEE', linewidth=0.4)
25 ax0_sns.grid(which='major', axis='y',
zorder=0, \
26
color='#EEEEEE', linewidth=0.4)
27
28
ax0_sns.tick_params(labelsize=10,
29
width=0.5, length=1.5)
30
ax0_sns.legend(legend, ncol=2,
facecolor='#D8D8D8', \
edgecolor=background_color, fontsize=14, \
bbox_to_anchor=(1, 0.989), loc='upper
right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label,fontweight
="bold",fontsize=14)
ax0_sns.set_title(title,fontweight
="bold",fontsize=16)
1 def prob_feat_versus_other(self,
2 feat,another,legend,ax0,label='',title=''):
3 background_color = "#fbe7dd"
4 sns.set_palette(['#ff355d','#66b3ff'])
5 for s in ["right", "top"]:
6 ax0.spines[s].set_visible(False)
7
8 ax0.set_facecolor(background_color)
9 ax0_sns =
sns.kdeplot(x=self.df[feat],ax=ax0,\
10
hue=another,linewidth=0.3,fill=True,cbar='g',
11
\
12
zorder=2,alpha=1,multiple='stack')
13
14
ax0_sns.set_xlabel('',fontsize=10,
15 weight='bold')
16 ax0_sns.set_ylabel('',fontsize=10,
17 weight='bold')
18
19
20 ax0_sns.grid(which='major', axis='x',
21 zorder=0, \
22 color='#EEEEEE', linewidth=0.4)
23 ax0_sns.grid(which='major', axis='y',
zorder=0, \
24
color='#EEEEEE', linewidth=0.4)
25
26
ax0_sns.tick_params(labelsize=10,
27
width=0.5, length=1.5)
ax0_sns.legend(legend, ncol=2,
facecolor='#D8D8D8', \
edgecolor=background_color, fontsize=14, \
bbox_to_anchor=(1, 0.989), loc='upper right')
ax0.set_facecolor(background_color)
ax0_sns.set_xlabel(label,fontweight
="bold",fontsize=14)
ax0_sns.set_title(title,fontweight
="bold",fontsize=16)
1 def hist_num_versus_nine_cat(self,feat):
2 self.label_bin = \
3 list(self.df_dummy["binaryClass"].value_counts().index
4 )
5 self.label_sex = \
6 list(self.df_dummy["sex"].value_counts().index)
7 self.label_thyroxine = \
8 list(self.df_dummy["on
thyroxine"].value_counts().index)
9
self.label_pregnant = \
10
list(self.df_dummy["pregnant"].value_counts().index)
11
self.label_lithium = \
12
list(self.df_dummy["lithium"].value_counts().index)
13
self.label_goitre = \
14
list(self.df_dummy["goitre"].value_counts().index)
15
self.label_tumor = \
16
list(self.df_dummy["tumor"].value_counts().index)
17
self.label_tsh = \
18
list(self.df_dummy["TSH
19
measured"].value_counts().index)
20
self.label_tt4 = \
21
list(self.df_dummy["TT4
22 measured"].value_counts().index)
23
24 self.widgetPlot3.canvas.figure.clf()
25 self.widgetPlot3.canvas.axis1 = \
26 self.widgetPlot3.canvas.figure.add_subplot(331,\
27 facecolor = '#fbe7dd')
28 print(self.df_dummy["binaryClass"].value_counts())
29 self.feat_versus_other(feat,\
30 self.df_dummy["binaryClass"],self.label_bin,\
31 self.widgetPlot3.canvas.axis1,label=feat,\
32 title='binaryClass versus ' + feat)
33
34 self.widgetPlot3.canvas.axis1 = \
35 self.widgetPlot3.canvas.figure.add_subplot(332,\
36 facecolor = '#fbe7dd')
37 print(self.df_dummy["sex"].value_counts())
38 self.feat_versus_other(feat,\
39 self.df_dummy["sex"],self.label_sex,\
40 self.widgetPlot3.canvas.axis1,label=feat,\
41 title='sex versus ' + feat)
42
43 self.widgetPlot3.canvas.axis1 = \
44 self.widgetPlot3.canvas.figure.add_subplot(333,\
45 facecolor = '#fbe7dd')
46 print(self.df_dummy["on thyroxine"].value_counts())
47 self.feat_versus_other(feat,\
48 self.df_dummy["on thyroxine"],self.label_thyroxine,\
49 self.widgetPlot3.canvas.axis1,label=feat,\
50 title='on thyroxine versus ' + feat)
51
52 self.widgetPlot3.canvas.axis1 = \
53 self.widgetPlot3.canvas.figure.add_subplot(334,\
54 facecolor = '#fbe7dd')
55 print(self.df_dummy["pregnant"].value_counts())
56 self.feat_versus_other(feat,\
57 self.df_dummy["pregnant"],self.label_pregnant,\
58 self.widgetPlot3.canvas.axis1,label=feat,\
59 title='pregnant versus ' + feat)
60
61 self.widgetPlot3.canvas.axis1 = \
62 self.widgetPlot3.canvas.figure.add_subplot(335,\
63 facecolor = '#fbe7dd')
64 print(self.df_dummy["lithium"].value_counts())
65 self.feat_versus_other(feat,\
66 self.df_dummy["lithium"],self.label_lithium,\
67 self.widgetPlot3.canvas.axis1,label=feat,\
68 title='lithium versus ' + feat)
69
70 self.widgetPlot3.canvas.axis1 = \
71 self.widgetPlot3.canvas.figure.add_subplot(336,\
72 facecolor = '#fbe7dd')
73 print(self.df_dummy["goitre"].value_counts())
74 self.feat_versus_other(feat,\
75 self.df_dummy["goitre"],self.label_goitre,\
76 self.widgetPlot3.canvas.axis1,label=feat,\
77 title='goitre versus ' + feat)
78
79 self.widgetPlot3.canvas.axis1 = \
80 self.widgetPlot3.canvas.figure.add_subplot(337,\
81 facecolor = '#fbe7dd')
82 print(self.df_dummy["tumor"].value_counts())
83 self.feat_versus_other(feat,\
84 self.df_dummy["tumor"],self.label_tumor,\
85 self.widgetPlot3.canvas.axis1,label=feat,\
86 title='tumor versus ' + feat)
87
88 self.widgetPlot3.canvas.axis1 = \
89 self.widgetPlot3.canvas.figure.add_subplot(338,\
90 facecolor = '#fbe7dd')
91 print(self.df_dummy["TSH measured"].value_counts())
92 self.feat_versus_other(feat,self.df_dummy["TSH
93 measured"],\
94 self.label_tsh,self.widgetPlot3.canvas.axis1,\
95 label=feat,title='TSH measured versus ' + feat)
96
97 self.widgetPlot3.canvas.axis1 = \
98 self.widgetPlot3.canvas.figure.add_subplot(339,\
99 facecolor = '#fbe7dd')
100 print(self.df_dummy["TT4 measured"].value_counts())
101 self.feat_versus_other(feat,self.df_dummy["TT4
measured"],\
102
self.label_tt4,self.widgetPlot3.canvas.axis1,\
label=feat,title='TT4 measured versus ' + feat)
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
The hist_num_versus_nine_cat() function generates a series of subplots
in the widgetPlot3 canvas to compare a numerical feature (feat)
against nine categorical features. Here's an explanation of the function
code:
1. Creating label lists: Create lists of labels for
each categorical feature based on their
value counts in the df_dummy DataFrame.
2. Clearing the figure: Clear the current figure
in the widgetPlot3 canvas.
3. Adding subplots: Add a subplot for each
categorical feature using the add_subplot()
method. Each subplot is assigned to
self.widgetPlot3.canvas.axis1,
self.widgetPlot3.canvas.axis2, etc.
4. Printing value counts: Print the value
counts of each categorical feature (for
debugging or information purposes).
5. Calling feat_versus_other(): Call the
feat_versus_other function for each
categorical feature to plot the numerical
feature (feat) against the categorical
feature.
6. Setting labels and titles: Set the x-label and
title for each subplot using the categorical
feature and feat.
7. Tight layout and drawing: Adjust the layout
and draw the figure on the widgetPlot3
canvas.
The purpose of this code is to generate multiple subplots to compare
the distribution of a numerical feature with different categories of nine
categorical features. It provides insights into how the numerical
feature varies across different categories of the categorical features.
widget.canvas.figure.tight_layout()
widget.canvas.draw()
1 if strCB == 'age':
2 self.prob_num_versus_two_cat("age","binaryClass"
3 , \
4 "T4U measured" ,self.widgetPlot1)
5 self.hist_num_versus_nine_cat("age")
6 self.prob_num_versus_two_cat("age","FTI
measured", \
7
"T3 measured" ,self.widgetPlot2)
8
9
if strCB == 'TSH':
10
self.prob_num_versus_two_cat("TSH","binaryClass"
11
, \
12
"T4U measured" ,self.widgetPlot1)
13
self.hist_num_versus_nine_cat("TSH")
14
self.prob_num_versus_two_cat("TSH","FTI
15 measured", \
16 "T3 measured" ,self.widgetPlot2)
17
18 if strCB == 'T3':
19
20 self.prob_num_versus_two_cat("T3","binaryClass",
21 \
22 "T4U measured" ,self.widgetPlot1)
23 self.hist_num_versus_nine_cat("T3")
24 self.prob_num_versus_two_cat("T3","FTI
measured", \
25
"T3 measured" ,self.widgetPlot2)
26
27
if strCB == 'TT4':
28
self.prob_num_versus_two_cat("TT4","binaryClass"
29
, \
30
"T4U measured" ,self.widgetPlot1)
31
self.hist_num_versus_nine_cat("TT4")
32
self.prob_num_versus_two_cat("TT4","FTI
33 measured", \
34 "T3 measured" ,self.widgetPlot2)
35
36 if strCB == 'T4U':
37 self.prob_num_versus_two_cat("T4U","binaryClass"
38 , \
39 "T4U measured" ,self.widgetPlot1)
40 self.hist_num_versus_nine_cat("T4U")
41 self.prob_num_versus_two_cat("T4U","FTI
measured", \
"T3 measured" ,self.widgetPlot2)
if strCB == 'FTI':
self.prob_num_versus_two_cat("FTI","binaryClass"
, \
"T4U measured" ,self.widgetPlot1)
self.hist_num_versus_nine_cat("FTI")
self.prob_num_versus_two_cat("FTI","FTI
measured", \
"T3 measured" ,self.widgetPlot2)
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
7 buttons. Then, choose TSH item from cbData widget. You will see the
result as shown in Figure 155.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
8 buttons. Then, choose T3 item from cbData widget. You will see the
result as shown in Figure 156.
Figure 156 The distribution of T3 variable versus other features
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
9 buttons. Then, choose TT4 item from cbData widget. You will see the
result as shown in Figure 157.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
10 buttons. Then, choose T4U item from cbData widget. You will see the
result as shown in Figure 158.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
11 buttons. Then, choose FTI item from cbData widget. You will see the
result as shown in Figure 159.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL buttons.
3 Then, choose Correlation Matrix item from cbData widget. You will see the
result as shown in Figure 160.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL buttons.
6 Then, choose Features Importance item from cbData widget. You will see
the result as shown in Figure 161.
plot_real_pred_val():
Calculates the accuracy score by
comparing the predicted values
(Y_pred) with the actual values
(Y_test).
Clears the current figure and creates
a new subplot with a steel blue
background color.
Plots the predicted values as yellow
points and the actual values as red
points.
Sets the title of the plot to indicate
the type of values being compared.
Sets the x-axis label to display the
accuracy of the predictions.
Adds a legend to differentiate
between the predicted and actual
values.
Adds gridlines to the plot for better
readability.
plot_cm():
Computes the confusion matrix by
comparing the predicted values
(Y_pred) with the actual values
(Y_test).
Clears the current figure and creates
a new subplot.
Defines the class labels as "Negative"
and "Positive".
Creates a DataFrame (df_cm) to
store the confusion matrix values
with appropriate row and column
labels.
Generates a heatmap plot using
sns.heatmap, where each cell
represents the count of true
positives, true negatives, false
positives, and false negatives.
Sets the title of the plot to indicate
the type of values being analyzed.
Sets the x-axis and y-axis labels to
"Predicted" and "True", respectively.
Adds annotations (count values) to
each cell of the heatmap.
Applies a color map ('plasma') and
adjusts the linewidth of the heatmap
for better visualization.
These functions provide visual representations of the
predicted values compared to the actual values and the
confusion matrix, respectively, helping to evaluate the
performance and accuracy of the model.
1 def plot_learning_curve(self,estimator,
2 title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0,
3
5)):
4
widget.canvas.axis1.set_title(title)
5
if ylim is not None:
6
widget.canvas.axis1.set_ylim(*ylim)
7
widget.canvas.axis1.set_xlabel("Training
8 examples")
9 widget.canvas.axis1.set_ylabel("Score")
10
11
12 train_sizes, train_scores, test_scores,
13 fit_times, _ = \
14 learning_curve(estimator, X, y,
cv=cv, n_jobs=n_jobs,
15
train_sizes=train_sizes,
16
return_times=True)
17
train_scores_mean = np.mean(train_scores,
18 axis=1)
19
train_scores_std = np.std(train_scores,
20 axis=1)
21 test_scores_mean = np.mean(test_scores,
22 axis=1)
23 test_scores_std = np.std(test_scores,
24 axis=1)
25
26 # Plot learning curve
27 widget.canvas.axis1.grid()
28
widget.canvas.axis1.fill_between(train_sizes,
29
\
30
train_scores_mean - train_scores_std,\
31 train_scores_mean + train_scores_std,
alpha=0.1,color="r")
widget.canvas.axis1.fill_between(train_sizes,
\
test_scores_mean - test_scores_std,\
test_scores_mean + test_scores_std,
alpha=0.1, color="g")
widget.canvas.axis1.plot(train_sizes,
train_scores_mean, \
'o-', color="r", label="Training score")
widget.canvas.axis1.plot(train_sizes,
test_scores_mean, \
'o-', color="g", label="Cross-validation
score")
widget.canvas.axis1.legend(loc="best")
1 def plot_scalability_curve(self,estimator,
2 title, X, y, widget, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0,
3
5)):
4
widget.canvas.axis1.set_title(title, \
5
fontweight ="bold",fontsize=15)
6
if ylim is not None:
7
widget.canvas.axis1.set_ylim(*ylim)
8
widget.canvas.axis1.set_xlabel("Training
9 examples")
10 widget.canvas.axis1.set_ylabel("Score")
11
12 train_sizes, train_scores, test_scores,
13 fit_times, _ = \
14 learning_curve(estimator, X, y, cv=cv,
15 n_jobs=n_jobs,
16 train_sizes=train_sizes,
return_times=True)
17 fit_times_mean = np.mean(fit_times, axis=1)
18 fit_times_std = np.std(fit_times, axis=1)
19
20 # Plot n_samples vs fit_times
21 widget.canvas.axis1.grid()
22 widget.canvas.axis1.plot(train_sizes,
23 fit_times_mean, 'o-')
24
25 widget.canvas.axis1.fill_between(train_sizes, \
26 fit_times_mean - fit_times_std,\
27 fit_times_mean + fit_times_std,
alpha=0.1)
28
widget.canvas.axis1.set_xlabel("Training
29 examples")
30 widget.canvas.axis1.set_ylabel("fit_times")
31
32 def plot_performance_curve(self,estimator,
33 title, X, y, \
34 widget, ylim=None, cv=None, n_jobs=None, \
35 train_sizes=np.linspace(.1, 1.0, 5)):
36
37 widget.canvas.axis1.set_title(title, \
38 fontweight ="bold",fontsize=15)
39 if ylim is not None:
40 widget.canvas.axis1.set_ylim(*ylim)
41 widget.canvas.axis1.set_xlabel("Training
42 examples")
43 widget.canvas.axis1.set_ylabel("Score")
44
45 train_sizes, train_scores, test_scores,
46 fit_times, _ = \
47 learning_curve(estimator, X, y, cv=cv,
n_jobs=n_jobs,
48
train_sizes=train_sizes,
49
return_times=True)
50
test_scores_mean = np.mean(test_scores,
51 axis=1)
52 test_scores_std = np.std(test_scores,
53 axis=1)
54 fit_times_mean = np.mean(fit_times,
axis=1)
widget.canvas.axis1.fill_between(fit_times_mean,
\
test_scores_mean - test_scores_std,\
test_scores_mean + test_scores_std,
alpha=0.1)
widget.canvas.axis1.set_xlabel("fit_times")
widget.canvas.axis1.set_ylabel("Score")
plot_scalability_curve():
1. Set the title and axis labels:
The method sets the title of the
plot to the provided title
argument.
The x-axis label is set to "Training
examples", and the y-axis label is
set to "Score".
2. Set the y-axis limit (optional):
If the ylim parameter is provided,
the y-axis limit of the plot is set
accordingly.
3. Calculate scalability curve data:
The learning_curve() function is
called to calculate the scalability
curve data.
The function takes the estimator
(estimator), input features (X),
target variable (y), cross-
validation strategy (cv), number
of parallel jobs (n_jobs), and the
training set sizes (train_sizes).
The function returns the training
set sizes, training scores, test
scores, fit times, and sample
counts.
4. Calculate mean and standard
deviation of fit times:
The mean and standard deviation
of the fit times are calculated
along the rows (across different
training set sizes).
fit_times_mean stores the mean fit
times, while fit_times_std stores
the standard deviations.
5. Plot the scalability curve:
The plot's grid is enabled.
The fit times are plotted against
the training set sizes using circles
('o-').
The area between the mean fit
times plus/minus one standard
deviation is filled with color.
The x-axis is labeled as "Training
examples", and the y-axis is
labeled as "fit_times".
plot_performance_curve():
1. Set the title and axis labels:
The method sets the title of the
plot to the provided title
argument.
The x-axis label is set to "Training
examples", and the y-axis label is
set to "Score".
2. Set the y-axis limit (optional):
If the ylim parameter is provided,
the y-axis limit of the plot is set
accordingly.
3. Calculate performance curve data:
The learning_curve() function is
called to calculate the
performance curve data.
The function takes the estimator
(estimator), input features (X),
target variable (y), cross-
validation strategy (cv), number
of parallel jobs (n_jobs), and the
training set sizes (train_sizes).
The function returns the training
set sizes, training scores, test
scores, fit times, and sample
counts.
4. Calculate mean and standard
deviation of test scores:
The mean and standard deviation
of the test scores are calculated
along the rows (across different
training set sizes).
test_scores_mean stores the mean
test scores, while test_scores_std
stores the standard deviations.
5. Calculate mean of fit times:
The mean of the fit times is
calculated along the rows.
6. Plot the performance curve:
The plot's grid is enabled.
The mean test scores are plotted
against the mean fit times using
circles ('o-').
The area between the mean test
scores plus/minus one standard
deviation is filled with color.
The x-axis is labeled as
"fit_times", and the y-axis is
labeled as "Score".
These methods are used to visualize the scalability and
performance of an estimator by plotting the fit times and
scores (such as accuracy) against the number of training
examples. They can provide insights into the efficiency and
effectiveness of a model as the dataset size increases.
Training Model and Predicting Thyroid
return y_pred
self.widgetPlot3.canvas.axis1 = \
self.widgetPlot3.canvas.figure.add_subplot(224,
\
facecolor = '#fbe7dd')
self.plot_performance_curve(model, \
'Performance of ' + name + " -- " + scaling, \
X_train, y_train, self.widgetPlot3)
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
The run_model() function is used to run a machine learning
model and evaluate its performance. It takes several input
parameters, including the name of the model, scaling method,
the model object, the training and test datasets (X_train,
X_test, y_train, y_test), and optional parameters for training
and predicting (train and proba).
1 def build_train_lr(self):
2 if path.isfile('logregRaw.pkl'):
3 #Loads model
4 self.logregRaw =
5 joblib.load('logregRaw.pkl')
6 self.logregNorm =
joblib.load('logregNorm.pkl')
7
self.logregStand =
8
joblib.load('logregStand.pkl')
9
10
if self.rbRaw.isChecked():
11
self.run_model('Logistic
12 Regression', 'Raw', \
13 self.logregRaw, self.X_train_raw,
14 \
15 self.X_test_raw, self.y_train_raw,
16 \
17 self.y_test_raw)
18 if self.rbNorm.isChecked():
19 self.run_model('Logistic
20 Regression', \
21 'Normalization', self.logregNorm,
\
22
self.X_train_norm,
23
self.X_test_norm, \
24
self.y_train_norm,
25 self.y_test_norm)
26
27 if self.rbStand.isChecked():
28 self.run_model('Logistic
29 Regression', \
30 'Standardization',
31 self.logregStand, \
32 self.X_train_stand,
self.X_test_stand, \
33
self.y_train_stand,
34
self.y_test_stand)
35
36
else:
37
#Builds and trains Logistic
38 Regression
39 self.logregRaw =
40 LogisticRegression(solver='lbfgs',
41 \
42 max_iter=500, random_state=2021)
43 self.logregNorm =
LogisticRegression(solver='lbfgs',
44
\
45
max_iter=500, random_state=2021)
46
47 self.logregStand =
48 LogisticRegression(solver='lbfgs',
\
49
max_iter=500, random_state=2021)
50
51
if self.rbRaw.isChecked():
52
self.run_model('Logistic
53
Regression', 'Raw', \
54
self.logregRaw, self.X_train_raw,
55 \
56 self.X_test_raw, self.y_train_raw,
\
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('Logistic
Regression', \
'Normalization', self.logregNorm,
\
self.X_train_norm,
self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Logistic
Regression', \
'Standardization',
self.logregStand, \
self.X_train_stand,
self.X_test_stand, \
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.logregRaw,
'logregRaw.pkl')
joblib.dump(self.logregNorm,
'logregNorm.pkl')
joblib.dump(self.logregStand,
'logregStand.pkl')
1 def choose_ML_model(self):
2 strCB =
3 self.cbClassifier.currentText()
4
5 if strCB == 'Logistic Regression':
self.build_train_lr()
1 def __init__(self):
2 QMainWindow.__init__(self)
3 loadUi("gui_thyroid.ui",self)
4 self.setWindowTitle(\
5 "GUI Demo of Classifying and Predicting Thyroid
6 Disease")
7 self.addToolBar(NavigationToolbar(\
8 self.widgetPlot1.canvas, self))
9 self.pbLoad.clicked.connect(self.import_dataset)
10 self.initial_state(False)
11 self.pbTrainML.clicked.connect(self.train_model_ML)
12 self.cbData.currentIndexChanged.connect(self.choose_plot
)
13
self.cbClassifier.currentIndexChanged.connect(\
self.choose_ML_model)
This connection ensures that whenever the selected item in the combo
box changes, the choose_ML_model() method is called. This allows for
dynamic execution of the selected machine learning model based on the
user's choice.
In other words, when the user selects a different option in the combo
box, the choose_ML_model() method will be triggered, and the
corresponding machine learning model will be built and trained
accordingly.
Step Run gui_thyroid.py and click LOAD DATA and TRAIN ML MODEL
4 buttons. Click on Raw radio button. Then, choose Logistic Regression
item from cbClassifier widget. Then, you will see the result as shown in
Figure 162.
Figure 162 The result using LR model with raw feature scaling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 def build_train_svm(self):
32 if path.isfile('SVMRaw.pkl'):
33 #Loads model
34 self.SVMRaw =
35 joblib.load('SVMRaw.pkl')
36 self.SVMNorm =
joblib.load('SVMNorm.pkl')
37
self.SVMStand =
38
joblib.load('SVMStand.pkl')
39
40
if self.rbRaw.isChecked():
41
self.run_model('Support Vector
42 Machine', 'Raw', \
43 self.SVMRaw, self.X_train_raw,
44 self.X_test_raw, \
45 self.y_train_raw, self.y_test_raw)
46 if self.rbNorm.isChecked():
47 self.run_model('Support Vector
48 Machine', \
49 'Normalization', self.SVMNorm, \
50 self.X_train_norm, self.X_test_norm, \
self.y_train_norm, self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Support Vector
Machine', \
'Standardization', self.SVMStand, \
self.X_train_stand, self.X_test_stand,
\
self.y_train_stand, self.y_test_stand)
else:
#Builds and trains Logistic Regression
self.SVMRaw =
SVC(random_state=2021,probability=True)
self.SVMNorm =
SVC(random_state=2021,probability=True)
self.SVMStand =
SVC(random_state=2021,probability=True)
if self.rbRaw.isChecked():
self.run_model('Support Vector
Machine', 'Raw', \
self.SVMRaw, self.X_train_raw,
self.X_test_raw, \
self.y_train_raw, self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('Support Vector
Machine', \
'Normalization', self.SVMNorm, \
self.X_train_norm, self.X_test_norm, \
self.y_train_norm, self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Support Vector
Machine', \
'Standardization', self.SVMStand, \
self.X_train_stand, self.X_test_stand,
\
self.y_train_stand, self.y_test_stand)
#Saves model
joblib.dump(self.SVMRaw,
'SVMRaw.pkl')
joblib.dump(self.SVMNorm,
'SVMNorm.pkl')
joblib.dump(self.SVMStand,
'SVMStand.pkl')
1 def build_train_knn(self):
2 if path.isfile('KNNRaw.pkl'):
3 #Loads model
4 self.KNNRaw =
5 joblib.load('KNNRaw.pkl')
6 self.KNNNorm =
joblib.load('KNNNorm.pkl')
7
self.KNNStand =
8
joblib.load('KNNStand.pkl')
9
10
if self.rbRaw.isChecked():
11
self.run_model('K-Nearest
12 Neighbor', 'Raw', \
13 self.KNNRaw, self.X_train_raw,
14 self.X_test_raw, \
15 self.y_train_raw, self.y_test_raw)
16 if self.rbNorm.isChecked():
17 self.run_model('K-Nearest
18 Neighbor', \
19 'Normalization', self.KNNNorm, \
20 self.X_train_norm,
21 self.X_test_norm, \
22 self.y_train_norm,
self.y_test_norm)
23
24
if self.rbStand.isChecked():
25
self.run_model('K-Nearest
26
Neighbor', \
27
'Standardization', self.KNNStand,
28 \
29 self.X_train_stand,
30 self.X_test_stand, \
31 self.y_train_stand,
32 self.y_test_stand)
33
34 else:
35 #Builds and trains K-Nearest
Neighbor
36
self.KNNRaw =
37
KNeighborsClassifier(n_neighbors =
38 20)
39 self.KNNNorm =
40 KNeighborsClassifier(n_neighbors =
20)
41
self.KNNStand =
42
KNeighborsClassifier(n_neighbors =
43 20)
44
45 if self.rbRaw.isChecked():
46 self.run_model('K-Nearest
47 Neighbor', 'Raw', \
48 self.KNNRaw, self.X_train_raw, \
49 self.X_test_raw, self.y_train_raw,
50 \
51 self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('K-Nearest
Neighbor', \
'Normalization', self.KNNNorm, \
self.X_train_norm,
self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('K-Nearest
Neighbor', \
'Standardization', self.KNNStand,
\
self.X_train_stand,
self.X_test_stand, \
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.KNNRaw,
'KNNRaw.pkl')
joblib.dump(self.KNNNorm,
'KNNNorm.pkl')
joblib.dump(self.KNNStand,
'KNNStand.pkl')
1 def build_train_dt(self):
2 if path.isfile('DTRaw.pkl'):
3 #Loads model
4 self.DTRaw =
5 joblib.load('DTRaw.pkl')
6 self.DTNorm =
joblib.load('DTNorm.pkl')
7
self.DTStand =
8
joblib.load('DTStand.pkl')
9
10
if self.rbRaw.isChecked():
11
self.run_model('Decision Tree', \
12
'Raw', self.DTRaw,
13 self.X_train_raw, \
14 self.X_test_raw, self.y_train_raw,
15 self.y_test_raw)
16 if self.rbNorm.isChecked():
17 self.run_model('Decision Tree', \
18 'Normalization', self.DTNorm, \
19 self.X_train_norm,
20 self.X_test_norm, \
21 self.y_train_norm,
22 self.y_test_norm)
23 if self.rbStand.isChecked():
24 self.run_model('Decision Tree', \
25 'Standardization', self.DTStand, \
26 self.X_train_stand,
27 self.X_test_stand, \
28 self.y_train_stand,
self.y_test_stand)
29
30
else:
31
#Builds and trains Decision Tree
32
dt =
33
DecisionTreeClassifier()
34
parameters = {
35 'max_depth':np.arange(1,20,1),\
36 'random_state':[2021]}
37 self.DTRaw = GridSearchCV(dt,
38 parameters)
39 self.DTNorm = GridSearchCV(dt,
40 parameters)
41 self.DTStand = GridSearchCV(dt,
parameters)
42
43
if self.rbRaw.isChecked():
44
self.run_model('Decision Tree', \
45
'Raw', self.DTRaw,
46
self.X_train_raw, \
47
self.X_test_raw, self.y_train_raw,
48 \
49 self.y_test_raw)
50
if self.rbNorm.isChecked():
51
self.run_model('Decision Tree', \
52
'Normalization', self.DTNorm, \
53
self.X_train_norm,
54 self.X_test_norm, \
55 self.y_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Decision Tree', \
'Standardization', self.DTStand, \
self.X_train_stand,
self.X_test_stand, \
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.DTRaw,
'DTRaw.pkl')
joblib.dump(self.DTNorm,
'DTNorm.pkl')
joblib.dump(self.DTStand,
'DTStand.pkl')
1 def build_train_rf(self):
2 if path.isfile('RFRaw.pkl'):
3 #Loads model
4 self.RFRaw = joblib.load('RFRaw.pkl')
5 self.RFNorm =
6 joblib.load('RFNorm.pkl')
7 self.RFStand =
joblib.load('RFStand.pkl')
8
9
if self.rbRaw.isChecked():
10
self.run_model('Random Forest', 'Raw',
11
\
12
self.RFRaw, self.X_train_raw,
13 self.X_test_raw, \
14 self.y_train_raw, self.y_test_raw)
15
if self.rbNorm.isChecked():
16
self.run_model('Random Forest', \
17
'Normalization', self.RFNorm, \
18
self.X_train_norm, self.X_test_norm, \
19
self.y_train_norm, self.y_test_norm)
20
21
if self.rbStand.isChecked():
22
self.run_model('Random Forest', \
23
'Standardization', self.RFStand, \
24
self.X_train_stand, self.X_test_stand,
25 \
26 self.y_train_stand, self.y_test_stand)
27
28 else:
29 #Builds and trains Random Forest
30 self.RFRaw =
31 RandomForestClassifier(n_estimators=200,
32 \
33 max_depth=20, random_state=2021)
34 self.RFNorm = RandomForestClassifier(\
35 n_estimators=100, max_depth=11,
random_state=2021)
36
self.RFStand =
37
RandomForestClassifier(\
38
n_estimators=100, max_depth=11,
39 random_state=2021)
40
41 if self.rbRaw.isChecked():
42 self.run_model('Random Forest',
43 'Raw',\
44 self.RFRaw, self.X_train_raw, \
45 self.X_test_raw, self.y_train_raw, \
46 self.y_test_raw)
47 if self.rbNorm.isChecked():
48 self.run_model('Random Forest', \
49 'Normalization', self.RFNorm, \
50 self.X_train_norm, self.X_test_norm, \
51 self.y_train_norm, self.y_test_norm)
52
53 if self.rbStand.isChecked():
54 self.run_model('Random Forest', \
55 'Standardization', self.RFStand, \
self.X_train_stand, self.X_test_stand,
\
self.y_train_stand, self.y_test_stand)
#Saves model
joblib.dump(self.RFRaw,
'RFRaw.pkl')
joblib.dump(self.RFNorm,
'RFNorm.pkl')
joblib.dump(self.RFStand,
'RFStand.pkl')
1 def build_train_gb(self):
2 if path.isfile('GBRaw.pkl'):
3 #Loads model
4 self.GBRaw =
5 joblib.load('GBRaw.pkl')
6 self.GBNorm =
joblib.load('GBNorm.pkl')
7
self.GBStand =
8
joblib.load('GBStand.pkl')
9
10
if self.rbRaw.isChecked():
11
self.run_model('Gradient
12 Boosting', 'Raw', \
13 self.GBRaw, self.X_train_raw, \
14 self.X_test_raw, self.y_train_raw,
15 \
16 self.y_test_raw)
17 if self.rbNorm.isChecked():
18 self.run_model('Gradient
19 Boosting', \
20 'Normalization', self.GBNorm, \
21 self.X_train_norm,
22 self.X_test_norm, \
23
24 self.y_train_norm,
25 self.y_test_norm)
26
27 if self.rbStand.isChecked():
28 self.run_model('Gradient
Boosting',\
29
'Standardization', self.GBStand, \
30
self.X_train_stand,
31 self.X_test_stand, \
32 self.y_train_stand,
33 self.y_test_stand)
34
35 else:
36 #Builds and trains Gradient
37 Boosting
38 self.GBRaw =
39 GradientBoostingClassifier(\
40 n_estimators = 100, max_depth=20,
subsample=0.8, \
41
max_features=0.2,
42 random_state=2021)
43 self.GBNorm =
44 GradientBoostingClassifier(\
45 n_estimators = 100, max_depth=20,
46 subsample=0.8,\
47 max_features=0.2,
48 random_state=2021)
49 self.GBStand =
GradientBoostingClassifier(\
50
n_estimators = 100, max_depth=20,
51 subsample=0.8,\
52 max_features=0.2,
53 random_state=2021)
54
55 if self.rbRaw.isChecked():
56 self.run_model('Gradient
57 Boosting', 'Raw', \
58 self.GBRaw, self.X_train_raw, \
59 self.X_test_raw, self.y_train_raw,
\
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('Gradient
Boosting', \
'Normalization', self.GBNorm, \
self.X_train_norm,
self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Gradient
Boosting', \
'Standardization', self.GBStand, \
self.X_train_stand,
self.X_test_stand, \
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.GBRaw,
'GBRaw.pkl')
joblib.dump(self.GBNorm,
'GBNorm.pkl')
joblib.dump(self.GBStand,
'GBStand.pkl')
1 def build_train_nb(self):
2 if path.isfile('NBRaw.pkl'):
3 #Loads model
4 self.NBRaw =
5 joblib.load('NBRaw.pkl')
6 self.NBNorm =
joblib.load('NBNorm.pkl')
7
self.NBStand =
8
joblib.load('NBStand.pkl')
9
10
if self.rbRaw.isChecked():
11
self.run_model('Naive Bayes',
12 'Raw', \
13 self.NBRaw, self.X_train_raw, \
14 self.X_test_raw, self.y_train_raw,
15 \
16 self.y_test_raw)
17 if self.rbNorm.isChecked():
18 self.run_model('Naive Bayes',
19 'Normalization', \
20 self.NBNorm, self.X_train_norm, \
21 self.X_test_norm,
22 self.y_train_norm, \
23 self.y_test_norm)
24
25 if self.rbStand.isChecked():
26 self.run_model('Naive Bayes', \
27 'Standardization', self.NBStand, \
28 self.X_train_stand,
29 self.X_test_stand,\
30 self.y_train_stand,
self.y_test_stand)
31
32
else:
33
#Builds and trains Naive Bayes
34
self.NBRaw = GaussianNB()
35
self.NBNorm = GaussianNB()
36
self.NBStand = GaussianNB()
37
38
if self.rbRaw.isChecked():
39
self.run_model('Naive Bayes',
40 'Raw', \
41 self.NBRaw, self.X_train_raw, \
42 self.X_test_raw, self.y_train_raw,
43 \
44 self.y_test_raw)
45 if self.rbNorm.isChecked():
46
self.run_model('Naive Bayes', \
47 'Normalization', self.NBNorm, \
48 self.X_train_norm,
49 self.X_test_norm, \
50 self.y_train_norm,
51 self.y_test_norm)
52
53 if self.rbStand.isChecked():
self.run_model('Naive Bayes', \
'Standardization', self.NBStand, \
self.X_train_stand,
self.X_test_stand, \
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.NBRaw,
'NBRaw.pkl')
joblib.dump(self.NBNorm,
'NBNorm.pkl')
joblib.dump(self.NBStand,
'NBStand.pkl')
1 def build_train_ada(self):
2 if path.isfile('ADARaw.pkl'):
3 #Loads model
4 self.ADARaw =
5 joblib.load('ADARaw.pkl')
6 self.ADANorm =
joblib.load('ADANorm.pkl')
7
self.ADAStand =
8
joblib.load('ADAStand.pkl')
9
10
if self.rbRaw.isChecked():
11
self.run_model('Adaboost', 'Raw',
12 \
13 self.ADARaw, self.X_train_raw, \
14 self.X_test_raw, self.y_train_raw,
15 \
16 self.y_test_raw)
17 if self.rbNorm.isChecked():
18 self.run_model('Adaboost',
19 'Normalization',\
20 self.ADANorm, self.X_train_norm,
21 \
22 self.X_test_norm,
self.y_train_norm, \
23
self.y_test_norm)
24
25
if self.rbStand.isChecked():
26
self.run_model('Adaboost', \
27
'Standardization', self.ADAStand,
28
\
29
self.X_train_stand,
30 self.X_test_stand, \
31 self.y_train_stand,
32 self.y_test_stand)
33
34 else:
35 #Builds and trains Adaboost
36 self.ADARaw = AdaBoostClassifier(\
37 n_estimators = 100,
38 learning_rate=0.01)
39 self.ADANorm =
AdaBoostClassifier(\
40
n_estimators = 100,
41
learning_rate=0.01)
42
self.ADAStand =
43 AdaBoostClassifier(\
44 n_estimators = 100,
45 learning_rate=0.01)
46
47 if self.rbRaw.isChecked():
48 self.run_model('Adaboost', 'Raw',
49 self.ADARaw, \
50 self.X_train_raw, self.X_test_raw,
\
51
self.y_train_raw, self.y_test_raw)
52
53 if self.rbNorm.isChecked():
54 self.run_model('Adaboost',
'Normalization',\
55
self.ADANorm, self.X_train_norm,
\
self.X_test_norm,
self.y_train_norm, \
self.y_test_norm)
if self.rbStand.isChecked(): \
self.run_model('Adaboost',
'Standardization', \
self.ADAStand, self.X_train_stand,
\
self.X_test_stand,
self.y_train_stand, \
self.y_test_stand)
#Saves model
joblib.dump(self.ADARaw,
'ADARaw.pkl')
joblib.dump(self.ADANorm,
'ADANorm.pkl')
joblib.dump(self.ADAStand,
'ADAStand.pkl')
Figure 183 The result using Adaboost model
with raw feature scaling
1 if strCB == 'Adaboost':
2 self.build_train_ada()
1 def build_train_xgb(self):
2 if path.isfile('XGBRaw.pkl'):
3 #Loads model
4 self.XGBRaw =
joblib.load('XGBRaw.pkl')
5 self.XGBNorm =
6 joblib.load('XGBNorm.pkl')
7 self.XGBStand =
joblib.load('XGBStand.pkl')
8
9
if self.rbRaw.isChecked():
10
self.run_model('XGB', 'Raw',
11
self.XGBRaw, \
12
self.X_train_raw, self.X_test_raw,
13 \
14 self.y_train_raw, self.y_test_raw)
15
if self.rbNorm.isChecked():
16
self.run_model('XGB',
17 'Normalization', \
18 self.XGBNorm, self.X_train_norm, \
19 self.X_test_norm,
20 self.y_train_norm, \
21 self.y_test_norm)
22
23 if self.rbStand.isChecked():
24 self.run_model('XGB',
25 'Standardization', \
26 self.XGBStand, self.X_train_stand,
\
27
self.X_test_stand,
28
self.y_train_stand, \
29
self.y_test_stand)
30
31
else:
32
#Builds and trains XGB classifier
33
self.XGBRaw =
34 XGBClassifier(n_estimators = 200, \
35 max_depth=20, random_state=2021, \
36 use_label_encoder=False,
37 eval_metric='mlogloss')
38 self.XGBNorm =
39 XGBClassifier(n_estimators = 200, \
40 max_depth=20, random_state=2021, \
41 use_label_encoder=False,
eval_metric='mlogloss')
42
self.XGBStand =
43
XGBClassifier(n_estimators = 200, \
44
max_depth=20, random_state=2021, \
45
use_label_encoder=False,
46 eval_metric='mlogloss')
47
48 if self.rbRaw.isChecked():
49 self.run_model('XGB', 'Raw',
50 self.XGBRaw, \
51 self.X_train_raw, self.X_test_raw,
\
52
self.y_train_raw, self.y_test_raw)
53
54 if self.rbNorm.isChecked():
55 self.run_model('XGB',
'Normalization', \
56
self.XGBNorm, self.X_train_norm, \
57
self.X_test_norm,
self.y_train_norm, \
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('XGB',
'Standardization', \
self.XGBStand, self.X_train_stand,
\
self.X_test_stand,
self.y_train_stand, \
self.y_test_stand)
#Saves model
joblib.dump(self.XGBRaw,
'XGBRaw.pkl')
joblib.dump(self.XGBNorm,
'XGBNorm.pkl')
joblib.dump(self.XGBStand,
'XGBStand.pkl')
1 def build_train_lgbm(self):
2 if path.isfile('LGBMRaw.pkl'):
3 #Loads model
4 self.LGBMRaw =
5 joblib.load('LGBMRaw.pkl')
6 self.LGBMNorm =
joblib.load('LGBMNorm.pkl')
7
self.LGBMStand =
8
joblib.load('LGBMStand.pkl')
9
10
if self.rbRaw.isChecked():
11
12 self.run_model('LGBM Classifier',
13 'Raw', \
14 self.LGBMRaw, self.X_train_raw, \
15 self.X_test_raw,
self.y_train_raw,\
16
self.y_test_raw)
17
18 if self.rbNorm.isChecked():
19 self.run_model('LGBM Classifier',
\
20
'Normalization', self.LGBMNorm, \
21
self.X_train_norm,
22
self.X_test_norm, \
23
self.y_train_norm,
24 self.y_test_norm)
25
26 if self.rbStand.isChecked():
27 self.run_model('LGBM Classifier',
28 \
29 'Standardization', self.LGBMStand,
30 \
31 self.X_train_stand,
self.X_test_stand, \
32
self.y_train_stand,
33
self.y_test_stand)
34
35
else:
36
#Builds and trains LGBMClassifier
37 classifier
38 self.LGBMRaw =
39 LGBMClassifier(max_depth = 20, \
40 n_estimators=500, subsample=0.8,
41 random_state=2021)
42 self.LGBMNorm =
LGBMClassifier(max_depth = 20, \
43
n_estimators=500, subsample=0.8,
44
random_state=2021)
45
self.LGBMStand =
46 LGBMClassifier(max_depth = 20, \
47 n_estimators=500, subsample=0.8,
48 random_state=2021)
49
50 if self.rbRaw.isChecked():
51 self.run_model('LGBM Classifier',
52 'Raw', \
53 self.LGBMRaw, self.X_train_raw, \
54
55 self.X_test_raw, self.y_train_raw,
56 \
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('LGBM Classifier',
\
'Normalization', self.LGBMNorm, \
self.X_train_norm,
self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('LGBM Classifier',
\
'Standardization', self.LGBMStand,
\
self.X_train_stand,
self.X_test_stand, \
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.LGBMRaw,
'LGBMRaw.pkl')
joblib.dump(self.LGBMNorm,
'LGBMNorm.pkl')
joblib.dump(self.LGBMStand,
'LGBMStand.pkl')
1 def build_train_mlp(self):
2 if path.isfile('MLPRaw.pkl'):
3 #Loads model
4 self.MLPRaw =
5 joblib.load('MLPRaw.pkl')
6 self.MLPNorm =
joblib.load('MLPNorm.pkl')
7
self.MLPStand =
8
joblib.load('MLPStand.pkl')
9
10
if self.rbRaw.isChecked():
11
self.run_model('MLP Classifier',
12 'Raw', \
13 self.MLPRaw, self.X_train_raw, \
14
15 self.X_test_raw, self.y_train_raw,
16 \
17 self.y_test_raw)
18 if self.rbNorm.isChecked():
19 self.run_model('MLP Classifier', \
20 'Normalization', self.MLPNorm, \
21 self.X_train_norm,
22 self.X_test_norm, \
23 self.y_train_norm,
self.y_test_norm)
24
25
if self.rbStand.isChecked():
26
self.run_model('MLP Classifier', \
27
'Standardization', self.MLPStand,
28
\
29
self.X_train_stand,
30 self.X_test_stand, \
31 self.y_train_stand,
32 self.y_test_stand)
33 else:
34 #Builds and trains MLP classifier
35 self.MLPRaw =
36 MLPClassifier(random_state=2021)
37 self.MLPNorm =
MLPClassifier(random_state=2021)
38
self.MLPStand =
39
MLPClassifier(random_state=2021)
40
41
if self.rbRaw.isChecked():
42
self.run_model('MLP Classifier',
43 'Raw', \
44 self.MLPRaw, self.X_train_raw,
45 self.X_test_raw, \
46 self.y_train_raw, self.y_test_raw)
47 if self.rbNorm.isChecked():
48 self.run_model('MLP Classifier', \
49 'Normalization', self.MLPNorm, \
50 self.X_train_norm,
51 self.X_test_norm, \
self.y_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('MLP Classifier', \
'Standardization', self.MLPStand,
\
self.X_train_stand,
self.X_test_stand, \
self.y_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.MLPRaw,
'MLPRaw.pkl')
joblib.dump(self.MLPNorm,
'MLPNorm.pkl')
joblib.dump(self.MLPStand,
'MLPStand.pkl')
ANN Classifier
Step Define train_test_ANN() to split dataset into train and test data for
1 ANN:
1 def train_test_DL(self):
2 X, Y = self.fit_dataset(self.df)
3
4 #Splits dataframe into X_train,
5 X_test, y_train and y_test
6 X_train, X_test, y_train_DL,
y_test_DL = \
7
train_test_split(X, Y, test_size =
8
0.2, \
9
random_state = 2021)
10
11
#Splits dataframe into X_train,
12 X_test, y_train and y_test
13 X_train, X_test,
14 self.y_train_DL, self.y_test_DL =
15 \
16 train_test_split(X, Y, test_size =
0.2, \
17
random_state = 2021)
18
19
#Deletes any outliers in the data
20
using StandardScaler from SKLearn
21
sc = StandardScaler()
22
self.X_train_DL =
23 sc.fit_transform(X_train)
self.X_test_DL =
sc.transform(X_test)
#Saves data
np.save('X_train_DL.npy',
X_train_DL)
np.save('X_test_DL.npy',
X_test_DL)
np.save('y_train_DL.npy',
y_train_DL)
np.save('y_test_DL.npy',
y_test_DL)
#Saves model
ann.save('thyroid_model.h5')
1 def train_ANN(self):
2 if path.isfile('X_train_DL.npy'):
3 #Loads files
4 self.X_train_DL = \
5 np.load('X_train_DL.npy',allow_pickle=True
6 )
7 self.X_test_DL = \
8 np.load('X_test_DL.npy',allow_pickle=True)
9 self.y_train_DL = \
10 np.load('y_train_DL.npy',allow_pickle=True
)
11
self.y_test_DL = \
12
np.load('y_test_DL.npy',allow_pickle=True)
13
14
else:
15
self.train_test_ANN()
16
#Loads files
17
self.X_train_DL = \
18
np.load('X_train_DL.npy',allow_pickle=True
19
)
20
self.X_test_DL = \
21
np.load('X_test_DL.npy',allow_pickle=True)
22
self.y_train_DL = \
23
np.load('y_train_DL.npy',allow_pickle=True
24 )
25 self.y_test_DL = \
26 np.load('y_test_DL.npy',allow_pickle=True)
27
28 if path.isfile('thyroid_model.h5') ==
29 False:
30 self.build_ANN(self.X_train_DL,
31 self.y_train_DL, 32, 100)
32
#Turns on cbPredictionDL
self.cbPredictionDL.setEnabled(True)
Step Define plot_loss_acc() method to plot loss and accuracy of the model:
4
return label
1
2
3
4
5
6
7
8
9
10
11
12 def choose_prediction_ANN(self):
13 strCB = self.cbPredictionDL.currentText()
14
15 if strCB == 'CNN 1D':
16 pred_val = self.pred_ann(self.X_test_DL,
17 self.y_test_DL)
18
19 #Plots true values versus predicted values
20 self.widgetPlot2.canvas.figure.clf()
21 self.widgetPlot2.canvas.axis1 = \
22 self.widgetPlot2.canvas.figure.add_subplot(111,
\
23
facecolor = '#fbe7dd')
24
self.plot_real_pred_val(pred_val,
25
self.y_test_DL, \
26
self.widgetPlot2, 'CNN 1D')
27
self.widgetPlot2.canvas.figure.tight_layout()
28
self.widgetPlot2.canvas.draw()
29
30
#Plot confusion matrix
31
self.widgetPlot1.canvas.figure.clf()
32
self.widgetPlot1.canvas.axis1 = \
33
self.widgetPlot1.canvas.figure.add_subplot(111,
34 \
35 facecolor = '#fbe7dd')
self.plot_cm(pred_val, self.y_test_DL, \
self.widgetPlot1, 'CNN 1D')
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
#Loads history
history =
np.load('thyroid_history.npy',\
allow_pickle=True).item()
train_loss = history['loss']
train_acc = history['accuracy']
val_acc = history['val_accuracy']
val_loss = history['val_loss']
self.plot_loss_acc(train_loss, val_loss,
train_acc, \
val_acc, self.widgetPlot3, "History of " +
strCB)
The purpose of the choose_prediction_ANN() function is to
select a specific prediction method for the ANN model and
visualize the results. Here is an explanation of each step:
1. Retrieve the selected prediction
method: The function obtains the
selected prediction method from the
combo box
(self.cbPredictionDL.currentText())
and stores it in the variable strCB.
2. Handle 'CNN 1D' prediction method:
If the selected prediction method is
'CNN 1D', the function calls the
pred_ann() method with the test data
(self.X_test_DL and self.y_test_DL) to
obtain the predicted values
(pred_val) using the ANN model.
3. Plot true values versus predicted
values: The function clears the figure
of self.widgetPlot2 and adds a new
subplot with a light background
color. It calls the plot_real_pred_val()
method to plot the true values versus
the predicted values (pred_val and
self.y_test_DL) on the
self.widgetPlot2 widget. This plot
visualizes the comparison between
the actual and predicted values.
4. Plot confusion matrix: The function
clears the figure of self.widgetPlot1
and adds a new subplot with a light
background color. It calls the plot_cm
method to plot the confusion matrix
based on the predicted values
(pred_val) and the actual values
(self.y_test_DL) on the
self.widgetPlot1 widget. The
confusion matrix provides insights
into the performance of the model in
terms of true positives, true
negatives, false positives, and false
negatives.
5. Load and plot history: The function
loads the history of the ANN model
from the file 'thyroid_history.npy'. It
retrieves the training loss, validation
loss, training accuracy, and
validation accuracy from the history
object. It then calls the plot_loss_acc
method to plot the training and
validation loss as well as the training
and validation accuracy on the
self.widgetPlot3 widget. This plot
visualizes the training progress and
model performance over epochs.
Overall, the choose_prediction_ANN() function allows the
user to select a prediction method for the ANN model,
generates the corresponding predictions, and visualizes the
results through plots of true values versus predicted values,
confusion matrix, and the training history of the model.
1 def __init__(self):
2 QMainWindow.__in
3 loadUi("gui_thy
4 self.setWindowTitle
5 "GUI Demo of Classi
6 Disease")
7 self.addToolBar(Nav
8 self.widgetPlot1.ca
9 self.pbLoad.clicked
10 self.initial_state
11 self.pbTrainML.clic
12 self.cbData.current
)
13
self.cbClassifier.c
14
15 self.choose_ML_mode
16 self.pbTrainDL.clic
self.cbPredictionDL
self.choose_predict
By connecting the
choose_prediction_ANN() m
the combo box wil
choose_prediction_ANN() m
the visualizations based on
Figure 195 T
#gui_thyroid.py
from PyQt5.QtWidgets import *
from PyQt5.uic import loadUi
from matplotlib.backends.backend_qt5agg import (NavigationToolbar2QT as Navig
from matplotlib.colors import ListedColormap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import warnings
import mglearn
warnings.filterwarnings('ignore')
import os
import joblib
from numpy import save
from numpy import load
from os import path
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.model_selection import train_test_split, RandomizedSearchCV, Gri
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, p
from sklearn.metrics import classification_report, f1_score, plot_confusion_m
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import learning_curve
from mlxtend.plotting import plot_decision_regions
import tensorflow as tf
from sklearn.base import clone
from sklearn.decomposition import PCA
from tensorflow.keras.models import Sequential, Model, load_model
from sklearn.impute import SimpleImputer
class DemoGUI_Thyroid(QMainWindow):
def __init__(self):
QMainWindow.__init__(self)
loadUi("gui_thyroid.ui",self)
self.setWindowTitle(\
"GUI Demo of Classifying and Predicting Thyroid Disease")
self.addToolBar(NavigationToolbar(\
self.widgetPlot1.canvas, self))
self.pbLoad.clicked.connect(self.import_dataset)
self.initial_state(False)
self.pbTrainML.clicked.connect(self.train_model_ML)
self.cbData.currentIndexChanged.connect(self.choose_plot)
self.cbClassifier.currentIndexChanged.connect(self.choose_ML_model)
self.pbTrainDL.clicked.connect(self.train_ANN)
self.cbPredictionDL.currentIndexChanged.connect(self.choose_prediction_ANN)
table.setAlternatingRowColors(True)
table.setStyleSheet("alternate-background-color: #ffbacd;background-c
def import_dataset(self):
curr_path = os.getcwd()
dataset_dir = curr_path + "/hypothyroid.csv"
self.populate_table(self.df.describe(), self.twData2)
self.twData2.setVerticalHeaderLabels(['Count', 'Mean', 'Std', 'Min', '25%',
self.label2.setText('Data Desciption')
#Populates cbData
self.populate_cbData()
def populate_cbData(self):
self.cbData.addItems(self.df)
self.cbData.addItems(["Features Importance"])
self.cbData.addItems(["Correlation Matrix", "Pairwise Relationship", "Featur
#Resamples data
sm = SMOTE(random_state=2021)
X,y = sm.fit_resample(X, y.ravel())
return X, y
def train_test(self):
X, y = self.fit_dataset(self.df)
self.X_train_norm = X_train.copy()
self.X_test_norm = X_test.copy()
self.y_train_norm = y_train.copy()
self.y_test_norm = y_test.copy()
norm = MinMaxScaler()
self.X_train_norm = norm.fit_transform(self.X_train_norm)
self.X_test_norm = norm.transform(self.X_test_norm)
self.X_train_stand = X_train.copy()
self.X_test_stand = X_test.copy()
self.y_train_stand = y_train.copy()
self.y_test_stand = y_test.copy()
scaler = StandardScaler()
self.X_train_stand = scaler.fit_transform(self.X_train_stand)
self.X_test_stand = scaler.transform(self.X_test_stand)
def split_data_ML(self):
if path.isfile('X_train_raw.npy'):
#Loads npy files
self.X_train_raw = np.load('X_train_raw.npy',allow_pickle=True)
self.y_train_raw = np.load('y_train_raw.npy',allow_pickle=True)
self.X_test_raw = np.load('X_test_raw.npy',allow_pickle=True)
self.y_test_raw = np.load('y_test_raw.npy',allow_pickle=True)
self.X_train_norm = np.load('X_train_norm.npy',allow_pickle=True)
self.y_train_norm = np.load('y_train_norm.npy',allow_pickle=True)
self.X_test_norm = np.load('X_test_norm.npy',allow_pickle=True)
self.y_test_norm = np.load('y_test_norm.npy',allow_pickle=True)
self.X_train_stand = np.load('X_train_stand.npy',allow_pickle=True)
self.y_train_stand = np.load('y_train_stand.npy',allow_pickle=True)
self.X_test_stand = np.load('X_test_stand.npy',allow_pickle=True)
self.y_test_stand = np.load('y_test_stand.npy',allow_pickle=True)
else:
self.train_test()
def train_model_ML(self):
self.split_data_ML()
# The height of the bar is the data value and can be used as the label
label_text = f'{height:.0f}'
# ax.text(x, y, text)
label_x = x + width / 2
label_y = y + height / 2
def choose_plot(self):
strCB = self.cbData.currentText()
if strCB == "binaryClass":
#Plots distribution of binaryClass variable in pie chart
self.widgetPlot1.canvas.figure.clf()
self.widgetPlot1.canvas.axis1 = self.widgetPlot1.canvas.figure.add_subplot(1
label_class = list(self.df_dummy["binaryClass"].value_counts().in
self.pie_cat(self.df_dummy["binaryClass"],'binaryClass', label_class, self.w
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
self.widgetPlot1.canvas.axis1 = self.widgetPlot1.canvas.figure.add_subplot(1
self.bar_cat(self.df_dummy,"binaryClass", self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 = self.widgetPlot2.canvas.figure.add_subplot(1
self.widgetPlot2.canvas.axis2 = self.widgetPlot2.canvas.figure.add_subplot(1
self.dist_percent_plot(self.df_dummy,"sex",self.widgetPlot2.canvas.axis1,sel
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["on thyroxine"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("on thyroxine versus binaryClass",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["TSH measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("TSH measured versus binaryClass",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["TT4 measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("TT4 measured versus binaryClass",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["T4U measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("T4U measured versus binaryClass",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["T3 measured"],hue = self.df_dummy[
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("T3 measured versus binaryClass",fon
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["query on thyroxine"],hue = self.df_
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("query on thyroxine versus binaryCla
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["sick"],hue = self.df_dummy["binary
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("sick versus binaryClass",fontweight
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["tumor"],hue = self.df_dummy["binar
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("tumor versus binaryClass",fontweigh
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["psych"],hue = self.df_dummy["binar
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("psych versus binaryClass",fontweigh
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
self.widgetPlot1.canvas.axis1 = self.widgetPlot1.canvas.figure.add_subplot(1
self.bar_cat(self.df_dummy,"TSH measured", self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 = self.widgetPlot2.canvas.figure.add_subplot(1
self.widgetPlot2.canvas.axis2 = self.widgetPlot2.canvas.figure.add_subplot(1
self.dist_percent_plot(self.df_dummy,"TSH measured",self.widgetPlot2.canvas.
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["on thyroxine"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("on thyroxine versus TSH measured",f
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["goitre"],hue = self.df_dummy["TSH
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("goitre versus TSH measured",fontwei
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["TT4 measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("TT4 measured versus TSH measured",f
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["T4U measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("T4U measured versus TSH measured",f
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["T3 measured"],hue = self.df_dummy[
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("T3 measured versus TSH measured",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["query on thyroxine"],hue = self.df_
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("query on thyroxine versus TSH measu
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["sick"],hue = self.df_dummy["TSH me
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("sick versus TSH measured",fontweigh
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["tumor"],hue = self.df_dummy["TSH m
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("tumor versus TSH measured",fontweig
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["psych"],hue = self.df_dummy["TSH m
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("psych versus TSH measured",fontweig
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 = self.widgetPlot2.canvas.figure.add_subplot(1
self.widgetPlot2.canvas.axis2 = self.widgetPlot2.canvas.figure.add_subplot(1
self.dist_percent_plot(self.df_dummy,"T3 measured",self.widgetPlot2.canvas.a
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["on thyroxine"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("on thyroxine versus T3 measured",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["goitre"],hue = self.df_dummy["T3 m
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("goitre versus T3 measured",fontweig
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["TT4 measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("TT4 measured versus T3 measured",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["T4U measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("T4U measured versus T3 measured",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["TSH measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("TSH measured versus T3 measured",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["query on thyroxine"],hue = self.df_
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("query on thyroxine versus T3 measur
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["sick"],hue = self.df_dummy["T3 mea
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("sick versus T3 measured",fontweight
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["tumor"],hue = self.df_dummy["T3 me
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("tumor versus T3 measured",fontweigh
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["psych"],hue = self.df_dummy["T3 me
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("psych versus T3 measured",fontweigh
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
self.widgetPlot1.canvas.axis1 = self.widgetPlot1.canvas.figure.add_subplot(1
self.bar_cat(self.df_dummy,"TT4 measured", self.widgetPlot1)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 = self.widgetPlot2.canvas.figure.add_subplot(1
self.widgetPlot2.canvas.axis2 = self.widgetPlot2.canvas.figure.add_subplot(1
self.dist_percent_plot(self.df_dummy,"TT4 measured",self.widgetPlot2.canvas.
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["on thyroxine"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("on thyroxine versus TT4 measured",f
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["goitre"],hue = self.df_dummy["TT4
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("goitre versus TT4 measured",fontwei
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["T3 measured"],hue = self.df_dummy[
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("T3 measured versus TT4 measured",fo
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["T4U measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("T4U measured versus TT4 measured",f
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["TSH measured"],hue = self.df_dummy
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("TSH measured versus TT4 measured",f
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["query on thyroxine"],hue = self.df_
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("query on thyroxine versus TT4 measu
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["sick"],hue = self.df_dummy["TT4 me
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("sick versus TT4 measured",fontweigh
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["tumor"],hue = self.df_dummy["TT4 m
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("tumor versus TT4 measured",fontweig
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
g=sns.countplot(self.df_dummy["psych"],hue = self.df_dummy["TT4 m
palette='Spectral_r',ax=self.widgetPlot3.canvas.axis1)
self.put_label_stacked_bar(g,17)
self.widgetPlot3.canvas.axis1.set_title("psych versus TT4 measured",fontweig
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
if strCB == 'age':
self.prob_num_versus_two_cat("age","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("age")
self.prob_num_versus_two_cat("age","FTI measured", "T3 measured" ,self.widge
if strCB == 'TSH':
self.prob_num_versus_two_cat("TSH","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("TSH")
self.prob_num_versus_two_cat("TSH","FTI measured", "T3 measured" ,self.widge
if strCB == 'T3':
self.prob_num_versus_two_cat("T3","binaryClass", "T4U measured" ,self.widget
self.hist_num_versus_nine_cat("T3")
self.prob_num_versus_two_cat("T3","FTI measured", "T3 measured" ,self.widget
if strCB == 'TT4':
self.prob_num_versus_two_cat("TT4","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("TT4")
self.prob_num_versus_two_cat("TT4","FTI measured", "T3 measured" ,self.widge
if strCB == 'T4U':
self.prob_num_versus_two_cat("T4U","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("T4U")
self.prob_num_versus_two_cat("T4U","FTI measured", "T3 measured" ,self.widge
if strCB == 'FTI':
self.prob_num_versus_two_cat("FTI","binaryClass", "T4U measured" ,self.widge
self.hist_num_versus_nine_cat("FTI")
self.prob_num_versus_two_cat("FTI","FTI measured", "T3 measured" ,self.widge
ax0.set_facecolor(background_color)
ax0_sns = sns.histplot(data=self.df, x=self.df[feat],ax=ax0,zorder=2,
shrink=.8,linewidth=0.3,alpha=1)
self.put_label_stacked_bar(ax0_sns,12)
ax0_sns.set_xlabel('',fontsize=10, weight='bold')
ax0_sns.set_ylabel('',fontsize=10, weight='bold')
def prob_feat_versus_other(self,feat,another,legend,ax0,label='',title=''):
background_color = "#fbe7dd"
sns.set_palette(['#ff355d','#66b3ff'])
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0_sns =
sns.kdeplot(x=self.df[feat],ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='
ax0_sns.set_xlabel('',fontsize=4, weight='bold')
ax0_sns.set_ylabel('',fontsize=4, weight='bold')
def hist_num_versus_nine_cat(self,feat):
self.label_bin = list(self.df_dummy["binaryClass"].value_counts().index)
self.label_sex = list(self.df_dummy["sex"].value_counts().index)
self.label_thyroxine = list(self.df_dummy["on thyroxine"].value_counts().ind
self.label_pregnant = list(self.df_dummy["pregnant"].value_counts().index)
self.label_lithium = list(self.df_dummy["lithium"].value_counts().index)
self.label_goitre = list(self.df_dummy["goitre"].value_counts().index)
self.label_tumor = list(self.df_dummy["tumor"].value_counts().index)
self.label_tsh = list(self.df_dummy["TSH measured"].value_counts().index)
self.label_tt4 = list(self.df_dummy["TT4 measured"].value_counts().index)
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["binaryClass"].value_counts())
self.feat_versus_other(feat,self.df_dummy["binaryClass"],self.label_bin,self
s versus ' + feat)
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["sex"].value_counts())
self.feat_versus_other(feat,self.df_dummy["sex"],self.label_sex,self.widgetP
feat)
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["on thyroxine"].value_counts())
self.feat_versus_other(feat,self.df_dummy["on thyroxine"],self.label_thyroxi
thyroxine versus ' + feat)
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["pregnant"].value_counts())
self.feat_versus_other(feat,self.df_dummy["pregnant"],self.label_pregnant,se
versus ' + feat)
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["lithium"].value_counts())
self.feat_versus_other(feat,self.df_dummy["lithium"],self.label_lithium,self
versus ' + feat)
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["goitre"].value_counts())
self.feat_versus_other(feat,self.df_dummy["goitre"],self.label_goitre,self.w
versus ' + feat)
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["tumor"].value_counts())
self.feat_versus_other(feat,self.df_dummy["tumor"],self.label_tumor,self.wid
versus ' + feat)
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["TSH measured"].value_counts())
self.feat_versus_other(feat,self.df_dummy["TSH measured"],self.label_tsh,sel
measured versus ' + feat)
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(3
print(self.df_dummy["TT4 measured"].value_counts())
self.feat_versus_other(feat,self.df_dummy["TT4 measured"],self.label_tt4,sel
measured versus ' + feat)
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
def prob_feat_versus_other(self,feat,another,legend,ax0,label='',title=''):
background_color = "#fbe7dd"
sns.set_palette(['#ff355d','#66b3ff'])
for s in ["right", "top"]:
ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0_sns =
sns.kdeplot(x=self.df[feat],ax=ax0,hue=another,linewidth=0.3,fill=True,cbar='
ax0_sns.set_xlabel('',fontsize=4, weight='bold')
ax0_sns.set_ylabel('',fontsize=4, weight='bold')
widget.canvas.figure.clf()
widget.canvas.axis1 = widget.canvas.figure.add_subplot(211,facecolor
print(self.df_dummy[feat_cat2].value_counts())
self.prob_feat_versus_other(feat,self.df_dummy[feat_cat2],self.label_feat_ca
2 + ' versus ' + feat)
widget.canvas.axis1 = widget.canvas.figure.add_subplot(212,facecolor
print(self.df_dummy[feat_cat1].value_counts())
self.prob_feat_versus_other(feat,self.df_dummy[feat_cat1],self.label_feat_ca
1 + ' versus ' + feat)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
#Output plot
widget.canvas.figure.clf()
widget.canvas.axis1 = widget.canvas.figure.add_subplot(111,facecolor=
widget.canvas.axis1.scatter(range(len(Y_pred)),Y_pred,color="yellow",
widget.canvas.axis1.scatter(range(len(Y_test)), Y_test,color="red",la
widget.canvas.axis1.set_title("Prediction Values vs Real Values of "
widget.canvas.axis1.set_xlabel("Accuracy: " + str(round((acc*100),3))
widget.canvas.axis1.legend()
widget.canvas.axis1.grid(True, alpha=0.75, lw=1, ls='-.')
widget.canvas.figure.tight_layout()
widget.canvas.draw()
plot_decision_regions(X_test_feature.values, y_test_feature.ravel(),
widget.canvas.axis1.set_title(title, fontweight ="bold",fontsize=15)
widget.canvas.axis1.set_xlabel(feat1)
widget.canvas.axis1.set_ylabel(feat2)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
return y_pred
self.widgetPlot1.canvas.figure.clf()
self.widgetPlot1.canvas.axis1 = self.widgetPlot1.canvas.figure.add_subplot(1
self.plot_cm(y_pred, y_test, self.widgetPlot1, name + " -- " + scaling)
self.widgetPlot1.canvas.figure.tight_layout()
self.widgetPlot1.canvas.draw()
self.widgetPlot2.canvas.figure.clf()
self.widgetPlot2.canvas.axis1 = self.widgetPlot2.canvas.figure.add_subplot(1
self.plot_real_pred_val(y_pred, y_test, self.widgetPlot2, name + " -- " + sc
self.widgetPlot2.canvas.figure.tight_layout()
self.widgetPlot2.canvas.draw()
self.widgetPlot3.canvas.figure.clf()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(2
self.plot_decision(model, 'TSH', 'FTI', self.widgetPlot3, title="The decisio
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(2
self.plot_learning_curve(model, 'Learning Curve' + " -- " + scaling, X_train
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(2
self.plot_scalability_curve(model, 'Scalability of ' + name + " -- " + scali
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.axis1 = self.widgetPlot3.canvas.figure.add_subplot(2
self.plot_performance_curve(model, 'Performance of ' + name + " -- " + scali
self.widgetPlot3.canvas.figure.tight_layout()
self.widgetPlot3.canvas.draw()
def build_train_lr(self):
if path.isfile('logregRaw.pkl'):
#Loads model
self.logregRaw = joblib.load('logregRaw.pkl')
self.logregNorm = joblib.load('logregNorm.pkl')
self.logregStand = joblib.load('logregStand.pkl')
if self.rbRaw.isChecked():
self.run_model('Logistic Regression', 'Raw', self.logregRaw, self.X_train_raw
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('Logistic Regression', 'Normalization', self.logregNorm, self
self.y_train_norm, self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Logistic Regression', 'Standardization', self.logregStand, s
self.y_train_stand, self.y_test_stand)
else:
#Builds and trains Logistic Regression
self.logregRaw = LogisticRegression(solver='lbfgs', max_iter=500, random_sta
self.logregNorm = LogisticRegression(solver='lbfgs', max_iter=500, random_st
self.logregStand = LogisticRegression(solver='lbfgs', max_iter=500, random_s
if self.rbRaw.isChecked():
self.run_model('Logistic Regression', 'Raw', self.logregRaw, self.X_train_raw
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('Logistic Regression', 'Normalization', self.logregNorm, self
self.y_train_norm, self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Logistic Regression', 'Standardization', self.logregStand, s
self.y_train_stand, self.y_test_stand)
#Saves model
joblib.dump(self.logregRaw, 'logregRaw.pkl')
joblib.dump(self.logregNorm, 'logregNorm.pkl')
joblib.dump(self.logregStand, 'logregStand.pkl')
def choose_ML_model(self):
strCB = self.cbClassifier.currentText()
if strCB == 'Adaboost':
self.build_train_ada()
def build_train_svm(self):
if path.isfile('SVMRaw.pkl'):
#Loads model
self.SVMRaw = joblib.load('SVMRaw.pkl')
self.SVMNorm = joblib.load('SVMNorm.pkl')
self.SVMStand = joblib.load('SVMStand.pkl')
if self.rbRaw.isChecked():
self.run_model('Support Vector Machine', 'Raw', self.SVMRaw, self.X_train_raw
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('Support Vector Machine', 'Normalization', self.SVMNorm, self
self.y_train_norm, self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Support Vector Machine', 'Standardization', self.SVMStand, s
self.y_train_stand, self.y_test_stand)
else:
#Builds and trains Logistic Regression
self.SVMRaw = SVC(random_state=2021,probability=True)
self.SVMNorm = SVC(random_state=2021,probability=True)
self.SVMStand = SVC(random_state=2021,probability=True)
if self.rbRaw.isChecked():
self.run_model('Support Vector Machine', 'Raw', self.SVMRaw, self.X_train_raw
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('Support Vector Machine', 'Normalization', self.SVMNorm, self
self.y_train_norm, self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Support Vector Machine', 'Standardization', self.SVMStand, s
self.y_train_stand, self.y_test_stand)
#Saves model
joblib.dump(self.SVMRaw, 'SVMRaw.pkl')
joblib.dump(self.SVMNorm, 'SVMNorm.pkl')
joblib.dump(self.SVMStand, 'SVMStand.pkl')
def build_train_knn(self):
if path.isfile('KNNRaw.pkl'):
#Loads model
self.KNNRaw = joblib.load('KNNRaw.pkl')
self.KNNNorm = joblib.load('KNNNorm.pkl')
self.KNNStand = joblib.load('KNNStand.pkl')
if self.rbRaw.isChecked():
self.run_model('K-Nearest Neighbor', 'Raw', self.KNNRaw, self.X_train_raw, s
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('K-Nearest Neighbor', 'Normalization', self.KNNNorm, self.X_t
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('K-Nearest Neighbor', 'Standardization', self.KNNStand, self.X
self.y_train_stand, self.y_test_stand)
else:
#Builds and trains K-Nearest Neighbor
self.KNNRaw = KNeighborsClassifier(n_neighbors = 20)
self.KNNNorm = KNeighborsClassifier(n_neighbors = 20)
self.KNNStand = KNeighborsClassifier(n_neighbors = 20)
if self.rbRaw.isChecked():
self.run_model('K-Nearest Neighbor', 'Raw', self.KNNRaw, self.X_train_raw, s
self.y_test_raw)
if self.rbNorm.isChecked():
self.run_model('K-Nearest Neighbor', 'Normalization', self.KNNNorm, self.X_t
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('K-Nearest Neighbor', 'Standardization', self.KNNStand, self.X
self.y_train_stand, self.y_test_stand)
#Saves model
joblib.dump(self.KNNRaw, 'KNNRaw.pkl')
joblib.dump(self.KNNNorm, 'KNNNorm.pkl')
joblib.dump(self.KNNStand, 'KNNStand.pkl')
def build_train_dt(self):
if path.isfile('DTRaw.pkl'):
#Loads model
self.DTRaw = joblib.load('DTRaw.pkl')
self.DTNorm = joblib.load('DTNorm.pkl')
self.DTStand = joblib.load('DTStand.pkl')
if self.rbRaw.isChecked():
self.run_model('Decision Tree', 'Raw', self.DTRaw, self.X_train_raw, self.X_
if self.rbNorm.isChecked():
self.run_model('Decision Tree', 'Normalization', self.DTNorm, self.X_train_n
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Decision Tree', 'Standardization', self.DTStand, self.X_trai
self.y_test_stand)
else:
#Builds and trains Decision Tree
dt = DecisionTreeClassifier()
parameters = { 'max_depth':np.arange(1,20,1),'random_state':[2021
self.DTRaw = GridSearchCV(dt, parameters)
self.DTNorm = GridSearchCV(dt, parameters)
self.DTStand = GridSearchCV(dt, parameters)
if self.rbRaw.isChecked():
self.run_model('Decision Tree', 'Raw', self.DTRaw, self.X_train_raw, self.X_
if self.rbNorm.isChecked():
self.run_model('Decision Tree', 'Normalization', self.DTNorm, self.X_train_n
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Decision Tree', 'Standardization', self.DTStand, self.X_trai
self.y_test_stand)
#Saves model
joblib.dump(self.DTRaw, 'DTRaw.pkl')
joblib.dump(self.DTNorm, 'DTNorm.pkl')
joblib.dump(self.DTStand, 'DTStand.pkl')
def build_train_rf(self):
if path.isfile('RFRaw.pkl'):
#Loads model
self.RFRaw = joblib.load('RFRaw.pkl')
self.RFNorm = joblib.load('RFNorm.pkl')
self.RFStand = joblib.load('RFStand.pkl')
if self.rbRaw.isChecked():
self.run_model('Random Forest', 'Raw', self.RFRaw, self.X_train_raw, self.X_
if self.rbNorm.isChecked():
self.run_model('Random Forest', 'Normalization', self.RFNorm, self.X_train_n
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Random Forest', 'Standardization', self.RFStand, self.X_trai
self.y_test_stand)
else:
#Builds and trains Random Forest
self.RFRaw = RandomForestClassifier(n_estimators=200, max_depth=20, random_s
self.RFNorm = RandomForestClassifier(n_estimators=200, max_depth=20, random_
self.RFStand = RandomForestClassifier(n_estimators=200, max_depth=20, random_
if self.rbRaw.isChecked():
self.run_model('Random Forest', 'Raw', self.RFRaw, self.X_train_raw, self.X_
if self.rbNorm.isChecked():
self.run_model('Random Forest', 'Normalization', self.RFNorm, self.X_train_n
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Random Forest', 'Standardization', self.RFStand, self.X_trai
self.y_test_stand)
#Saves model
joblib.dump(self.RFRaw, 'RFRaw.pkl')
joblib.dump(self.RFNorm, 'RFNorm.pkl')
joblib.dump(self.RFStand, 'RFStand.pkl')
def build_train_gb(self):
if path.isfile('GBRaw.pkl'):
#Loads model
self.GBRaw = joblib.load('GBRaw.pkl')
self.GBNorm = joblib.load('GBNorm.pkl')
self.GBStand = joblib.load('GBStand.pkl')
if self.rbRaw.isChecked():
self.run_model('Gradient Boosting', 'Raw', self.GBRaw, self.X_train_raw, sel
if self.rbNorm.isChecked():
self.run_model('Gradient Boosting', 'Normalization', self.GBNorm, self.X_tra
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Gradient Boosting', 'Standardization', self.GBStand, self.X_
self.y_train_stand, self.y_test_stand)
else:
#Builds and trains Gradient Boosting
self.GBRaw = GradientBoostingClassifier(n_estimators = 100, max_depth=10, su
random_state=2021)
self.GBNorm = GradientBoostingClassifier(n_estimators = 100, max_depth=10, s
random_state=2021)
self.GBStand = GradientBoostingClassifier(n_estimators = 100, max_depth=10,
random_state=2021)
if self.rbRaw.isChecked():
self.run_model('Gradient Boosting', 'Raw', self.GBRaw, self.X_train_raw, sel
if self.rbNorm.isChecked():
self.run_model('Gradient Boosting', 'Normalization', self.GBNorm, self.X_tra
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Gradient Boosting', 'Standardization', self.GBStand, self.X_
self.y_train_stand, self.y_test_stand)
#Saves model
joblib.dump(self.GBRaw, 'GBRaw.pkl')
joblib.dump(self.GBNorm, 'GBNorm.pkl')
joblib.dump(self.GBStand, 'GBStand.pkl')
def build_train_nb(self):
if path.isfile('NBRaw.pkl'):
#Loads model
self.NBRaw = joblib.load('NBRaw.pkl')
self.NBNorm = joblib.load('NBNorm.pkl')
self.NBStand = joblib.load('NBStand.pkl')
if self.rbRaw.isChecked():
self.run_model('Naive Bayes', 'Raw', self.NBRaw, self.X_train_raw, self.X_te
if self.rbNorm.isChecked():
self.run_model('Naive Bayes', 'Normalization', self.NBNorm, self.X_train_nor
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Naive Bayes', 'Standardization', self.NBStand, self.X_train_
self.y_test_stand)
else:
#Builds and trains Naive Bayes
self.NBRaw = GaussianNB()
self.NBNorm = GaussianNB()
self.NBStand = GaussianNB()
if self.rbRaw.isChecked():
self.run_model('Naive Bayes', 'Raw', self.NBRaw, self.X_train_raw, self.X_te
if self.rbNorm.isChecked():
self.run_model('Naive Bayes', 'Normalization', self.NBNorm, self.X_train_nor
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Naive Bayes', 'Standardization', self.NBStand, self.X_train_
self.y_test_stand)
#Saves model
joblib.dump(self.NBRaw, 'NBRaw.pkl')
joblib.dump(self.NBNorm, 'NBNorm.pkl')
joblib.dump(self.NBStand, 'NBStand.pkl')
def build_train_ada(self):
if path.isfile('ADARaw.pkl'):
#Loads model
self.ADARaw = joblib.load('ADARaw.pkl')
self.ADANorm = joblib.load('ADANorm.pkl')
self.ADAStand = joblib.load('ADAStand.pkl')
if self.rbRaw.isChecked():
self.run_model('Adaboost', 'Raw', self.ADARaw, self.X_train_raw, self.X_test_
if self.rbNorm.isChecked():
self.run_model('Adaboost', 'Normalization', self.ADANorm, self.X_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Adaboost', 'Standardization', self.ADAStand, self.X_train_st
self.y_test_stand)
else:
#Builds and trains Adaboost
self.ADARaw = AdaBoostClassifier(n_estimators = 100, learning_rate=0.01)
self.ADANorm = AdaBoostClassifier(n_estimators = 100, learning_rate=0.01)
self.ADAStand = AdaBoostClassifier(n_estimators = 100, learning_rate=0.01)
if self.rbRaw.isChecked():
self.run_model('Adaboost', 'Raw', self.ADARaw, self.X_train_raw, self.X_test_
if self.rbNorm.isChecked():
self.run_model('Adaboost', 'Normalization', self.ADANorm, self.X_train_norm,
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('Adaboost', 'Standardization', self.ADAStand, self.X_train_st
self.y_test_stand)
#Saves model
joblib.dump(self.ADARaw, 'ADARaw.pkl')
joblib.dump(self.ADANorm, 'ADANorm.pkl')
joblib.dump(self.ADAStand, 'ADAStand.pkl')
def build_train_xgb(self):
if path.isfile('XGBRaw.pkl'):
#Loads model
self.XGBRaw = joblib.load('XGBRaw.pkl')
self.XGBNorm = joblib.load('XGBNorm.pkl')
self.XGBStand = joblib.load('XGBStand.pkl')
if self.rbRaw.isChecked():
self.run_model('XGB', 'Raw', self.XGBRaw, self.X_train_raw, self.X_test_raw,
if self.rbNorm.isChecked():
self.run_model('XGB', 'Normalization', self.XGBNorm, self.X_train_norm, self
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('XGB', 'Standardization', self.XGBStand, self.X_train_stand,
self.y_test_stand)
else:
#Builds and trains XGB classifier
self.XGBRaw = XGBClassifier(n_estimators = 100, max_depth=20, random_state=2
eval_metric='mlogloss')
self.XGBNorm = XGBClassifier(n_estimators = 100, max_depth=20, random_state=
eval_metric='mlogloss')
self.XGBStand = XGBClassifier(n_estimators = 100, max_depth=20, random_state
eval_metric='mlogloss')
if self.rbRaw.isChecked():
self.run_model('XGB', 'Raw', self.XGBRaw, self.X_train_raw, self.X_test_raw,
if self.rbNorm.isChecked():
self.run_model('XGB', 'Normalization', self.XGBNorm, self.X_train_norm, self
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('XGB', 'Standardization', self.XGBStand, self.X_train_stand,
self.y_test_stand)
#Saves model
joblib.dump(self.XGBRaw, 'XGBRaw.pkl')
joblib.dump(self.XGBNorm, 'XGBNorm.pkl')
joblib.dump(self.XGBStand, 'XGBStand.pkl')
def build_train_lgbm(self):
if path.isfile('LGBMRaw.pkl'):
#Loads model
self.LGBMRaw = joblib.load('LGBMRaw.pkl')
self.LGBMNorm = joblib.load('LGBMNorm.pkl')
self.LGBMStand = joblib.load('LGBMStand.pkl')
if self.rbRaw.isChecked():
self.run_model('LGBM Classifier', 'Raw', self.LGBMRaw, self.X_train_raw, sel
if self.rbNorm.isChecked():
self.run_model('LGBM Classifier', 'Normalization', self.LGBMNorm, self.X_tra
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('LGBM Classifier', 'Standardization', self.LGBMStand, self.X_
self.y_train_stand, self.y_test_stand)
else:
#Builds and trains LGBMClassifier classifier
self.LGBMRaw = LGBMClassifier(max_depth = 20, n_estimators=100, subsample=0.
self.LGBMNorm = LGBMClassifier(max_depth = 20, n_estimators=100, subsample=0
self.LGBMStand = LGBMClassifier(max_depth = 20, n_estimators=100, subsample=
if self.rbRaw.isChecked():
self.run_model('LGBM Classifier', 'Raw', self.LGBMRaw, self.X_train_raw, sel
if self.rbNorm.isChecked():
self.run_model('LGBM Classifier', 'Normalization', self.LGBMNorm, self.X_tra
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('LGBM Classifier', 'Standardization', self.LGBMStand, self.X_
self.y_train_stand, self.y_test_stand)
#Saves model
joblib.dump(self.LGBMRaw, 'LGBMRaw.pkl')
joblib.dump(self.LGBMNorm, 'LGBMNorm.pkl')
joblib.dump(self.LGBMStand, 'LGBMStand.pkl')
def build_train_mlp(self):
if path.isfile('MLPRaw.pkl'):
#Loads model
self.MLPRaw = joblib.load('MLPRaw.pkl')
self.MLPNorm = joblib.load('MLPNorm.pkl')
self.MLPStand = joblib.load('MLPStand.pkl')
if self.rbRaw.isChecked():
self.run_model('MLP Classifier', 'Raw', self.MLPRaw, self.X_train_raw, self.X
if self.rbNorm.isChecked():
self.run_model('MLP Classifier', 'Normalization', self.MLPNorm, self.X_train_
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('MLP Classifier', 'Standardization', self.MLPStand, self.X_tr
self.y_train_stand, self.y_test_stand)
else:
#Builds and trains MLP classifier
self.MLPRaw = MLPClassifier(random_state=2021)
self.MLPNorm = MLPClassifier(random_state=2021)
self.MLPStand = MLPClassifier(random_state=2021)
if self.rbRaw.isChecked():
self.run_model('MLP Classifier', 'Raw', self.MLPRaw, self.X_train_raw, self.X
if self.rbNorm.isChecked():
self.run_model('MLP Classifier', 'Normalization', self.MLPNorm, self.X_train_
self.y_test_norm)
if self.rbStand.isChecked():
self.run_model('MLP Classifier', 'Standardization', self.MLPStand, self.X_tr
self.y_train_stand, self.y_test_stand)
#Saves model
joblib.dump(self.MLPRaw, 'MLPRaw.pkl')
joblib.dump(self.MLPNorm, 'MLPNorm.pkl')
joblib.dump(self.MLPStand, 'MLPStand.pkl')
def train_test_ANN(self):
X, y = self.fit_dataset(self.df)
#Resamples data
sm = SMOTE(random_state=2021)
X,y = sm.fit_resample(X, y.ravel())
#Saves data
np.save('X_train_DL.npy', X_train_DL)
np.save('X_test_DL.npy', X_test_DL)
np.save('y_train_DL.npy', y_train_DL)
np.save('y_test_DL.npy', y_test_DL)
def train_ANN(self):
if path.isfile('X_train_DL.npy'):
#Loads files
self.X_train_DL = np.load('X_train_DL.npy',allow_pickle=True)
self.X_test_DL = np.load('X_test_DL.npy',allow_pickle=True)
self.y_train_DL = np.load('y_train_DL.npy',allow_pickle=True)
self.y_test_DL = np.load('y_test_DL.npy',allow_pickle=True)
else:
self.train_test_ANN()
#Loads files
self.X_train_DL = np.load('X_train_DL.npy',allow_pickle=True)
self.X_test_DL = np.load('X_test_DL.npy',allow_pickle=True)
self.y_train_DL = np.load('y_train_DL.npy',allow_pickle=True)
self.y_test_DL = np.load('y_test_DL.npy',allow_pickle=True)
if path.isfile('thyroid_model.h5') == False:
self.build_ANN(self.X_train_DL, self.y_train_DL, 32, 250)
#Turns on cbPredictionDL
self.cbPredictionDL.setEnabled(True)
#Input layer
ann.add(tf.keras.layers.Dense(units=100,
input_dim=27,
kernel_initializer='uniform',
activation='relu'))
ann.add(tf.keras.layers.Dropout(0.5))
#Hidden layer 1
ann.add(tf.keras.layers.Dense(units=20,
kernel_initializer='uniform',
activation='relu'))
ann.add(tf.keras.layers.Dropout(0.5))
#Output layer
ann.add(tf.keras.layers.Dense(units=1,
kernel_initializer='uniform',
activation='sigmoid'))
#Saves model
ann.save('thyroid_model.h5')
def choose_prediction_ANN(self):
strCB = self.cbPredictionDL.currentText()
#Loads history
history = np.load('thyroid_history.npy',allow_pickle=True).item()
train_loss = history['loss']
train_acc = history['accuracy']
val_acc = history['val_accuracy']
val_loss = history['val_loss']
self.plot_loss_acc(train_loss, val_loss, train_acc, val_acc, self.widgetPlot
widget.canvas.axis1 = widget.canvas.figure.add_subplot(212,facecolor=
widget.canvas.axis1.plot(train_acc, \
label='Training Accuracy',color='blue',
widget.canvas.axis1.plot(val_acc, 'b--', label='Validation Accuracy',
widget.canvas.axis1.set_title('Accuracy',fontweight ="bold",fontsize=
widget.canvas.axis1.set_xlabel('Epoch')
widget.canvas.axis1.grid(True, alpha=0.75, lw=1, ls='-.')
widget.canvas.axis1.legend(fontsize=16)
widget.canvas.figure.tight_layout()
widget.canvas.draw()
prediction = self.ann.predict(xtest)
label = [int(p>=0.5) for p in prediction]
print("test_target:", ytest)
print("pred_val:", label)
return label
if __name__ == '__main__':
import sys
app = QApplication(sys.argv)
ex = DemoGUI_Thyroid()
ex.show()
sys.exit(app.exec_())