Machine Learning Algorithm
Machine Learning Algorithm
Naïve Bayes
Objective: This python script demonstrate implementation of Naïve Bayes Classifier to predict a
sign of Breast Cancer.
Name of Dataset : Breast Cancer dataset from Scikit-Learn.
Overview of dataset: This dataset contains features computed from digitized images of fine
needle aspirates of breast mass. It aims to classify tumors as malignant or benign based on these
features.
breast_cancer.target_names
df = pd.DataFrame(np.c_[breast_cancer.data, breast_cancer.target],columns
= [list(breast_cancer.feature_names)+ ['target']])
df.info()
plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Distribution of Benign and Malignant Cases')
plt.show()
Output
Based on the report generated below it show that
Number of attributes / features : 30 (mean_radius, mean_texture, mean_area, mean_smoothness,
etc)
Number of patients : 569
Number of class labels : 2 ('B' and 'M' corresponding to 357 Benign and 212 Malignant patients)
Missing Value : Non-null or no attribute have missing value
Value Inconsistency format : All attribute have same format which is float64 or numbers
Output
Based on the classification report below, it shown that the model performed excellently with an
accuracy score of 0.973 (97.3%), which means that 97 out of 100 predictions (whether benign or
malignant) is correct.
Based on the confusion matrix below it show that the model correctly predicted 71 benign cases
and 70 malignant cases. The off-diagonal cells show the number of incorrect predictions. For
example, the model predicted 3 benign cases as malignant, and 0 malignant cases as benign.
accuracy score: 0.9736842105263158
classifcation report :
precision recall f1-score support
https://colab.research.google.com/drive/1YATsBgPmBGMm9XTUufbFk0SaDywHq3h3?usp=sharing
B. SVM
Objective: This Python script demonstrates the implementation of Support Vector Machine
(SVM) using different kernels to classify iris flowers based on their features.
Name of Dataset: Iris dataset from Scikit-Learn.
Overview of Dataset: The Iris dataset contains features measured from samples of three species
of iris flowers: Iris setosa, Iris versicolor, and Iris virginica. The features include sepal length,
sepal width, petal length, and petal width. It aims to classify iris flowers into one of the three
species based on these features.
feature_names = iris.feature_names
Based on the correlation matrix and data distribution graph generated above, it can be concluded
that sepal length has a stronger positive linear relationship with petal length (0.87) and petal
width (0.82) compared to sepal width (-0.12). Petal length and petal width have the strongest
positive (0.96) linear relationship among all the features. There is very weak or close to no linear
relationship between sepal width and sepal length (-0.12), sepal width and petal width (-0.37).
Given the non-linear separability shown from the IRIS data distribution graph above, SVM
kernels will be employed for further analysis.
Step 4 : Find The Best Hyperparameters For SVM Classifiers With Different Kernels
In the code provided, cv=5 is used, which means the dataset is divided into 5 folds. This process
is repeated 5 times, resulting in 5 estimates of the model's performance.
# Get the mean cross-validation accuracy for each iteration of grid search
print("Linear Kernel:")
print(pd.DataFrame(svm_linear_grid.cv_results_)[['param_C',
'mean_test_score']])
print("")
print("Polynomial Kernel:")
print(pd.DataFrame(svm_poly_grid.cv_results_)[['param_C', 'param_degree',
'param_gamma', 'mean_test_score']])
print("")
print("RBF Kernel:")
print(pd.DataFrame(svm_rbf_grid.cv_results_)[['param_C', 'param_gamma',
'mean_test_score']])
Output
Linear Kernel:
param_C mean_test_score
0 0.1 0.941667
1 1 0.958333
2 10 0.950000
3 100 0.950000
Polynomial Kernel:
param_C param_degree param_gamma mean_test_score
0 0.1 2 0.1 0.950000
1 0.1 2 0.01 0.441667
2 0.1 2 0.001 0.441667
3 0.1 3 0.1 0.958333
4 0.1 3 0.01 0.425000
5 0.1 3 0.001 0.425000
6 0.1 4 0.1 0.941667
7 0.1 4 0.01 0.441667
8 0.1 4 0.001 0.408333
9 1 2 0.1 0.958333
10 1 2 0.01 0.883333
11 1 2 0.001 0.441667
12 1 3 0.1 0.950000
13 1 3 0.01 0.841667
14 1 3 0.001 0.425000
15 1 4 0.1 0.933333
16 1 4 0.01 0.816667
17 1 4 0.001 0.408333
18 10 2 0.1 0.950000
19 10 2 0.01 0.950000
20 10 2 0.001 0.441667
21 10 3 0.1 0.933333
22 10 3 0.01 0.958333
23 10 3 0.001 0.425000
24 10 4 0.1 0.941667
25 10 4 0.01 0.925000
26 10 4 0.001 0.408333
27 100 2 0.1 0.950000
28 100 2 0.01 0.958333
29 100 2 0.001 0.883333
30 100 3 0.1 0.941667
31 100 3 0.01 0.958333
32 100 3 0.001 0.425000
33 100 4 0.1 0.941667
34 100 4 0.01 0.950000
35 100 4 0.001 0.408333
RBF Kernel:
param_C param_gamma mean_test_score
0 0.1 0.1 0.900000
1 0.1 0.01 0.466667
2 0.1 0.001 0.466667
3 1 0.1 0.950000
4 1 0.01 0.908333
5 1 0.001 0.466667
6 10 0.1 0.950000
7 10 0.01 0.950000
8 10 0.001 0.916667
9 100 0.1 0.950000
10 100 0.01 0.958333
11 100 0.001 0.950000
Best Parameters (Linear Kernel): {'C': 1}
Best Score (Linear Kernel): 0.9583333333333334
Best Parameters (Polynomial Kernel): {'C': 0.1, 'degree': 3, 'gamma':
0.1}
Best Score (Polynomial Kernel): 0.9583333333333334
Best Parameters (RBF Kernel): {'C': 100, 'gamma': 0.01}
Best Score (RBF Kernel): 0.9583333333333334
Observations:
All three kernel types (Linear, Polynomial, and RBF) achieved the same best accuracy
score of 95.83%. This suggests that for this IRIS dataset, the choice of kernel might not
have a significant impact on SVM classification performance.
According to linear and RBF SVM's decision boundary image above, there are in a
straight line, indicating reasonable classification. The polynomial kernels have
complicated decision boundaries.
Given the IRIS have small dataset, the linear kernel seems like a good choice due to its
simplicity. It achieves high accuracy and avoids the potential for overfitting that can
occur with complex kernels like Polynomial and RBF.
https://colab.research.google.com/drive/1MVm5P9lPHF9ukZMZPHxrdmvWSnemBDoC?usp=sharing
C. Logistic Regression
colab.research.google.com/drive/1_eXsElMEFZmOAkag6LhdP-1K2sCrHhWM?usp=sharing
D. Ensembled
Loads the Heart dataset from the openml repository into the variable data.
# Data preprocessing
# For simplicity, let's handle missing values by dropping them
X = X.dropna()
Output
Observation
The accuracy results show that the ensemble model outperformed the Support Vector Machine
(SVM) classifier. The ensemble model achieved a higher accuracy by combining the predictions
of multiple classifiers (Random Forest, SVM, Logistic Regression) using the soft voting strategy.
Output
Observation
True positives (32): The model correctly predicted 32 people who have heart disease.
False positives (1): The model incorrectly predicted that 1 person has heart disease, but
they actually don't.
True negatives (17): The model correctly predicted 17 people who do not have heart
disease.
False negatives (4): The model missed predicting 4 cases of heart disease.
Overall, the model appears to be good at identifying people with heart disease with few
mistakes.