Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Machine Learning Algorithm

The document describes implementing a Naive Bayes classifier and SVM classifier to predict breast cancer and iris flower types. It loads datasets, explores the data, trains models using different algorithms, and evaluates the models by calculating accuracy and confusion matrices. The Naive Bayes model achieves 97.3% accuracy on breast cancer data while SVM performs classification on iris flower types using different kernels.

Uploaded by

Fatini N.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
13 views

Machine Learning Algorithm

The document describes implementing a Naive Bayes classifier and SVM classifier to predict breast cancer and iris flower types. It loads datasets, explores the data, trains models using different algorithms, and evaluates the models by calculating accuracy and confusion matrices. The Naive Bayes model achieves 97.3% accuracy on breast cancer data while SVM performs classification on iris flower types using different kernels.

Uploaded by

Fatini N.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 18

A.

Naïve Bayes

Objective: This python script demonstrate implementation of Naïve Bayes Classifier to predict a
sign of Breast Cancer.
Name of Dataset : Breast Cancer dataset from Scikit-Learn.
Overview of dataset: This dataset contains features computed from digitized images of fine
needle aspirates of breast mass. It aims to classify tumors as malignant or benign based on these
features.

Step 1 : Importing Library and load the dataset

# Importing necessary libraries


import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Load the breast cancer dataset


breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

Step 2 : Data Exploration

breast_cancer.target_names

df = pd.DataFrame(np.c_[breast_cancer.data, breast_cancer.target],columns
= [list(breast_cancer.feature_names)+ ['target']])
df.info()

import matplotlib.pyplot as plt

# Count the occurrences of each class in the target array


benign_count = np.sum(breast_cancer.target == 0)
malignant_count = np.sum(breast_cancer.target == 1)

# Create a bar plot


plt.figure(figsize=(8, 6))
bars = plt.bar(['Benign', 'Malignant'], [benign_count, malignant_count],
color=['skyblue', 'salmon'])
# Annotate the bars with the count of each class
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval),
va='bottom')

plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Distribution of Benign and Malignant Cases')
plt.show()

Output
Based on the report generated below it show that
Number of attributes / features : 30 (mean_radius, mean_texture, mean_area, mean_smoothness,
etc)
Number of patients : 569
Number of class labels : 2 ('B' and 'M' corresponding to 357 Benign and 212 Malignant patients)
Missing Value : Non-null or no attribute have missing value
Value Inconsistency format : All attribute have same format which is float64 or numbers

RangeIndex: 569 entries,


array(['malignant', 0 to 568
'benign'], dtype='<U9')
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (mean radius,) 569 non-null float64
1 (mean texture,) 569 non-null float64
2 (mean perimeter,) 569 non-null float64
3 (mean area,) 569 non-null float64
4 (mean smoothness,) 569 non-null float64
5 (mean compactness,) 569 non-null float64
6 (mean concavity,) 569 non-null float64
7 (mean concave points,) 569 non-null float64
8 (mean symmetry,) 569 non-null float64
9 (mean fractal dimension,) 569 non-null float64
10 (radius error,) 569 non-null float64
11 (texture error,) 569 non-null float64
12 (perimeter error,) 569 non-null float64
13 (area error,) 569 non-null float64
14 (smoothness error,) 569 non-null float64
15 (compactness error,) 569 non-null float64
16 (concavity error,) 569 non-null float64
17 (concave points error,) 569 non-null float64
18 (symmetry error,) 569 non-null float64
19 (fractal dimension error,) 569 non-null float64
20 (worst radius,) 569 non-null float64
21 (worst texture,) 569 non-null float64
22 (worst perimeter,) 569 non-null float64
23 (worst area,) 569 non-null float64
24 (worst smoothness,) 569 non-null float64
25 (worst compactness,) 569 non-null float64
26 (worst concavity,) 569 non-null float64
27 (worst concave points,) 569 non-null float64
28 (worst symmetry,) 569 non-null float64
29 (worst fractal dimension,) 569 non-null float64
Step 3 : Data Pre-processing
Based on the report shown that dataset consists of numerical features and there are no missing
values, thus not need to perform extensive preprocessing for this dataset. Moreover, pre-
processing is minimal for the breast cancer dataset, as it is already clean and ready-made for use
in the sklearn library.

Step 4 : Splitting dataset


80% of the data is set up for training while 20% is used for testing

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(breast_cancer.data,
breast_cancer.target, test_size=0.2, random_state=42)

Step 5 : Model Training


As dataset having Numerical Data, so Guassian Naïve Bayes works best. This script will
demonstrate using Guassian Naïve Bayes.

# Initialize the Gaussian Naive Bayes classifier


naive_bayes = GaussianNB()

# Train the classifier


naive_bayes.fit(X_train, y_train)

# Predict the labels for the test set


y_pred = naive_bayes.predict(X_test)

Step 6 : Evaluation - Calculate Accuracy and Confusion Matrix

# Calculate the accuracy of the classifier


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate the confusion matrix


cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix


plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=breast_cancer.target_names,
yticklabels=breast_cancer.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

Output

Based on the classification report below, it shown that the model performed excellently with an
accuracy score of 0.973 (97.3%), which means that 97 out of 100 predictions (whether benign or
malignant) is correct.

Based on the confusion matrix below it show that the model correctly predicted 71 benign cases
and 70 malignant cases. The off-diagonal cells show the number of incorrect predictions. For
example, the model predicted 3 benign cases as malignant, and 0 malignant cases as benign.
accuracy score: 0.9736842105263158
classifcation report :
precision recall f1-score support

0 1.00 0.93 0.96 43


1 0.96 1.00 0.98 71

accuracy 0.97 114


macro avg 0.98 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114

Full Code via Google Colab

https://colab.research.google.com/drive/1YATsBgPmBGMm9XTUufbFk0SaDywHq3h3?usp=sharing
B. SVM

Objective: This Python script demonstrates the implementation of Support Vector Machine
(SVM) using different kernels to classify iris flowers based on their features.
Name of Dataset: Iris dataset from Scikit-Learn.
Overview of Dataset: The Iris dataset contains features measured from samples of three species
of iris flowers: Iris setosa, Iris versicolor, and Iris virginica. The features include sepal length,
sepal width, petal length, and petal width. It aims to classify iris flowers into one of the three
species based on these features.

Step 1 : Importing Library and load the dataset

from sklearn import datasets


from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.datasets import load_iris

# Load Iris dataset


iris = datasets.load_iris()
X = iris.data
y = iris.target

Step 2 : Data Exploration


# Convert the dataset to a DataFrame for easier visualization
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

# Plot the pairplot to visualize the distribution


sns.pairplot(iris_df, hue='target', palette='Set1')
plt.show()

feature_names = iris.feature_names

# Convert the data to a pandas DataFrame


iris_df = pd.DataFrame(X, columns=feature_names)
# Calculate the correlation matrix
correlation_matrix = iris_df.corr()

# Plot the correlation matrix as a heatmap


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f",
annot_kws={"size": 10})
plt.title('Correlation Between Features')
plt.show()
Output

Based on the correlation matrix and data distribution graph generated above, it can be concluded
that sepal length has a stronger positive linear relationship with petal length (0.87) and petal
width (0.82) compared to sepal width (-0.12). Petal length and petal width have the strongest
positive (0.96) linear relationship among all the features. There is very weak or close to no linear
relationship between sepal width and sepal length (-0.12), sepal width and petal width (-0.37).

Given the non-linear separability shown from the IRIS data distribution graph above, SVM
kernels will be employed for further analysis.

Step 3 : Splitting Data


80% of the data is set up for training while 20% is used for testing

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

Step 4 : Find The Best Hyperparameters For SVM Classifiers With Different Kernels

1. Linear Kernel (param_grid_linear):


o C: takes values of [0.1, 1, 10, 100], different levels of regularization strength
2. Polynomial Kernel (param_grid_poly):
o C: Same as in the linear kernel, C controls the regularization strength.
o takes values of [0.1, 0.01, 0.001], representing different levels of
Gamma
influence of individual training
o Degree takes values of [2, 3, 4], representing different degrees of polynomial
functions
3. RBF Kernel (param_grid_rbf):
o C: Same as in the linear and polynomial kernels, C controls the regularization
o gamma: takes values of [0.1, 0.01, 0.001], representing different levels of
width of the kernel.

In the code provided, cv=5 is used, which means the dataset is divided into 5 folds. This process
is repeated 5 times, resulting in 5 estimates of the model's performance.

# Define parameter grids for each kernel


param_grid_linear = {'C': [0.1, 1, 10, 100]}
param_grid_poly = {'C': [0.1, 1, 10, 100],
'gamma': [0.1, 0.01, 0.001],
'degree': [2, 3, 4]}
param_grid_rbf = {'C': [0.1, 1, 10, 100],
'gamma': [0.1, 0.01, 0.001]}

# Perform grid search with cross-validation for each kernel


svm_linear_grid = GridSearchCV(SVC(kernel='linear'), param_grid_linear,
cv=5)
svm_poly_grid = GridSearchCV(SVC(kernel='poly'), param_grid_poly, cv=5)
svm_rbf_grid = GridSearchCV(SVC(kernel='rbf'), param_grid_rbf, cv=5)

# Fit the models


svm_linear_grid.fit(X_train, y_train)
svm_poly_grid.fit(X_train, y_train)
svm_rbf_grid.fit(X_train, y_train)

# Get the mean cross-validation accuracy for each iteration of grid search
print("Linear Kernel:")
print(pd.DataFrame(svm_linear_grid.cv_results_)[['param_C',
'mean_test_score']])
print("")

print("Polynomial Kernel:")
print(pd.DataFrame(svm_poly_grid.cv_results_)[['param_C', 'param_degree',
'param_gamma', 'mean_test_score']])
print("")

print("RBF Kernel:")
print(pd.DataFrame(svm_rbf_grid.cv_results_)[['param_C', 'param_gamma',
'mean_test_score']])
Output
Linear Kernel:
param_C mean_test_score
0 0.1 0.941667
1 1 0.958333
2 10 0.950000
3 100 0.950000

Polynomial Kernel:
param_C param_degree param_gamma mean_test_score
0 0.1 2 0.1 0.950000
1 0.1 2 0.01 0.441667
2 0.1 2 0.001 0.441667
3 0.1 3 0.1 0.958333
4 0.1 3 0.01 0.425000
5 0.1 3 0.001 0.425000
6 0.1 4 0.1 0.941667
7 0.1 4 0.01 0.441667
8 0.1 4 0.001 0.408333
9 1 2 0.1 0.958333
10 1 2 0.01 0.883333
11 1 2 0.001 0.441667
12 1 3 0.1 0.950000
13 1 3 0.01 0.841667
14 1 3 0.001 0.425000
15 1 4 0.1 0.933333
16 1 4 0.01 0.816667
17 1 4 0.001 0.408333
18 10 2 0.1 0.950000
19 10 2 0.01 0.950000
20 10 2 0.001 0.441667
21 10 3 0.1 0.933333
22 10 3 0.01 0.958333
23 10 3 0.001 0.425000
24 10 4 0.1 0.941667
25 10 4 0.01 0.925000
26 10 4 0.001 0.408333
27 100 2 0.1 0.950000
28 100 2 0.01 0.958333
29 100 2 0.001 0.883333
30 100 3 0.1 0.941667
31 100 3 0.01 0.958333
32 100 3 0.001 0.425000
33 100 4 0.1 0.941667
34 100 4 0.01 0.950000
35 100 4 0.001 0.408333

RBF Kernel:
param_C param_gamma mean_test_score
0 0.1 0.1 0.900000
1 0.1 0.01 0.466667
2 0.1 0.001 0.466667
3 1 0.1 0.950000
4 1 0.01 0.908333
5 1 0.001 0.466667
6 10 0.1 0.950000
7 10 0.01 0.950000
8 10 0.001 0.916667
9 100 0.1 0.950000
10 100 0.01 0.958333
11 100 0.001 0.950000
Best Parameters (Linear Kernel): {'C': 1}
Best Score (Linear Kernel): 0.9583333333333334
Best Parameters (Polynomial Kernel): {'C': 0.1, 'degree': 3, 'gamma':
0.1}
Best Score (Polynomial Kernel): 0.9583333333333334
Best Parameters (RBF Kernel): {'C': 100, 'gamma': 0.01}
Best Score (RBF Kernel): 0.9583333333333334

Observations:

All three kernel types (Linear, Polynomial, and RBF) achieved the same best accuracy
score of 95.83%. This suggests that for this IRIS dataset, the choice of kernel might not
have a significant impact on SVM classification performance.

According to linear and RBF SVM's decision boundary image above, there are in a
straight line, indicating reasonable classification. The polynomial kernels have
complicated decision boundaries.

Given the IRIS have small dataset, the linear kernel seems like a good choice due to its
simplicity. It achieves high accuracy and avoids the potential for overfitting that can
occur with complex kernels like Polynomial and RBF.

Full Code via Google Colab

https://colab.research.google.com/drive/1MVm5P9lPHF9ukZMZPHxrdmvWSnemBDoC?usp=sharing
C. Logistic Regression

Title: Python Script: Logistic Regression for Classifying Handwritten Digits


Objective: This Python script showcases the implementation of logistic regression to classify
handwritten digits using the digit dataset from Scikit-Learn.
Dataset: The digit dataset consists of images of handwritten digits (0 through 9). Each image is
represented as an array of pixel values, where each pixel represents the grayscale intensity. The
objective is to classify these images into one of the ten digits based on their pixel values.

Step 1 : Importing Library and load the dataset


The script begins by importing the required libraries. These include scikit-learn modules
for loading the dataset, splitting the data, preprocessing, and evaluating the model. The
script loads the digit dataset using the load_digits function from scikit-learn. This dataset
contains images of handwritten digits along with their corresponding labels.

# Import necessary libraries

from sklearn.datasets import load_digits


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load the digits dataset


digits = load_digits()

Step 2 : Split the Dataset:


The dataset is split into training and testing sets using the train_test_split function. 0% is
for training, 20% is for testing.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(digits.data,
digits.target, test_size=0.2, random_state=42)

Step 3 : Preprocessing to Standardize the Features:


Feature scaling is performed to standardize the features, ensuring that each feature has a
mean of 0 and a standard deviation of 1. This is done using the StandardScaler class.

# Standardize the features


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Step 4 : Train the model and predict on test data
The logistic regression model is trained on the training data using the fit() method and
predict the labels for the test data using the predict method.

# Initialize the logistic regression model


logistic_regression = LogisticRegression(max_iter=1000)

# Train the model


logistic_regression.fit(X_train_scaled, y_train)

# Predict on the test set


y_pred = logistic_regression.predict(X_test_scaled)

Step 5 : Evaluation Model Performance and visualizing using Confusion Matrix:


The performance of the model is evaluated using accuracy metrics such as accuracy score
and a classification report. Finally, a visually readable confusion matrix is generated
using seaborn to assess the model's classification performance.
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report


print(classification_report(y_test, y_pred))

import matplotlib.pyplot as plt


import seaborn as sns
from sklearn.metrics import confusion_matrix

# Compute confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)

# Plot confusion matrix using seaborn


plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
Output
Accuracy: 0.9722222222222222
precision recall f1-score support

0 1.00 1.00 1.00 33


1 0.97 1.00 0.98 28
2 1.00 1.00 1.00 33
3 0.97 0.97 0.97 34
4 1.00 0.98 0.99 46
5 0.94 0.94 0.94 47
Observation
Overall result show 97.2 accuracy. The model performed well on classifying the digits 0, 2, 4, and
7 There were no instances where the model predicted these digits incorrectly. The model had
slightly difficulty classifying digits 1, 3, 5, 6, 8, and 9. There were few instances where these
digits were classified incorrectly as other digits. Overall, the model seems to be performing well,
but there is room for improvement in classifying digits 1, 3, 5, 6, 8, and 9.
Full Code via Google Colab
https://

colab.research.google.com/drive/1_eXsElMEFZmOAkag6LhdP-1K2sCrHhWM?usp=sharing

D. Ensembled

Title: Ensemble Method vs Single Classifier for Heart Disease Prediction


Objective: This Python script compares the performance of an ensemble method with a single
classifier (Support Vector Machine) for predicting the presence or absence of heart disease using
the UCI Statlog (Heart) dataset.
Dataset: The UCI Statlog (Heart) dataset contains various attributes related to heart health,
including age, gender, chest pain type, blood pressure, cholesterol level, and more.
Target variable: "Presence of Heart Disease," is a binary variable indicating whether the
individual has heart disease (1) or not (0).
Step 1 : Importing Library and load the dataset

Loads the Heart dataset from the openml repository into the variable data.

from sklearn.datasets import fetch_openml


from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Heart dataset


data = fetch_openml(name='heart-statlog', version=1, as_frame=True)
X = data.data
y = data.target

Step 2 : Pre=Process Data


X = X.dropna(): Drops rows with missing values from the feature matrix X. This is a simple way
to handle missing values.
Splits the dataset into training and testing sets. 80% of the data will be used for training ( X_train
and y_train), and 20% will be used for testing (X_test and y_test).

# Data preprocessing
# For simplicity, let's handle missing values by dropping them
X = X.dropna()

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

Step 3 : Create Ensemble Model and Single classifier. Evaluate accuracy

# Instantiate SVM classifier (single classifer)


svm_clf = SVC(kernel='linear', probability=True, random_state=42)

# Fit and evaluate SVM Classifier


svm_clf.fit(X_train, y_train)
y_pred_svm = svm_clf.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("SVM Classifier Accuracy:", accuracy_svm)
# Instantiate classifiers
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(kernel='linear', probability=True, random_state=42)
lr_clf = LogisticRegression(max_iter=1000, random_state=42)

# Create ensemble model using VotingClassifier


ensemble_model = VotingClassifier(estimators=[('rf', rf_clf), ('svm',
svm_clf), ('lr', lr_clf)], voting='soft')

# Fit and evaluate ensemble model


ensemble_model.fit(X_train, y_train)
y_pred_ensemble = ensemble_model.predict(X_test)
accuracy_ensemble = accuracy_score(y_test, y_pred_ensemble)
print("Ensemble Model Accuracy:", accuracy_ensemble)

Output

SVM Classifier Accuracy: 0.8703703703703703


Ensemble Model Accuracy: 0.9074074074074074

Observation

The accuracy results show that the ensemble model outperformed the Support Vector Machine
(SVM) classifier. The ensemble model achieved a higher accuracy by combining the predictions
of multiple classifiers (Random Forest, SVM, Logistic Regression) using the soft voting strategy.

Step 4: Visualizing Confusion Matrix

Output
Observation

 True positives (32): The model correctly predicted 32 people who have heart disease.
 False positives (1): The model incorrectly predicted that 1 person has heart disease, but
they actually don't.
 True negatives (17): The model correctly predicted 17 people who do not have heart
disease.
 False negatives (4): The model missed predicting 4 cases of heart disease.

Overall, the model appears to be good at identifying people with heart disease with few
mistakes.

Full Code via Google Colab


https://colab.research.google.com/drive/1VJ9vN1zqw1z0lXOw-N3YQaVgs6a77rN-?usp=sharing

You might also like