Machine Learning
Machine Learning
Machine Learning
(3170724)
B.E. Semester 7
(Computer Engineering)
Certificate
Place: __________________
Date: __________________
Preface
Main motto of any laboratory/practical/field work is for enhancing required skills as well
as creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient
weightage is given to practical work. It shows importance of enhancement of skills
amongst the students and it pays attention to utilize every second of time allotted for
practical amongst students, instructors and faculty members to achieve relevant outcomes
by performing the experiments rather than having merely study type experiments. It is
must for effective implementation of competency focused outcome-based curriculum that
every practical is keenly designed to serve as a tool to develop and enhance relevant
competency required by the various industry among every student. These psychomotor
skills are very difficult to develop through traditional chalk and board content delivery
method in the classroom. Accordingly, this lab manual is designed to focus on the industry
defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.
By using this lab manual students can go through the relevant theory and procedure in
advance before the actual performance which creates an interest and students can have
basic idea prior to performance. This in turn enhances pre-determined outcomes amongst
students. Each experiment in this manual begins with competency, industry relevant
skills, course outcomes as well as practical outcomes (objectives). The students will also
achieve safety and necessary precautions to be taken while performing practical.
This manual also provides guidelines to faculty members to facilitate student centric lab
activities through each experiment by arranging and managing necessary resources in
order that the students follow the procedures with required safety and necessary
precautions to achieve the outcomes. It also gives an idea that how students will be
assessed by providing rubrics.
Machine Learning is the fundamental course which deals with various forms of energy and
their conversion from one to the another. It provides a platform for students to
demonstrate first and second laws of thermodynamics, entropy principle and concept of
exergy. Students also learn various gas and vapor power cycles and refrigeration cycle.
Fundamentals of combustion are also learnt.
Utmost care has been taken while preparing this lab manual however always there is
chances of improvement. Therefore, we welcome constructive suggestions for
improvement and removal of errors if any.
Machine Learning (3170724)
Index
(Progressive Assessment Sheet)
Sr. No. Objective(s) of Experiment Page Date of Date of Assessm Sign. of Remar
No. perfor submis ent Teacher ks
mance sion Marks with
date
1 Find statistical measures such as Mean,
Median and Mode of the given data
2 Find statistical measures such as Standard
Deviation and Variance of the given data
3 Implement program to perform dimension
reduction of the high dimension data
4 Implement program to understand
similarity measure and dissimilarity
measures
5 Implement Linear Regression model and
evaluate model performance
6 Implement Logistic Regression model and
evaluate model performance
7 Implement k-NN classifier to classify the
flower spices from IRIS dataset
8 Implement Decision tree classifier and test
its performance
9 Implement program to demonstrate Neural
Network Classifier
10 Write a program to demonstrate within
class scatter, between class scatter and
total scatter of the dataset
11 Write a program to demonstrate clustering
using K-means algorithm
Total
Machine Learning (3170724) 210210107003
EXPERIMENT NO: 0
3.To provide affordable quality professional education with moral values, equal
opportunities, accessibility and accountability
4.To allocate competent and dedicated human resources and infrastructure to the
institutions for providing world-class professional education to become a Global
Leader (“Vishwa Guru”)
Institute’s Vision:
To transform the students into good human beings, employable engineering
graduates and continuous learners by inculcating human values and imparting
excellence in technical education.
Institute’s Mission:
To impart education to rural and urban students, so as to earn respect from the
society and thereby improving the living standards of their families and become asset
for the industry and society. To foster a learning environment with technology
integration and individual attention, so that the students imbibe quality technical
knowledge, skill-development and character building.
Program Outcome:
Experiment No: 1
Find statistical measures such as Mean, Median and Mode of the given
data
Date:
Relevant CO: 1, 2
Objectives:
1. To understand basic statistical properties of data like mean, median and mode
Theory:
Statistical properties of data tell a lot about data. It is first step for data analysis task.
Mean: We can compute the mean for only numeric data. The average value of a group of
numbers is referred to as the mean of those numbers. A data set's mean can be
calculated by first adding up all of the values in the set and then dividing that sum by the
total number of values.
Let X = <x1, x2, …, xn> be the vector of n numbers. The following formula can be used to
get the mean:
𝑛
1
𝜇 = ∑ 𝑥𝑖
𝑛
𝑖=1
22 + 44 + 33 + 11 + 55 165
𝜇= = = 33
5 5
Median: When a set of data is sorted in order from lowest to highest (or highest to
lowest), the value that falls in the middle of the set is referred to as the median. When
there are an even number of values, the median is determined by taking the average of
the two values that are in the middle of the set.
To compute the median of data set X = <22, 44, 33, 11, 55>, we shall first arrange it in
ascending or descending order:
There are 5 elements in the array X, so median is the element which is on index 3, which
1
Machine Learning (3170724) 210210107003
There are 6 elements in the array Y, so median is of this data set would be (33 + 44) / 2
= 38.5
Mode: The mode of a data set is the most frequent value within it. If two or more values
occur at the same frequency, we can say that there is more than one mode. So mode is
the element in data which occurs maximum number of time.
For X = <11, 33, 66, 55, 22, 11, 66, 44, 11, 33, 55, 11>, element 11 appears maximum
number of times (4 times), so 11 is the mode of this dataset.
For Y = <33, 66, 55, 22, 11, 33, 44, 11, 33, 55, 11>, element 11 and 33 appears maximum
number of times (3 times each), so 11 and 33 are modes of this dataset.
Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
X = [1, 3, 2, 4, 5, 6, 4, 3, 2, 4, 5, 3, 1, 2, 3, 2, 3, 1, 4]
mean = np.mean(X)
median = np.median(X)
mode = stats.mode(X)[0]
plt.figure(figsize=(10,6))
plt.hist(X, bins=range(1,8), edgecolor='black', alpha=0.7, color='skyblue')
plt.axvline(mean, color='red', linestyle='dashed', linewidth=1.5, label=f'Mean: {mean:.2f}')
plt.axvline(median, color='yellow', linestyle='dashed', linewidth=1.5, label=f'Median:
{median:.2f}')
plt.axvline(mode, color='blue', linestyle='dashed', linewidth=1.5, label=f'Mode: {mode:.2f}')
plt.title('Histogram of Data X with Mean, Median, and Mode')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Implementation:
2
Machine Learning (3170724) 210210107003
Write a program to compute mean, median and mode for following data using your
preferred programming language:
X1 = <1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2>
X2 = <1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2000>
X3 = <1, 2, 3, 10, 20, 30, 100, 200, 300, 1000, 2000, 3000>
Result:
X1 2.56 2.0 1
X2 82.48 2.0 1
X3 555.50 65.0 1
Conclusion:
In the context of machine learning, understanding these central tendencies, outliers, and
unique characteristics is crucial for effective data preprocessing, feature engineering,
and model development. X1 has an average (mean) value of approximately 2.44, with a
median of 2, and the mode is also 2. X2 has a high outlier (2000) that significantly
affects the mean, resulting in a mean of around 84.08. The median remains 2, and the
mode is 1. X3 consists of larger values, with a mean of approximately 584.45, a median
of 20, and no unique mode due to the absence of repeated values.
Quiz:
2. What is the mean, median and mode of the following set of data: 4, 7, 9, 9, 11,
11, 11, 13?
Ans:
• Mean (Average): (4 + 7 + 9 + 9 + 11 + 11 + 11 + 13) / 8 = 75 / 8 = 9.375
• Median: (9 + 11) / 2 = 20 / 2 = 10
• Mode: So, the mode is 11.
3
Machine Learning (3170724) 210210107003
3. Which measure of central tendency is preferred when the data set has extreme
values?
Ans:
• when dealing with data sets that have extreme values or outliers, the median is the
preferred measure of central tendency. It is less influenced by extreme values,
providing a more robust representation of the typical value in the dataset.
• https://codecrucks.com/mean-median-mode-variance-discovering-statistical-
properties-of-data
• https://www.techtarget.com/searchdatacenter/definition/statistical-mean-median-
mode-and-range
• https://www.statisticshowto.com/probability-and-statistics/statistics-
definitions/mean-median-mode/
• https://www.twinkl.co.in/teaching-wiki/mean-median-mode-and-range
• https://www.thoughtco.com/definition-of-bimodal-in-statistics-3126325
• https://www.diffen.com/difference/Mean_vs_Median
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
4
Machine Learning (3170724) 210210107003
Experiment No: 2
Date:
Relevant CO: 1, 2
Objectives:
Theory:
Statistical properties of data tell a lot about data. It is first step for data analysis task.
Standard deviation: The dispersion of data relative to its mean or average is quantified
by its standard deviation. How much each data point varies from the mean is revealed. If
your data points have a low standard deviation, they cluster closely around the mean,
but if they have a large standard deviation, they are more widely dispersed.
Let X = <x1, x2, …, xn> be the vector of n numbers. The following formula can be used to
get the standard deviation:
(𝑥 − 𝜇)2
𝜎=√
𝑛
𝑛
1
𝜇 = ∑ 𝑥𝑖
𝑛
𝑖=1
22 + 44 + 33 + 11 + 55 165
𝜇= = = 33
5 5
(22 − 33)2 + (44 − 33)2 + (33 − 33)2 + (11 − 33)2 + (55 − 33)2
𝜎=√
5
5
Machine Learning (3170724) 210210107003
Variance: The variance of a data collection is another indicator of its dispersion around
the mean. The standard deviation is calculated as the mean squared deviation from the
mean. How much each data point varies from the mean is revealed.
2
(𝑥 − 𝜇)2
𝜎 =
𝑛
22 + 44 + 33 + 11 + 55 165
𝜇= = = 33
5 5
(22 − 33)2 + (44 − 33)2 + (33 − 33)2 + (11 − 33)2 + (55 − 33)2 484
𝜎2 = = = 96.8
5 5
Set up diagram: Plot the histogram for the data X = <1, 3, 2, 4, 56, 4, 3, 2, 4, 5, 3, 1, 2, 3,
2, 3, 1, 4> and show standard deviation and variance in the histogram.
Implementation:
Write a program to compute standard deviation and variance for following data using
your preferred programming language:
X1 = <1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2>
X2 = <1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2000>
X3 = <1, 2, 3, 10, 20, 30, 100, 200, 300, 1000, 2000, 3000>
Code:
import numpy as np
import matplotlib.pyplot as plt
X = [1, 3, 2, 4, 56, 4, 3, 2, 4, 5, 3, 1, 2, 3, 2, 3, 1, 4]
6
Machine Learning (3170724) 210210107003
plt.xlabel('Value')
plt.ylabel('Frequency')
std_dev_X = np.std(X)
variance_X = np.var(X)
plt.text(45, 5, f'Standard Deviation: {std_dev_X:.2f}\nVariance: {variance_X:.2f}',
bbox=dict(facecolor='white', alpha=0.5))
plt.show()
X1 = [1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2]
X2 = [1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2000]
X3 = [1, 2, 3, 10, 20, 30, 100, 200, 300, 1000, 2000, 3000]
X1 1.36 1.85
X2 391.41 153205.29
X3 932.67 869810.92
Conclusion:
Quiz:
2. If the standard deviation of a set of data is very low, what does that tell you
about the data points?
Ans:
• A very low standard deviation in a dataset indicates that the data points are highly
consistent and closely clustered around the mean.
• This suggests little variation or spread among the data points, making them
7
Machine Learning (3170724) 210210107003
3. If a set of data has a high standard deviation, what does that tell you about the
spread of the data points?
Ans:
• A very low standard deviation in a dataset signifies a high degree of consistency,
with data points tightly concentrated around the mean.
• This indicates minimal variability and high predictability in the data, which is
valuable in various contexts, such as financial stability, precise measurements, and
reliable results.
Suggested Reference:
• https://codecrucks.com/mean-median-mode-variance-discovering-statistical-
properties-of-data
• https://www.investopedia.com/ask/answers/021215/what-difference-between-
standard-deviation-and-variance.asp
• https://www.mathsisfun.com/data/standard-deviation.html
• https://www.cuemath.com/data/variance-and-standard-deviation/
References used by the students:
• https://www.statology.org/what-is-a-low-standard-deviation/
• https://www.scribbr.com/statistics/standard-deviation/
Rubric wise marks obtained:
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
8
Machine Learning (3170724) 210210107003
Experiment No: 3
Date:
Relevant CO: 1, 4
Objectives:
Theory:
High dimensionality:
The term "high-dimensional data" refers to data sets that contain a significant number
of distinct features. When dealing with high-dimensional data, it can be challenging to
visualize and interpret the information because the human brain is only able to process
information in a limited number of dimensions. For instance, a data collection
containing 1000 features could be interpreted and displayed as a space with 1000
dimensions, which would be impossible to visualize.
Dealing with data that has a high dimension can be difficult due to the possibility that
conventional statistical approaches will not function well in the situation. The
phenomenon known as the "curse of dimensionality" occurs as the number of
dimensions in a problem rises, causing the amount of data that must be collected to
keep a specific level of statistical accuracy to increase at an exponential rate.
In addition, methods of data visualization such as scatter plots, heat maps, and parallel
coordinates can be utilized in order to assist with the comprehension of high-
dimensional data. It is essential to select the methods that are suitable for the data that
you are working with and the research issue that you are attempting to answer.
Principal component analysis, sometimes known as PCA, is a method that can be used in
statistics to reduce the dimensions of a data set. It includes translating a high-
dimensional data collection into a lower-dimensional space while trying to keep as
much of the information intact.
The principle behind principal components analysis (PCA) is to determine the ways in
9
Machine Learning (3170724) 210210107003
which the data varies the most in order to determine which patterns or features in the
data are the most significant. Principal components are another name for these different
directions. The first principal component (PC) accounts for the greatest amount of
variance in the data, and each successive PC accounts for as much of the remaining
variation as it is possible while adhering to the requirement that it is orthogonal
(perpendicular) to the PCs that came before it.
• Standardize the data: Principal component analysis (PCA) functions most well
when the data are standardized, which means that each feature has a mean of
zero and a variance of one.
• Perform the computation that will produce the covariance matrix. This matrix
measures the linear relationship that exists between each pair of features in the
data.
• Choose the k most significant eigenvectors: These are the top k principal
components, which are responsible for capturing the most variation in the data.
• Change the format of the data: In order to obtain a representation of the data
with fewer dimensions, multiply the standardized data by the matrix that
contains the top k eigenvectors.
Implementation:
Load Iris flower dataset, study it. Reduce the dimension of dataset to 2 by applying PCA.
Step up plot: Draw plot of all three classes in dataset with 2 dimensions only, after
applying PCA
Code:
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
10
Machine Learning (3170724) 210210107003
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(10, 7))
sns.scatterplot(data=df_pca, x='PC1', y='PC2', hue='target_name', palette='Set1', s=100)
Conclusion:
Applying Principal Component Analysis (PCA) to the Iris dataset and visualizing it in
two dimensions reveals clear separation among the three classes of Iris flowers (Setosa,
Versicolor, and Virginica). Setosa forms a distinct cluster, while Versicolor and Virginica,
though distinguishable, show some overlap. Reducing the dimensionality to 2D
enhances data visualization and class separability.
Quiz:
Ans:
• There are several alternatives to Principal Component Analysis (PCA) for feature
extraction and dimensionality reduction, each with its own strengths and
weaknesses.
• Independent Component Analysis (ICA): ICA assumes that the observed data is a
11
Machine Learning (3170724) 210210107003
Ans:
• Image Compression: PCA can be used to reduce the dimensionality of images
while retaining important features.
• Face Recognition: In facial recognition systems, PCA can be applied to extract
essential facial features and reduce the dimensionality of the data, making it easier
to compare and recognize faces.
• Genomic Data Analysis: In bioinformatics, PCA can identify patterns in gene
expression data, helping researchers discover relationships between genes and
their functions.
• Environmental Data Analysis: PCA is used to reduce the dimensionality of
environmental datasets, such as climate data, to uncover trends and patterns in the
data.
• Speech Recognition: PCA can be applied to reduce the dimensionality of audio data
and improve the efficiency of speech recognition systems.
• Recommendation Systems: In recommendation engines, PCA can help identify
latent factors that influence user preferences, improving personalized
recommendations.
3. How does the choice of scaling method impact the results of PCA?
12
Machine Learning (3170724) 210210107003
Suggested Reference:
● https://codecrucks.com/question/machine-learning-question-set-5/
● https://builtin.com/data-science/step-step-explanation-principal-component-
analysis
● https://towardsdatascience.com/a-one-stop-shop-for-principal-component-
analysis-5582fb7e0a9c
● https://www.turing.com/kb/guide-to-principal-component-analysis
● https://medium.com/all-about-ml/understanding-principal-component-analysis-
pca-556778324b0e
● https://www.simplilearn.com/tutorials/machine-learning-tutorial/principal-
component-analysis
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
13
Machine Learning (3170724) 210210107003
Experiment No: 4
Date:
Relevant CO: 2
Objectives:
Theory:
Similarity measures:
A numerical value that shows how similar or alike two things or entities are with
respect to specific features or characteristics is referred to as a similarity measure. It is
possible to calculate the similarity measure based on a variety of metrics or distances
between the qualities of the objects or entities being compared. This allows the measure
to be used to quantify the degree to which two things or entities resemble one another.
In the fields of machine learning and data analysis, similarity measures are frequently
utilized for the purpose of carrying out tasks such as clustering, classification, and the
development of recommendation systems. For instance, the similarity measure is used
in clustering to group together objects or entities that are similar to one another,
whereas in recommendation systems, the similarity measure is used to identify things
that are similar to those that a user has liked or purchased in the past.
A. Cosine Similarity:
The cosine similarity is a measure of similarity that is utilized for the purpose of
determining the degree to which two non-zero vectors of an inner product space are
similar to one another. It is usual practice in the fields of machine learning and data
analysis to utilize cosine similarity to evaluate the degree of similarity between two
documents or texts. However, cosine similarity may also be utilized to evaluate the
degree of similarity between any two vectors.
14
Machine Learning (3170724) 210210107003
𝐴 .𝐵
𝐶𝑜𝑠𝑖𝑛𝑒 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
||𝐴|| ||𝐵||
where A and B are two different vectors, where A. B is the dot product of A and B, and
where || A || and || B || are the respective magnitudes of A and B.
Dissimilarity Measure:
A dissimilarity measure is a mathematical formula that expresses how far apart two
items, entities, or observations are from one another in a dataset. Algorithms for
machine learning, pattern recognition, data analysis, and clustering frequently employ
dissimilarity metrics.
A. Euclidean distance:
The Euclidean distance is a measure of distance that can be used to determine how far
apart two points are in the space defined by Euclidean geometry. In the fields of
machine learning and data analysis, the Euclidean distance is a measurement that is
frequently used to determine how similar or unlike two numerical vectors are to one
another.
The following formula is used to determine the Euclidean distance between two
locations in a space of n dimensions:
Where, A = {a1, a2, …, an} and B = {b1, b2, …,bn} are two feature vectors of dimension n.
Regardless of the number of dimensions involved, the Euclidean distance can be utilized
to determine the separation that exists between any two numerical vectors. In the fields
of machine learning and data analysis, one of the most common applications of
Euclidean distance is in the performance of tasks such as grouping, classification, and
regression. However, the Euclidean distance might not always be the best distance
15
Machine Learning (3170724) 210210107003
measure to use for certain kinds of data or applications. This is because the Euclidean
distance assumes that all points on a line are equal. In these kinds of scenarios, alternate
methods of measuring distance, like the Manhattan distance or the Mahalanobis
distance, might be more appropriate.
Implementation:
Set up diagram:
Draw the diagrams geometrically explaining cosine similarity and Euclidian distance
Code:
import numpy as np
A = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
B = np.array([1, 3, 5, 7, 9, 7, 5, 3, 1, 0])
dot_product = np.dot(A, B)
norm_A = np.linalg.norm(A)
norm_B = np.linalg.norm(B)
cosine_similarity = dot_product / (norm_A * norm_B)
print("Cosine Similarity between A and B:", cosine_similarity)
euclidean_distance = np.sqrt(np.sum((A - B) ** 2))
print("Euclidean Distance between A and B:", euclidean_distance)
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.title(f"Cosine Similarity: {cosine_similarity:.2f}")
plt.subplot(1, 2, 2)
plt.quiver(0, 0, X[0], X[1], angles='xy', scale_units='xy', scale=1, color='b', label='X')
plt.quiver(X[0], X[1], Y[0] - X[0], Y[1] - X[1], angles='xy', scale_units='xy', scale=1, color='r',
label='Y - X')
plt.xlim(0, 11)
plt.ylim(0, 11)
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.title(f"Euclidean Distance: {euclidean_distance:.2f}")
plt.tight_layout()
plt.show()
Results:
(A, B) 0.67
Cosine
(X, Y) 0.67
(A, B) 14.97
Euclidean
(X, Y) 14.97
Conclusion:
Implementing a program to work with similarity and dissimilarity measures is a fundamental
step in various data analysis and machine learning tasks. It provides insights into data
relationships, helps in grouping and classification, and has broad applications across different
domains. The choice of the appropriate measure is crucial, and evaluating the program's
results is essential for its effectiveness.
Quiz:
17
Machine Learning (3170724) 210210107003
Suggested Reference:
• https://codecrucks.com/distance-and-similarity-measures-for-machine-
learning/
• https://www.sciencedirect.com/topics/computer-science/cosine-similarity
• https://medium.datadriveninvestor.com/cosine-similarity-cosine-distance-
6571387f9bf8
• https://www.cuemath.com/euclidean-distance-formula/
• https://www.engati.com/glossary/euclidean-distance
References used by the students:
• https://online.stat.psu.edu/stat508/lesson/1b/1b.2/1b.2.1
• https://www.scaler.com/topics/measures-of-similarity-and-dissimilarity/
Rubric wise marks obtained:
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
18
Machine Learning (3170724) 210210107003
Experiment No: 5
Implement Linear Regression model and evaluate model performance
Date:
Relevant CO: 3, 5
Objectives:
Theory:
What is Regression?
Modelling the relationship between a dependent variable (also known as the response or
goal variable) and one or more independent variables (also known as predictors or
features) is the purpose of the statistical method known as regression. Regression can be
used to model this relationship. The objective of regression analysis is to locate the line
or curve that provides the most accurate description of the connection between the
variables being studied.
To create predictions or get an estimate of the value of the dependent variable based on
the values of the independent variables, regression analysis is a technique that is
frequently utilized in a variety of sectors, including finance, economics, the social
sciences, and engineering, amongst others.
19
Machine Learning (3170724) 210210107003
premise that there is a linear connection between X and Y, and it attempts to determine
the equation of a straight line that most accurately depicts this connection.
The equation for a straightforward linear regression model can be written as follows:
𝑦̂𝑖 = 𝑤0 + 𝑤1 𝑥𝑖
where 𝑦̂𝑖 represents the predicted value for input independent variable x. 𝑤0 and 𝑤1
indicates Y intercept and slope of the predictor line respectively.
The objective of simple linear regression is to determine the values of 𝑤0 and 𝑤1 that
will result in the smallest sum of squared errors between the predicted values of Y and
the actual values of Y. This can be accomplished by estimating the values of 𝑤0 and 𝑤1.
Find the values of 𝑤0 and 𝑤1 that minimise the sum of the squared differences between
the expected and actual values of Y. This is commonly done using the method of least
squares, which entails finding the values that minimize the total of the squared
differences.
It is necessary to make sure that the assumptions of the model are met and that the
model is appropriate for the data that is being analyzed. Simple linear regression is a
strong and extensively used technique for analyzing the relationship between two
variables. However, it is also important to keep in mind that the model's assumptions
must be met.
Implementation:
Results:
Variable Value
20
Machine Learning (3170724) 210210107003
w0 -1.1454545454545446
w1 0.10909090909090909
Y | x = 210 21.763636363636362
Plot: Plot the given data and fir the regression line. Also show the predicted value for X
= 210
Code:
Conclusion:
• The linear regression model showed [mention whether it was successful or not]
in explaining the relationship between the independent variables and the target
21
Machine Learning (3170724) 210210107003
variable. The choice of this model was appropriate for this dataset, given its
simplicity and interpretability.
• The linear regression model serves as a valuable starting point for
understanding and predicting the target variable. The evaluation metrics provide
insights into the model's performance and guide us in making data-driven
decisions based on the analysis.
Quiz:
1. What are the pros and cons of using KNN for classification tasks?
➢ Pros:
1. Simplicity: KNN is easy to understand and implement. It's an ideal choice
for beginners in machine learning.
2. No Training Period: KNN is a lazy learning algorithm, which means there
is no explicit training phase. The model stores the entire dataset and
makes predictions on the fly, which can be advantageous when the data is
continuously changing.
3. Non-Parametric: KNN is a non-parametric algorithm, meaning it doesn't
make strong assumptions about the underlying data distribution. It can
work well with data that doesn't adhere to specific statistical
assumptions.
4. Versatile: KNN can be used for both binary and multiclass classification
tasks. It's also adaptable for regression tasks by averaging the values of
the K nearest neighbors.
➢ Cons:
1. Computational Cost: KNN can be computationally expensive, especially
for large datasets. Predicting a new data point requires calculating
distances to all data points in the training set.
2. Sensitivity to Distance Metric: The choice of distance metric is critical in
KNN. Different distance metrics can yield different results, and selecting
the right one is often a trial-and-error process.
3. Curse of Dimensionality: KNN's performance degrades as the
dimensionality of the data increases. In high-dimensional spaces, the
nearest neighbors may not be representative, and the algorithm can
become less effective.
Suggested Reference:
• https://codecrucks.com/question/machine-learning-question-set-12/
• https://www.scribbr.com/statistics/simple-linear-regression
• https://online.stat.psu.edu/stat462/node/91/
• https://www.jmp.com/en_in/statistics-knowledge-portal/what-is-
regression.html
References used by the students:
• https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-
metrics-for-your-regression-model/
• https://machinelearningmastery.com/regression-metrics-for-machine-
learning/
Rubric wise marks obtained:
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
23
Machine Learning (3170724) 210210107003
Experiment No: 6
Date:
Relevant CO: 3, 5
Objectives:
Theory:
Logistic Regression:
For the purpose of converting the linear combination of the independent variables into
a probability value in the range of 0 and 1, the logistic regression model makes use of a
logistic function, which is also known as a sigmoid function. The logistic function has a
curve that is shaped like a S and may be expressed as:
Let 𝑧 = 𝑤0 + 𝑤1 𝑥
𝑒𝑧
𝑃=
1 + 𝑒𝑧
24
Machine Learning (3170724) 210210107003
This will create the sigmoid curve as shown below. The appropriate threshold will
create the binary label for the test data.
[Source: wikipedia]
After the model has been fit to the data, the coefficients may be used to forecast the
likelihood of the dependent variable being equal to 1 for new observations with known
values of the independent variables. This can be done for new observations that already
have the values for the independent variables.
Logistic regression is a strong and extensively used technique for analyzing binary data;
nonetheless, it is essential to make certain that the model's assumptions are satisfied
and that the model is suitable for the data that is being analyzed.
Implementation:
• Load Iris flower data set. Divide dataset in 70-30 ratio. Use 70% data to train
logistic regression model and use 30% data to test the model performance.
Measure various performance metric such as precision, recall, F1 score, accuracy.
Also derive confusion matrix.
Results:
Set up Plot:
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
# Load the Iris dataset
25
Machine Learning (3170724) 210210107003
data = load_iris()
X = data.data
y = data.target
# Split the dataset into a training set (70%) and a test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Output:
Conclusion:
• Implementing a Logistic Regression model and evaluating its performance is an
26
Machine Learning (3170724) 210210107003
Quiz:
➢ Binary logistic regression is for binary outcomes (yes/no, 0/1), while multinomial
logistic regression is for outcomes with more than two categories that are nominal
(unordered). Choose binary when you have two categories and multinomial when
you have multiple, unordered categories in your dependent variable.
3. What is overfitting, and how can you guard against it when using logistic
regression?
➢ Overfitting in logistic regression occurs when the model fits the training data too
closely, capturing noise and performing poorly on new data. To guard against
overfitting:
1. Carefully select relevant features.
2. Apply regularization techniques (L1, L2).
3. Use cross-validation to assess generalization.
4. Consider early stopping during training.
5. Simplify the model structure.
6. Increase the dataset size.
7. Balance bias and variance.
8. Tune the regularization strength.
9. Explore ensemble methods (e.g., Random Forest).
10. Use a separate validation set for evaluation.
4. What are some common performance metrics used to evaluate the accuracy of a
logistic regression model, and how do you interpret them?
➢ Accuracy: Overall proportion of correct predictions.
➢ Precision: Proportion of true positives among positive predictions.
➢ Recall (Sensitivity): Proportion of true positives among actual positives.
➢ Specificity: Proportion of true negatives among actual negatives.
➢ F1-Score: Harmonic meaning of precision and recall.
➢ ROC Curve and AUC: Evaluates model's ability to discriminate between positive
and negative cases.
➢ Log-Loss: Considers prediction confidence for accuracy.
➢ Confusion Matrix: Breakdown of true positives, true negatives, false positives,
27
Machine Learning (3170724) 210210107003
Suggested Reference:
● https://codecrucks.com/question/machine-learning-question-set-9/
● https://www.ibm.com/in-en/topics/logistic-regression
● https://towardsdatascience.com/logistic-regression-detailed-overview-
46c4da4303bc
● https://careerfoundry.com/en/blog/data-analytics/what-is-logistic-regression/
● https://www.r-bloggers.com/2015/08/evaluating-logistic-regression-models/
● https://www.hackerearth.com/practice/machine-learning/machine-learning-
algorithms/logistic-regression-analysis-r/tutorial/
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
28
Machine Learning (3170724) 210210107003
Experiment No: 7
Date:
Relevant CO: 3, 5
Objectives:
Theory:
K-Nearest Neighbours (KNN) is a technique for supervised machine learning that has
applications in classification and regression work. KNN can be used for any task.
The functionality of the method relies on locating the K data points in the training
dataset that are closest to a specific test data point. The value of K is a user-defined
parameter that defines the number of individuals whose immediate neighbours should
be taken into account.
After determining which K neighbours are the closest, the algorithm then produces a
forecast by picking either the target value that corresponds to the majority class (in
classification) or the mean value (in regression) of the target values of these neighbours.
For example, for below image, circle is the query data point, if we consider k = 3, we
shall inspect nearest three data points and should chose the majority class for circle. So
for k = 3, class assigned to circle would be triangle. And for k = 5, the majority class is
square, so class assigned to green circle would be square.
The algorithm computes the distance between the test data point and each data point in
the training dataset by employing a distance metric such as Euclidean distance or cosine
similarity. This allows it to locate the K data points that are the closest in proximity to
the test data point. The type of data being examined and the specific nature of the issue
being solved both have an impact on the distance metric that is selected.
29
Machine Learning (3170724) 210210107003
After computing the distances, the algorithm sorts the data points in descending order
of how far away they are from the test data point, and then it chooses the K data points
that are the closest neighbours to the test data point.
Implementation:
• Load Iris flower dataset. Use 10-fold cross validation and find accuracy of k-nn
for k = 1, 3, 5 and 7
Results:
1 0.933
3 0.977
5 0.955
7 0.933
Code:
train_score=[]
test_score=[]
neigh = np.arange(1,50,1)
for n in neigh:
knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(X_train,y_train)
test_score.append(knn.score(X_test,y_test))
train_score.append(knn.score(X_train,y_train))
plt.plot(neigh,train_score,'o-',label="training score")
plt.plot(neigh,test_score,'o-',label="testing score")
plt.legend()
plt.xlabel("K")
plt.ylabel("Score")
plt.title("score train vs testing w.r.t. k")
30
Machine Learning (3170724) 210210107003
Conclusion:
• The k-NN classifier is a simple and effective method for classifying iris flower
species, and it can serve as a foundation for more complex classification tasks.
This project demonstrates the importance of data preprocessing, feature
selection, hyperparameter tuning, and model evaluation in machine learning.
Quiz:
1. What are the pros and cons of using KNN for classification tasks?
➢ Pros of using k-NN for classification:
1. Simplicity: k-NN is easy to understand and implement, making it a good
choice for beginners in machine learning.
2. Non-parametric: k-NN is a non-parametric algorithm, meaning it makes
no assumptions about the underlying data distribution. This makes it
suitable for a wide range of data types.
3. No Training Period: Unlike many other machine learning algorithms, k-
NN doesn't require a lengthy training period. The model stores the entire
dataset, and predictions can be made immediately.
4. Adaptability: k-NN can be used for both binary and multi-class
classification tasks, as well as regression.
➢ Cons of using k-NN for classification:
1. Computationally Expensive: As the dataset size increases, k-NN's
computational cost grows significantly because it needs to calculate
distances between the test point and all training points. This can be a
major drawback for large datasets.
2. High Memory Usage: k-NN requires storing the entire dataset, which can
be memory-intensive, especially for large datasets.
31
Machine Learning (3170724) 210210107003
3. Choice of 'k': Selecting the right value for 'k' (the number of nearest
neighbors to consider) can be challenging. A small 'k' may lead to
overfitting, while a large 'k' may lead to underfitting.
Suggested Reference:
• https://codecrucks
• https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4
• https://towardsdatascience.com/a-simple-introduction-to-k-nearest-neighbors-
algorithm-b3519ed98e
References used by the students:
• https://www.geeksforgeeks.org/project-knn-classifying-iris-dataset/
• https://www.analyticsvidhya.com/blog/2022/06/iris-flowers-classification-
using-machine-learning/
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
32
Machine Learning (3170724) 210210107003
Experiment No: 8
Implement Decision tree classifier and test its performance
Date:
Relevant CO: 3, 5
Objectives:
Theory:
The tree continues to extend its branches in a recursive manner, with each succeeding
node reflecting a choice or outcome that is more specific than the previous one and
being determined by the values of additional input variables. The very last nodes of the
tree, which are referred to as the leaves, stand in for the very last choice or result of the
procedure.
33
Machine Learning (3170724) 210210107003
Decision tree usually classify data using “if then” rules. For example, for above decision
tree,
Rule 2: If salary is > 50K and company distance > 30km, then reject job
Rule 3: If salary is > 50K and company distance < 30km and yearly increment > 20%
then accept job.
Both classification and regression problems can be solved with the help of decision
trees. When solving a classification problem, the objective is to determine the category
or class into which a new observation will fall by using the values of one or more of the
variables that were input into the problem. Predicting the value of a continuous variable
based on the values of one or more input variables is the purpose of a regression
problem. There may be one or more input variables.
The decision tree algorithm is effective because it recursively divides the input space
into subsets, with each subset being determined by the values of the variables that are
entered. The input variable that offers the best split is the one that is chosen by the
algorithm. This is the variable that either maximizes the information gain or minimizes
the impurity of the subsets that are produced as a result of the split. This process is
repeated until all of the variables that were entered have been used or until a stopping
requirement (such as a maximum tree depth or a minimum number of observations per
leaf) has been satisfied, whichever comes first.
The popularity of decision trees can be attributed to the fact that they are simple to read
and visualize, as well as the fact that they are able to process categorical and continuous
input variables. On the other hand, they are susceptible to overfitting, particularly when
the tree is excessively deep or when there are an excessive number of input variables.
Several different approaches, such as pruning and ensemble methods, are some of the
potential solutions to this problem.
The building of a decision tree requires the partitioning of the input space into subsets
in a recursive manner, with the subsets being determined by the values of the variables
that are entered, until a stopping criterion is satisfied. The following are the stages
involved in the construction of a decision tree:
1. Begin with the root node: The root node represents the complete dataset, and all
of the input variables are available to be used in making decisions regarding the
splitting of the nodes.
2. The algorithm analyses each input variable and chooses the one that offers the
34
Machine Learning (3170724) 210210107003
best split based on a given criterion (such as information gain, Gini impurity, or
the chi-squared test). The variable that is selected as the best one to split on is
referred to as the "splitting variable." The optimal split is one that either
maximizes the amount of information gained or minimizes the amount of
impurity in the subsets that are produced as a result.
3. Once the best input variable has been chosen, the dataset will be partitioned into
two or more subsets based on the possible values of that variable. This step of
the process is known as the creation of child nodes. Each individual subset can
be thought of as a child node, which serves as a fresh Launchpad for the
subsequent level of the tree.
4. Iterate over steps two and three in a recursive manner: The algorithm iterates
over steps two and three for each child node, selecting the best input variable
and creating additional child nodes until a stopping criterion is met (for example,
the maximum tree depth, the minimum number of observations per leaf, or there
is no significant improvement in the model's performance).
5. Remove any dead or diseased branches from the tree. Once the tree has been
completely built, it may be excessively complicated and prone to overfitting. To
prevent overfitting, the tree can be "pruned" by eliminating branches that do not
increase the model's performance on a validation set or by placing a complexity
penalty on the tree. Either of these two options is an alternative to "overfitting."
6. Utilize the tree for predictive purposes: After the tree has been built and
trimmed, it can then be utilized for predictive purposes by traversing the tree
from the root node to the appropriate leaf node based on the values of the input
variables for a new observation. This allows the tree to be used for prediction
after it has been produced.
Visualizing the process of constructing a decision tree as a tree-like structure, with the
root node at the top and the leaf nodes at the bottom, is one way to represent this
process. The nodes in a decision tree represent the points at which a decision must be
made, and the branches reflect the various alternative outcomes or values that can be
obtained from the variables that are input.
Implementation:
• Consider following data set. Train decision tree with random 10 data points and
test with remaining 4. Create decision tree with different parameters and test it
35
Machine Learning (3170724) 210210107003
Code:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn import tree
import pandas as pd
# Create a DataFrame for the dataset
data = {
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny',
'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
'Temp': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot',
'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal',
'Normal', 'Normal', 'High', 'Normal', 'High'],
'Windy': [False, True, False, False, False, True, True, False, False, False, True, True, False,
True],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)
# Encoding categorical variables
df_encoded = pd.get_dummies(df, columns=['Outlook', 'Temp', 'Humidity', 'Windy'])
# Split the data into training and testing sets (10 data points for training, 4 for testing)
X_train = df_encoded.iloc[:10, 1:]
y_train = df_encoded.iloc[:10, 0]
X_test = df_encoded.iloc[10:, 1:]
y_test = df_encoded.iloc[10:, 0]
# Initialize lists to store parameter values and corresponding accuracies
max_depth_values = [1,2,3,5]
accuracies = []
# Iterate through different max_depth values
for max_depth in max_depth_values:
# Create and train a decision tree classifier with the specified max_depth
decision_tree = DecisionTreeClassifier(criterion='entropy', max_depth=max_depth,
random_state=42)
decision_tree.fit(X_train, y_train)
# Test the decision tree on the test data
y_pred = decision_tree.predict(X_test)
# Calculate accuracy and store in the accuracies list
36
Machine Learning (3170724) 210210107003
Conclusion:
• Implementing a Decision Tree Classifier and evaluating its performance is an
essential task in machine learning. In conclusion, this study delved into the
fundamentals of decision trees and their operation. We successfully acquired the
knowledge and skills necessary to train a decision tree classifier, a valuable tool
in machine learning. Furthermore, we assessed the performance of the classifier
through a rigorous 10-fold cross-validation process, which helps ensure the
model's robustness and reliability. This exploration provides a strong foundation
for leveraging decision trees in various data classification tasks.
Quiz:
37
Machine Learning (3170724) 210210107003
applications.
➢ Improved Stability: Pruned trees tend to be more stable and less sensitive to
variations in the training data. A non-pruned tree can be highly sensitive to small
changes in the data, resulting in different tree structures for similar datasets.
Suggested Reference:
• https://www.ibm.com/in-en/topics/decision-trees
• https://hbr.org/1964/07/decision-trees-for-decision-making
• https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052
•
References used by the students:
● https://hbr.org/1964/07/decision-trees-for-decision-making
● https://www.javatpoint.com/machine-learning-decision-tree-classification-
algorithm
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
39
Machine Learning (3170724) 210210107003
Experiment No: 9
Implement program to demonstrate Neural Network Classifier
Date:
Relevant CO: 3, 5
Objectives:
Theory:
Biological Neuron:
Biological neurons are the primary components of the nervous system in all species.
These cells are specialized in the use of electrochemical signals for processing and
communication. An axon, dendrites, and a cell body make up a neuron in a living
organism. The axon is responsible for sending messages to other neurons or muscles,
while the dendrites are responsible for receiving them.
Artificial Neuron:
Artificial neurons, also called perceptron, are mathematical functions that simulate the
actions of real neurons. Synthetic neural networks, composed of artificial neurons, are
employed in many AI and machine learning systems. A synthetic neuron receives data
as input, processes it using a predetermined set of weights and biases, and then
40
Machine Learning (3170724) 210210107003
There are functional parallels between biological neurons and artificial neurons, but
there are also important distinctions. To modify their weights and biases, artificial
neurons, in contrast to their biological counterparts, need training data. Biological
neurons also have the ability to create new neurons and form new connections, but
artificial neurons are limited by a predetermined architecture.
The perceptron is the most elementary form of neural network, with just one layer of
output nodes that take input from several input nodes and spit out a single binary value.
Frank Rosenblatt was the first to publicize it in 1957.
The input values for the perceptron method are multiplied by their associated weights
and then added together. The output of the perceptron is the result of this sum being fed
into an activation function.
Most perceptron utilizes the step activation function, which returns 1 if the weighted
total of the inputs is larger than a threshold value and 0 otherwise. The perceptron is
trained by modifying the weights based on the difference between the expected and
actual outputs, which are both initially set to random values.
Many different types of binary classification tasks, including those in the fields of image
recognition and natural language processing, have benefited from the use of perceptron.
However, they are unable to deal with situations that are more complex and would
benefit from additional layers and non-linear activation functions.
Neural Network:
Inspired by the form and function of biological neural networks in the human brain,
Artificial Neural Networks (ANNs) are a type of machine learning method. ANNs are
made up of a network of processing nodes, or "neurons," that communicate with one
another to discover and understand hidden correlations and patterns in data.
Each neuron in an ANN is equipped with a mathematical function that takes as its input
41
Machine Learning (3170724) 210210107003
signals from other neurons or from the outside world and generates an output signal
based on the processed data. Each neuron's output signals are propagated to nearby
neurons, creating a distributed system of processors.
In order to increase their predictive or classifying abilities during training, ANNs tweak
the strengths of the connections between neurons. To do so, the network's predictions
are compared to the actual values of the target variable, and the resulting "cost
function" is minimized.
Image and speech recognition, NLP, predictive analytics, and robotics are just few of the
many areas where ANNs find widespread use. They excel at activities with non-linear,
intricate interactions between input and output variables.
Implementation:
• Use 70% of Iris flower dataset to train neural network model. Test it with remaining
30% data and measure the accuracy.
• Try different architectures and training functions and also note down performance
of each.
Code:
42
Machine Learning (3170724) 210210107003
Results:
Conclusion:
• In conclusion, this study aimed to explore the fundamental distinctions between
biological neural networks and artificial neural networks. We delved into the
mathematical underpinnings of artificial neural networks, gaining insights into
43
Machine Learning (3170724) 210210107003
1. What are the different types of neural networks and their applications?
➢ Feedforward Neural Networks (FNN or FFNN):
Applications: Feedforward neural networks are general-purpose and can be
applied to a wide range of tasks, including regression, classification, function
approximation, and more. They are commonly used in image and text
classification, financial forecasting, and speech recognition.
➢ Convolutional Neural Networks (CNN):
Applications: CNNs are widely used in image and video analysis tasks, including
image classification, object detection, image segmentation, facial recognition,
and medical image analysis. They can also be applied to natural language
processing for tasks like text classification.
➢ Recurrent Neural Networks (RNN):
Applications: RNNs are suitable for sequential data, such as time series
forecasting, natural language processing (NLP) tasks (e.g., language modeling,
machine translation), and speech recognition. Long Short-Term Memory
(LSTM) and Gated Recurrent Unit (GRU) variations of RNNs are often used
for improved performance.
➢ Gated Recurrent Unit (GRU):
Applications: LSTMs, GRUs are used in NLP tasks and speech recognition. They
are computationally more efficient and are suitable for simpler sequential
tasks.
➢ Autoencoders:
Applications: Autoencoders are used for dimensionality reduction, feature
learning, and data denoising. Variational Autoencoders (VAEs), a variant of
autoencoders, are used for generating new data points in a structured and
meaningful way.
44
Machine Learning (3170724) 210210107003
Suggested Reference:
● https://www.ibm.com/in-en/topics/neural-networks#What
%20is%20a%20 neural%20network?
● https://towardsdatascience.com/a-beginner-friendly-explanation-of-how-
neural-networks-work-55064db60df4
● https://aws.amazon.com/what-is/neural-network/
● https://wiki.pathmind.com/neural-network
● https://www.javatpoint.com/artificial-neural-network
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
45
Machine Learning (3170724) 210210107003
Experiment No: 10
Date:
Relevant CO: 1, 2
Objectives:
2. To learn to compute and compare within class, between class and total scatter of
dataset.
Theory:
Scatter:
Scatter means dispersion in dataset. It defines how close, or how far data points are
from the mean of dataset. It is useful in clustering.
The variance or standard deviation of a dataset can be calculated to get a sense of its
dispersion. Both show how far the individual values deviate from the mean or average.
The standard deviation is the square root of the variance, while variance is the average
of the squared differences between each data point and the mean. If the data points are
more widely scattered, then the variance or standard deviation will be larger, and vice
versa if they are more tightly packed around the mean.
46
Machine Learning (3170724) 210210107003
The sum of squared distances between each data point in a cluster and the cluster
center can be used to determine the within-class scatter for that cluster. The
corresponding formula is as follows:
𝑆𝑊 = ∑ ∑ (𝑥 − 𝑚𝑖 ) (𝑥 − 𝑚𝑖 )𝑇
𝑖=1 𝑥 ∈𝑤𝑖
Where,
• 𝐶 is number of classes
Lower values of the within-class scatter indicate more tightly packed and clearly
defined clusters, making it a useful metric for judging the efficacy of various clustering
algorithms. Fisher's linear discriminant analysis (LDA), which aims to maximize the
ratio of the between-class scatter to the within-class scatter, is one example of a feature
selection and dimensionality reduction technique that makes advantage of this
property.
To determine the dispersion between classes, we can add up the squared differences
between the centers of each cluster and the overall mean or centroid of the data. The
corresponding formula is as follows:
Where,
47
Machine Learning (3170724) 210210107003
The between-class scatter can be used in feature selection and dimensionality reduction
techniques such as Fisher's linear discriminant analysis (LDA), which seeks to maximize
the ratio of the between-class scatter to the within-class scatter. A larger value of
between-class scatter relative to within-class scatter implies better discriminative
power of the clustering algorithm. However, it should be noted that maximizing the
between-class scatter alone may lead to overfitting and poor generalization to new data.
Total Scatter:
A dataset's total scatter, sometimes called total variance or total sum of squares,
quantifies the degree to which the data points within it vary from one another. For
clustering difficulties, it can be broken down into within-class scatter and between-class
scatter, and for regression problems, it can be broken down into explained variance and
unexplained variance.
The sum of squared distances between each data point and the general mean or
centroid of all data points can be used to estimate the total scatter for a dataset. The
corresponding formula is as follows:
Where,
To evaluate the efficacy of different modelling strategies and to comprehend the overall
variability of the data, knowing the total scatter is crucial.
Implementation:
• Use Iris flower data set and compute with class scatter, between class scatter and
total scatter
Code:
import numpy as np
from sklearn.datasets import load_iris
48
Machine Learning (3170724) 210210107003
Value
Conclusion:
49
Machine Learning (3170724) 210210107003
• In conclusion, this study delved into the concept of data scatter, a crucial aspect
of data analysis. We successfully learned how to compute and compare three key
measures of scatter: within-class scatter, between-class scatter, and total scatter.
These measures provide valuable insights into the distribution and separability
of data points, aiding in the assessment and optimization of various machine
learning and statistical models. Understanding these scatter metrics is essential
for making informed decisions in data analysis and pattern recognition tasks.
Quiz:
2. How can outliers affect the calculation of variance and standard deviation?
➢ Outliers can have a significant impact on the calculation of both variance and
standard deviation.
➢ Variance: Outliers can lead to an inflated variance because the variance is
calculated by squaring the differences between data points and the mean. When
there are extreme values (outliers) in the dataset, these squared differences
become very large, which in turn increases the overall variance. Outliers
effectively contribute more to the variance than other data points.
➢ Standard Deviation: Outliers can also influence the standard deviation, although
to a somewhat lesser extent than variance. While the standard deviation still
accounts for the squared differences, it mitigates the effect of outliers by taking the
square root of the variance.
50
Machine Learning (3170724) 210210107003
Suggested Reference:
● https://www.sciencedirect.com/topics/computer-science/class-scatter-matrix
● https://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/old_IDAPILecture15.pdf
● https://multivariatestatsjl.readthedocs.io/en/latest/mclda.html
● https://www.oreilly.com/library/view/feature-engineering-
made/9781787287600/ad8e90ca-9227-4150-9bd2-6b664dd04f46.xhtml
● https://www.machinelearningplus.com/plots/python-scatter-plot/
● https://www.geeksforgeeks.org/problem-solving-on-scatter-matrix/
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
51
Machine Learning (3170724) 210210107003
Experiment No: 11
Date:
Relevant CO: 4, 5
Objectives:
Theory:
Clustering
Clustering is a data grouping method used in machine learning and data mining to find
patterns in large amounts of data. Clustering's main purpose is to assist us see
connections between data pieces and figure out what that structure is.
52
Machine Learning (3170724) 210210107003
clustering takes data points one at a time and combines them into larger clusters,
while divisive hierarchical clustering takes all data points at once and splits them
recursively into smaller clusters.
3. Data points are clustered in the feature space using a method called density-
based clustering. It is resistant to noise and outliers and can detect clusters of
any size or shape.
K-Means clustering:
1. As a first step, pick K points at random to serve as the centers of your data.
2. To build K clusters, assign each data point to the centroid that is the closest to it.
3. To update, find the average value in each group and set the new center of gravity
there.
K-means clustering's benefits include its ease of use, scalability, and productivity. The
method assumes that clusters have a spherical shape and equal variance, which may not
always be the case, and K must be specified in advance.
Implementation:
• Use 70% of Iris dataset to train K-means clustering. Test is with remaining 30% data
53
Machine Learning (3170724) 210210107003
• Try different values of k and observe the effect. Also observe different distance
Code:
Results:
54
Machine Learning (3170724) 210210107003
Conclusion:
• In conclusion, this study provided a fundamental understanding of clustering, a
powerful unsupervised learning technique. We explored the diverse applications
of clustering in various fields, including data analysis, image processing, and
customer segmentation. Additionally, we successfully implemented the k-means
clustering algorithm, a widely used clustering method, showcasing its capability
to group data points into meaningful clusters. This knowledge equips us with
valuable tools for organizing and extracting insights from complex datasets,
paving the way for improved decision-making and problem-solving in a range of
real-world scenarios.
Quiz:
55
Machine Learning (3170724) 210210107003
Suggested Reference:
• https://towardsdatascience.com/k-means-clustering-algorithm-
applications-evaluation-methods-and-drawbacks-aa03e644b48a
• https://serokell.io/blog/k-means-clustering-in-machine-learning
• https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
References used by the students:
• https://www.geeksforgeeks.org/k-means-clustering-introduction/
• https://www.javatpoint.com/k-means-clustering-algorithm-in-
machine-learning
Rubrics 1 2 3 4 5 Total
References References
Correct answer
to all questions
56