Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
29 views

Machine Learning

Machine Learning Lab Manual for GTU

Uploaded by

vcjadhav27
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
29 views

Machine Learning

Machine Learning Lab Manual for GTU

Uploaded by

vcjadhav27
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

A Laboratory Manual for

Machine Learning
(3170724)

B.E. Semester 7
(Computer Engineering)

Directorate of Technical Education, Gandhinagar,


Gujarat
Government Engineering College, Bhavnagar

Certificate

This is to certify that Mr./Ms. ___________________________________ ________


Enrollment No. _______________ of B.E. Semester _____ Computer
Engineering of this Institute (GTU Code: 021) has satisfactorily
completed the Practical / Tutorial work for the subject Machine
Learning (3170724) for the academic year 2024-25.

Place: __________________
Date: __________________

Name and Sign of Faculty member

Head of the Department


Machine Learning (3170724)

Preface

Main motto of any laboratory/practical/field work is for enhancing required skills as well
as creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient
weightage is given to practical work. It shows importance of enhancement of skills
amongst the students and it pays attention to utilize every second of time allotted for
practical amongst students, instructors and faculty members to achieve relevant outcomes
by performing the experiments rather than having merely study type experiments. It is
must for effective implementation of competency focused outcome-based curriculum that
every practical is keenly designed to serve as a tool to develop and enhance relevant
competency required by the various industry among every student. These psychomotor
skills are very difficult to develop through traditional chalk and board content delivery
method in the classroom. Accordingly, this lab manual is designed to focus on the industry
defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.

By using this lab manual students can go through the relevant theory and procedure in
advance before the actual performance which creates an interest and students can have
basic idea prior to performance. This in turn enhances pre-determined outcomes amongst
students. Each experiment in this manual begins with competency, industry relevant
skills, course outcomes as well as practical outcomes (objectives). The students will also
achieve safety and necessary precautions to be taken while performing practical.

This manual also provides guidelines to faculty members to facilitate student centric lab
activities through each experiment by arranging and managing necessary resources in
order that the students follow the procedures with required safety and necessary
precautions to achieve the outcomes. It also gives an idea that how students will be
assessed by providing rubrics.

Machine Learning is the fundamental course which deals with various forms of energy and
their conversion from one to the another. It provides a platform for students to
demonstrate first and second laws of thermodynamics, entropy principle and concept of
exergy. Students also learn various gas and vapor power cycles and refrigeration cycle.
Fundamentals of combustion are also learnt.

Utmost care has been taken while preparing this lab manual however always there is
chances of improvement. Therefore, we welcome constructive suggestions for
improvement and removal of errors if any.
Machine Learning (3170724)

Practical – Course Outcome matrix

Course Outcomes (COs):


CO-1: Explore the fundamental issues and challenges in Machine Learning
including data and model selection and complexity
CO-2: Appreciate the underlying mathematical relationships within and across
Machine Learning algorithms
CO-3: Evaluate the various Supervised Learning algorithms using appropriate
Dataset.
CO-4: Evaluate the various unsupervised Learning algorithms using appropriate
Dataset.
CO-5: Design and implement various machine learning algorithms in a range of
real-world applications.
Sr. CO CO CO CO CO
Objective(s) of Experiment
No. 1 2 3 4 5
Find statistical measures such as Mean, Median and
1. √ √
Mode of the given data
Find statistical measures such as Standard
2. √ √
Deviation and Variance of the given data
Implement program to perform dimension
3. √ √
reduction of the high dimension data
Implement program to understand similarity
4. √
measure and dissimilarity measures
Implement Linear Regression model and evaluate
5. √ √
model performance
Implement Logistic Regression model and evaluate
6. √ √
model performance
Implement k-NN classifier to classify the flower
7. √ √
spices from IRIS dataset
Implement Decision tree classifier and test its
8. √ √
performance
Implement program to demonstrate Neural
9. √ √
Network Classifier
Write a program to demonstrate within class
10. scatter, between class scatter and total scatter of √ √
the dataset
Write a program to demonstrate clustering using K-
11. √ √
means algorithm
Machine Learning (3170724)

Industry Relevant Skills

The following industry relevant competency are expected to be developed in the


student by undertaking the practical work of this laboratory.
1. Apply knowledge of Machine Learning to solve real world problems
2. Understand and analyze the data analytically to use them wisely to build the ML
model

Guidelines for Faculty members


1. Teacher should provide the guideline with demonstration of practical to the
students with all features.
2. Teacher shall explain basic concepts/theory related to the experiment to the
students before starting of each practical
3. Involve all the students in performance of each experiment.
4. Teacher is expected to share the skills and competencies to be developed in the
students and ensure that the respective skills and competencies are developed
in the students after the completion of the experimentation.
5. Teachers should give opportunity to students for hands-on experience after the
demonstration.
6. Teacher may provide additional knowledge and skills to the students even
though not covered in the manual but are expected from the students by
concerned industry.
7. Give practical assignment and assess the performance of students based on task
assigned to check whether it is as per the instructions or not.
8. Teacher is expected to refer complete curriculum of the course and follow the
guidelines for implementation.

Instructions for Students


1. Students are expected to carefully listen to all the theory classes delivered by the
faculty members and understand the COs, content of the course, teaching and
examination scheme, skill set to be developed etc.
2. Students shall organize the work in the group and make record of all observations.
3. Students shall develop maintenance skill as expected by industries.
4. Student shall attempt to develop related hand-on skills and build confidence.
5. Student shall develop the habits of evolving more ideas, innovations, skills etc. apart
from those included in scope of manual.
6. Student shall refer technical magazines and data books.
7. Student should develop a habit of submitting the experimentation work as per the
schedule and s/he should be well prepared for the same.
Machine Learning (3170724)

Index
(Progressive Assessment Sheet)

Sr. No. Objective(s) of Experiment Page Date of Date of Assessm Sign. of Remar
No. perfor submis ent Teacher ks
mance sion Marks with
date
1 Find statistical measures such as Mean,
Median and Mode of the given data
2 Find statistical measures such as Standard
Deviation and Variance of the given data
3 Implement program to perform dimension
reduction of the high dimension data
4 Implement program to understand
similarity measure and dissimilarity
measures
5 Implement Linear Regression model and
evaluate model performance
6 Implement Logistic Regression model and
evaluate model performance
7 Implement k-NN classifier to classify the
flower spices from IRIS dataset
8 Implement Decision tree classifier and test
its performance
9 Implement program to demonstrate Neural
Network Classifier
10 Write a program to demonstrate within
class scatter, between class scatter and
total scatter of the dataset
11 Write a program to demonstrate clustering
using K-means algorithm
Total
Machine Learning (3170724) 210210107003

EXPERIMENT NO: 0

Commissionerate of Technical Education’s Vision:


To facilitate quality technical and professional education having relevance for both industry
and society, with moral and ethical values, giving equal opportunity and access, aiming to
prepare globally competent technocrats.

Commissionerate of Technical Education’s Mission:

1.Quality technical and professional education with continuous improvement of all


the resources and personnel

2.To promote conducive ecosystem for Academic, Industry, Research, Innovations


and Startups

3.To provide affordable quality professional education with moral values, equal
opportunities, accessibility and accountability

4.To allocate competent and dedicated human resources and infrastructure to the
institutions for providing world-class professional education to become a Global
Leader (“Vishwa Guru”)

Institute’s Vision:
To transform the students into good human beings, employable engineering
graduates and continuous learners by inculcating human values and imparting
excellence in technical education.

Institute’s Mission:
To impart education to rural and urban students, so as to earn respect from the
society and thereby improving the living standards of their families and become asset
for the industry and society. To foster a learning environment with technology
integration and individual attention, so that the students imbibe quality technical
knowledge, skill-development and character building.

Computer Engineering Department’s Vision:

To achieve excellence for providing value-based education in computer science and


Information Technology through innovation, team work and ethical practices.
Machine Learning (3170724) 210210107003

Computer Engineering Department’s Mission:


• To produce graduates according to the need of industry, government, society and
scientific community and to develop partnership with industries, government agencies
and R & D Organizations for knowledge sharing and overall development of faculties
and students.
• To motivate students/graduates to be entrepreneurs.
• To motivate students to participate in reputed conferences, workshops, symposiums,
seminars and related technical activities.
• To impart human and ethical values in our students for better serving of society.

Program Educational Objectives (PEOs) of Computer


Engineering Department:
• PEO1: have careers in industry, research & academics focusing on the application of
the information technologies and take leadership positions in the industry and also
initiate businesses offering innovative solutions.
• PEO2: capable of adapting to new technologies and constantly upgrade their skills
with an attitude towards lifelong learning.
• PEO3: act conscientiously about the potentials and the responsibilities of the
computing profession in the context of science, technology, society and humanity.

Program Specific Objective (PSOs) of Computer Engineering


Department:
• PSO1: An ability to design solutions by simulating computer engineering problems in
absence of real time environment.
• PSO2: An ability to learn and use of open-source tools in the areas like operating
systems, computer programming, computer networks and database management.
• PSO3: An ability to face competitive examinations for higher education and PSUs.
Machine Learning (3170724) 210210107003

Program Outcome:

1. Engineering knowledge: Apply the knowledge of mathematics, science,


engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified
needs with appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge
and research methods including design of experiments, analysis and interpretation
of data, and synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources,
and modern engineering and IT tools including prediction and modeling to
complex engineering activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a
member or leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with
the engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of
the engineering and management principles and apply these to one’s own work, as
a member and leader in a team, to manage projects and in multidisciplinary
environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of
technological change.
Machine Learning (3170724) 210210107003

Experiment No: 1

Find statistical measures such as Mean, Median and Mode of the given
data

Date:

Competency and Practical Skills:

Relevant CO: 1, 2

Objectives:

1. To understand basic statistical properties of data like mean, median and mode

2. To learn about effect of outliers on these properties

Theory:

Statistical properties of data tell a lot about data. It is first step for data analysis task.

Mean: We can compute the mean for only numeric data. The average value of a group of
numbers is referred to as the mean of those numbers. A data set's mean can be
calculated by first adding up all of the values in the set and then dividing that sum by the
total number of values.

Let X = <x1, x2, …, xn> be the vector of n numbers. The following formula can be used to
get the mean:
𝑛
1
𝜇 = ∑ 𝑥𝑖
𝑛
𝑖=1

If X = <22, 44, 33, 11, 55>,

22 + 44 + 33 + 11 + 55 165
𝜇= = = 33
5 5

Median: When a set of data is sorted in order from lowest to highest (or highest to
lowest), the value that falls in the middle of the set is referred to as the median. When
there are an even number of values, the median is determined by taking the average of
the two values that are in the middle of the set.

To compute the median of data set X = <22, 44, 33, 11, 55>, we shall first arrange it in
ascending or descending order:

X = <11, 22, 33, 44, 55>

There are 5 elements in the array X, so median is the element which is on index 3, which
1
Machine Learning (3170724) 210210107003

is 33.For Y = <11, 22, 33, 44, 55, 66>

There are 6 elements in the array Y, so median is of this data set would be (33 + 44) / 2
= 38.5

Mode: The mode of a data set is the most frequent value within it. If two or more values
occur at the same frequency, we can say that there is more than one mode. So mode is
the element in data which occurs maximum number of time.

For X = <11, 33, 66, 55, 22, 11, 66, 44, 11, 33, 55, 11>, element 11 appears maximum
number of times (4 times), so 11 is the mode of this dataset.

For Y = <33, 66, 55, 22, 11, 33, 44, 11, 33, 55, 11>, element 11 and 33 appears maximum
number of times (3 times each), so 11 and 33 are modes of this dataset.

Set up diagram: Plot the histogram for the data X = <1, 3, 2, 4, 5, 6, 4, 3, 2, 4, 5, 3, 1, 2, 3,


2, 3, 1, 4> and show mean, median and mode in the histogram.

Code:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
X = [1, 3, 2, 4, 5, 6, 4, 3, 2, 4, 5, 3, 1, 2, 3, 2, 3, 1, 4]
mean = np.mean(X)
median = np.median(X)
mode = stats.mode(X)[0]
plt.figure(figsize=(10,6))
plt.hist(X, bins=range(1,8), edgecolor='black', alpha=0.7, color='skyblue')
plt.axvline(mean, color='red', linestyle='dashed', linewidth=1.5, label=f'Mean: {mean:.2f}')
plt.axvline(median, color='yellow', linestyle='dashed', linewidth=1.5, label=f'Median:
{median:.2f}')
plt.axvline(mode, color='blue', linestyle='dashed', linewidth=1.5, label=f'Mode: {mode:.2f}')
plt.title('Histogram of Data X with Mean, Median, and Mode')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Implementation:

2
Machine Learning (3170724) 210210107003

Write a program to compute mean, median and mode for following data using your
preferred programming language:

X1 = <1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2>

X2 = <1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2000>

X3 = <1, 2, 3, 10, 20, 30, 100, 200, 300, 1000, 2000, 3000>

Result:

Dataset Mean Median Mode

X1 2.56 2.0 1

X2 82.48 2.0 1

X3 555.50 65.0 1

Conclusion:

In the context of machine learning, understanding these central tendencies, outliers, and
unique characteristics is crucial for effective data preprocessing, feature engineering,
and model development. X1 has an average (mean) value of approximately 2.44, with a
median of 2, and the mode is also 2. X2 has a high outlier (2000) that significantly
affects the mean, resulting in a mean of around 84.08. The median remains 2, and the
mode is 1. X3 consists of larger values, with a mean of approximately 584.45, a median
of 20, and no unique mode due to the absence of repeated values.

Quiz:

1. What is the difference between mean and median?


Ans:
• Mean is the average of all values in a dataset, calculated by summing them up and
dividing by the total count. It considers all data points and is sensitive to outliers.
• Median is the middle value in a sorted dataset. If there's an even number of values,
it's the average of the two middle values. It's less affected by outliers and represents
the central value of the data.

2. What is the mean, median and mode of the following set of data: 4, 7, 9, 9, 11,
11, 11, 13?
Ans:
• Mean (Average): (4 + 7 + 9 + 9 + 11 + 11 + 11 + 13) / 8 = 75 / 8 = 9.375
• Median: (9 + 11) / 2 = 20 / 2 = 10
• Mode: So, the mode is 11.

3
Machine Learning (3170724) 210210107003

3. Which measure of central tendency is preferred when the data set has extreme
values?
Ans:
• when dealing with data sets that have extreme values or outliers, the median is the
preferred measure of central tendency. It is less influenced by extreme values,
providing a more robust representation of the typical value in the dataset.

4. If a set of data has two modes, what is it called?


Ans:
• When a set of data has two modes, it is called bimodal. A dataset is considered
bimodal when it exhibits two distinct values that occur with the highest frequency,
indicating the presence of two different modes within the data. Bimodality is a
characteristic of the data's distribution and suggests that there are two significant
peaks or clusters of values in the dataset.
Suggested Reference:

• https://codecrucks.com/mean-median-mode-variance-discovering-statistical-
properties-of-data
• https://www.techtarget.com/searchdatacenter/definition/statistical-mean-median-
mode-and-range
• https://www.statisticshowto.com/probability-and-statistics/statistics-
definitions/mean-median-mode/
• https://www.twinkl.co.in/teaching-wiki/mean-median-mode-and-range

References used by the students:

• https://www.thoughtco.com/definition-of-bimodal-in-statistics-3126325
• https://www.diffen.com/difference/Mean_vs_Median

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

4
Machine Learning (3170724) 210210107003

Experiment No: 2

Find statistical measures such as Standard Deviation and Variance of


the given data

Date:

Competency and Practical Skills:

Relevant CO: 1, 2

Objectives:

1. To understand statistical properties of data like standard deviation and variance

2. To learn about spread and magnitude of the data

Theory:

Statistical properties of data tell a lot about data. It is first step for data analysis task.

Standard deviation: The dispersion of data relative to its mean or average is quantified
by its standard deviation. How much each data point varies from the mean is revealed. If
your data points have a low standard deviation, they cluster closely around the mean,
but if they have a large standard deviation, they are more widely dispersed.

Let X = <x1, x2, …, xn> be the vector of n numbers. The following formula can be used to
get the standard deviation:

(𝑥 − 𝜇)2
𝜎=√
𝑛

𝑛
1
𝜇 = ∑ 𝑥𝑖
𝑛
𝑖=1

If X = <22, 44, 33, 11, 55>,

22 + 44 + 33 + 11 + 55 165
𝜇= = = 33
5 5

(22 − 33)2 + (44 − 33)2 + (33 − 33)2 + (11 − 33)2 + (55 − 33)2
𝜎=√
5

121 + 121 + 0 + 121 + 121 484


𝜎=√ =√ = 9.84
5 5

5
Machine Learning (3170724) 210210107003

Variance: The variance of a data collection is another indicator of its dispersion around
the mean. The standard deviation is calculated as the mean squared deviation from the
mean. How much each data point varies from the mean is revealed.

The formula for finding the variance of a data set is as follows:

2
(𝑥 − 𝜇)2
𝜎 =
𝑛

If X = <22, 44, 33, 11, 55>,

22 + 44 + 33 + 11 + 55 165
𝜇= = = 33
5 5

(22 − 33)2 + (44 − 33)2 + (33 − 33)2 + (11 − 33)2 + (55 − 33)2 484
𝜎2 = = = 96.8
5 5

Set up diagram: Plot the histogram for the data X = <1, 3, 2, 4, 56, 4, 3, 2, 4, 5, 3, 1, 2, 3,
2, 3, 1, 4> and show standard deviation and variance in the histogram.

Implementation:

Write a program to compute standard deviation and variance for following data using
your preferred programming language:

X1 = <1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2>
X2 = <1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2000>
X3 = <1, 2, 3, 10, 20, 30, 100, 200, 300, 1000, 2000, 3000>
Code:

import numpy as np
import matplotlib.pyplot as plt
X = [1, 3, 2, 4, 56, 4, 3, 2, 4, 5, 3, 1, 2, 3, 2, 3, 1, 4]

plt.hist(X, bins=10, color='blue', edgecolor='black')


plt.title('Histogram for Data X')

6
Machine Learning (3170724) 210210107003

plt.xlabel('Value')
plt.ylabel('Frequency')
std_dev_X = np.std(X)
variance_X = np.var(X)
plt.text(45, 5, f'Standard Deviation: {std_dev_X:.2f}\nVariance: {variance_X:.2f}',
bbox=dict(facecolor='white', alpha=0.5))
plt.show()

X1 = [1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2]
X2 = [1, 2, 3, 4, 5, 4, 3, 2, 4, 5, 1, 2, 3, 4, 2, 5, 1, 3, 2, 1, 2, 1, 1, 1, 2000]
X3 = [1, 2, 3, 10, 20, 30, 100, 200, 300, 1000, 2000, 3000]

std_dev_X1, variance_X1 = np.std(X1), np.var(X1)


std_dev_X2, variance_X2 = np.std(X2), np.var(X2)
std_dev_X3, variance_X3 = np.std(X3), np.var(X3)
Result:

Dataset Standard deviation Variance

X1 1.36 1.85

X2 391.41 153205.29

X3 932.67 869810.92

Conclusion:

In summary, standard deviation and variance analysis helps us understand the


dispersion of data points. X1 has the least spread, X2 has increased spread primarily
due to an outlier, and X3 exhibits the most significant spread due to the presence of
large values.

Quiz:

1. What does the standard deviation of a set of data tell you?


Ans:
• The standard deviation of a set of data tells you how much the individual data
points deviate from the average (mean) value. It quantifies the level of variation or
spread in the data, with a higher standard deviation indicating more spread and a
lower standard deviation indicating less spread. In practical terms, it helps you
understand how consistent or variable the data is, making it a useful tool in various
fields such as finance, statistics, and scientific research.

2. If the standard deviation of a set of data is very low, what does that tell you
about the data points?
Ans:
• A very low standard deviation in a dataset indicates that the data points are highly
consistent and closely clustered around the mean.
• This suggests little variation or spread among the data points, making them

7
Machine Learning (3170724) 210210107003

predictable and less variable.

3. If a set of data has a high standard deviation, what does that tell you about the
spread of the data points?
Ans:
• A very low standard deviation in a dataset signifies a high degree of consistency,
with data points tightly concentrated around the mean.
• This indicates minimal variability and high predictability in the data, which is
valuable in various contexts, such as financial stability, precise measurements, and
reliable results.

4. Which of the following is a correct interpretation of the coefficient of


variation?
Ans:
• The coefficient of variation (CV) is a measure that expresses the standard deviation
as a percentage of the mean. It is used to gauge the relative variability in a dataset,
making it a standardized way to compare the spread of data with different means.
• A higher CV indicates greater relative variability compared to the mean, making it
useful for comparisons, risk assessment, and quality control in various fields.

Suggested Reference:

• https://codecrucks.com/mean-median-mode-variance-discovering-statistical-
properties-of-data
• https://www.investopedia.com/ask/answers/021215/what-difference-between-
standard-deviation-and-variance.asp
• https://www.mathsisfun.com/data/standard-deviation.html
• https://www.cuemath.com/data/variance-and-standard-deviation/
References used by the students:

• https://www.statology.org/what-is-a-low-standard-deviation/
• https://www.scribbr.com/statistics/standard-deviation/
Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

8
Machine Learning (3170724) 210210107003

Experiment No: 3

Implement program to perform dimension reduction of the high


dimension data

Date:

Competency and Practical Skills:

Relevant CO: 1, 4

Objectives:

1. To understand the drawback of high dimensionality in data

2. To study method to reduce dimensionality of data

Theory:

High dimensionality:

The term "high-dimensional data" refers to data sets that contain a significant number
of distinct features. When dealing with high-dimensional data, it can be challenging to
visualize and interpret the information because the human brain is only able to process
information in a limited number of dimensions. For instance, a data collection
containing 1000 features could be interpreted and displayed as a space with 1000
dimensions, which would be impossible to visualize.

Dealing with data that has a high dimension can be difficult due to the possibility that
conventional statistical approaches will not function well in the situation. The
phenomenon known as the "curse of dimensionality" occurs as the number of
dimensions in a problem rises, causing the amount of data that must be collected to
keep a specific level of statistical accuracy to increase at an exponential rate.

In addition, methods of data visualization such as scatter plots, heat maps, and parallel
coordinates can be utilized in order to assist with the comprehension of high-
dimensional data. It is essential to select the methods that are suitable for the data that
you are working with and the research issue that you are attempting to answer.

Principal Component Analysis:

Principal component analysis, sometimes known as PCA, is a method that can be used in
statistics to reduce the dimensions of a data set. It includes translating a high-
dimensional data collection into a lower-dimensional space while trying to keep as
much of the information intact.

The principle behind principal components analysis (PCA) is to determine the ways in

9
Machine Learning (3170724) 210210107003

which the data varies the most in order to determine which patterns or features in the
data are the most significant. Principal components are another name for these different
directions. The first principal component (PC) accounts for the greatest amount of
variance in the data, and each successive PC accounts for as much of the remaining
variation as it is possible while adhering to the requirement that it is orthogonal
(perpendicular) to the PCs that came before it.

The following is a list of the steps involved in PCA:

• Standardize the data: Principal component analysis (PCA) functions most well
when the data are standardized, which means that each feature has a mean of
zero and a variance of one.

• Perform the computation that will produce the covariance matrix. This matrix
measures the linear relationship that exists between each pair of features in the
data.

• Perform the computations necessary to determine the covariance matrix's


eigenvectors and eigenvalues: The eigenvectors illustrate the directions in which
the data exhibit the greatest amount of variation, and the eigenvalues that
correlate to each eigenvector illustrate the amount of variation that may be
found along that eigenvector.

• Choose the k most significant eigenvectors: These are the top k principal
components, which are responsible for capturing the most variation in the data.

• Change the format of the data: In order to obtain a representation of the data
with fewer dimensions, multiply the standardized data by the matrix that
contains the top k eigenvectors.

Implementation:

Load Iris flower dataset, study it. Reduce the dimension of dataset to 2 by applying PCA.

Step up plot: Draw plot of all three classes in dataset with 2 dimensions only, after
applying PCA

Code:

from sklearn.datasets import load_iris


from sklearn.decomposition import PCA
import pandas as pd
import seaborn as sns

iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

10
Machine Learning (3170724) 210210107003

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])


df_pca['target'] = y
df_pca['target_name'] = df_pca['target'].apply(lambda i: target_names[i])

plt.figure(figsize=(10, 7))
sns.scatterplot(data=df_pca, x='PC1', y='PC2', hue='target_name', palette='Set1', s=100)

plt.title('Iris Dataset - PCA (2D)')


plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Iris Class')
plt.show()

Conclusion:
Applying Principal Component Analysis (PCA) to the Iris dataset and visualizing it in
two dimensions reveals clear separation among the three classes of Iris flowers (Setosa,
Versicolor, and Virginica). Setosa forms a distinct cluster, while Versicolor and Virginica,
though distinguishable, show some overlap. Reducing the dimensionality to 2D
enhances data visualization and class separability.

Quiz:

1. What are some alternatives to PCA for feature extraction and


dimensionality reduction?

Ans:
• There are several alternatives to Principal Component Analysis (PCA) for feature
extraction and dimensionality reduction, each with its own strengths and
weaknesses.
• Independent Component Analysis (ICA): ICA assumes that the observed data is a

11
Machine Learning (3170724) 210210107003

linear combination of independent sources and seeks to separate these sources.


• t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a nonlinear
dimensionality reduction technique that focuses on preserving the local structure of
the data points.
• Isomap: Isomap is a nonlinear method that uses geodesic distances to measure the
intrinsic geometry of the data.
• Locally Linear Embedding (LLE): LLE is another nonlinear method that preserves
the local linear relationships between data points.
• Sparse Coding: Sparse coding seeks to represent data using a sparse set of basis
functions. It is often employed in image processing and signal analysis.
• Feature Selection Methods: Instead of reducing dimensions, feature selection
methods select a subset of the most important features.

2. What are some common applications of PCA in real-world data analysis?

Ans:
• Image Compression: PCA can be used to reduce the dimensionality of images
while retaining important features.
• Face Recognition: In facial recognition systems, PCA can be applied to extract
essential facial features and reduce the dimensionality of the data, making it easier
to compare and recognize faces.
• Genomic Data Analysis: In bioinformatics, PCA can identify patterns in gene
expression data, helping researchers discover relationships between genes and
their functions.
• Environmental Data Analysis: PCA is used to reduce the dimensionality of
environmental datasets, such as climate data, to uncover trends and patterns in the
data.
• Speech Recognition: PCA can be applied to reduce the dimensionality of audio data
and improve the efficiency of speech recognition systems.
• Recommendation Systems: In recommendation engines, PCA can help identify
latent factors that influence user preferences, improving personalized
recommendations.

3. How does the choice of scaling method impact the results of PCA?

• The choice of scaling method in PCA significantly impacts the results:


• No scaling can lead to a dominance of variables with larger ranges.
• Standardization (Z-score scaling) is common and makes variables comparable.
• Normalization preserves the relative scale of variables and can be meaningful in
specific contexts.
• Non-linear transformations, like ranking, may be used to capture non-Gaussian data
or emphasize ordinal relationships. The choice should align with data
characteristics and analysis goals.

12
Machine Learning (3170724) 210210107003

Suggested Reference:

● https://codecrucks.com/question/machine-learning-question-set-5/

● https://builtin.com/data-science/step-step-explanation-principal-component-
analysis

● https://towardsdatascience.com/a-one-stop-shop-for-principal-component-
analysis-5582fb7e0a9c

● https://www.turing.com/kb/guide-to-principal-component-analysis

References used by the students: (Sufficient space to be provided)

● https://medium.com/all-about-ml/understanding-principal-component-analysis-
pca-556778324b0e

● https://www.simplilearn.com/tutorials/machine-learning-tutorial/principal-
component-analysis

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

13
Machine Learning (3170724) 210210107003

Experiment No: 4

Implement program to understand similarity measure and


dissimilarity measures

Date:

Competency and Practical Skills:

Relevant CO: 2

Objectives:

1. To understand similarity measures and their applications

2. To study dissimilarity measures and their applications

Theory:

Similarity measures:

A numerical value that shows how similar or alike two things or entities are with
respect to specific features or characteristics is referred to as a similarity measure. It is
possible to calculate the similarity measure based on a variety of metrics or distances
between the qualities of the objects or entities being compared. This allows the measure
to be used to quantify the degree to which two things or entities resemble one another.

In the fields of machine learning and data analysis, similarity measures are frequently
utilized for the purpose of carrying out tasks such as clustering, classification, and the
development of recommendation systems. For instance, the similarity measure is used
in clustering to group together objects or entities that are similar to one another,
whereas in recommendation systems, the similarity measure is used to identify things
that are similar to those that a user has liked or purchased in the past.

A. Cosine Similarity:

The cosine similarity is a measure of similarity that is utilized for the purpose of
determining the degree to which two non-zero vectors of an inner product space are
similar to one another. It is usual practice in the fields of machine learning and data
analysis to utilize cosine similarity to evaluate the degree of similarity between two
documents or texts. However, cosine similarity may also be utilized to evaluate the
degree of similarity between any two vectors.

14
Machine Learning (3170724) 210210107003

Cosine similarity between vectors A and B

The cosine similarity measure is calculated as follows:

𝐴 .𝐵
𝐶𝑜𝑠𝑖𝑛𝑒 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 =
||𝐴|| ||𝐵||

where A and B are two different vectors, where A. B is the dot product of A and B, and
where || A || and || B || are the respective magnitudes of A and B.

Dissimilarity Measure:

A dissimilarity measure is a mathematical formula that expresses how far apart two
items, entities, or observations are from one another in a dataset. Algorithms for
machine learning, pattern recognition, data analysis, and clustering frequently employ
dissimilarity metrics.

A. Euclidean distance:

The Euclidean distance is a measure of distance that can be used to determine how far
apart two points are in the space defined by Euclidean geometry. In the fields of
machine learning and data analysis, the Euclidean distance is a measurement that is
frequently used to determine how similar or unlike two numerical vectors are to one
another.

The following formula is used to determine the Euclidean distance between two
locations in a space of n dimensions:

𝑑(𝐴, 𝐵) = √(𝑎1 − 𝑏1 )2 + (𝑎2 − 𝑏2 )2 + ⋯ + (𝑎𝑛 − 𝑏𝑛 )2

Where, A = {a1, a2, …, an} and B = {b1, b2, …,bn} are two feature vectors of dimension n.

Regardless of the number of dimensions involved, the Euclidean distance can be utilized
to determine the separation that exists between any two numerical vectors. In the fields
of machine learning and data analysis, one of the most common applications of
Euclidean distance is in the performance of tasks such as grouping, classification, and
regression. However, the Euclidean distance might not always be the best distance
15
Machine Learning (3170724) 210210107003

measure to use for certain kinds of data or applications. This is because the Euclidean
distance assumes that all points on a line are equal. In these kinds of scenarios, alternate
methods of measuring distance, like the Manhattan distance or the Mahalanobis
distance, might be more appropriate.

Implementation:

Compute Cosine similarity and Euclidian distance between following vectors:

• A = <1, 2, 3, 4, 5, 6, 7, 8, 9, 10>, B = <1, 3, 5, 7, 9, 7, 5, 3, 1, 0>

• X = <1, 2, 3, 4, 5, 6, 7, 8, 9, 10>, Y = <1, 3, 5, 7, 9, 7, 5, 3, 1, 0>

Set up diagram:

Draw the diagrams geometrically explaining cosine similarity and Euclidian distance

Code:

import numpy as np
A = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
B = np.array([1, 3, 5, 7, 9, 7, 5, 3, 1, 0])
dot_product = np.dot(A, B)
norm_A = np.linalg.norm(A)
norm_B = np.linalg.norm(B)
cosine_similarity = dot_product / (norm_A * norm_B)
print("Cosine Similarity between A and B:", cosine_similarity)
euclidean_distance = np.sqrt(np.sum((A - B) ** 2))
print("Euclidean Distance between A and B:", euclidean_distance)

import matplotlib.pyplot as plt


plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.quiver(0, 0, X[0], X[1], angles='xy', scale_units='xy', scale=1, color='b', label='X')
plt.quiver(0, 0, Y[0], Y[1], angles='xy', scale_units='xy', scale=1, color='r', label='Y')
plt.xlim(0, 11)
plt.ylim(0, 11)
16
Machine Learning (3170724) 210210107003

plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.title(f"Cosine Similarity: {cosine_similarity:.2f}")
plt.subplot(1, 2, 2)
plt.quiver(0, 0, X[0], X[1], angles='xy', scale_units='xy', scale=1, color='b', label='X')
plt.quiver(X[0], X[1], Y[0] - X[0], Y[1] - X[1], angles='xy', scale_units='xy', scale=1, color='r',
label='Y - X')
plt.xlim(0, 11)
plt.ylim(0, 11)
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.title(f"Euclidean Distance: {euclidean_distance:.2f}")
plt.tight_layout()
plt.show()

Results:

Measure Vectors Value

(A, B) 0.67
Cosine
(X, Y) 0.67

(A, B) 14.97
Euclidean
(X, Y) 14.97

Conclusion:
Implementing a program to work with similarity and dissimilarity measures is a fundamental
step in various data analysis and machine learning tasks. It provides insights into data
relationships, helps in grouping and classification, and has broad applications across different
domains. The choice of the appropriate measure is crucial, and evaluating the program's
results is essential for its effectiveness.
Quiz:

1. What is the difference between similarity and dissimilarity measures?


➢ Purpose: The primary difference is in their purpose. Similarity measures
quantify how much two objects are alike, while dissimilarity measures
quantify how different or distant, they are.
➢ Scale: Similarity measures typically yield values between 0 and 1, where 1
signifies perfect similarity. Dissimilarity measures yield non-negative values
that can vary widely based on the data and measure used.
➢ Interpretation: In similarity measures, higher values indicate greater
similarity, while in dissimilarity measures, higher values indicate greater
dissimilarity.
➢ Examples: Different mathematical formulas and approaches are used to
compute similarity and dissimilarity. The choice of measure depends on the

17
Machine Learning (3170724) 210210107003

data type and the specific analysis or application.


2. What are some common applications of similarity and dissimilarity measures?
➢ Similarity and dissimilarity measures have a wide range of applications
across different domains, including recommendation systems, information
retrieval, image processing, clustering, classification, and more. The specific
choice of measure and its application depends on the problem and the nature
of the data being analyzed.

3. State 4 similarity measures


➢ Similarity measures
1. Cosine Similarity
2. Jaccard Similarity
3. Euclidean Distance
4. Pearson Correlation Coefficient

4. State 4 dissimilarity measures


➢ Dissimilarity measures
1. Euclidean Distance
2. Manhattan Distance (L1 Norm)
3. Jaccard Distance (for Sets)
4. Hamming Distance (for Binary Data or Strings)

Suggested Reference:

• https://codecrucks.com/distance-and-similarity-measures-for-machine-
learning/
• https://www.sciencedirect.com/topics/computer-science/cosine-similarity
• https://medium.datadriveninvestor.com/cosine-similarity-cosine-distance-
6571387f9bf8
• https://www.cuemath.com/euclidean-distance-formula/
• https://www.engati.com/glossary/euclidean-distance
References used by the students:

• https://online.stat.psu.edu/stat508/lesson/1b/1b.2/1b.2.1
• https://www.scaler.com/topics/measures-of-similarity-and-dissimilarity/
Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

18
Machine Learning (3170724) 210210107003

Experiment No: 5
Implement Linear Regression model and evaluate model performance

Date:

Competency and Practical Skills:

Relevant CO: 3, 5

Objectives:

1. To understand how linear regression works

2. To learn the applications of linear regression

3. To implement and test the linear regression model

Theory:

What is Regression?

Modelling the relationship between a dependent variable (also known as the response or
goal variable) and one or more independent variables (also known as predictors or
features) is the purpose of the statistical method known as regression. Regression can be
used to model this relationship. The objective of regression analysis is to locate the line
or curve that provides the most accurate description of the connection between the
variables being studied.

To create predictions or get an estimate of the value of the dependent variable based on
the values of the independent variables, regression analysis is a technique that is
frequently utilized in a variety of sectors, including finance, economics, the social
sciences, and engineering, amongst others.

It is possible for regression models to be linear or nonlinear, depending on the type of


relationship that exists between the variables being studied. When doing linear
regression, it is assumed that the relationship between the dependent variable and the
independent variables is linear. The model then attempts to determine the equation of a
straight line that most accurately captures the supposed linearity of the relationship.
Nonlinear regression models, on the other hand, make it possible to take into account
more intricate correlations between the variables. These models can take into account
relationships that are exponential, logarithmic, or polynomial.

Simple Linear Regression:

A regression model known as simple linear regression attempts to determine the


relationship that exists between a dependent variable (Y) and a single independent
variable (X) by using only one independent variable. The model works under the

19
Machine Learning (3170724) 210210107003

premise that there is a linear connection between X and Y, and it attempts to determine
the equation of a straight line that most accurately depicts this connection.

The equation for a straightforward linear regression model can be written as follows:

𝑦̂𝑖 = 𝑤0 + 𝑤1 𝑥𝑖

where 𝑦̂𝑖 represents the predicted value for input independent variable x. 𝑤0 and 𝑤1
indicates Y intercept and slope of the predictor line respectively.

Graphics representation of simple linear regression model

The objective of simple linear regression is to determine the values of 𝑤0 and 𝑤1 that
will result in the smallest sum of squared errors between the predicted values of Y and
the actual values of Y. This can be accomplished by estimating the values of 𝑤0 and 𝑤1.
Find the values of 𝑤0 and 𝑤1 that minimise the sum of the squared differences between
the expected and actual values of Y. This is commonly done using the method of least
squares, which entails finding the values that minimize the total of the squared
differences.

It is necessary to make sure that the assumptions of the model are met and that the
model is appropriate for the data that is being analyzed. Simple linear regression is a
strong and extensively used technique for analyzing the relationship between two
variables. However, it is also important to keep in mind that the model's assumptions
must be met.

Implementation:

• Consider vector X as amount spend for advertisement and Y as increase in unit


sale. X = <70, 80, 90, 100, 110, 120, 130, 140, 150, 160>, Y = <7, 7, 8, 9, 12, 12, 15,
14, 13, 17>. Create simple linear regression model. Find the model parameters
w0 and w1. Predict the Y for X = 210

Results:

Variable Value

20
Machine Learning (3170724) 210210107003

w0 -1.1454545454545446

w1 0.10909090909090909

Y | x = 210 21.763636363636362

Plot: Plot the given data and fir the regression line. Also show the predicted value for X
= 210

Code:

import matplotlib.pyplot as plt


import numpy as np
X = np.array([70, 80, 90, 100, 110, 120, 130, 140, 150, 160])
Y = np.array([7, 7, 8, 9, 12, 12, 15, 14, 13, 17])
w0 = 0.0945
w1 = 0.0945
regression_line = w0 + w1 * X
plt.scatter(X, Y, label='Data')
plt.plot(X, regression_line, color='red', label='Regression Line')
plt.xlabel('Amount Spent for Advertisement (X)')
plt.ylabel('Increase in Unit Sale (Y)')
plt.title('Simple Linear Regression')
plt.legend()
plt.grid(True)
X_pred = 210
Y_pred = w0 + w1 * X_pred
plt.text(X_pred, Y_pred, f'Predicted Y = {Y_pred:.2f}', ha='right')
plt.show()

Conclusion:
• The linear regression model showed [mention whether it was successful or not]
in explaining the relationship between the independent variables and the target
21
Machine Learning (3170724) 210210107003

variable. The choice of this model was appropriate for this dataset, given its
simplicity and interpretability.
• The linear regression model serves as a valuable starting point for
understanding and predicting the target variable. The evaluation metrics provide
insights into the model's performance and guide us in making data-driven
decisions based on the analysis.

Quiz:

1. What are the pros and cons of using KNN for classification tasks?
➢ Pros:
1. Simplicity: KNN is easy to understand and implement. It's an ideal choice
for beginners in machine learning.
2. No Training Period: KNN is a lazy learning algorithm, which means there
is no explicit training phase. The model stores the entire dataset and
makes predictions on the fly, which can be advantageous when the data is
continuously changing.
3. Non-Parametric: KNN is a non-parametric algorithm, meaning it doesn't
make strong assumptions about the underlying data distribution. It can
work well with data that doesn't adhere to specific statistical
assumptions.
4. Versatile: KNN can be used for both binary and multiclass classification
tasks. It's also adaptable for regression tasks by averaging the values of
the K nearest neighbors.
➢ Cons:
1. Computational Cost: KNN can be computationally expensive, especially
for large datasets. Predicting a new data point requires calculating
distances to all data points in the training set.
2. Sensitivity to Distance Metric: The choice of distance metric is critical in
KNN. Different distance metrics can yield different results, and selecting
the right one is often a trial-and-error process.
3. Curse of Dimensionality: KNN's performance degrades as the
dimensionality of the data increases. In high-dimensional spaces, the
nearest neighbors may not be representative, and the algorithm can
become less effective.

2. How do you choose the value of K in KNN?


➢ The choice of K can influence the balance between bias and variance in your
model.
➢ Here are some methods and considerations to help select an appropriate K
value:
1. Domain Knowledge
2. Odd K for Binary Classification
3. Square Root of N
4. Cross-Validation
5. Grid Search

3. How does KNN handle data with missing values?


➢ Here are some approaches to handle data with missing values in KNN:
1. Imputation: Fill Missing Values: One of the simplest approaches is to fill
missing values with a reasonable estimate. This could be the mean,
22
Machine Learning (3170724) 210210107003

median, or mode of the feature, or you could use more sophisticated


imputation techniques, such as k-Nearest Neighbors imputation, to
predict missing values using KNN.
2. Ignore Instances with Missing Values: In some cases, you may choose
to exclude instances with missing values from the analysis. This approach
can be effective when the proportion of instances with missing values is
relatively small and removing them doesn't significantly affect the overall
dataset.
3. Multiple Imputation: For more complex scenarios, you can use multiple
imputation techniques. Multiple imputation involves creating multiple
datasets with different imputed values and running KNN on each of them.
You then combine the results to obtain a more robust prediction.
4. KNN Interpolation: For regression tasks, you can use KNN-based
interpolation to estimate the missing values. This involves using KNN to
find the most similar instances and averaging their target values to
predict the missing value.
5. Advanced Imputation Methods: If you have domain-specific knowledge
or additional information, more advanced imputation methods, such as
matrix factorization or deep learning-based imputation, may be suitable
for handling missing values.

Suggested Reference:

• https://codecrucks.com/question/machine-learning-question-set-12/
• https://www.scribbr.com/statistics/simple-linear-regression
• https://online.stat.psu.edu/stat462/node/91/
• https://www.jmp.com/en_in/statistics-knowledge-portal/what-is-
regression.html
References used by the students:

• https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-
metrics-for-your-regression-model/
• https://machinelearningmastery.com/regression-metrics-for-machine-
learning/
Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

23
Machine Learning (3170724) 210210107003

Experiment No: 6

Implement Logistic Regression model and evaluate model


performance

Date:

Competency and Practical Skills:

Relevant CO: 3, 5

Objectives:

1. To understand how logistic regression works

2. To learn the applications of logistic regression

3. To implement and test the logistic regression model

Theory:

Logistic Regression:

Logistic regression is basically a binary classifier. It is a statistical model that is utilized


for the purpose of analyzing the connection that exists between a dependent variable
that is either binary or numeric and one or more variables that are independent. The
values of the independent variables are used in logistic regression in order to make
predictions about the likelihood of certain events (such as whether or not they will be
successful, whether they will win, or whether they will lose).

The dependent variable in logistic regression is considered to be binary, which means


that it can only take on the values 0 or 1. The model attempts to determine the link
between the independent variables and the log chances (logit) of the dependent
variable being equal to 1, and the continuous or categorical nature of the independent
variables is irrelevant to this endeavour.

For the purpose of converting the linear combination of the independent variables into
a probability value in the range of 0 and 1, the logistic regression model makes use of a
logistic function, which is also known as a sigmoid function. The logistic function has a
curve that is shaped like a S and may be expressed as:

Let 𝑧 = 𝑤0 + 𝑤1 𝑥

So, logistic regression probability is computed as,

𝑒𝑧
𝑃=
1 + 𝑒𝑧

24
Machine Learning (3170724) 210210107003

This will create the sigmoid curve as shown below. The appropriate threshold will
create the binary label for the test data.

[Source: wikipedia]

After the model has been fit to the data, the coefficients may be used to forecast the
likelihood of the dependent variable being equal to 1 for new observations with known
values of the independent variables. This can be done for new observations that already
have the values for the independent variables.

Logistic regression is a strong and extensively used technique for analyzing binary data;
nonetheless, it is essential to make certain that the model's assumptions are satisfied
and that the model is suitable for the data that is being analyzed.

Implementation:

• Load Iris flower data set. Divide dataset in 70-30 ratio. Use 70% data to train
logistic regression model and use 30% data to test the model performance.
Measure various performance metric such as precision, recall, F1 score, accuracy.
Also derive confusion matrix.

Results:

Metric Precision Recall F1 Score Accuracy

Value 1.00 1.00 1.00 100%

Set up Plot:

Plot confusion matrix for the experiment

Code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
# Load the Iris dataset
25
Machine Learning (3170724) 210210107003

data = load_iris()
X = data.data
y = data.target
# Split the dataset into a training set (70%) and a test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Logistic Regression model


model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Predict the target values on the test set
y_pred = model.predict(X_test)
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
plot_confusion_matrix(model, X_test, y_test, display_labels=data.target_names,
cmap=plt.cm.Blues, values_format='d')
plt.title("Confusion Matrix")
plt.show()
# Calculate precision, recall, F1-score, and accuracy
report = classification_report(y_test, y_pred, target_names=data.target_names)
print("Classification Report:\n", report)

Output:

Conclusion:
• Implementing a Logistic Regression model and evaluating its performance is an

26
Machine Learning (3170724) 210210107003

essential step in assessing the model's effectiveness for a specific classification


task.
• Implementation of a Logistic Regression model and its performance evaluation is
a critical process in machine learning. It helps you assess the model's suitability
for a given classification task and provides insights for decision-making and
potential improvements.

Quiz:

1. What is the difference between a binary logistic regression and a multinomial


logistic regression? How do you choose which one to use for your data?

➢ Binary logistic regression is for binary outcomes (yes/no, 0/1), while multinomial
logistic regression is for outcomes with more than two categories that are nominal
(unordered). Choose binary when you have two categories and multinomial when
you have multiple, unordered categories in your dependent variable.

2. How do you interpret the coefficients in a logistic regression model? What do


they tell you about the relationship between your predictors and your outcome?
➢ Interpreting logistic regression coefficients involves understanding their sign
(positive or negative), magnitude, and statistical significance. The sign indicates
the direction of the relationship, the magnitude reflects the strength of the
relationship, and statistical significance assesses the reliability of the relationship.
Coefficients help explain how predictors affect the likelihood of the outcome while
considering other predictors in the model.

3. What is overfitting, and how can you guard against it when using logistic
regression?
➢ Overfitting in logistic regression occurs when the model fits the training data too
closely, capturing noise and performing poorly on new data. To guard against
overfitting:
1. Carefully select relevant features.
2. Apply regularization techniques (L1, L2).
3. Use cross-validation to assess generalization.
4. Consider early stopping during training.
5. Simplify the model structure.
6. Increase the dataset size.
7. Balance bias and variance.
8. Tune the regularization strength.
9. Explore ensemble methods (e.g., Random Forest).
10. Use a separate validation set for evaluation.

4. What are some common performance metrics used to evaluate the accuracy of a
logistic regression model, and how do you interpret them?
➢ Accuracy: Overall proportion of correct predictions.
➢ Precision: Proportion of true positives among positive predictions.
➢ Recall (Sensitivity): Proportion of true positives among actual positives.
➢ Specificity: Proportion of true negatives among actual negatives.
➢ F1-Score: Harmonic meaning of precision and recall.
➢ ROC Curve and AUC: Evaluates model's ability to discriminate between positive
and negative cases.
➢ Log-Loss: Considers prediction confidence for accuracy.
➢ Confusion Matrix: Breakdown of true positives, true negatives, false positives,
27
Machine Learning (3170724) 210210107003

and false negatives.


➢ Area Under the Precision-Recall Curve (AUC-PR): Quantifies performance in
terms of precision and recall.

Suggested Reference:

● https://codecrucks.com/question/machine-learning-question-set-9/

● https://www.ibm.com/in-en/topics/logistic-regression

● https://towardsdatascience.com/logistic-regression-detailed-overview-
46c4da4303bc

● https://careerfoundry.com/en/blog/data-analytics/what-is-logistic-regression/

References used by the students:

● https://www.r-bloggers.com/2015/08/evaluating-logistic-regression-models/

● https://www.hackerearth.com/practice/machine-learning/machine-learning-
algorithms/logistic-regression-analysis-r/tutorial/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

28
Machine Learning (3170724) 210210107003

Experiment No: 7

Implement k-NN classifier to classify the flower spices from IRIS


dataset

Date:

Competency and Practical Skills:

Relevant CO: 3, 5

Objectives:

1. To understand how KNN works

2. To learn to train classifier

3. To test the performance of given classifier using 10-fold cross validation

Theory:

How KNN Works?

K-Nearest Neighbours (KNN) is a technique for supervised machine learning that has
applications in classification and regression work. KNN can be used for any task.

The functionality of the method relies on locating the K data points in the training
dataset that are closest to a specific test data point. The value of K is a user-defined
parameter that defines the number of individuals whose immediate neighbours should
be taken into account.

After determining which K neighbours are the closest, the algorithm then produces a
forecast by picking either the target value that corresponds to the majority class (in
classification) or the mean value (in regression) of the target values of these neighbours.

For example, for below image, circle is the query data point, if we consider k = 3, we
shall inspect nearest three data points and should chose the majority class for circle. So
for k = 3, class assigned to circle would be triangle. And for k = 5, the majority class is
square, so class assigned to green circle would be square.

The algorithm computes the distance between the test data point and each data point in
the training dataset by employing a distance metric such as Euclidean distance or cosine
similarity. This allows it to locate the K data points that are the closest in proximity to
the test data point. The type of data being examined and the specific nature of the issue
being solved both have an impact on the distance metric that is selected.

29
Machine Learning (3170724) 210210107003

After computing the distances, the algorithm sorts the data points in descending order
of how far away they are from the test data point, and then it chooses the K data points
that are the closest neighbours to the test data point.

Implementation:

• Load Iris flower dataset. Use 10-fold cross validation and find accuracy of k-nn
for k = 1, 3, 5 and 7

Results:

Value of k Accuracy of model

1 0.933

3 0.977

5 0.955

7 0.933

Plot: Plot the line chart for k versus accuracy

Code:

train_score=[]
test_score=[]
neigh = np.arange(1,50,1)
for n in neigh:
knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(X_train,y_train)
test_score.append(knn.score(X_test,y_test))
train_score.append(knn.score(X_train,y_train))
plt.plot(neigh,train_score,'o-',label="training score")
plt.plot(neigh,test_score,'o-',label="testing score")
plt.legend()
plt.xlabel("K")
plt.ylabel("Score")
plt.title("score train vs testing w.r.t. k")

30
Machine Learning (3170724) 210210107003

Conclusion:
• The k-NN classifier is a simple and effective method for classifying iris flower
species, and it can serve as a foundation for more complex classification tasks.
This project demonstrates the importance of data preprocessing, feature
selection, hyperparameter tuning, and model evaluation in machine learning.
Quiz:

1. What are the pros and cons of using KNN for classification tasks?
➢ Pros of using k-NN for classification:
1. Simplicity: k-NN is easy to understand and implement, making it a good
choice for beginners in machine learning.
2. Non-parametric: k-NN is a non-parametric algorithm, meaning it makes
no assumptions about the underlying data distribution. This makes it
suitable for a wide range of data types.
3. No Training Period: Unlike many other machine learning algorithms, k-
NN doesn't require a lengthy training period. The model stores the entire
dataset, and predictions can be made immediately.
4. Adaptability: k-NN can be used for both binary and multi-class
classification tasks, as well as regression.
➢ Cons of using k-NN for classification:
1. Computationally Expensive: As the dataset size increases, k-NN's
computational cost grows significantly because it needs to calculate
distances between the test point and all training points. This can be a
major drawback for large datasets.
2. High Memory Usage: k-NN requires storing the entire dataset, which can
be memory-intensive, especially for large datasets.

31
Machine Learning (3170724) 210210107003

3. Choice of 'k': Selecting the right value for 'k' (the number of nearest
neighbors to consider) can be challenging. A small 'k' may lead to
overfitting, while a large 'k' may lead to underfitting.

2. How do you choose the value of K in KNN?


➢ The choice of k can significantly impact the model's performance. Here are some
common methods and considerations for selecting an appropriate value of k:
1. Odd vs. Even K
2. Rule of Thumb
3. Cross-Validation
4. Grid Search
5. Domain Knowledge
6. Distance Metrics
7. Visualizations
8. Regularization Techniques

3. How does KNN handle data with missing values?


➢ Here are some common approaches to deal with missing data in k-NN:
1. Remove Instances with Missing Values:
2. Imputation
3. Ignoring Missing Values
4. Weighted k-NN
5. Feature Engineering

Suggested Reference:

• https://codecrucks
• https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4
• https://towardsdatascience.com/a-simple-introduction-to-k-nearest-neighbors-
algorithm-b3519ed98e
References used by the students:

• https://www.geeksforgeeks.org/project-knn-classifying-iris-dataset/
• https://www.analyticsvidhya.com/blog/2022/06/iris-flowers-classification-
using-machine-learning/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

32
Machine Learning (3170724) 210210107003

Experiment No: 8
Implement Decision tree classifier and test its performance

Date:

Competency and Practical Skills:

Relevant CO: 3, 5

Objectives:

1. To understand how decision tree works

2. To learn to train decision tree classifier

3. To test the performance of classifier using 10-fold cross validation

Theory:

What is Decision Tree?

A technique for predictive modelling known as a decision tree employs a structure


similar to that of a tree to depict the various possible outcomes and options involved in
a decision-making process. The decision-making process begins at a single node at the
beginning of the tree, which is referred to as the root. This node signifies the beginning
of the process. The initial node, which is called the root node, then divides into several
child nodes, each of which represents a distinct decision or result that could occur based
on the values of one or more input variables.

The tree continues to extend its branches in a recursive manner, with each succeeding
node reflecting a choice or outcome that is more specific than the previous one and
being determined by the values of additional input variables. The very last nodes of the
tree, which are referred to as the leaves, stand in for the very last choice or result of the
procedure.

Decision tree model to accept/reject job offer

33
Machine Learning (3170724) 210210107003

Decision tree usually classify data using “if then” rules. For example, for above decision
tree,

Rule 1: if Salary < 50K, then reject job

Rule 2: If salary is > 50K and company distance > 30km, then reject job

Rule 3: If salary is > 50K and company distance < 30km and yearly increment > 20%
then accept job.

And many more rules…

Both classification and regression problems can be solved with the help of decision
trees. When solving a classification problem, the objective is to determine the category
or class into which a new observation will fall by using the values of one or more of the
variables that were input into the problem. Predicting the value of a continuous variable
based on the values of one or more input variables is the purpose of a regression
problem. There may be one or more input variables.

The decision tree algorithm is effective because it recursively divides the input space
into subsets, with each subset being determined by the values of the variables that are
entered. The input variable that offers the best split is the one that is chosen by the
algorithm. This is the variable that either maximizes the information gain or minimizes
the impurity of the subsets that are produced as a result of the split. This process is
repeated until all of the variables that were entered have been used or until a stopping
requirement (such as a maximum tree depth or a minimum number of observations per
leaf) has been satisfied, whichever comes first.

The popularity of decision trees can be attributed to the fact that they are simple to read
and visualize, as well as the fact that they are able to process categorical and continuous
input variables. On the other hand, they are susceptible to overfitting, particularly when
the tree is excessively deep or when there are an excessive number of input variables.
Several different approaches, such as pruning and ensemble methods, are some of the
potential solutions to this problem.

Construction of Decision Tree:

The building of a decision tree requires the partitioning of the input space into subsets
in a recursive manner, with the subsets being determined by the values of the variables
that are entered, until a stopping criterion is satisfied. The following are the stages
involved in the construction of a decision tree:

1. Begin with the root node: The root node represents the complete dataset, and all
of the input variables are available to be used in making decisions regarding the
splitting of the nodes.

2. The algorithm analyses each input variable and chooses the one that offers the
34
Machine Learning (3170724) 210210107003

best split based on a given criterion (such as information gain, Gini impurity, or
the chi-squared test). The variable that is selected as the best one to split on is
referred to as the "splitting variable." The optimal split is one that either
maximizes the amount of information gained or minimizes the amount of
impurity in the subsets that are produced as a result.

3. Once the best input variable has been chosen, the dataset will be partitioned into
two or more subsets based on the possible values of that variable. This step of
the process is known as the creation of child nodes. Each individual subset can
be thought of as a child node, which serves as a fresh Launchpad for the
subsequent level of the tree.

4. Iterate over steps two and three in a recursive manner: The algorithm iterates
over steps two and three for each child node, selecting the best input variable
and creating additional child nodes until a stopping criterion is met (for example,
the maximum tree depth, the minimum number of observations per leaf, or there
is no significant improvement in the model's performance).

5. Remove any dead or diseased branches from the tree. Once the tree has been
completely built, it may be excessively complicated and prone to overfitting. To
prevent overfitting, the tree can be "pruned" by eliminating branches that do not
increase the model's performance on a validation set or by placing a complexity
penalty on the tree. Either of these two options is an alternative to "overfitting."

6. Utilize the tree for predictive purposes: After the tree has been built and
trimmed, it can then be utilized for predictive purposes by traversing the tree
from the root node to the appropriate leaf node based on the values of the input
variables for a new observation. This allows the tree to be used for prediction
after it has been produced.

Visualizing the process of constructing a decision tree as a tree-like structure, with the
root node at the top and the leaf nodes at the bottom, is one way to represent this
process. The nodes in a decision tree represent the points at which a decision must be
made, and the branches reflect the various alternative outcomes or values that can be
obtained from the variables that are input.

Implementation:

• Consider following data set. Train decision tree with random 10 data points and
test with remaining 4. Create decision tree with different parameters and test it

# Outlook Temp. Humidity Windy Play


D1 Sunny Hot High False No
D2 Sunny Hot High True No
D3 Overcast Hot High False Yes
D4 Rainy Mild High False Yes

35
Machine Learning (3170724) 210210107003

D5 Rainy Cool Normal False Yes


D6 Rainy Cool Normal True No
D7 Overcast Cool Normal True Yes
D8 Sunny Mild High False No
D9 Sunny Cool Normal False Yes
D10 Rainy Mild Normal False Yes
D11 Sunny Mild Normal True Yes
D12 Overcast Mild High True Yes
D13 Overcast Hot Normal False Yes
D14 Rainy Mild High True No

Code:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn import tree
import pandas as pd
# Create a DataFrame for the dataset
data = {
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy', 'Overcast', 'Sunny', 'Sunny',
'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
'Temp': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot',
'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal',
'Normal', 'Normal', 'High', 'Normal', 'High'],
'Windy': [False, True, False, False, False, True, True, False, False, False, True, True, False,
True],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)
# Encoding categorical variables
df_encoded = pd.get_dummies(df, columns=['Outlook', 'Temp', 'Humidity', 'Windy'])
# Split the data into training and testing sets (10 data points for training, 4 for testing)
X_train = df_encoded.iloc[:10, 1:]
y_train = df_encoded.iloc[:10, 0]
X_test = df_encoded.iloc[10:, 1:]
y_test = df_encoded.iloc[10:, 0]
# Initialize lists to store parameter values and corresponding accuracies
max_depth_values = [1,2,3,5]
accuracies = []
# Iterate through different max_depth values
for max_depth in max_depth_values:
# Create and train a decision tree classifier with the specified max_depth
decision_tree = DecisionTreeClassifier(criterion='entropy', max_depth=max_depth,
random_state=42)
decision_tree.fit(X_train, y_train)
# Test the decision tree on the test data
y_pred = decision_tree.predict(X_test)
# Calculate accuracy and store in the accuracies list

36
Machine Learning (3170724) 210210107003

accuracy = (y_pred == y_test).mean()


accuracies.append(accuracy)
print(f"parameter: {max_depth}, Accuracy: {accuracy}")
# Plot the parameter values vs. accuracies
plt.figure(figsize=(12, 6))
tree.plot_tree(decision_tree, feature_names=X_train.columns.tolist(), class_names=['No', 'Yes'],
filled=True, rounded=True)
plt.savefig("prfig8.pdf", format="pdf", bbox_inches="tight")
plt.show()
# Test the decision tree on the remaining 4 data points
y_pred = decision_tree.predict(X_test)
# Print the predictions for the test data
print("Predictions for the test data:")
print(y_pred)
Output:

Conclusion:
• Implementing a Decision Tree Classifier and evaluating its performance is an
essential task in machine learning. In conclusion, this study delved into the
fundamentals of decision trees and their operation. We successfully acquired the
knowledge and skills necessary to train a decision tree classifier, a valuable tool
in machine learning. Furthermore, we assessed the performance of the classifier
through a rigorous 10-fold cross-validation process, which helps ensure the
model's robustness and reliability. This exploration provides a strong foundation
for leveraging decision trees in various data classification tasks.

Quiz:

37
Machine Learning (3170724) 210210107003

1. What are the benefits of using a decision tree in decision making?


➢ Versatility: They can handle classification and regression tasks.
➢ Non-linearity: Suitable for modeling non-linear relationships in data.
➢ Feature Selection: Identifies important features and interactions.
➢ Scalability: Decision trees can handle large datasets and are relatively
computationally efficient, especially when compared to more complex models like
neural networks.
➢ Easy to Update: Decision trees can be easily updated with new data or changes in
the decision criteria, making them flexible for evolving decision-making scenarios.

2. How do you choose the best split for a decision tree?


➢ Evaluate Potential Splits: For each feature in your dataset, evaluate potential
splits (thresholds for continuous features or unique values for categorical
features).
➢ Select the Best Split: Choose the split that maximizes information gain (for
entropy) or minimizes impurity (for Gini impurity) or error (for regression
criteria). This is done by comparing the values calculated in step 2.
➢ Repeat for All Features: Repeat steps 2 and 3 for all features in your dataset.
Select the best overall split across all features.
➢ Recursive Splitting: Recursively apply steps 2 to 5 to the resulting child nodes
until a stopping criterion is met.
➢ Stop Criteria: Define stopping criteria to determine when to stop the splitting
process.

3. What are some limitations of using a decision tree?


➢ Overfitting: Decision trees are prone to overfitting, especially when they are deep
and complex. Overfitting occurs when a tree captures noise or random fluctuations
in the training data, making it perform poorly on unseen data.
➢ Instability: Small changes in the training data can lead to significantly different
tree structures. Decision trees can be highly sensitive to variations in the dataset,
which can result in unstable models.
➢ Inefficient on Large Datasets: Building a decision tree on large datasets can be
computationally expensive and time-consuming. Other algorithms like Random
Forests or Gradient Boosting may be more efficient and accurate in such cases.
➢ Greedy Algorithm: Decision trees use a greedy approach to select the best split at
each node. This can lead to suboptimal splits at higher levels in the tree, affecting
the overall tree structure.
➢ Lack of Consistency: Decision tree algorithms may produce different trees on
different runs of the same data due to the randomness involved in the splitting
process. This lack of consistency can be a problem in certain scenarios.

4. What is pruning in a decision tree, and why is it important?


➢ Pruning in the context of decision trees refers to the process of reducing the size
and complexity of a decision tree by removing certain branches or nodes. It is an
important technique used to prevent overfitting and improve the generalization
ability of the tree.
➢ Overfitting Prevention: Decision trees can be highly prone to overfitting, where
they capture noise and small fluctuations in the training data, resulting in poor
performance on unseen data.
➢ Model Simplicity: A smaller, pruned tree is simpler and easier to interpret. It
provides a more concise representation of the decision process, making it more
accessible to humans.
➢ Reduced Computational Complexity: Smaller trees are computationally less
expensive to use for making predictions. They require less processing power and
time to evaluate, which is especially valuable for real-time or resource-constrained
38
Machine Learning (3170724) 210210107003

applications.
➢ Improved Stability: Pruned trees tend to be more stable and less sensitive to
variations in the training data. A non-pruned tree can be highly sensitive to small
changes in the data, resulting in different tree structures for similar datasets.

Suggested Reference:

• https://www.ibm.com/in-en/topics/decision-trees
• https://hbr.org/1964/07/decision-trees-for-decision-making
• https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

References used by the students:

● https://hbr.org/1964/07/decision-trees-for-decision-making

● https://www.javatpoint.com/machine-learning-decision-tree-classification-
algorithm

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

39
Machine Learning (3170724) 210210107003

Experiment No: 9
Implement program to demonstrate Neural Network Classifier

Date:

Competency and Practical Skills:

Relevant CO: 3, 5

Objectives:

1. To understand the basic difference between biological neural network and


artificial neural network

2. To lean the mathematical structure of artificial neural network

3. To implement neural network as classifier

Theory:

Biological Neuron:

Biological neurons are the primary components of the nervous system in all species.
These cells are specialized in the use of electrochemical signals for processing and
communication. An axon, dendrites, and a cell body make up a neuron in a living
organism. The axon is responsible for sending messages to other neurons or muscles,
while the dendrites are responsible for receiving them.

Biological neuron [Source: wikipedia]

Artificial Neuron:

Artificial neurons, also called perceptron, are mathematical functions that simulate the
actions of real neurons. Synthetic neural networks, composed of artificial neurons, are
employed in many AI and machine learning systems. A synthetic neuron receives data
as input, processes it using a predetermined set of weights and biases, and then
40
Machine Learning (3170724) 210210107003

generates a response as output.

There are functional parallels between biological neurons and artificial neurons, but
there are also important distinctions. To modify their weights and biases, artificial
neurons, in contrast to their biological counterparts, need training data. Biological
neurons also have the ability to create new neurons and form new connections, but
artificial neurons are limited by a predetermined architecture.

The perceptron is the most elementary form of neural network, with just one layer of
output nodes that take input from several input nodes and spit out a single binary value.
Frank Rosenblatt was the first to publicize it in 1957.

The input values for the perceptron method are multiplied by their associated weights
and then added together. The output of the perceptron is the result of this sum being fed
into an activation function.

Most perceptron utilizes the step activation function, which returns 1 if the weighted
total of the inputs is larger than a threshold value and 0 otherwise. The perceptron is
trained by modifying the weights based on the difference between the expected and
actual outputs, which are both initially set to random values.

Perceptron computational model

Many different types of binary classification tasks, including those in the fields of image
recognition and natural language processing, have benefited from the use of perceptron.
However, they are unable to deal with situations that are more complex and would
benefit from additional layers and non-linear activation functions.

Neural Network:

Inspired by the form and function of biological neural networks in the human brain,
Artificial Neural Networks (ANNs) are a type of machine learning method. ANNs are
made up of a network of processing nodes, or "neurons," that communicate with one
another to discover and understand hidden correlations and patterns in data.

Each neuron in an ANN is equipped with a mathematical function that takes as its input
41
Machine Learning (3170724) 210210107003

signals from other neurons or from the outside world and generates an output signal
based on the processed data. Each neuron's output signals are propagated to nearby
neurons, creating a distributed system of processors.

Basic architecture of neural network

In order to increase their predictive or classifying abilities during training, ANNs tweak
the strengths of the connections between neurons. To do so, the network's predictions
are compared to the actual values of the target variable, and the resulting "cost
function" is minimized.

Image and speech recognition, NLP, predictive analytics, and robotics are just few of the
many areas where ANNs find widespread use. They excel at activities with non-linear,
intricate interactions between input and output variables.

Implementation:

• Use 70% of Iris flower dataset to train neural network model. Test it with remaining
30% data and measure the accuracy.

• Try different architectures and training functions and also note down performance
of each.

Code:

!pip install keras


!pip install tensorflow
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
iris = load_iris()
X = iris.data[:, :2]
Y = iris.target

42
Machine Learning (3170724) 210210107003

# Standardizing the features


scaler = StandardScaler()
X = scaler.fit_transform(X)
# Split the data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
def create_model(input_dim, output_dim, hidden_layers, hidden_neurons):
model = Sequential()

units_list = hidden_neurons if isinstance(hidden_neurons, (list, tuple)) else [hidden_neurons]


model.add(Dense(units_list[0], input_dim=input_dim, activation='relu'))
for i in range(hidden_layers - 1):
for units in units_list:
model.add(Dense(units, activation='relu'))
model.add(Dense(output_dim, activation='softmax'))
return model
def train_model(model, X_train, Y_train, X_test, Y_test, epochs=20, batch_size=32):
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test,
Y_test))
def test_model(model, X_test, Y_test):
scores = model.evaluate(X_test, Y_test, verbose=0)
return scores[1] * 100
input_dim = X_train.shape[1]
output_dim = len(np.unique(Y_train))
# Try different architectures and training functions
for hidden_layers in [1, 2, 3]:
for hidden_neurons in [5, (6, 3), (100, 50), (10, 5, 3)]:
model = create_model(input_dim, output_dim, hidden_layers, hidden_neurons)
train_model(model, X_train, Y_train, X_test, Y_test)
accuracy = test_model(model, X_test, Y_test)
print(f'Hidden Layer({hidden_layers}) | Hidden Neurons({hidden_neurons}) |
Accuracy({accuracy:.2f}%)')

Results:

Architecture of Neural network Training function Accuracy

1 Hidden Layer (5 neurons) adam 68.89%

2 Hidden Layer (6,3) adam 57.78%

2 Hidden Layer (100,50) adam 68.89%

3 Hidden Layer (10,5,3) adam 24.44%

Conclusion:
• In conclusion, this study aimed to explore the fundamental distinctions between
biological neural networks and artificial neural networks. We delved into the
mathematical underpinnings of artificial neural networks, gaining insights into

43
Machine Learning (3170724) 210210107003

their structure and operations. Additionally, we successfully applied this


knowledge to implement an artificial neural network as a classifier,
demonstrating its potential in solving a variety of machine learning tasks. This
research contributes to our understanding of the synergy between biological and
artificial neural networks and their practical applications in modern technology.
Quiz:

1. What are the different types of neural networks and their applications?
➢ Feedforward Neural Networks (FNN or FFNN):
Applications: Feedforward neural networks are general-purpose and can be
applied to a wide range of tasks, including regression, classification, function
approximation, and more. They are commonly used in image and text
classification, financial forecasting, and speech recognition.
➢ Convolutional Neural Networks (CNN):
Applications: CNNs are widely used in image and video analysis tasks, including
image classification, object detection, image segmentation, facial recognition,
and medical image analysis. They can also be applied to natural language
processing for tasks like text classification.
➢ Recurrent Neural Networks (RNN):
Applications: RNNs are suitable for sequential data, such as time series
forecasting, natural language processing (NLP) tasks (e.g., language modeling,
machine translation), and speech recognition. Long Short-Term Memory
(LSTM) and Gated Recurrent Unit (GRU) variations of RNNs are often used
for improved performance.
➢ Gated Recurrent Unit (GRU):
Applications: LSTMs, GRUs are used in NLP tasks and speech recognition. They
are computationally more efficient and are suitable for simpler sequential
tasks.
➢ Autoencoders:
Applications: Autoencoders are used for dimensionality reduction, feature
learning, and data denoising. Variational Autoencoders (VAEs), a variant of
autoencoders, are used for generating new data points in a structured and
meaningful way.

2. What is regularization in neural networks and why is it important?


➢ Regularization techniques, like L1 and L2 regularization, are used to prevent
overfitting in neural networks. They add penalty terms to the loss function,
encouraging the model to have smaller weights. Regularization helps to simplify
the model, improve generalization, and reduce the risk of fitting noise in the
training data. It is crucial in preventing neural networks from memorizing the
training data while failing to generalize to unseen data.

3. What are some common activation functions used in neural networks?


➢ Activation functions are crucial components of artificial neural networks. They
introduce non-linearity to the network, allowing it to learn complex relationships
in the data.
1. Sigmoid Function (Logistic Activation)
2. Hyperbolic Tangent Function (Tanh Activation)
3. Rectified Linear Unit (ReLU)
4. Leaky ReLU
5. Parametric ReLU (PReLU)
6. Step Function
7. Softmax Function

4. What are some real-world applications of neural networks?

44
Machine Learning (3170724) 210210107003

➢ Image and Video Analysis: Image Classification, Object Detection, Face


Recognition, Video Surveillance
➢ Natural Language Processing (NLP): Language Translation, Text Classification,
Chatbots and Virtual Assistants, Speech Recognition
➢ Healthcare: Medical Imaging, Drug Discovery, Disease Prediction
➢ Autonomous Vehicles: Self-Driving Cars
➢ Financial Services: Algorithmic Trading, Credit Scoring, Fraud Detection
➢ Recommendation Systems: E-commerce, Content Recommendations,
➢ Gaming: Game AI, Game Testing and Debugging
➢ Manufacturing and Industry: Quality Control, Predictive Maintenance
➢ Energy: Smart Grids, Energy Consumption Prediction
➢ Human Resources: They assist in resume screening, candidate selection, and
employee turnover prediction.

Suggested Reference:

● https://www.ibm.com/in-en/topics/neural-networks#What
%20is%20a%20 neural%20network?

● https://towardsdatascience.com/a-beginner-friendly-explanation-of-how-
neural-networks-work-55064db60df4

● https://aws.amazon.com/what-is/neural-network/

References used by the students:

● https://wiki.pathmind.com/neural-network

● https://www.javatpoint.com/artificial-neural-network

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

45
Machine Learning (3170724) 210210107003

Experiment No: 10

Write a program to demonstrate within class scatter, between class


scatter and total scatter of the dataset.

Date:

Competency and Practical Skills:

Relevant CO: 1, 2

Objectives:

1. To understand scatter of the data

2. To learn to compute and compare within class, between class and total scatter of
dataset.

Theory:

Scatter:

Scatter means dispersion in dataset. It defines how close, or how far data points are
from the mean of dataset. It is useful in clustering.

The variance or standard deviation of a dataset can be calculated to get a sense of its
dispersion. Both show how far the individual values deviate from the mean or average.

The standard deviation is the square root of the variance, while variance is the average
of the squared differences between each data point and the mean. If the data points are
more widely scattered, then the variance or standard deviation will be larger, and vice
versa if they are more tightly packed around the mean.

46
Machine Learning (3170724) 210210107003

Data distribution for three classes

Within Class scatter:

The variance of data points within a cluster as determined by a clustering technique is


referred to as the within-class scatter, intra-class scatter, or within-cluster variance. It's
a metric for measuring how tightly the clusters in the features space are packed
together.

The sum of squared distances between each data point in a cluster and the cluster
center can be used to determine the within-class scatter for that cluster. The
corresponding formula is as follows:

𝑆𝑊 = ∑ ∑ (𝑥 − 𝑚𝑖 ) (𝑥 − 𝑚𝑖 )𝑇
𝑖=1 𝑥 ∈𝑤𝑖

Where,

• 𝑤𝑖 is the i-th class,

• 𝑚𝑖 is the mean of i-th class

• 𝐶 is number of classes

Lower values of the within-class scatter indicate more tightly packed and clearly
defined clusters, making it a useful metric for judging the efficacy of various clustering
algorithms. Fisher's linear discriminant analysis (LDA), which aims to maximize the
ratio of the between-class scatter to the within-class scatter, is one example of a feature
selection and dimensionality reduction technique that makes advantage of this
property.

Between Class Scatter:

Distance between cluster centers in a clustering algorithm's feature space is measured


by between-class scatter, also called inter-class scatter. Distance between clusters can
be measured quantitatively and used to test the algorithm's ability to classify data into
distinct groups.

To determine the dispersion between classes, we can add up the squared differences
between the centers of each cluster and the overall mean or centroid of the data. The
corresponding formula is as follows:

𝑆𝐵 = ∑ 𝑛𝑖 (𝑚𝑖 − 𝑚) (𝑚𝑖 − 𝑚)𝑇


𝑖=1

Where,

47
Machine Learning (3170724) 210210107003

• 𝑛𝑖 is the number of samples in i-th class

• 𝑚 is the mean of entire dataset

The between-class scatter can be used in feature selection and dimensionality reduction
techniques such as Fisher's linear discriminant analysis (LDA), which seeks to maximize
the ratio of the between-class scatter to the within-class scatter. A larger value of
between-class scatter relative to within-class scatter implies better discriminative
power of the clustering algorithm. However, it should be noted that maximizing the
between-class scatter alone may lead to overfitting and poor generalization to new data.

Total Scatter:

A dataset's total scatter, sometimes called total variance or total sum of squares,
quantifies the degree to which the data points within it vary from one another. For
clustering difficulties, it can be broken down into within-class scatter and between-class
scatter, and for regression problems, it can be broken down into explained variance and
unexplained variance.

The sum of squared distances between each data point and the general mean or
centroid of all data points can be used to estimate the total scatter for a dataset. The
corresponding formula is as follows:

𝑆𝑇 = ∑(𝑥𝑖 − 𝑚) (𝑥𝑖 − 𝑚)𝑇


𝑖=1

Where,

• 𝑥𝑖 is the i-th sample from dataset

If you're trying to compare the within-class scatter to the between-class scatter in a


clustering problem or the explained variance to the unexplained variance in a
regression problem, you can use the total scatter as a benchmark. R-squared, the
fraction of total variation in the dependent variable that is explained by the independent
variable(s) in a regression model, can also be calculated using this method.

To evaluate the efficacy of different modelling strategies and to comprehend the overall
variability of the data, knowing the total scatter is crucial.

Implementation:

• Use Iris flower data set and compute with class scatter, between class scatter and
total scatter

Code:
import numpy as np
from sklearn.datasets import load_iris
48
Machine Learning (3170724) 210210107003

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target
# Calculate the mean for each feature
feature_means = np.mean(X, axis=0)
# Initialize scatter matrices
within_class_scatter = np.zeros((X.shape[1], X.shape[1]))
between_class_scatter = np.zeros((X.shape[1], X.shape[1]))
# Calculate the class means
class_means = np.zeros((3, X.shape[1]))
for i in range(3):
class_instances = X[y == i]
class_means[i, :] = np.mean(class_instances, axis=0)
# Calculate the within-class scatter and between-class scatter
for i in range(3):
class_instances = X[y == i]
n = class_instances.shape[0]
diff = class_instances - class_means[i]
within_class_scatter += np.dot(diff.T, diff)
mean_diff = class_means[i] - feature_means
between_class_scatter += n * np.outer(mean_diff, mean_diff)
# Calculate the total scatter
total_scatter = within_class_scatter + between_class_scatter
# Print the results
print("Within Class Scatter:")
print(within_class_scatter)
print("\nBetween Class Scatter:")
print(between_class_scatter)
print("\nTotal Scatter:")
print(total_scatter)
Results:

Value

Within class scatter [[38.9562 13.63 24.6246 5.645 ]


[13.63 16.962 8.1208 4.8084]
[24.6246 8.1208 27.2226 6.2718]
[ 5.645 4.8084 6.2718 6.1566]]

Between class scatter [[ 63.21213333 -19.95266667 165.2484 71.27933333]


[-19.95266667 11.34493333 -57.2396 -22.93266667]
[165.2484 -57.2396 437.1028 186.774 ]
[ 71.27933333 -22.93266667 186.774 80.41333333]]
Total Scatter [[102.16833333 -6.32266667 189.873 76.92433333]
[ -6.32266667 28.30693333 -49.1188 -18.12426667]
[189.873 -49.1188 464.3254 193.0458 ]
[ 76.92433333 -18.12426667 193.0458 86.56993333]]

Conclusion:

49
Machine Learning (3170724) 210210107003

• In conclusion, this study delved into the concept of data scatter, a crucial aspect
of data analysis. We successfully learned how to compute and compare three key
measures of scatter: within-class scatter, between-class scatter, and total scatter.
These measures provide valuable insights into the distribution and separability
of data points, aiding in the assessment and optimization of various machine
learning and statistical models. Understanding these scatter metrics is essential
for making informed decisions in data analysis and pattern recognition tasks.

Quiz:

1. What is the difference between variance and standard deviation as measures of


scatter?
➢ Variance and standard deviation are both statistical measures used to quantify the
spread or dispersion of a dataset, but they differ in their units and scale.
➢ Variance: Variance is the average of the squared differences between each data
point and the mean of the dataset. It measures how much individual data points
deviate from the mean. Variance is represented as σ2 for a population and s2 for a
sample. (e.g., square meters, square dollars, etc.).
➢ Standard Deviation: The standard deviation is the square root of the variance. It
measures the average deviation of data points from the mean and is denoted as σ
for a population and s for a sample. The standard deviation has the same units as
the original data. (e.g., meters, dollars, etc.).

2. How can outliers affect the calculation of variance and standard deviation?
➢ Outliers can have a significant impact on the calculation of both variance and
standard deviation.
➢ Variance: Outliers can lead to an inflated variance because the variance is
calculated by squaring the differences between data points and the mean. When
there are extreme values (outliers) in the dataset, these squared differences
become very large, which in turn increases the overall variance. Outliers
effectively contribute more to the variance than other data points.
➢ Standard Deviation: Outliers can also influence the standard deviation, although
to a somewhat lesser extent than variance. While the standard deviation still
accounts for the squared differences, it mitigates the effect of outliers by taking the
square root of the variance.

3. What is the difference between within-class scatter and between-class scatter in


clustering problems?
➢ In clustering problems, within-class scatter (intra-cluster scatter) measures the
variability or dispersion of data points within the same cluster. Lower within-class
scatter indicates that data points within a cluster are closer to each other, implying
a more compact cluster.
➢ Between-class scatter (inter-cluster scatter) measures the variability between
different clusters. It represents how separated the clusters are from each other.
Lower between-class scatter implies better separation between clusters.

4. What is the total scatter of a dataset and how is it calculated?


➢ The total scatter, often referred to as the total sum of squares (TSS), is a measure
that quantifies the total variability or dispersion of a dataset. It provides a baseline
for understanding how much data points deviate from the dataset's overall mean.
The total scatter is an important component in statistical analysis, particularly
when assessing the contributions of other sources of variability, such as within-

50
Machine Learning (3170724) 210210107003

class scatter and between-class scatter in classification or clustering problems.


➢ Here's how the total scatter is calculated:
1. Step 1: Calculate the Mean
2. Step 2: Calculate the Squared Differences
3. Step 3: Calculate the Total Scatter (TSS)

Suggested Reference:

● https://www.sciencedirect.com/topics/computer-science/class-scatter-matrix

● https://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/old_IDAPILecture15.pdf

● https://multivariatestatsjl.readthedocs.io/en/latest/mclda.html

● https://www.oreilly.com/library/view/feature-engineering-
made/9781787287600/ad8e90ca-9227-4150-9bd2-6b664dd04f46.xhtml

References used by the students:

● https://www.machinelearningplus.com/plots/python-scatter-plot/

● https://www.geeksforgeeks.org/problem-solving-on-scatter-matrix/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

51
Machine Learning (3170724) 210210107003

Experiment No: 11

Write a program to demonstrate clustering using K-means algorithm

Date:

Competency and Practical Skills:

Relevant CO: 4, 5

Objectives:

1. To understand the basics of clustering

2. To learn the applications of clustering

3. To implement k-means clustering algorithm

Theory:

Clustering

Clustering is a data grouping method used in machine learning and data mining to find
patterns in large amounts of data. Clustering's main purpose is to assist us see
connections between data pieces and figure out what that structure is.

Clustering process: Input: unlabeled data, output: 3 clusters

Different clustering algorithms have different advantages and disadvantages. The


following are examples of common clustering methods:

1. K-means clustering is a well-known approach for organizing data into K clusters


with a focus on how closely they are related. The user decides on the optimal
number of clusters in advance.

2. In hierarchical clustering, the number of clusters is not specified beforehand, and


data points are organized into a tree-like structure. Agglomerative hierarchical

52
Machine Learning (3170724) 210210107003

clustering takes data points one at a time and combines them into larger clusters,
while divisive hierarchical clustering takes all data points at once and splits them
recursively into smaller clusters.

3. Data points are clustered in the feature space using a method called density-
based clustering. It is resistant to noise and outliers and can detect clusters of
any size or shape.

4. This approach, known as spectral clustering, employs graph theory to form


groupings of data points with similar spectral characteristics. It works
particularly well with nonlinear or complex datasets.

Image segmentation, customer segmentation, anomaly detection, and pattern


identification are just a few of the many real-world uses for clustering.

K-Means clustering:

One common unsupervised machine learning approach is k-means clustering, which


divides data into as many groups of K as necessary. To begin, the algorithm generates K
random starting centroids to serve as cluster hubs. Next, it divides the data into K
clusters by placing each observation at its centroid. After that, it finds the average of
each group and adjusts the centre of mass accordingly. When the centroids stop shifting,
or when the maximum number of iterations is achieved, the procedure begins again.

K-means clustering basically consists of these steps:

1. As a first step, pick K points at random to serve as the centers of your data.

2. To build K clusters, assign each data point to the centroid that is the closest to it.

3. To update, find the average value in each group and set the new center of gravity
there.

4. Iterate until convergence is attained or the maximum number of allowed


iterations is reached; then, stop.

A key parameter in K-means clustering is the number of clusters, or K. Domain


expertise, trial and error, or heuristics like the elbow approach or the silhouette method
can all help with this determination.

K-means clustering's benefits include its ease of use, scalability, and productivity. The
method assumes that clusters have a spherical shape and equal variance, which may not
always be the case, and K must be specified in advance.

Implementation:

• Use 70% of Iris dataset to train K-means clustering. Test is with remaining 30% data

53
Machine Learning (3170724) 210210107003

and measure the accuracy.

• Try different values of k and observe the effect. Also observe different distance

Code:

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
# Create a dataframe from the Iris dataset
df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] +
['target'])
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df[iris['feature_names']], df['target'],
test_size=0.3, random_state=42)
# Define the different distance measures
distance_measures = ['euclidean']
# Define the different values of k
k_values = [2, 3, 4, 5]
# Train and evaluate the K-Means clustering model
def train_and_evaluate(X_train, X_test, y_train, y_test, k, distance_measure):
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X_train)
y_pred = kmeans.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy for k={k} and distance measure={distance_measure}: {accuracy}')
return accuracy
# Plot the accuracy of the K-Means clustering model for different values of k and distance
measures
accuracies = {}
for k in k_values:
for distance_measure in distance_measures:
accuracy = train_and_evaluate(X_train, X_test, y_train, y_test, k, distance_measure)
accuracies[f'k={k}_distance_measure={distance_measure}'] = accuracy
plt.figure(figsize=(8, 6))
for distance_measure in distance_measures:
plt.plot(k_values, [accuracies[f'k={k}_distance_measure={distance_measure}'] for k in
k_values], label=distance_measure)
plt.xlabel('Value of k')
plt.ylabel('Accuracy')
plt.title('Accuracy of K-Means Clustering on Iris Dataset')
plt.legend()
plt.show()

Results:

54
Machine Learning (3170724) 210210107003

Value of K Distance measure Accuracy

K=2 Euclidean 0.7111%


K=3 Euclidean 0.2222%
K=4 Euclidean 0.6222%
K=5 Euclidean 0.1777%

Conclusion:
• In conclusion, this study provided a fundamental understanding of clustering, a
powerful unsupervised learning technique. We explored the diverse applications
of clustering in various fields, including data analysis, image processing, and
customer segmentation. Additionally, we successfully implemented the k-means
clustering algorithm, a widely used clustering method, showcasing its capability
to group data points into meaningful clusters. This knowledge equips us with
valuable tools for organizing and extracting insights from complex datasets,
paving the way for improved decision-making and problem-solving in a range of
real-world scenarios.
Quiz:

1. What are the main steps of K-means clustering?


1. Step 1. Determine the value “K”, the value “K” represents the number of
clusters.
2. Step 2. Randomly select 3 distinct centroids (new data points as cluster
initialization)
3. Step 3. Measure the distance (euclidean distance) between each point and
the centroid
4. Step 4. Assign each point to the nearest cluster.
5. Step 5. Calculate the mean of each cluster as new centroid.
6. Step 6. Repeat step 3–5 with the new center of cluster.
7. Step 7. Calculate the variance of each cluster.
8. Step 8. Repeat step 2–7 until get the lowest sum of variance.

2. What is the role of centroids in K-means clustering?


➢ Centroids play a central role in the K-means clustering algorithm. They serve as
the representatives or centers of the clusters and are essential for both the initial
setup of clusters and the iterative optimization process.
➢ centroids in K-means clustering serve as the core of the algorithm, helping define
the initial clusters, guiding the assignment of data points, and driving the iterative
process to optimize cluster assignments. They are critical for creating well-defined
clusters and capturing the central tendencies of the data within each cluster.

3. What are some advantages of K-means clustering?


➢ Simplicity and Ease of Implementation: K-means is a straightforward and easy-
to-understand algorithm, making it accessible to both beginners and experts in the
field of machine learning and data analysis.
➢ Efficiency: K-means is computationally efficient and can handle large datasets
with many data points and features. It converges relatively quickly, especially with
the right initialization techniques.
➢ Scalability: K-means is scalable and can be applied to high-dimensional data. It
can handle datasets with many features, making it versatile for a wide range of
applications.
➢ Applicability to Many Domains: K-means is versatile and has applications in

55
Machine Learning (3170724) 210210107003

various domains, including image processing, natural language processing,


customer segmentation, and more.
➢ Good for Data Exploration: K-means can be used as a first step in data
exploration, helping to discover patterns and groupings within the data. It can
serve as a starting point for more advanced analysis.

4. What are some limitations of K-means clustering?


➢ Assumption of Spherical Clusters: K-means assumes that clusters are spherical,
equally sized, and have roughly the same density. It may not perform well when
clusters have non-spherical shapes, differing sizes, or varying densities.
➢ Lack of Robustness to Outliers: Outliers can significantly impact K-means
results, as the algorithm aims to minimize the sum of squared distances from data
points to their assigned centroids. Isolated outliers can pull centroids away from
the main cluster.
➢ Hard Assignments: K-means provides hard cluster assignments, meaning each
data point belongs to one and only one cluster. In real-world data, soft
assignments (where data points can belong to multiple clusters with varying
degrees of membership) may be more appropriate.
➢ Limited Handling of Noise: K-means does not have an explicit mechanism to
handle noisy data points or outliers. Outliers can lead to the creation of separate
clusters, even when they are not meaningful.
➢ Loss of Information: K-means reduces data to cluster assignments and centroids,
which can lead to a loss of information about the underlying structure of the data.

Suggested Reference:

• https://towardsdatascience.com/k-means-clustering-algorithm-
applications-evaluation-methods-and-drawbacks-aa03e644b48a
• https://serokell.io/blog/k-means-clustering-in-machine-learning
• https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
References used by the students:

• https://www.geeksforgeeks.org/k-means-clustering-introduction/
• https://www.javatpoint.com/k-means-clustering-algorithm-in-
machine-learning

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total

Marks Complete Complete Complete Complete Complete


implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

56

You might also like