Machine Learning Lab Manual
Machine Learning Lab Manual
Objective:
Relevance: Statistical descriptions are foundational in understanding data and are used across various
domains such as healthcare, finance, and engineering for decision-making. Understanding these concepts is
crucial for exploring data trends, variability, and patterns, which form the basis of predictive analytics and
machine learning.
Theory: -
Statistical description involves summarizing and analyzing datasets to extract meaningful patterns and
characteristics. Key concepts include:
1. Central Tendency:
o Mean: The average of data values.
o Median: The middle value in sorted data.
o Mode: The most frequent data value.
2. Dispersion:
o Range: The difference between the maximum and minimum values.
o Variance: The average of squared deviations from the mean.
o Standard Deviation: The square root of variance, indicating data spread.
3. Visualization:
o Histogram: Graphical representation of data distribution.
Significance:
Data Distribution: Histograms help visualize how data is spread over a range.
Outliers: Helps identify outliers or unusual data points.
Comparison: Useful for comparing distributions of different datasets.
Skewness: Indicates whether data is symmetric or skewed.
Box plot is a graphical representation of the distribution of a dataset. It displays key summary
statistics such as the median, quartiles, and potential outliers in a concise and visual manner. By
using Box plot you can provide a summary of the distribution, identify potential and compare
different datasets in a compact and visual manner.
Conclusion: -
Experiment 2
Title: Implementation and Evaluation of ID3/C4.5 Algorithm
Aim: To implement and evaluate the ID3 or C4.5 decision tree algorithm on a given dataset.
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across various
domains, including healthcare, finance, and technology. Understanding the implementation and
evaluation of these algorithms equips learners with the skills to build predictive models, analyze data
effectively, and innovate solutions tailored to specific challenges. Each experiment fosters critical
thinking, technical expertise, and hands-on experience, bridging the gap between theoretical
knowledge and practical application.
Theory:
Types of decision trees are based on the type of target variable we have. It can be of two types:
1. Categorical Variable Decision Tree: Decision Tree which has a categorical target variable then
2. Continuous Variable Decision Tree: Decision Tree has a continuous target variable then it is
1. Root Node: It represents the entire population or sample and this further gets divided into two or
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can
7. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-
2. On each iteration of the algorithm, it iterates through the very unused attribute of the set S and
3. It then selects the attribute which has the smallest Entropy or Largest Information gain.
4. The set S is then split by the selected attribute to produce a subset of the data.
5. The algorithm continues to recur on each subset, considering only attributes never selected before.
pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf
node such that the overall accuracy is not disturbed. This is done by segregating the actual training set
into two sets: training data set, D and validation data set, V. Prepare the decision tree using the
segregated training data set, D. Then continue trimming the tree accordingly to optimize the accuracy
C4.5 Algorithm: An extension of ID3 that handles continuous attributes and missing values.
C4.5
As an enhancement to the ID3 algorithm, Ross Quinlan created the decision tree algorithm C4.5.
In machine learning and data mining applications, it is a well-liked approach for creating decision
trees. Certain drawbacks of the ID3 algorithm are addressed in C4.5, including its incapacity to deal
with continuous characteristics and propensity to overfit the training set.
A modification of information gain known as the gain ratio is used to address the bias towards
qualities with many values. It is computed by dividing the information gain by the intrinsic
information, which is a measurement of the quantity of data required to characterize an attribute's
values.
GainRatio=Split gain / Gain information
Where Split Information represents the entropy of the feature itself. The feature with the highest
gain ratio is chosen for splitting.
When dealing with continuous attributes, C4.5 sorts the attribute's values first, and then chooses
the midpoint between each pair of adjacent values as a potential split point. Next, it determines
which split point has the largest value by calculating the information gain or gain ratio for each.
By turning every path from the root to a leaf into a rule, C4.5 can also produce rules from the
decision tree. Predictions based on fresh data can be generated using the rules.
C4.5 is an effective technique for creating decision trees that can produce rules from the tree and
handle both discrete and continuous attributes. The model's accuracy is increased and overfitting is
prevented by its utilization of gain ratio and decreased error pruning.
Conclusion:
Decision tree algorithms provide intuitive and effective models for classification tasks.
Experiment 3
Title: Implementation and Evaluation of Random Forest Algorithm
Aim: To implement and evaluate the Random Forest algorithm on a given dataset.
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across various
domains, including healthcare, finance, and technology. Understanding the implementation and
evaluation of these algorithms equips learners with the skills to build predictive models, analyze data
effectively, and innovate solutions tailored to specific challenges. Each experiment fosters critical
thinking, technical expertise, and hands-on experience, bridging the gap between theoretical
knowledge and practical application.
Theory:
Random Forest is an ensemble learning method that creates multiple decision trees using bootstrapped
datasets and combines their predictions using majority voting or averaging.
Random Forest algorithm is a powerful tree learning technique in Machine Learning. It works by
creating a number of Decision Trees during the training phase. Each tree is constructed using a
random subset of the data set to measure a random subset of features in each partition. This
randomness introduces variability among individual trees, reducing the risk of overfitting and
improving overall prediction performance.
In prediction, the algorithm aggregates the results of all trees, either by voting (for classification
tasks) or by averaging (for regression tasks)
This collaborative decision-making process, supported by multiple trees with their insights,
provides an example stable and precise results. Random forests are widely used for classification
and regression functions, which are known for the
their ability to handle complex data, reduce overfitting, and provide reliable forecasts in different
environments.
Conclusion:
Random Forest is a robust and versatile machine learning model for classification and regression
tasks.
Experiment 4
Title: Implementation and Evaluation of KNN Algorithm
Aim: To implement and evaluate the K-Nearest Neighbors (KNN) algorithm on a given dataset.
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across various
domains, including healthcare, finance, and technology. Understanding the implementation and
evaluation of these algorithms equips learners with the skills to build predictive models, analyze data
effectively, and innovate solutions tailored to specific challenges. Each experiment fosters critical
thinking, technical expertise, and hands-on experience, bridging the gap between theoretical
knowledge and practical application.
Theory:
KNN assigns a class to a data point based on the majority class among its k-nearest neighbors,
calculated using distance metrics such as Euclidean distance.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-
NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Conclusion:
KNN is an intuitive and effective model for classification and regression tasks
.
Experiment 5
Title: Implementation and Evaluation of K-Means Clustering Algorithm
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across various
domains, including healthcare, finance, and technology. Understanding the implementation and
evaluation of these algorithms equips learners with the skills to build predictive models, analyze data
effectively, and innovate solutions tailored to specific challenges. Each experiment fosters critical
thinking, technical expertise, and hands-on experience, bridging the gap between theoretical
knowledge and practical application.
Theory:
K-Means iteratively assigns data points to the nearest cluster centroid and updates centroids until
convergence.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Conclusion:
Aim: To implement and evaluate the Naïve Bayes classifier algorithm on a given dataset.
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across various
domains, including healthcare, finance, and technology. Understanding the implementation and
evaluation of these algorithms equips learners with the skills to build predictive models, analyze data
effectively, and innovate solutions tailored to specific challenges. Each experiment fosters critical
thinking, technical expertise, and hands-on experience, bridging the gap between theoretical
knowledge and practical application.
Theory:
Naïve Bayes applies Bayes' theorem with the assumption of feature independence to classify data
points.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:
Conclusion:
Naïve Bayes is a powerful algorithm for text classification and other applications.
Experiment 7
Title: Implementation and Evaluation of Linear/Multiple Regression Algorithm
Aim: To implement and evaluate the Linear/Multiple Regression algorithm on a given dataset.
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across various
domains, including healthcare, finance, and technology. Understanding the implementation and
evaluation of these algorithms equips learners with the skills to build predictive models, analyze data
effectively, and innovate solutions tailored to specific challenges. Each experiment fosters critical
thinking, technical expertise, and hands-on experience, bridging the gap between theoretical
knowledge and practical application.
Theory:
Regression is a supervised learning technique that supports finding the correlation among
variables. A regression problem is when the output variable is a real or continuous value.
“Regression shows a line or curve that passes through all the data points on a target-
predictor graph in such a way that the vertical distance between the data points and
the regression line is minimum.” It is used principally for prediction, forecasting, time
Linear regression is a quiet and simple statistical regression method used for predictive
analysis and shows the relationship between the continuous variables. Linear regression
shows the linear relationship between the independent variable (X-axis) and the
input variable (x), such linear regression is called simple linear regression. And if there
is more than one input variable, such linear regression is called multiple linear
regression. The linear regression model gives a sloped straight line describing the
The above graph presents the linear relationship between the dependent variable and
the best fit straight line. Based on the given data points, we try to plot a line that
Y=mx+b Y=a0+a1X
Cost Function :
Cost function optimizes the regression coefficients or weights and measures how
a linear regression model is performing. The cost function is used to find the
accuracy of the mapping function that maps the input variable to the output
variable. This mapping function is also known as the Hypothesis function.
Cost function(J)=n1∑ni(yi^−yi)2
In Linear Regression, Mean Squared Error (MSE) cost function is used, which
is the average of squared error that occurred between the predicted values and
actual values.
By simple linear equation y=mx+b we can calculate MSE as:
Conclusion:
Aim: To implement and evaluate the Logistic Regression algorithm on a binary classification
problem.
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across
various domains, including healthcare, finance, and technology. Understanding the
implementation and evaluation of these algorithms equips learners with the skills to build
predictive models, analyze data effectively, and innovate solutions tailored to specific
challenges. Each experiment fosters critical thinking, technical expertise, and hands-on
experience, bridging the gap between theoretical knowledge and practical application.
Theory:
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it
is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
Conclusion:
Aim: To collaboratively develop a mini-project that applies machine learning to solve a real-
world problem.
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across
various domains, including healthcare, finance, and technology. Understanding the
implementation and evaluation of these algorithms equips learners with the skills to build
predictive models, analyze data effectively, and innovate solutions tailored to specific
challenges. Each experiment fosters critical thinking, technical expertise, and hands-on
experience, bridging the gap between theoretical knowledge and practical application.
Theory:
Conclusion:
Objective:
Relevance:
Machine learning algorithms are foundational to solving complex real-world problems across
various domains, including healthcare, finance, and technology. Understanding the
implementation and evaluation of these algorithms equips learners with the skills to build
predictive models, analyze data effectively, and innovate solutions tailored to specific
challenges. Each experiment fosters critical thinking, technical expertise, and hands-on
experience, bridging the gap between theoretical knowledge and practical application.
Theory:
The report and presentation should include an introduction, methodology, results, and
conclusion sections.
Conclusion: