Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 41

Unit -1

Machine learning is a subset of Artificial Intelligence (AI) which provides


machines the ability to learn automatically & improve from experience
without being explicitly programmed to do so. In the sense, it is the
practice of getting Machines to solve problems by gaining the ability to
think.
 Can a machine think or make decisions?
Well, if you feed a machine a good amount of data, it will learn how to
interpret, process and analyze this data by using Machine Learning
Algorithms, in order to solve real-world problems.

MACHINE LEARNING DEFINITIONS

Machine learning is a branch of artificial intelligence (AI) and computer


science which focuses on the use of data and algorithms to imitate the
way that humans learn, gradually improving its accuracy.

Machine learning enables a machine to automatically learn from data,


improve performance from experiences, and predict things without
being explicitly programmed.

With the help of sample historical data, which is known as training data,
machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for
creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the
information, the higher will be the performance.

A Machine Learning algorithm is a set of rules and statistical techniques


used to learn patterns from data and draw significant information from
it. It is the logic behind a Machine Learning model.
 Model: A model is the main component of Machine Learning. A model
is trained by using a Machine Learning Algorithm. An algorithm maps all
the decisions that a model is supposed to take based on the given input,
in order to get the correct output.
 Predictor Variable: It is a feature(s) of the data that can be used to
predict the output.
 Response Variable: It is the feature or the output variable that needs
to be predicted by using the predictor variable(s).
 Training Data: The Machine Learning model is built using the training
data. The training data helps the model to identify key trends and
patterns essential to predict the output.
 Testing Data: After the model is trained, it must be tested to evaluate
how accurately it can predict an outcome. This is done by the testing
data set.

Machine Learning Process

The Machine Learning process involves building a Predictive model that


can be used to find a solution for a Problem Statement.

Step 1: Define the objective of the Problem Statement

Step 2: Data Gathering.At this stage, you must be asking questions such
as,
• What kind of data is needed to solve this problem?
• Is the data available?
• How can I get the data?

Step 3: Data Preparation


The data you collected is almost never in the right format. You will
encounter a lot of inconsistencies in the data set such as missing values,
redundant variables, duplicate values, etc. Removing such
inconsistencies is very essential because they might lead to wrongful
computations and predictions. Therefore, at this stage, you scan the
data set for any inconsistencies and you fix them then and there.

Step 4: Exploratory Data Analysis
Exploratory Data Analysis is the brainstorming stage of Machine
Learning. Data Exploration involves understanding the patterns and
trends in the data. At this stage, all the useful insights are drawn and
correlations between the variables are understood.
 For example, in the case of predicting rainfall, we know that there is a
strong possibility of rain if the temperature has fallen low. Such
correlations must be understood and mapped at this stage.

Step 5: Building a Machine Learning Model
 All the insights and patterns derived during Data Exploration are used
to build the Machine Learning Model. This stage always begins by
splitting the data set into two parts, training data, and testing data. The
training data will be used to build and analyze the model. The logic of
the model is based on the Machine Learning Algorithm that is being
implemented.
 In the case of predicting rainfall, since the output will be in the form of
True (if it will rain tomorrow) or False (no rain tomorrow), we can use a
Classification Algorithm such as Logistic Regression.
 Choosing the right algorithm depends on the type of problem you’re
trying to solve, the data set and the level of complexity of the problem.

Step 6: Model Evaluation & Optimization


After building a model by using the training data set, it is finally time to
put the model to a test. The testing data set is used to check the
efficiency of the model and how accurately it can predict the outcome.
Once the accuracy is calculated, any further improvements in the model
can be implemented at this stage. Methods like parameter tuning and
cross-validation can be used to improve the performance of the model.

Step 7: Predictions
Once the model is evaluated and improved, it is finally used to make
predictions. The final output can be a Categorical variable (eg. True or
False) or it can be a Continuous Quantity (eg. the predicted value
of a stock).
 In our case, for predicting the occurrence of rainfall, the output will be
a categorical variable.

OR
Machine learning Life cycle

Machine learning life cycle is a cyclic process to build an efficient


machine learning project. The main purpose of the life cycle is to find a
solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given
below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment

The most important thing in the complete process is to understand the


problem and to know the purpose of the problem. Therefore, before
starting the life cycle, we need to understand the problem because the
good result depends on the better understanding of the problem.

1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The
goal of this step is to identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can
be collected from various sources such as files, database, internet,
or mobile devices. It is one of the most important steps of the life cycle.
The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate
will be the prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called
as a dataset. It will be used in further steps.

2. Data preparation

After collecting the data, we need to prepare it for further steps. Data
preparation is a step where we put our data into a suitable place and
prepare it to use in our machine learning training.

In this step, first, we put all data together, and then randomize the
ordering of data.

This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and
quality of data.
A better understanding of data leads to an effective outcome. In
this, we find Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a
useable format. It is the process of cleaning the data, selecting the
variable to use, and transforming the data in a proper format to make it
more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to
address the quality issues.

It is not necessary that data we have collected is always of our use as


some of the data may not be useful. In real-world applications, collected
data may have various issues, including:

o Missing Values
o Duplicate data
o Invalid data
o Noise

So, we use various filtering techniques to clean the data.


It is mandatory to detect and remove the above issues because it can
negatively affect the quality of the outcome

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step.
This step involves:

o Selection of analytical techniques


o Building models
o Review the result

The aim of this step is to build a machine learning model to analyze the
data using various analytical techniques and review the outcome. It
starts with the determination of the type of the problems, where we
select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then
build the model using prepared data, and evaluate the model.Hence, in
this step, we take the data and use machine learning algorithms to build
the model.

5. Train Model

Now the next step is to train the model, in this step we train our model
to improve its performance for better outcome of the problem.

We use datasets to train the model using various machine learning


algorithms. Training a model is required so that it can understand the
various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset,
then we test the model. In this step, we check for the accuracy of our
model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as


per the requirement of project or problem.

7. Deployment
The last step of machine learning life cycle is deployment, where we
deploy the model in the real-world system.

If the above-prepared model is producing an accurate result as per our


requirement with acceptable speed, then we deploy the model in the
real system. But before deploying the project, we will check whether it is
improving its performance using available data or not. The deployment
phase is similar to making the final report for a project.

Perspectives in ML

It involves searching a very large space of possible hypothesis to


determine the one that best fits the observed data.
• Which algorithm performs best for which types of problems &
representation?
• How much training data is sufficient?
• Can prior knowledge be helpful even when it is only approximately
correct?
• The best strategy for choosing a useful next training experience.
• What specific function should the system attempt to learn?
• How can learner automatically alter it’s representation to improve it’s
ability to represent and learn the target function?

Issues in ML

1. Not enough training data :it takes a lot of data for most of the
algorithms to function properly. For a simple task, it needs thousands of
examples to make something out of it, and for advanced tasks like image
or speech recognition, it may need lakhs(millions) of examples

2. Poor Quality of data: if your training data has lots of errors, outliers,
and noise, it will make it impossible for your machine learning model to
detect a proper underlying pattern. Hence, it will not perform well. So
put effort in cleaning up your training data. No matter how good you are
in selecting and hyper tuning the model, this part plays a major role in
helping us make an accurate machine learning model.

3. Irrelevant Features: Our training data must always contain more


relevant and less to none irrelevant
features. The credit for a successful machine learning project goes to
coming up with a good set of features on which it has been trained
(often referred to as feature engineering ), which includes feature
selection, extraction, and creating new features

4. Nonrepresentative training data: To make sure that our model


generalizes well, we have to make sure that our
training data should be representative of the new cases that we want to
generalize. If we train our model by using a
nonrepresentative training set, it won’t be accurate in predictions it will
be biased against one class or a group.

5. Overfitting and Underfitting : In ML, we call overfitting i.e model


performs well on training
data but fails to generalize well. Overfitting happens when our model is
too complex.
 Things which we can do to overcome this problem:
a) Simplify the model by selecting one with fewer parameters.
b) By reducing the number of attributes in training data.
c) Constraining the model.
d) Gather more training data.
e) Reduce the noise.
What is underfitting?
underfitting is the opposite of overfitting .Things which we can do to
overcome this problem:
a) Select a more advanced model, one with more parameters.
b) Train on better and relevant features.
c) Reduce the constraints.

DESIGNING LEARNING SYSTEMS

1. Data Collection
• The quantity & quality of your data dictate how accurate our model is
• The outcome of this step is generally a representation of data which
we will use for training
• Using pre-collected data, by way of datasets

2. Data Preparation
• Wrangle data and prepare it for training
• Clean that which may require it (remove duplicates, correct errors,
deal with missing values, normalization, data type conversions, etc.)
• Randomize data, which erases the effects of the particular order in
which we collected and/or otherwise prepared our data
• Visualize data to help detect relevant relationships between variables
or class imbalances (bias alert!), or perform other exploratory analysis
• Split into training and evaluation sets

3. Choose a Model
• Different algorithms are for different tasks; choose the right one

4 . Train the Model


• The goal of training is to answer a question or make a prediction
correctly as often as possible
• Each iteration of process is a training step

5 - Evaluate the Model


• Uses some metric or combination of metrics to "measure" objective
performance of model
• Test the model against previously unseen data
• This unseen data is meant to be somewhat representative of model
performance in the real world, but still helps tune the model (as
opposed to test data, which does not)
• Good train/eval split? 80/20, 70/30, or similar, depending on domain,
data availability, dataset particulars, etc.

6 - Parameter Tuning
• Tuning is the process of maximizing a model’s performance without
overfitting or creating too high of a variance. In machine learning, this is
accomplished by selecting appropriate “hyperparameters.”
• Choosing an appropriate set of hyperparameters is crucial for model
accuracy, but can be computationally challenging. Hyperparameters
differ from other model parameters in that they are not learned by the
model automatically through training methods. Instead, these
parameters must be set manually.
• Many methods exist for selecting appropriate hyperparameters.

7 - Make Predictions
• Using further test data set which have, until this point, been withheld
from the model (and for which class labels are known), are used to test
the model; a better approximation of how the model will perform in the
real world

OR

Designing a learning system in machine learning requires careful


consideration of several key factors, including the type of data being
used, the desired outcome, and the available resources. In this article,
we will explore the key steps involved in designing a learning system in
machine learning and discuss some best practices to keep in mind.

o The first step in designing a learning system in machine learning is


to identify the type of data that will be used. This can include
structured data, such as numerical and categorical data, as well as
unstructured data, such as text and images. The type of data will
determine the type of machine learning algorithms that can be
used and the preprocessing steps required.
o Once the data has been identified, the next step is to determine
the desired outcome of the learning system. This can include
classifying data, making predictions, or identifying patterns in the
data. The desired outcome will determine the type of machine
learning algorithm that should be used, as well as the evaluation
metrics that will be used to measure the performance of the
learning system.
o Next, the resources available for the learning system must be
considered. This includes the amount of data available, the
computational power available, and the amount of time available
to train the model. These resources will determine the complexity
of the machine learning algorithm that can be used and the
amount of data that can be used for training.
o Once the data, desired outcome, and resources have been
identified, it is time to select a machine-learning algorithm and
begin the training process. Decision trees, SVMs, and neural
networks are examples of common algorithms. It is crucial to
assess the effectiveness of the learning system using the right
assessment measures, such as recall, accuracy, and precision.
o After the learning system is trained, it is important to fine-tune
the model by adjusting the parameters and hyperparameters. This
can be done using techniques such as cross-validation and grid
search. The final model should be tested on a hold-out test set to
evaluate its performance on unseen data.
TYPES OF ML

Machine learning is a subset of AI, which enables the machine to


automatically learn from data, improve performance from past
experiences, and make predictions.
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

1. Supervised Machine Learning

As its name suggests, Supervised machine learning is based on


supervision. It means in the supervised learning technique, we train the
machines using the "labelled" dataset, and based on the training, the
machine predicts the output. Here, the labelled data specifies that some
of the inputs are already mapped to the output. More preciously, we can
say; first, we train the machine with the input and corresponding output,
and then we ask the machine to predict the output using the test
dataset.

Let's understand supervised learning with an example. Suppose we have


an input dataset of cats and dog images. So, first, we will provide the
training to the machine to understand the images, such as the shape &
size of the tail of cat and dog, Shape of eyes, colour, height (dogs are
taller, cats are smaller), etc. After completion of training, we input the
picture of a cat and ask the machine to identify the object and predict
the output. Now, the machine is well trained, so it will check all the
features of the object, such as height, shape, colour, eyes, ears, tail, etc.,
and find that it's a cat. So, it will put it in the Cat category. This is the
process of how the machine identifies the objects in Supervised
Learning.

The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y). Some real-world applications of
supervised learning are Risk Assessment, Fraud Detection, Spam
filtering, etc. Image Segmentation,Medical Diagnosis,Fraud
Detection,Spam detection,Speech Recognition (application).

Supervised machine learning can be classified into two types of


problems, which are given below:
o Classification
o Regression

a) Classification

Classification algorithms are used to solve the classification problems in


which the output variable is categorical, such as "Yes" or No, Male or
Female, Red or Blue, etc. The classification algorithms predict the
categories present in the dataset. Some real-world examples of
classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:


o Random Forest Algorithm
o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which


there is a linear relationship between input and output variables. These
are used to predict continuous output variables, such as market trends,
weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression

2. Unsupervised Machine Learning

Unsupervised learning is different from the Supervised learning


technique; as its name suggests, there is no need for supervision. It
means, in unsupervised machine learning, the machine is trained using
the unlabeled dataset, and the machine predicts the output without any
supervision.

In unsupervised learning, the models are trained with the data that is
neither classified nor labelled, and the model acts on that data without
any supervision.

The main aim of the unsupervised learning algorithm is to group or


categories the unsorted dataset according to the similarities, patterns,
and differences. Machines are instructed to find the hidden patterns
from the input dataset.

Let's take an example to understand it more preciously; suppose there is


a basket of fruit images, and we input it into the machine learning
model. The images are totally unknown to the model, and the task of the
machine is to find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as
colour difference, shape difference, and predict the output when it is
tested with the test dataset.

Application :- Network Analysis,Recommendation Systems,,Anomaly


Detection,Singular Value Decomposition

Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which


are given below:
Clustering
Association

1) Clustering

The clustering technique is used when we want to find the inherent


groups from the data. It is a way to group the objects into a cluster such
that the objects with the most similarities remain in one group and have
fewer or no similarities with the objects of other groups. An example of
the clustering algorithm is grouping the customers by their purchasing
behaviour.

Some of the popular clustering algorithms are given below:


o K-Means Clustering algorithm
o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association

Association rule learning is an unsupervised learning technique, which


finds interesting relations among variables within a large dataset. The
main aim of this learning algorithm is to find the dependency of one
data item on another data item and map those variables accordingly so
that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production,
etc.Some popular algorithms of Association rule learning are Apriori
Algorithm, Eclat, FP-growth algorithm.

3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that
lies between Supervised and Unsupervised machine learning. It
represents the intermediate ground between Supervised (With Labelled
training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets
during the training period.

Although Semi-supervised learning is the middle ground between


supervised and unsupervised learning and operates on the data that
consists of a few labels, it mostly consists of unlabeled data. As labels
are costly, but for corporate purposes, they may have few labels. It is
completely different from supervised and unsupervised learning as they
are based on the presence & absence of labels.

To overcome the drawbacks of supervised learning and unsupervised


learning algorithms, the concept of Semi-supervised learning is
introduced. The main aim of semi-supervised learning is to effectively
use all the available data, rather than only labelled data like in
supervised learning. Initially, similar data is clustered along with an
unsupervised learning algorithm, and further, it helps to label the
unlabeled data into labelled data. It is because labelled data is a
comparatively more expensive acquisition than unlabeled data.

4. Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which


an AI agent (A software component) automatically explore its
surrounding by hitting & trail, taking action, learning from experiences,
and improving its performance. Agent gets rewarded for each good
action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised


learning, and agents learn from their experiences only.

The reinforcement learning process is similar to a human being; for


example, a child learns various things by experiences in his day-to-day
life. An example of reinforcement learning is to play a game, where the
Game is the environment, moves of an agent at each step define states,
and the goal of the agent is to get a high score. Agent receives feedback
in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in
different fields such as Game theory, Operation Research, Information
theory, multi-agent systems.

Concepts of hypotheses

The hypothesis is defined as the supposition or proposed explanation


based on insufficient evidence or assumptions. It is just a guess based
on some known facts but has not yet been proven. A good hypothesis is
testable, which results in either true or false.

Hypothesis in Machine Learning (ML)

The hypothesis is one of the commonly used concepts of statistics in


Machine Learning. It is specifically used in Supervised Machine learning,
where an ML model learns a function that best maps the input to
corresponding outputs with the help of an available dataset.

In supervised learning techniques, the main aim is to determine the


possible hypothesis out of hypothesis space that best maps input to the
corresponding or correct outputs.

There are some common methods given to find out the possible
hypothesis from the Hypothesis space, where hypothesis space is
represented by uppercase-h (H) and hypothesis by lowercase-h (h). Th
ese are defined as follows:

Hypothesis space (H):

Hypothesis space is defined as a set of all possible legal hypotheses;


hence it is also known as a hypothesis set. It is used by supervised
machine learning algorithms to determine the best possible hypothesis
to describe the target function or best maps input to output.

Hypothesis (h):

It is defined as the approximate function that best describes the target in


supervised machine learning algorithms. It is primarily based on data as
well as bias and restrictions applied to data.
Hence hypothesis (h) can be concluded as a single hypothesis that maps
input to proper output and can be evaluated as well as used to make
predictions.

The hypothesis (h) can be formulated in machine learning as follows:

y= mx + b

Where,

Y: Range
m: Slope of the line which divided test data or changes in y divided by
change in x.
x: domain
c: intercept (constant)

https://www.geeksforgeeks.org/ml-understanding-hypothesis/

VERSION SPACE

 A version space is a hierarchical representation of knowledge that


enables you to keep track of all the useful information supplied by a
sequence of learning examples without remembering any of the
examples.
 The version space method is a concept learning process accomplished
by managing multiple models within a version space.

Version Space Characteristics


 Tentative heuristics are represented using version spaces.
 A version space represents all the alternative plausible descriptions of
a heuristic. A plausible description is one that is applicable to all known
positive examples and no known negative example.
 A version space description consists of two complementary trees:
1.One that contains nodes connected to overly general models, and
2.One that contains nodes connected to overly specific models.
 Node values/attributes are discrete.
Fundamental Assumptions
1.The data is correct; there are no erroneous instances.
2.A correct description is a conjunction of some of the attributes with
values.

Diagrammatical Guidelines
 There is a generalization tree and a specialization tree.
 Each node is connected to a model.
 Nodes in the generalization tree are connected to a model that
matches everything in its subtree.
 Nodes in the specialization tree are connected to a model that
matches only one thing in its subtree.

 Links between nodes and their models denote


• generalization relations in a generalization tree, and
• specialization relations in a specialization tree.

 The key idea in version space learning is that specialization of the


general models and generalization of the specific models may ultimately
lead to just one correct model that matches all observed positive
examples and does not match any negative examples.
 That is, each time a negative example is used to specialilize the general
models, those specific models that match the negative example are
eliminated and each time a positive example is used to generalize the
specific models, those general models that fail to match the positive
example are eliminated. Eventually, the positive and negative examples
may be such that only one general model and one identical specific
model survive

INDUCTIVE BIAS

Inductive bias in machine learning refers to the set of assumptions that


the learning algorithm uses to generalize from the training data to
unseen data. These biases influence how the algorithm selects a
particular hypothesis (model) from the space of all possible hypotheses.

Inductive bias refers to the restrictions that are imposed by the


assumptions made in the learning method. 
In machine learning, the term inductive bias refers to a set of (explicit or
implicit) assumptions made by a learning algorithm in order to perform
induction, that is, to generalize a finite set of observation (training data)
into a general model of the domain. 
For example, assuming that the solution to the problem of road safety
can be expressed as a conjunction of a set of eight concepts  In order to
have an unbiased learner, the version space would have to contain every
possible hypothesis that could possibly be expressed.  The solution that
the learner produced could never be more general than the complete
set of training data. 

In other words, it would be able to classify data that it had previously


seen (as the rote learner could) but would be unable to generalize in
order to classify new, unseen data.  The inductive bias of the candidate
elimination algorithm is that it is only able to classify a new piece of data
if all the hypotheses contained within its version space give data the
same classification. 

Hence, the inductive bias does not impose a limitation on the learning
method

performance metrics in machine learning

Evaluating the performance of a Machine learning model is one of the


important steps while building an effective ML model. To evaluate the
performance or quality of the model, different metrics are used, and
these metrics are known as performance metrics or evaluation
metrics. These performance metrics help us understand how well our
model has performed for the given data. In this way, we can improve the
model's performance by tuning the hyper-parameters. Each ML model
aims to generalize well on unseen/new data, and performance metrics
help determine how well the model generalizes on the new dataset.

In machine learning, each task or problem is divided


into classification and Regression. Not all metrics can be used for all
types of problems; hence, it is important to know and understand which
metrics should be used. Different evaluation metrics are used for both
Regression and Classification tasks. In this topic, we will discuss metrics
used for classification and regression tasks.

1. Performance Metrics for Classification

In a classification problem, the category or classes of data is identified


based on training data. The model learns from the given dataset and
then classifies the new data into classes or groups based on the training.
It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or
Not Spam, etc. To evaluate the performance of a classification model,
different metrics are used, and some of them are as follows:

o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC

2. Performance Metrics for Regression

Regression is a supervised learning technique that aims to find the


relationships between the dependent and independent variables. A
predictive regression model predicts a numeric or discrete value. The
metrics used for regression are different from the classification metrics.
It means we cannot use the Accuracy metric (explained above) to
evaluate a regression model; instead, the performance of a Regression
model is reported as errors in the prediction. Following are the popular
metrics that are used to evaluate the performance of Regression
models.

o Mean Absolute Error


o Mean Squared Error
o R2 Score
o Adjusted R2

performance metrics accuracy in machine learning

The accuracy metric is one of the simplest Classification metrics to


implement, and it can be determined as the number of correct
predictions to the total number of predictions.

It can be formulated as:


To implement an accuracy metric, we can compare ground truth and
predicted values in a loop, or we can also use the scikit-learn module for
this.I
t is the ratio of correctly classified points (prediction) to the total
number of predictions. Its value ranges between 0 and 1.

what will be the accuracy in the above example?


Since, 2 predictions are correct out of 4, accuracy =2/4 = 0.5 or 50%. Can
we conclude that
if the accuracy of the model is good, the model is adequate?
EXAMPLE

Actual and predicted values

Confusion Matrix

A confusion matrix helps us gain an insight into how correct our


predictions were and how they hold up against the actual values.
 Confusion Matrix :The confusion matrix is a matrix used to determine
the performance of the classification models for a given set of test data.
It can only be determined if the true values for test data are known. The
matrix itself can be easily understood, but the related terminologies may
be confusing. Since it shows the errors in the model performance in the
form of a matrix, hence also known as an error matrix.

Fig presents the confusion matrix for the binary class classification
problem.
o  For the 2 prediction classes of classifiers, the matrix is of 2*2
table, for 3 classes, it is 3*3 table, and so on.
o The matrix is divided into two dimensions, that are predicted
values and actual values along with the total number of
predictions.
o Predicted values are those values, which are predicted by the
model, and actual values are the true values for the given
observations.
True Negative: Model has given prediction No, and the real or actual
value was also No.
True Positive: The model has predicted yes, and the actual value was
also true.
False Negative: The model has predicted no, but the actual value was
Yes, it is also called as Type-II error.
False Positive: The model has predicted Yes, but the actual value was
No. It is also called a Type-I error.

Precision

The precision metric is used to overcome the limitation of Accuracy. The


precision determines the proportion of positive prediction that was
actually correct. It can be calculated as the True Positive or predictions
that are actually true to the total positive predictions (True Positive and
False Positive).

Precision is defined as the ratio of correctly classified positive samples


(True Positive) to a total number of classified positive samples (either
correctly or incorrectly).
o TP- True Positive
o FP- False Positive
The precision of a machine learning model will be low when the value
of :
TP+FP (denominator) > TP (Numerator)

The precision of the machine learning model will be high when Value of:
TP (Numerator) > TP+FP (denominator)

Hence, precision helps us to visualize the reliability of the machine


learning model in classifying the model as positive.

Recall
The recall is calculated as the ratio between the numbers of Positive
samples correctly classified as Positive to the total number of Positive
samples. The recall measures the model's ability to detect positive
samples. The higher the recall, the more positive samples detected.

1. Recall = True Positive/True Positive + False Negative


2. Recall = TP/TP+FN
o TP- True Positive
o FN- False Negative

o Recall of a machine learning model will be low when the value of;
TP+FN (denominator) > TP (Numerator)
o Recall of machine learning model will be high when Value of;
TP (Numerator) > TP+FN (denominator)

Unlike Precision, Recall is independent of the number of negative sample


classifications. Further, if the model classifies all positive samples as
positive, then Recall will be 1.

Examples to calculate the Recall in the machine learning model

Below are some examples for calculating Recall in machine learning as


follows

Example 1- Let's understand the calculation of Recall with four different


cases where each case has the same Recall as 0.667 but differs in the
classification of negative samples. See how:
In this scenario, the classification of the negative sample is different in
each case. Case A has two negative samples classified as negative, and
case B have two negative samples classified as negative; case c has only
one negative sample classified as negative, while case d does not classify
any negative sample as negative.

However, recall is independent of how the negative samples are


classified in the model; hence, we can neglect negative samples and only
calculate all samples that are classified as positive.

In the above image, we have only two positive samples that are correctly
classified as positive while only 1 negative sample that is correctly
classified as negative.

Hence, true positivity rate is 2 and while false negativity rate is 1. Then
recall will be:

Recall = True Positive/True Positive + False Negative

Recall = TP/TP+FN

=2/(2+1)

=2/3 =0.667

Example-2
Now, we have another scenario where all positive samples are classified
correctly as positive. Hence, the True Positive rate is 3 while the False
Negative rate is 0.

Recall = TP/TP+FN = 3/(3+0) =3/3 =1

If the recall is 100%, then it tells us the model has detected all positive
samples as positive and neglects how all negative samples are classified
in the model. However, the model could still have so many samples that
are classified as negative but recall just neglect those samples, which
results in a high False Positive rate in the model.
sensitivity

Sensitivity talks about the number of positive records correctly predicted.

Sensitivity is the metric that evaluates a model's ability to predict true


positives of each available category.
Sensitivity is a measure of how well a machine learning model can detect
positive instances. It is also known as the true positive rate (TPR) or
recall. Sensitivity is used to evaluate model performance because it
allows us to see how many positive instances the model was able to
correctly identify

SPECIFICITY

Specificity talks about the number of negative records correctly


predicted.
 Specificity is the metric that evaluates a model's ability to predict true
negatives of each available category.
Specificity is defined as the proportion of actual negatives, which got
predicted as the negative (or true negative).
 Specificity is a measure of the proportion of people not suffering from
the disease who got predicted correctly as the ones who are not
suffering from the disease. In other words, the person who is healthy
actually got predicted as healthy is specificity.

AUC-ROC

Sometimes we need to visualize the performance of the classification


model on charts; then, we can use the AUC-ROC curve. It is one of the
popular and important metrics for evaluating the performance of the
classification model.

False Negative Rate

 False Negative Rate (FNR) tells us what proportion of the positive class
got incorrectly classified by the classifier.
 A higher TPR and a lower FNR is desirable since we want to correctly
classify the positive class.
True Negative Rate

Specificity tells us what proportion of the negative class got correctly


classified.
False Positive Rate

FPR tells us what proportion of the negative class got incorrectly


classified by the classifier. A higher TNR and a lower FPR is desirable
since we want to correctly classify the negative class.

Firstly, let's understand ROC (Receiver Operating Characteristic curve)


curve. ROC represents a graph to show the performance of a
classification model at different threshold levels.

ROC stands for Receiver Operating Characteristics, and the ROC curve is
the graphical representation of the effectiveness of the binary
classification model. It plots the true positive rate (TPR) vs the false
positive rate (FPR) at different classification thresholds.

The curve is plotted between two parameters, which are:

o True Positive Rate


o False Positive Rate

TPR or true Positive rate is a synonym for Recall, hence can be calculated
as:

FPR or False Positive Rate can be calculated as:

To calculate value at any point in a ROC curve, we can evaluate a logistic


regression model multiple times with different classification thresholds,
but this would not be much efficient. So, for this, one efficient method is
used, which is known as AUC.
AUC: Area Under the ROC curve

AUC stands for the Area Under the Curve, and the AUC curve
represents the area under the ROC curve. It measures the overall
performance of the binary classification model. As both TPR and FPR
range between 0 to 1, So, the area will always lie between 0 and 1, and
A greater value of AUC denotes better model performance. Our main
goal is to maximize this area in order to have the highest TPR and
lowest FPR at the given threshold. The AUC measures the probability
that the model will assign a randomly chosen positive instance a higher
predicted probability compared to a randomly chosen negative
instance.

AUC is known for Area Under the ROC curve. As its name suggests, AUC
calculates the two-dimensional area under the entire ROC curve, as
shown below image:

AUC calculates the performance across all the thresholds and provides
an aggregate measure. The value of AUC ranges from 0 to 1. It means a
model with 100% wrong prediction will have an AUC of 0.0, whereas
models with 100% correct predictions will have an AUC of 1.0.

TPR and FPR

This is the most common definition that you would have encountered
when you would Google AUC-ROC. Basically, the ROC curve is a graph
that shows the performance of a classification model at all possible
thresholds( threshold is a particular value beyond which you say a point
belongs to a particular class). The curve is plotted between two
parameters
 TPR – True Positive Rate
 FPR – False Positive Rate

Bias Variance decomposition

The components of any predictive errors are Noise, Bias, and Variance.
This intends to measure the bias and variance of a given model and
observe the behavior of bias and variance w.r.t various models such as
Linear Regression, Decision Tree, Bagging, and Random Forest for a
given number of sample sizes.
 Bias: Difference between the prediction of the true model and the
average models (models build on n number of samples obtained from
the population).
 True Model: Model builds on a population data
 Average Model: Average of all the prediction results obtained from
the various sample obtained from the population model.

 Variance: Difference between the prediction of all the models


obtained from the sample with the
average model.
 Noise: It is the irreducible error that a model cannot predict

Bias-variance decomposition is a fundamental concept in understanding


the performance of supervised learning algorithms, particularly in the
context of regression and classification problems. It decomposes the
expected prediction error of a model into three components: bias,
variance, and irreducible error. Here's a breakdown of each component:

1. Bias: Bias refers to the error introduced by approximating a real-world


problem with a simplified model. A high bias indicates that the model is
too simplistic and fails to capture the true underlying patterns in the
data. Models with high bias tend to underfit the training data.

2. Variance: Variance measures the variability of the model's predictions


across different training sets. A high variance indicates that the model is
overly sensitive to fluctuations in the training data, capturing noise along
with the underlying patterns. Models with high variance tend to overfit
the training data.

3. Irreducible Error: Irreducible error represents the inherent noise or


randomness in the data that cannot be reduced by any model. It stems
from factors such as measurement errors, natural variability, or
incomplete feature representation. No matter how complex the model
is, it cannot eliminate irreducible error.

Mathematically, the expected prediction error (EPE) of a model can be


decomposed as follows:

EPE=Bias2+Variance+Irreducible Error

The bias-variance tradeoff arises from the fact that decreasing bias
typically increases variance, and vice versa. Finding the right balance
between bias and variance is essential for building models that
generalize well to unseen data.

Here's how the bias-variance tradeoff typically manifests: -

High Bias, Low Variance: Models with high bias and low variance tend to
be overly simplistic and may underfit the training data. They fail to
capture the complexity of the underlying patterns, resulting in poor
performance on both the training and test datasets.

Low Bias, High Variance: Models with low bias and high variance are
more complex and have a greater capacity to capture intricate patterns
in the training data. However, they are more prone to overfitting and
may fail to generalize to new data, leading to high performance on the
training set but poor performance on the test set.

The goal of model selection and training is to strike an appropriate


balance between bias and variance, typically by tuning model
complexity, regularization parameters, or ensemble techniques. The aim
is to minimize the overall expected prediction error by effectively
managing both bias and variance.
Bias-Variance Trade-Off

While building the machine learning model, it is really important to take


care of bias and variance in order to avoid overfitting and underfitting in
the model. If the model is very simple with fewer parameters, it may
have low variance and high bias. Whereas, if the model has a large
number of parameters, it will have high variance and low bias. So, it is
required to make a balance between bias and variance errors, and this
balance between the bias error and variance error is known as the Bias-
Variance trade-off.

For an accurate prediction of the model, algorithms need a low variance


and low bias. But this is not possible because bias and variance are
related to each other:

o If we decrease the variance, it will increase the bias.


o If we decrease the bias, it will increase the variance.

Bias-Variance trade-off is a central issue in supervised learning. Ideally,


we need a model that accurately captures the regularities in training
data and simultaneously generalizes well with the unseen dataset.
Unfortunately, doing this is not possible simultaneously. Because a high
variance algorithm may perform well with training data, but it may lead
to overfitting to noisy data. Whereas, high bias algorithm generates a
much simple model that may not even capture important regularities in
the data. So, we need to find a sweet spot between bias and variance to
make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to
make a balance between bias and variance errors.

Decision Trees Learning: Basic algorithm (ID3)

Supervised Machine Learning Algorithm, used to build classification and


regression models in the form of a tree structure.
A decision tree is a tree where each -
•Node - a feature(attribute)
•Branch - a decision(rule)
•Leaf - an outcome(categorical or continuous)
A decision tree is a structure that contains nodes (rectangular boxes)
and edges(arrows) and is built from a dataset (table of columns
representing features/attributes and rows corresponds to records). Each
node is either used to make a decision (known as decision node) or
represent an outcome (known as leaf node)

ID3

The ID3 (Iterative Dichotomiser 3) algorithm is one of the earliest and


simplest algorithms used toconstruct decision trees. It was introduced
by Ross Quinlan in 1986. ID3 is a greedy algorithm thatrecursively
partitions the feature space based on the information gain criterion.

Here's a basic outline of the ID3 algorithm:

1. Input: Training dataset D with attributes (features) A and a target


variable Y.
2. Output: Decision tree T.
3. Algorithm:
a. Base : If all instances in the current subset belong to the same
class y, create a leaf node with label y.
b. Base case: If the attribute set A is empty, create a leaf node with
the most common class label among the instances in the current subset.
c. Attribute selection: Choose the best attribute a from the attribute
set A using a criterion such as information gain or entropy. The attribute
with the highest information gain is selected.
d. Split on attribute: Partition the dataset D into subsets based on
the values of the selected attribute a.
e. Recursive call: For each subset created in step (d), repeat steps
(a) to (c) recursively until one of the base cases is met.

How ID3 Works

The ID3 algorithm is specifically designed for building decision


trees from a given dataset. Its primary objective is to construct a tree
that best explains the relationship between attributes in the data and
their corresponding class labels.
1. Selecting the Best Attribute
 ID3 employs the concept of entropy and information gain to
determine the attribute that best separates the data. Entropy
measures the impurity or randomness in the dataset.
 The algorithm calculates the entropy of each attribute and selects
the one that results in the most significant information gain when
used for splitting the data.

2. Creating Tree Nodes


 The chosen attribute is used to split the dataset into subsets based
on its distinct values.
 For each subset, ID3 recurses to find the next best attribute to
further partition the data, forming branches and new nodes
accordingly.

3. Stopping Criteria
 The recursion continues until one of the stopping criteria is met,
such as when all instances in a branch belong to the same class or
when all attributes have been used for splitting.

4. Handling Missing Values


 ID3 can handle missing attribute values by employing various
strategies like attribute mean/mode substitution or using majority
class values.

5. Tree Pruning
 Pruning is a technique to prevent overfitting. While not directly
included in ID3, post-processing techniques or variations like C4.5
incorporate pruning to improve the tree’s generalization.
Mathematical Concepts of ID3 Algorithm

Now let’s examine the formulas linked to the main theoretical ideas in
the ID3 algorithm:

1. Entropy
A measure of disorder or uncertainty in a set of data is called entropy.
Entropy is a tool used in ID3 to measure a dataset’s disorder or
impurity. By dividing the data into as homogenous subsets as feasible,
the objective is to minimize entropy.
For a set S with classes {c1, c2, …, cn}, the entropy is calculated as:

Where, pi is the proportion of instances of class ci in the set.

2. Information Gain
A measure of how well a certain quality reduces uncertainty is
called Information Gain. ID3 splits the data at each stage, choosing the
property that maximizes Information Gain. It is computed using the
distinction between entropy prior to and following the split.
Information Gain measures the effectiveness of an attribute A in
reducing uncertainty in set S.

Where, |Sv | is the size of the subset of S for which attribute A has
value v.

3. Gain Ratio
Gain Ratio is an improvement on Information Gain that considers the
inherent worth of characteristics that have a wide range of possible
values. It deals with the bias of Information Gain in favor of
characteristics with more pronounced values.
ISSUES IN LEARNING DECISION TREES INCLUDE

• Determining how deeply to grow the decision tree


• Handling continuous attributes
• Choosing an appropriate attribute selection measure
• Handling training data with missing attribute values
• Handling attributes with differing costs
• Improving computational efficiency

Overfitting in Machine Learning

In the real world, the dataset present will never be clean and perfect. It
means each dataset contains impurities, noisy data, outliers, missing
data, or imbalanced data. Due to these impurities, different problems
occur that affect the accuracy and the performance of the model. One of
such problems is Overfitting in Machine Learning. Overfitting is a
problem that a model can exhibit.

Noise: Noise is meaningless or irrelevant data present in the dataset. It


affects the performance of the model if it is not removed.

Bias: Bias is a prediction error that is introduced in the model due to


oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.

Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance
occurs.

Generalization: It shows how well a model is trained to predict unseen


data.

o Overfitting occurs when the model fits more data than required,
and it tries to capture each and every datapoint fed to it. Hence it
starts capturing noise and inaccurate data from the dataset, which
degrades the performance of the model.
o An overfitted model doesn't perform accurately with the
test/unseen dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.

Overfitting occurs when our machine learning model tries to cover all
the data points or more than the required data points present in the
given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model.

ex:- Suppose there are three students, X, Y, and Z, and all three are
preparing for an exam. X has studied only three sections of the book and
left all other sections. Y has a good memory, hence memorized the
whole book. And the third student, Z, has studied and practiced all the
questions. So, in the exam, X will only be able to solve the questions if
the exam has questions related to section 3. Student Y will only be able
to solve questions if they appear exactly the same as given in the book.
Student Z will be able to solve all the exam questions in a proper way.

The same happens with machine learning; if the algorithm learns from a
small part of the data, it is unable to capture the required data points
and hence under fitted.

Suppose the model learns the training dataset, like the Y student. They
perform very well on the seen dataset but perform badly on unseen data
or unknown instances. In such cases, the model is said to be Overfitting

Solutions to overfitting/ Ways to prevent the Overfitting

Although overfitting is an error in Machine learning which reduces the


performance of the model, however, we can prevent it in several ways.
With the use of the linear model, we can avoid overfitting; however,
many real-world problems are non-linear ones. It is important to prevent
overfitting from the models. Below are several ways that can be used to
prevent overfitting:

1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization

Early Stopping

In this technique, the training is paused before the model starts learning
the noise within the model. In this process, while training the model
iteratively, measure the performance of the model after each iteration.
Continue up to a certain number of iterations until a new iteration
improves the performance of the model.

After that point, the model begins to overfit the training data; hence we
need to stop the process before the learner passes that point.

Train with More data

Increasing the training set by including more data can enhance the
accuracy of the model, as it provides more chances to discover the
relationship between input and output variables.

It may not always work to prevent overfitting, but this way helps the
algorithm to detect the signal better to minimize the errors.

When a model is fed with more training data, it will be unable to overfit
all the samples of data and forced to generalize well.

But in some cases, the additional data may add more noise to the
model; hence we need to be sure that data is clean and free from in-
consistencies before feeding it to the model.

Feature Selection

While building the ML model, we have a number of parameters or


features that are used to predict the outcome. However, sometimes
some of these features are redundant or less important for the
prediction, and for this feature selection process is applied. In the
feature selection process, we identify the most important features
within training data, and other features are removed. Further, this
process helps to simplify the model and reduces noise from the data.
Some algorithms have the auto-feature selection, and if not, then we
can manually perform this process.
Cross-Validation

Cross-validation is one of the powerful techniques to prevent overfitting.

In the general k-fold cross-validation technique, we divided the dataset


into k-equal-sized subsets of data; these subsets are known as folds.

Data Augmentation

Data Augmentation is a data analysis technique, which is an alternative


to adding more data to prevent overfitting. In this technique, instead of
adding more training data, slightly modified copies of already existing
data are added to the dataset.

The data augmentation technique makes it possible to appear data


sample slightly different every time it is processed by the model. Hence
each data set appears unique to the model and prevents overfitting.

Regularization

If overfitting occurs when a model is complex, we can reduce the


number of features. However, overfitting may also occur with a simpler
model, more specifically the Linear model, and for such cases,
regularization techniques are much helpful.

Regularization is the most popular technique to prevent overfitting. It is


a group of methods that forces the learning algorithms to make a model
simpler. Applying the regularization technique may slightly increase the
bias but slightly reduces the variance. In this technique, we modify the
objective function by adding the penalizing term, which has a higher
value with a more complex model.

The two commonly used regularization techniques are L1 Regularization


and L2 Regularization.
Underfitting

Underfitting occurs when our machine learning model is not able to


capture the underlying trend of the data. To avoid the overfitting in the
model, the fed of training data can be stopped at an early stage, due to
which the model may not learn enough from the training data. As a
result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from
the training data, and hence it reduces the accuracy and produces
unreliable predictions.

An underfitted model has high bias and low variance.

Example: We can understand the underfitting using below output of the


linear regression model:

As we can see from the above diagram, the model is unable to capture
the data points present in the plot.

How to avoid underfitting:

o By increasing the training time of the model.


o By increasing the number of features.

You might also like