Unit 1
Unit 1
Unit 1
With the help of sample historical data, which is known as training data,
machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for
creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the
information, the higher will be the performance.
OR
Machine learning Life cycle
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The
goal of this step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can
be collected from various sources such as files, database, internet,
or mobile devices. It is one of the most important steps of the life cycle.
The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate
will be the prediction.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data
preparation is a step where we put our data into a suitable place and
prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the
ordering of data.
o Data exploration:
It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and
quality of data.
A better understanding of data leads to an effective outcome. In
this, we find Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a
useable format. It is the process of cleaning the data, selecting the
variable to use, and transforming the data in a proper format to make it
more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to
address the quality issues.
o Missing Values
o Duplicate data
o Invalid data
o Noise
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step.
This step involves:
The aim of this step is to build a machine learning model to analyze the
data using various analytical techniques and review the outcome. It
starts with the determination of the type of the problems, where we
select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then
build the model using prepared data, and evaluate the model.Hence, in
this step, we take the data and use machine learning algorithms to build
the model.
5. Train Model
Now the next step is to train the model, in this step we train our model
to improve its performance for better outcome of the problem.
6. Test Model
Once our machine learning model has been trained on a given dataset,
then we test the model. In this step, we check for the accuracy of our
model by providing a test dataset to it.
7. Deployment
The last step of machine learning life cycle is deployment, where we
deploy the model in the real-world system.
Perspectives in ML
Issues in ML
1. Not enough training data :it takes a lot of data for most of the
algorithms to function properly. For a simple task, it needs thousands of
examples to make something out of it, and for advanced tasks like image
or speech recognition, it may need lakhs(millions) of examples
2. Poor Quality of data: if your training data has lots of errors, outliers,
and noise, it will make it impossible for your machine learning model to
detect a proper underlying pattern. Hence, it will not perform well. So
put effort in cleaning up your training data. No matter how good you are
in selecting and hyper tuning the model, this part plays a major role in
helping us make an accurate machine learning model.
1. Data Collection
• The quantity & quality of your data dictate how accurate our model is
• The outcome of this step is generally a representation of data which
we will use for training
• Using pre-collected data, by way of datasets
2. Data Preparation
• Wrangle data and prepare it for training
• Clean that which may require it (remove duplicates, correct errors,
deal with missing values, normalization, data type conversions, etc.)
• Randomize data, which erases the effects of the particular order in
which we collected and/or otherwise prepared our data
• Visualize data to help detect relevant relationships between variables
or class imbalances (bias alert!), or perform other exploratory analysis
• Split into training and evaluation sets
3. Choose a Model
• Different algorithms are for different tasks; choose the right one
6 - Parameter Tuning
• Tuning is the process of maximizing a model’s performance without
overfitting or creating too high of a variance. In machine learning, this is
accomplished by selecting appropriate “hyperparameters.”
• Choosing an appropriate set of hyperparameters is crucial for model
accuracy, but can be computationally challenging. Hyperparameters
differ from other model parameters in that they are not learned by the
model automatically through training methods. Instead, these
parameters must be set manually.
• Many methods exist for selecting appropriate hyperparameters.
7 - Make Predictions
• Using further test data set which have, until this point, been withheld
from the model (and for which class labels are known), are used to test
the model; a better approximation of how the model will perform in the
real world
OR
The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y). Some real-world applications of
supervised learning are Risk Assessment, Fraud Detection, Spam
filtering, etc. Image Segmentation,Medical Diagnosis,Fraud
Detection,Spam detection,Speech Recognition (application).
a) Classification
b) Regression
In unsupervised learning, the models are trained with the data that is
neither classified nor labelled, and the model acts on that data without
any supervision.
So, now the machine will discover its patterns and differences, such as
colour difference, shape difference, and predict the output when it is
tested with the test dataset.
1) Clustering
2) Association
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that
lies between Supervised and Unsupervised machine learning. It
represents the intermediate ground between Supervised (With Labelled
training data) and Unsupervised learning (with no labelled training data)
algorithms and uses the combination of labelled and unlabeled datasets
during the training period.
4. Reinforcement Learning
Concepts of hypotheses
There are some common methods given to find out the possible
hypothesis from the Hypothesis space, where hypothesis space is
represented by uppercase-h (H) and hypothesis by lowercase-h (h). Th
ese are defined as follows:
Hypothesis (h):
y= mx + b
Where,
Y: Range
m: Slope of the line which divided test data or changes in y divided by
change in x.
x: domain
c: intercept (constant)
https://www.geeksforgeeks.org/ml-understanding-hypothesis/
VERSION SPACE
Diagrammatical Guidelines
There is a generalization tree and a specialization tree.
Each node is connected to a model.
Nodes in the generalization tree are connected to a model that
matches everything in its subtree.
Nodes in the specialization tree are connected to a model that
matches only one thing in its subtree.
INDUCTIVE BIAS
Hence, the inductive bias does not impose a limitation on the learning
method
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
Confusion Matrix
Precision
The precision of the machine learning model will be high when Value of:
TP (Numerator) > TP+FP (denominator)
Recall
The recall is calculated as the ratio between the numbers of Positive
samples correctly classified as Positive to the total number of Positive
samples. The recall measures the model's ability to detect positive
samples. The higher the recall, the more positive samples detected.
o Recall of a machine learning model will be low when the value of;
TP+FN (denominator) > TP (Numerator)
o Recall of machine learning model will be high when Value of;
TP (Numerator) > TP+FN (denominator)
In the above image, we have only two positive samples that are correctly
classified as positive while only 1 negative sample that is correctly
classified as negative.
Hence, true positivity rate is 2 and while false negativity rate is 1. Then
recall will be:
Recall = TP/TP+FN
=2/(2+1)
=2/3 =0.667
Example-2
Now, we have another scenario where all positive samples are classified
correctly as positive. Hence, the True Positive rate is 3 while the False
Negative rate is 0.
If the recall is 100%, then it tells us the model has detected all positive
samples as positive and neglects how all negative samples are classified
in the model. However, the model could still have so many samples that
are classified as negative but recall just neglect those samples, which
results in a high False Positive rate in the model.
sensitivity
SPECIFICITY
AUC-ROC
False Negative Rate (FNR) tells us what proportion of the positive class
got incorrectly classified by the classifier.
A higher TPR and a lower FNR is desirable since we want to correctly
classify the positive class.
True Negative Rate
ROC stands for Receiver Operating Characteristics, and the ROC curve is
the graphical representation of the effectiveness of the binary
classification model. It plots the true positive rate (TPR) vs the false
positive rate (FPR) at different classification thresholds.
TPR or true Positive rate is a synonym for Recall, hence can be calculated
as:
AUC stands for the Area Under the Curve, and the AUC curve
represents the area under the ROC curve. It measures the overall
performance of the binary classification model. As both TPR and FPR
range between 0 to 1, So, the area will always lie between 0 and 1, and
A greater value of AUC denotes better model performance. Our main
goal is to maximize this area in order to have the highest TPR and
lowest FPR at the given threshold. The AUC measures the probability
that the model will assign a randomly chosen positive instance a higher
predicted probability compared to a randomly chosen negative
instance.
AUC is known for Area Under the ROC curve. As its name suggests, AUC
calculates the two-dimensional area under the entire ROC curve, as
shown below image:
AUC calculates the performance across all the thresholds and provides
an aggregate measure. The value of AUC ranges from 0 to 1. It means a
model with 100% wrong prediction will have an AUC of 0.0, whereas
models with 100% correct predictions will have an AUC of 1.0.
This is the most common definition that you would have encountered
when you would Google AUC-ROC. Basically, the ROC curve is a graph
that shows the performance of a classification model at all possible
thresholds( threshold is a particular value beyond which you say a point
belongs to a particular class). The curve is plotted between two
parameters
TPR – True Positive Rate
FPR – False Positive Rate
The components of any predictive errors are Noise, Bias, and Variance.
This intends to measure the bias and variance of a given model and
observe the behavior of bias and variance w.r.t various models such as
Linear Regression, Decision Tree, Bagging, and Random Forest for a
given number of sample sizes.
Bias: Difference between the prediction of the true model and the
average models (models build on n number of samples obtained from
the population).
True Model: Model builds on a population data
Average Model: Average of all the prediction results obtained from
the various sample obtained from the population model.
EPE=Bias2+Variance+Irreducible Error
The bias-variance tradeoff arises from the fact that decreasing bias
typically increases variance, and vice versa. Finding the right balance
between bias and variance is essential for building models that
generalize well to unseen data.
High Bias, Low Variance: Models with high bias and low variance tend to
be overly simplistic and may underfit the training data. They fail to
capture the complexity of the underlying patterns, resulting in poor
performance on both the training and test datasets.
Low Bias, High Variance: Models with low bias and high variance are
more complex and have a greater capacity to capture intricate patterns
in the training data. However, they are more prone to overfitting and
may fail to generalize to new data, leading to high performance on the
training set but poor performance on the test set.
ID3
3. Stopping Criteria
The recursion continues until one of the stopping criteria is met,
such as when all instances in a branch belong to the same class or
when all attributes have been used for splitting.
5. Tree Pruning
Pruning is a technique to prevent overfitting. While not directly
included in ID3, post-processing techniques or variations like C4.5
incorporate pruning to improve the tree’s generalization.
Mathematical Concepts of ID3 Algorithm
Now let’s examine the formulas linked to the main theoretical ideas in
the ID3 algorithm:
1. Entropy
A measure of disorder or uncertainty in a set of data is called entropy.
Entropy is a tool used in ID3 to measure a dataset’s disorder or
impurity. By dividing the data into as homogenous subsets as feasible,
the objective is to minimize entropy.
For a set S with classes {c1, c2, …, cn}, the entropy is calculated as:
2. Information Gain
A measure of how well a certain quality reduces uncertainty is
called Information Gain. ID3 splits the data at each stage, choosing the
property that maximizes Information Gain. It is computed using the
distinction between entropy prior to and following the split.
Information Gain measures the effectiveness of an attribute A in
reducing uncertainty in set S.
Where, |Sv | is the size of the subset of S for which attribute A has
value v.
3. Gain Ratio
Gain Ratio is an improvement on Information Gain that considers the
inherent worth of characteristics that have a wide range of possible
values. It deals with the bias of Information Gain in favor of
characteristics with more pronounced values.
ISSUES IN LEARNING DECISION TREES INCLUDE
In the real world, the dataset present will never be clean and perfect. It
means each dataset contains impurities, noisy data, outliers, missing
data, or imbalanced data. Due to these impurities, different problems
occur that affect the accuracy and the performance of the model. One of
such problems is Overfitting in Machine Learning. Overfitting is a
problem that a model can exhibit.
Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance
occurs.
o Overfitting occurs when the model fits more data than required,
and it tries to capture each and every datapoint fed to it. Hence it
starts capturing noise and inaccurate data from the dataset, which
degrades the performance of the model.
o An overfitted model doesn't perform accurately with the
test/unseen dataset and can’t generalize well.
o An overfitted model is said to have low bias and high variance.
Overfitting occurs when our machine learning model tries to cover all
the data points or more than the required data points present in the
given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the
efficiency and accuracy of the model.
ex:- Suppose there are three students, X, Y, and Z, and all three are
preparing for an exam. X has studied only three sections of the book and
left all other sections. Y has a good memory, hence memorized the
whole book. And the third student, Z, has studied and practiced all the
questions. So, in the exam, X will only be able to solve the questions if
the exam has questions related to section 3. Student Y will only be able
to solve questions if they appear exactly the same as given in the book.
Student Z will be able to solve all the exam questions in a proper way.
The same happens with machine learning; if the algorithm learns from a
small part of the data, it is unable to capture the required data points
and hence under fitted.
Suppose the model learns the training dataset, like the Y student. They
perform very well on the seen dataset but perform badly on unseen data
or unknown instances. In such cases, the model is said to be Overfitting
1. Early Stopping
2. Train with more data
3. Feature Selection
4. Cross-Validation
5. Data Augmentation
6. Regularization
Early Stopping
In this technique, the training is paused before the model starts learning
the noise within the model. In this process, while training the model
iteratively, measure the performance of the model after each iteration.
Continue up to a certain number of iterations until a new iteration
improves the performance of the model.
After that point, the model begins to overfit the training data; hence we
need to stop the process before the learner passes that point.
Increasing the training set by including more data can enhance the
accuracy of the model, as it provides more chances to discover the
relationship between input and output variables.
It may not always work to prevent overfitting, but this way helps the
algorithm to detect the signal better to minimize the errors.
When a model is fed with more training data, it will be unable to overfit
all the samples of data and forced to generalize well.
But in some cases, the additional data may add more noise to the
model; hence we need to be sure that data is clean and free from in-
consistencies before feeding it to the model.
Feature Selection
Data Augmentation
Regularization
In the case of underfitting, the model is not able to learn enough from
the training data, and hence it reduces the accuracy and produces
unreliable predictions.
As we can see from the above diagram, the model is unable to capture
the data points present in the plot.