Machine Learning.
Machine Learning.
Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of
classification algorithms are Spam Detection, Email filtering, etc.
Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.
2. Unsupervised Learning
• How it works: The model is trained on data that isn’t labeled, so it tries to find hidden
patterns or groupings in the data.
• Example: Grouping customers based on their shopping behavior without knowing
their categories beforehand.
• Common Algorithms: Clustering (like K-means), principal component analysis (PCA),
and association rules.
3. Semi-Supervised Learning:
• Semi-supervised learning is a mix of supervised and unsupervised learning. The
model is trained on a small amount of labeled data and a large amount of unlabeled
data. This approach is useful when labeling data is expensive or time-consuming.
• Example: In image recognition, only a few images are labeled (e.g., “cat” or “dog”),
while most of the images are unlabeled. The model uses the labeled images to guide
its learning and improves its ability to label the unlabeled images.
Common Algorithms:
• Self-training
• Co-training
• Generative models (e.g., Variational Autoencoders)
4. Reinforcement Learning:
• In reinforcement learning, the model learns by interacting with its environment and
receiving feedback in the form of rewards or penalties. It tries to take actions that
maximize the cumulative reward over time.
• Example: A robot learning to walk by trying different movements and getting
rewarded for successful steps.
Common Algorithms:
• Q-learning
• Deep Q Networks (DQN)
• Policy Gradients
• Actor-Critic Methods
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data preprocessing task.
2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:
Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory.
The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value.
Categorical data is data which has some categories such as colors,education level etc.
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.
6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.
Training Set: A subset of dataset to train the machine learning model, and we already know
the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.
7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we
put our variables in the same range and in the same scale so that no any variable dominate the
other variable.
ERROR- It is the difference between the prediction made by a model and the actual value or
result.
Reducible errors: These errors can be reduced to improve the model accuracy. Such errors
can further be classified into bias and Variance.
Irreducible errors: These errors will always be present in the model regardless of which
algorithm has been used. The cause of these errors is unknown variables whose value can't be
reduced.
What is Bias?
While making predictions, a difference occurs between prediction values made by the model
and actual values/expected values, and this difference is known as bias errors or Errors due
to bias.
Low bias means the model is good at learning the patterns in the data.
High bias means the model is too simple and doesn’t learn enough from the data.
The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is
different from its expected value.
Low Variance- It means the model gives almost the same results every time, even if you
change the training data a little.
High Variance- It means the model changes a lot when you give it different data. It's too
focused on the details of the training data.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the model
has a large number of parameters, it will have high variance and low bias. So, it is required to
make a balance between bias and variance errors, and this balance between the bias error and
variance error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias. But
this is not possible because bias and variance are related to each other:
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance
between bias and variance errors.
Underfitting
• Definition: Underfitting happens when a model is too simple to capture the
underlying patterns in the data. It doesn’t learn enough from the training data,
leading to poor performance on both training and new data.
• Example: Imagine trying to predict house prices using only one feature, like the size
of the house, when in reality, factors like location and condition also matter. The
model might make general, inaccurate predictions because it’s not considering all
relevant information.
Overfitting
• Definition: Overfitting occurs when a model is too complex and learns not only the
useful patterns but also the noise or irrelevant details from the training data. This
makes it perform very well on the training data but poorly on new, unseen data.
Real-World Overfitting Example
Scenario: Predicting student test scores based on their study hours and other factors.
1. Problem: Imagine a teacher wants to predict students' test scores based on how
many hours they study, their previous grades, and additional details like their favorite
study snacks or the color of their notebooks.
2. Model: The teacher uses a very complex model that takes into account not just study
hours and previous grades but also many specific details like the type of study snacks
or notebook color. The model ends up fitting the data very closely.
3. Outcome: The model predicts test scores with high accuracy for the students in the
training data. For instance, it might learn that students who ate a particular snack
scored better on the test. This is due to random variations in the training data rather
than a real pattern.
4. New Data: When the model is used to predict scores for new students, it doesn’t
perform well. For example, if a new student didn’t eat the same snack or use the
same color notebook, the predictions might be off. The model fails to generalize
because it learned specific quirks from the training data that aren’t applicable to all
students.
UNIT-2
Regression Analysis in Machine learning
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
More specifically, Regression analysis helps us to understand how the value of the
dependent variable is changing corresponding to an independent variable when other
independent variables are held fixed. It predicts continuous/real values such as temperature,
age, salary, price, etc.
Regression:
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
Some examples of regression can be as:
o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.
Independent Variable: The factors which affect the dependent variables or which are used
to predict the values of the dependent variables are called independent variable, also called
as a predictor.
Outliers: These are unusual data points that are much higher or lower than most of the data.
Outliers can mess up predictions, so they need special attention.
Multicollinearity: This happens when two or more independent variables are very similar. It
makes it harder to figure out which factor is more important. It's best to avoid this.
Underfitting and Overfitting:
• Overfitting: When the model is too good at remembering the training data but
doesn't perform well on new, unseen data.
• Underfitting: When the model is too simple and doesn't perform well even on the
training data.
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
Here, the blue line is called hyperplane, and the other two lines are known as boundary
lines.
Simple Linear Regression in Machine Learning
Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by
a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
Simple Linear regression algorithm has mainly two objectives:
o Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
Y= a0+a1x+ ε
ε= error term
1.Linearity: The relationship between the independent and dependent variable should be a
straight line.
3.No Autocorrelation: This means the residuals should not be correlated with each other. This is
especially important in time series data.
1.Import Libraries: Use libraries like pandas (for data handling), numpy (for mathematical
operations), and sklearn (for regression).
import pandas as pd
import numpy as np
2.Load Dataset: Load your data (e.g., CSV file) into a DataFrame.
data = pd.read_csv('data.csv')
3.Define Variables: Choose your independent (X) and dependent (Y) variables.
5.Create and Train Model: Create a linear regression model and train it on the training data.
model = LinearRegression()
model.fit(X_train, Y_train)
6.Make Predictions: Use the trained model to predict on the test data.
Y_pred = model.predict(X_test)
7.Evaluate the Model: Check how well the model performs using metrics like R-squared or
Mean Squared Error (MSE).
8.Visualize Results: Plot the regression line and data points for better understanding.
plt.show()
Multiple Linear Regression is an extension of simple linear regression that models the
relationship between one dependent variable and two or more independent variables. The
goal is to understand how the independent variables impact the dependent variable.
o A linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent
variable) in data.
Feature Selection
Feature Selection in machine learning refers to the process of selecting the most relevant and
important features (variables) from your dataset to improve the model's performance, reduce
overfitting, and make the model easier to interpret. It helps to focus on the most informative
parts of the data while discarding irrelevant or redundant features.
1. Filter Methods:
• How it Works: These methods apply statistical techniques to select features
independently of the model. They rank features based on statistical metrics and select
the highest-ranked features.
• Techniques:
o Correlation Coefficient: Measures how features are correlated with the target
variable. Features with high correlation are selected.
o Chi-Square Test: Used for categorical features, measures the association
between the feature and the target variable.
o ANOVA (Analysis of Variance): Measures the difference between the means
of different groups. Useful for continuous target variables.
2. Wrapper Methods:
• How it Works: These methods use the machine learning model itself to evaluate the
importance of features. They involve training a model on different subsets of features
and selecting the subset that yields the best performance.
• Techniques:
o Forward Selection: Start with no features and add them one by one, evaluating
the model at each step, and keeping the ones that improve performance.
o Backward Elimination: Start with all features, remove them one by one, and
check model performance. Remove features that don’t improve or harm
performance.
o Recursive Feature Elimination (RFE): Starts with all features and recursively
removes the least important ones based on model performance.
3. Embedded Methods:
• How it Works: These methods perform feature selection during the model training
process. Some models have built-in feature selection capabilities where important
features are identified as part of the learning algorithm.
• Techniques:
o Lasso Regression (L1 Regularization): Adds a penalty term to the model for
having too many features, forcing less important feature coefficients to zero.
o Ridge Regression (L2 Regularization): Reduces the magnitude of less
important feature coefficients.
o Decision Trees and Random Forest: These models automatically rank features
by importance based on how they reduce impurity at each split.
• Filter Methods: Good when you have a large dataset and need a quick and simple
method to reduce dimensionality.
• Wrapper Methods: Best when accuracy is important and computational resources are
not a limitation.
• Embedded Methods: Suitable for when you're using models like decision trees, Lasso,
or Ridge, as they perform feature selection internally.
Dimensionality Reduction
1. Improves Model Performance: Reducing the number of features can lead to better
model performance by eliminating noise and irrelevant information.
2. Reduces Overfitting: Fewer dimensions can help the model generalize better to
unseen data.
3. Decreases Computation Time: Fewer features mean faster training and testing times.
4. Enhances Visualization: It allows for easier visualization of data in lower dimensions,
aiding in understanding and interpreting the data.
UNIT-3
What is the Classification Algorithm?
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes
can be called as targets/labels or categories.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
There are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then
it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG,
etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then
it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Classification Algorithms
Linear Classifiers
Linear models create a linear decision boundary between classes. They are simple and
computationally efficient. Some of the linear classification models are as follows:
• Logistic Regression
• Single-layer Perceptron
Non-linear models create a non-linear decision boundary between classes. They can capture
more complex relationships between the input features and the target variable. Some of the
non-linear classification models are as follows:
• K-Nearest Neighbours
• Kernel SVM
• Naive Bayes
• Random Forests,
• AdaBoost,
• Bagging Classifier,
• Voting Classifier,
• ExtraTrees Classifier
Before getting started with classification, it is important to understand the problem you are
trying to solve. What are the class labels you are trying to predict? What is the relationship
between the input data and the class labels?
Suppose we have to predict whether a patient has a certain disease or not, on the basis of 7
independent variables, called features. This means, there can be only two possible outcomes:
Data preparation
Once you have a good understanding of the problem, the next step is to prepare your data.
This includes collecting and preprocessing the data and splitting it into training, validation,
and test sets. In this step, the data is cleaned, preprocessed, and transformed into a format that
can be used by the classification algorithm.
Feature Extraction
The relevant features or attributes are extracted from the data that can be used to differentiate
between the different classes.
Suppose our input X has 7 independent features, having only 5 features influencing the label
or target values remaining 2 are negligibly or not correlated, then we will use only these 5
features only for the model training.
Model Selection
There are many different models that can be used for classification, including logistic
regression, decision trees, support vector machines (SVM), or neural networks. It is
important to select a model that is appropriate for your problem, taking into account the size
and complexity of your data, and the computational resources you have available.
Model Training
Once you have selected a model, the next step is to train it on your training data. This
involves adjusting the parameters of the model to minimize the error between the predicted
class labels and the actual class labels for the training data.
Model Evaluation
Evaluating the model: After training the model, it is important to evaluate its performance on
a validation set. This will give you a good idea of how well the model is likely to perform on
new, unseen data.
If the model’s performance is not satisfactory, you can fine-tune it by adjusting the
parameters, or trying a different model.
Finally, once we are satisfied with the performance of the model, we can deploy it to make
predictions on new data. it can be used for real world problem.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Support Vector Machines (SVMs) use kernel methods to transform the input data into a
higher-dimensional feature space, which makes it simpler to distinguish between classes or
generate predictions. Kernel approaches in SVMs work on the fundamental principle of
implicitly mapping input data into a higher-dimensional feature space without directly
computing the coordinates of the data points in that space.
The kernel function in SVMs is essential in determining the decision boundary that divides
the various classes. In order to calculate the degree of similarity between any two points in
the feature space, the kernel function computes their dot product.
Kernel functions used in machine learning, including in SVMs (Support Vector Machines),
have several important characteristics, including:
In Support Vector Machines (SVMs), there are several types of kernel functions that can be
used to map the input data into a higher-dimensional feature space. The choice of kernel
function depends on the specific problem and the characteristics of the data.
Linear Kernel
A linear kernel is a type of kernel function used in machine learning, including in SVMs
(Support Vector Machines). It is the simplest and most commonly used kernel function, and it
defines the dot product between the input vectors in the original feature space.
1. K (x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input vectors is a measure
of their similarity or distance in the original feature space.
Polynomial Kernel
Where x and y are the input feature vectors, c is a constant term, and d is the degree of the
polynomial, K(x, y) = (x. y + c)d. The constant term is added to, and the dot product of the
input vectors elevated to the degree of the polynomial.
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a popular
kernel function used in machine learning, particularly in SVMs (Support Vector Machines). It
is a nonlinear kernel function that maps the input data into a higher-dimensional feature space
using a Gaussian function.
Where x and y are the input feature vectors, gamma is a parameter that controls the width of
the Gaussian function, and ||x - y||^2 is the squared Euclidean distance between the input
vectors.
Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type of
kernel function used in machine learning, including in SVMs (Support Vector Machines). It
is a non-parametric kernel that can be used to measure the similarity or distance between two
input feature vectors.
Properties of SVM
■ Feature Selection
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
The ID3 (Iterative Dichotomiser 3) Algorithm in Machine Learning is a popular decision tree
algorithm used to classify data. It works by selecting the attribute that provides the maximum
information gain for splitting the data.
The ID3 (Iterative Dichotomiser 3) algorithm is pretty easy and a powerful algorithm used to
construct decision trees. The algorithm involves several key steps:
1. Selecting the Best Attribute: Begin by selecting the best attribute that splits the data
into subsets. This is done using a metric named as information gain, which measures
how well an attribute separates the data into groups based on the target attribute.
2. Tree Construction: Use the best attribute as a decision node and branch off from it
for each possible value of the attribute. This process is mainly for partitioning the
data.
3. Recursive Splitting: Repeat the process for each branch using the remaining
attributes. Stop if all instances in a branch are the same or no more attributes are
available.
4. Pruning (Optional): Simplify the tree by removing branches that have little effect on
the decision-making process to reduce overfitting and improve the model's
generalizability.
The ID3 algorithm builds a decision tree by selecting the attribute that separates the data into
different classes in the best way possiblle. Here’s a step-by-step overview of how the
algorithm works:
• Start with the Entire Dataset: The algorithm begins by considering the entire
dataset as a whole.
• Determine Information Gain for Each Attribute: Information Gain is the reduction
in entropy achieved by splitting the data based on an attribute. The attribute with the
highest Information Gain is selected for the split. The formula for Information Gain is
Where:
• Split the Dataset: The dataset is split based on the chosen attribute, and the process is
repeated for each subset until all data points are perfectly classified, or no further
splits can be made.
• Create Leaf Nodes: Once the data is fully classified, the nodes at the ends of the
branches become leaf nodes, representing the final decision or classification.
The ID3 algorithm depends mainly on two main mathematical concepts: Entropy and
Information Gain.
1. Entropy
Entropy measures the level of uncertainty in a dataset. In decision trees, it quantifies the
randomness or impurity present. Low entropy means most data points belong to one class,
while high entropy shows a mix of classes. For example, if all data points in a dataset are
classified as "Yes," the entropy will be zero due to no uncertainty. On the other hand, a
50/50 split between "Yes" and "No" indicates maximum entropy due to higher uncertainty.
2. Information Gain
Information Gain measures how well an attribute separates the data. It is calculated as the
difference between the entropy of the dataset before the split and the weighted average of
the entropy after the split.
For example, in a dataset where splitting based on the "Outlook" attribute reduces the
entropy the most, the ID3 algorithm will select "Outlook" as the root node of the decision
tree.
Ensemble Methods
Key Idea: Combine predictions from multiple models trained independently on random
subsets of the data.
• Steps:
1. Create multiple subsets of the dataset by random sampling with replacement
(bootstrap sampling).
2. Train a separate model (e.g., decision trees) on each subset.
3. Combine predictions (e.g., average for regression or majority vote for
classification).
• Advantages:
o Reduces variance (prevents overfitting).
o Works well with high-variance models like decision trees.
• Example: Random Forest
o Random Forest is an extension of Bagging where each tree also selects a
random subset of features for splitting.
2. Boosting
Key Idea: Build models sequentially, where each model tries to correct the errors of the
previous one.
• Steps:
1. Start with a weak model (e.g., a shallow decision tree).
2. Train the next model to focus on the errors (misclassified examples) from the
previous model.
3. Combine all models’ predictions, usually with weighted voting.
• Advantages:
o Reduces bias and variance.
o Works well with weak learners, like shallow decision trees.
• Drawbacks:
o More prone to overfitting compared to Bagging.
o Can be computationally expensive.
Key Idea: A type of Boosting where each model focuses on correcting the mistakes of the
previous one by adjusting the weights of misclassified samples.
• Steps:
1. Initialize equal weights for all samples.
2. Train a weak learner (e.g., a decision stump).
3. Increase weights for misclassified samples so the next model focuses more on
them.
4. Combine all weak learners’ predictions using a weighted sum.
• Advantages:
o Simple and effective for many tasks.
o Handles outliers better than regular boosting.
• Drawbacks:
o Sensitive to noise and outliers because it gives higher weights to difficult
examples.
Key Idea: An optimized version of Gradient Boosting that uses advanced techniques to
improve speed and performance.
• Steps:
1. Build a series of decision trees using Gradient Boosting.
2. Use regularization techniques (L1 and L2) to prevent overfitting.
3. Employ techniques like parallel processing, sparse matrix optimization, and
early stopping for faster training.
• Advantages:
o Faster and more efficient than traditional Boosting.
o Regularization helps to avoid overfitting.
o Highly customizable for complex datasets.
• Applications:
o Often used in machine learning competitions (e.g., Kaggle) because of its
performance and versatility.
1. Bagging:
o Best for high-variance models like decision trees.
o When you want to reduce overfitting.
2. Boosting:
o Best for reducing bias.
o Useful when your model underfits the data.
3. AdaBoost:
o Best for simple models and small datasets.
o Avoid if data has significant noise or outliers.
4. XGBoost:
o Best for large and complex datasets.
o Use when you need high performance and speed.
6. Gain Curves
• Definition: Gain curves show the cumulative percentage of true positives
captured as you increase the dataset size.
• How to Interpret:
o The closer the curve is to the top-right corner, the better the model is at
prioritizing positives.
o The baseline (diagonal line) represents random guessing.
• Applications: Similar to lift curves, useful for evaluating ranking and targeting
models.