Unit 5 Intro To Machine Learning

Machine Learning is an application of Artificial Intelligence (AI) that enables a program(software) to
learn from experiences and improve itself at a task without being explicitly programmed.
5.1 Machine Learning as a Process:

1. Problem Definition:
Identify and define the problem you want to solve using machine learning. Understand the
objectives, constraints, and desired outcomes.
2. Data Collection:
Gather relevant data that will be used to train and test the machine learning model. This data should
be representative and diverse, covering different scenarios and variations of the problem.
3. Data Preprocessing:
Clean the data by handling missing values, outliers, and inconsistencies. Convert categorical variables
into numerical format, scale features, and perform normalization or standardization as needed.
4. Feature Engineering:
Select or create features from the raw data that are relevant and informative for the model. This may
involve transforming, combining, or selecting features to improve the model's performance.
5. Model Selection:
Choose the appropriate machine learning algorithm(s) based on the problem type (classification,
regression, clustering, etc.) and the characteristics of the data.
6. Model Training:
Use a portion of the collected data to train the selected model(s). The model learns patterns and
relationships between input features and the target variable.
7. Model Evaluation:
Assess the performance of the trained model using evaluation metrics appropriate for the specific
problem. This step involves testing the model on a separate portion of the data (the test set) to
measure its accuracy, precision, recall, etc.
8. Model Tuning:
Optimize the model's hyperparameters to improve its performance. Techniques like cross-validation
or grid search help in finding the best hyperparameter values.
9. Model Deployment:
Once satisfied with the model's performance, deploy it in a real-world environment. This could
involve integrating the model into an application or system where it can make predictions or
decisions based on new, unseen data.
5.2 Associated concepts in Machine Learning
Underfitting and overfitting are two common issues in machine learning that affect the performance
of a model:
Underfitting:
• Definition: Underfitting occurs when a model is too simple to capture the underlying
patterns in the data.
• Characteristics:
• High bias and low variance.
• Fails to learn the complexities and nuances present in the dataset.
• Performs poorly on both the training data and unseen data (test/validation).
• Causes:
• Model complexity is insufficient.
• Insufficient training or not enough features are used.
• Solution:
• Increase model complexity by using more features, a more sophisticated algorithm,

or adding polynomial features.
• Decrease regularization to allow the model to fit the training data better.
• Use more advanced algorithms that can capture complex patterns.
Overfitting:
• Definition: Overfitting occurs when a model is too complex and learns not only the
underlying patterns but also the noise in the training data.
• Characteristics:
• Low bias and high variance.
• Performs extremely well on the training data but poorly on unseen data.
• Captures noise and irrelevant patterns present in the training set.
• Causes:
• Model is too complex relative to the amount of data available.
• Too many features are used, or the model is trained for too many iterations.
• Solution:
• Simplify the model by reducing the number of features, applying feature selection,
or feature engineering.
• Regularize the model (e.g., through techniques like Lasso, Ridge regression) to
penalize complex models.
• Use more data for training to expose the model to a wider range of scenarios and
patterns.
Balancing Underfitting and Overfitting:
• The goal is to find a balance between the two extremes. This is achieved by optimizing the
model's complexity, ensuring it can generalize well to unseen data without overly simplifying
or overly complexifying.
• Techniques like cross-validation, regularization, and using an appropriate amount of data for
training and testing can help strike this balance.
Learner
In machine learning, a "learner" refers to an algorithm or a system that acquires knowledge or learns
patterns and relationships from data. Essentially, it's the entity responsible for learning from the
provided information to make predictions, classifications, or take actions without being explicitly
programmed for every scenario.
Characteristics of a Learner:
1. Algorithm or Model: A learner can be a specific algorithm or a model that implements a

particular learning approach. For instance, decision trees, neural networks, support vector
machines, etc., are different types of learners.
2. Learning from Data: Learners learn patterns, structures, or relationships present in the
provided data. They generalize from observed examples to make predictions or decisions on
new, unseen data.
3. Training and Adaptation: The learner is trained or adapted using a dataset—this involves
adjusting its internal parameters or structure based on the data to improve its performance.
4. Generalization: Ideally, a good learner generalizes well, meaning it performs accurately on

new, unseen data beyond the training set. This ability to generalize is a crucial aspect of
effective learning.
Types of Learners:
• Supervised Learners: Learn from labeled training data, where each input has a
corresponding output or target label. Examples include classification and regression
algorithms.
• Unsupervised Learners: Learn from unlabeled data to discover patterns or structures present
in the data. Examples include clustering algorithms.
• Semi-supervised Learners: Utilize a combination of labeled and unlabeled data for learning.
• Reinforcement Learners: Learn through trial-and-error interactions with an environment,

aiming to maximize a reward signal.
Examples:
• A decision tree learner constructs a tree-like structure by selecting features that best split
the data based on certain criteria.
• A neural network learner adjusts weights and biases through training iterations to
approximate complex functions.
• A k-means clustering learner divides data into clusters based on similarity measures without
needing labeled information.
5.3 Types of Machine Learning

Machine learning has been broadly categorized into three categories
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
What is Supervised Learning?
Let us start with an easy example, say you are teaching a kid to differentiate dogs from cats. How
would you do it?
You may show him/her a dog and say “here is a dog” and when you encounter a cat you would point
it out as a cat. When you show the kid enough dogs and cats, he may learn to differentiate between
them. If he is trained well, he may be able to recognize different breeds of dogs which he hasn’t even
seen.
Similarly, in Supervised Learning, we have two sets of variables. One is called the target variable, or
labels (the variable we want to predict) and features (variables that help us to predict target
variables). We show the program(model) the features and the label associated with these features
and then the program is able to find the underlying pattern in the data. Take this example of the
dataset where we want to predict the price of the house given its size. The price which is a target
variable depends upon the size which is a feature.
Number of rooms Price
1 $100
3 $300
5 $500
In a real dataset, we will have a lot more rows and more than one feature like size, location, number
of floors, and many more.
Thus, we can say that the supervised learning model has a set of input variables (x), and an output
variable (y). An algorithm identifies the mapping function between the input and output variables.
The relationship is y = f(x).
The learning is monitored or supervised in the sense that we already know the output and the
algorithm are corrected each time to optimize its results. The algorithm is trained over the data set
and amended until it achieves an acceptable level of performance.
We can group the supervised learning problems as:
• Regression problems – Used to predict future values and the model is trained with the
historical data. E.g., Predicting the future price of a house.
• Classification problems – Various labels train the algorithm to identify items within a specific
category. E.g., Dog or cat( as mentioned in the above example), Apple or an orange, Beer or
wine or water.
What is Unsupervised Learning?
This approach is the one where we have no target variables, and we have only the input
variable(features) at hand. The algorithm learns by itself and discovers an impressive structure in the
data.
The goal is to decipher the underlying distribution in the data to gain more knowledge about the
data.
We can group the unsupervised learning problems as:
• Clustering: This means bundling the input variables with the same characteristics together.
E.g., grouping users based on search history
• Association: Here, we discover the rules that govern meaningful associations among the
data set. E.g., People who watch ‘X’ will also watch ‘Y’.
What is Reinforcement Learning?
In this approach, machine learning models are trained to make a series of decisions based on the
rewards and feedback they receive for their actions. The machine learns to achieve a goal in complex
and uncertain situations and is rewarded each time it achieves it during the learning period.
Reinforcement learning is different from supervised learning in the sense that there is no answer
available, so the reinforcement agent decides the steps to perform a task. The machine learns from
its own experiences when there is no training data set present.
5.4 Popular Machine Learning Algorithms

- Supervised Learning Algorithms:
• Linear Regression: Predicts continuous values based on input features by fitting a linear
equation to observed data.
• Logistic Regression: Used for binary classification problems, estimating the probability that
an instance belongs to a particular class.
• Decision Trees: Builds a tree-like structure to make decisions by splitting data based on
feature values.
• Random Forest: An ensemble method using multiple decision trees to improve predictive
accuracy and control overfitting.
• Support Vector Machines (SVM): Effective for both classification and regression tasks by
finding the best decision boundary between data points.
2. Unsupervised Learning Algorithms:

• K-Means Clustering: Divides data into clusters based on similarity, aiming to minimize the
distance between data points in the same cluster.
• Hierarchical Clustering: Builds a tree of clusters by either iteratively merging smaller

clusters or splitting larger clusters.
• Principal Component Analysis (PCA): Reduces the dimensionality of data while preserving
as much variance as possible.
• Association Rule Learning: Discovers interesting relationships between variables in large

datasets using measures like support, confidence, and lift.
3. Semi-Supervised Learning Algorithms:
• Combines elements of supervised and unsupervised learning, using both labeled and
unlabeled data to improve learning accuracy.
4. Reinforcement Learning Algorithms:
• Q-Learning: A model-free reinforcement learning algorithm that learns by interacting with

an environment, aiming to maximize a reward.
• Deep Q Networks (DQN): Utilizes neural networks to approximate Q-values, enabling more
complex learning in reinforcement tasks.
5. Neural Network Algorithms:
• Feedforward Neural Networks (FNN): Traditional neural networks where information flows
in one direction from input to output.
• Convolutional Neural Networks (CNN): Suited for image recognition tasks, using
convolutional layers to detect patterns.
• Recurrent Neural Networks (RNN): Ideal for sequential data by retaining memory of
previous inputs, used in natural language processing and time series prediction.
• Long Short-Term Memory (LSTM) Networks: A specialized type of RNN with better handling
of long-term dependencies.
5.5 Feature engineering
It can be defined as the process of selecting, manipulating, and transforming raw data into features
that can improve the efficiency of developed ML models. It is a crucial step in the Machine Learning
development lifecycle, as the quality of the features used to train an ML model can significantly
affect its performance.
What is Feature Engineering in Machine Learning
Feature engineering is the process of leveraging domain knowledge to engineer and create new
features that are not part of the input dataset. These features can greatly enhance the performance
of developed ML models. Feature engineering can involve the creation of new features from existing
data, combining existing features, or selecting the most relevant features. In other words, feature
engineering is the process of selecting, manipulating, and transforming raw data into features that
better represent the underlying problem and thus improve the accuracy of the machine learning
model.
Feature engineering is part of the data pre-processing stage in a machine learning development
lifecycle, as mentioned in the above figure. It typically consists of four processes.
Feature Creation
• Feature creation is one of the most common techniques in feature engineering. It is also
known as feature construction or feature synthesis. It involves the creation of new features
from existing data that is performed by combining or transforming the existing features. For
example, if you have a dataset with the birth date of individuals, you can create a new
feature age by subtracting it from the current date.
• This technique is advantageous when the existing features do not provide enough
information to train and develop a robust machine-learning model or when the existing
features are not in a usable format for an ML model development. It is a powerful way to
improve the performance of machine learning models, but it can also be time-
consuming and require a deep understanding of the data and the problem at hand.
Transformations
• Transformations are a set of mathematical operations that can be applied to existing features
to create new ones.
• Some of the most common transformation techniques include data aggregation, scaling,
normalization, binning, etc.
Feature Extraction
• Feature extraction is a feature engineering technique that creates new variables by

extracting them from the raw data. The main objective of this process is to reduce the
volume and dimensionality of the input data.
• A few of the most common techniques used for this method include - Principal Component
Analysis (PCA), Cluster Analysis, etc.
Feature Selection
• The feature selection technique involves the selection of a subset of the most relevant
features from the data and dropping the remaining features. It is performed using criteria
such as the importance of the features, the correlation between the features and target
variables, etc.
• It can greatly reduce the complexity of the ML model by removing redundant and irrelevant
features from the training process.
Need for Feature Engineering in Machine Learning
Feature engineering is a crucial step in the machine learning development lifecycle, as the quality of
the features used in the development of an ML model can greatly impact its performance. Let’s
explore some of the reasons why feature engineering is required in machine learning.
1. Better data representation - Feature engineering can help transform raw data into features
that better represent the underlying problem, which can result in improving the efficiency
and accuracy of the developed ML model.
2. Improvement in model performance - Feature engineering can improve the machine

learning model's performance in various ways, such as selecting the most relevant features,
creating new important features, reducing the volume of input data, dropping irrelevant and
redundant features, etc.
3. Increased interpretability of the model - By creating new features that are more easily
interpretable, feature engineering can make it easier to interpret and explain how the ML
model is making its predictions, which can be greatly useful for debugging and improving the
model.
4. Reduced overfitting - By removing irrelevant or redundant features, feature engineering can

help prevent overfitting.
5. Reduce the complexity of the model - By dropping irrelevant features and keeping only
important features, we can reduce the volume of the input data and can select simpler ML
models to train.
Steps in Feature Engineering
1. Data Preparation - This data pre-processing step consists of collecting, manipulating, and
combining raw data from various sources into a standardized format.
2. Data Cleaning - This step requires handling missing values, removing erroneous and
inconsistent values, etc.
3. Exploratory Data Analysis (EDA) - EDA involves analyzing and investigating datasets through
various data statistical and visualization methods. It can help data scientists to understand
data better, its characteristics, how to manipulate it, choose the right features for the
models, etc.
4. Benchmark/Evaluation - This step involves evaluating the performance of the ML model

using the engineered features and comparing it to the performance of the ML model without
feature engineering.
Feature Engineering Techniques for Machine Learning

Imputation
• In feature engineering, Imputation is a technique used to

handle missing, irregular, erroneous, or NULL values in a given dataset.
• The most popular imputation techniques for numerical features include imputation by mean,
median, and mode, imputation by nearest neighbors, interpolation, etc. The most common
imputation technique for categorical features is the mode, or maximum occurred label or
category.
Read More on https://deepai.org/machine-learning-glossary-and-terms/Imputation
Handling Outliers
• Outliers are the values or data points that are significantly different from the rest of the
observations or data points in the dataset. These are also
called abnormal, aberration, anomalous points, etc.
• In Data Science, handling outliers is a crucial step, as outliers can significantly impact the
accuracy of the analysis and lead to incorrect conclusions and poor decision-making. In
feature engineering, z-score, standard deviation, and log transformation are the most
popular techniques to identify outliers and treat them.
Log Transformation
• Log transformation is one of the most popular techniques used by Data Scientists. It is used
to transform skewed distributions into normal or less skewed distributions.
• Log transformation can also make outliers less extreme, improving the model's accuracy and
fit.
Scaling
• In real-world datasets, features have different units and a range of values. For some ML
models, such as KNN, etc., it is necessary to have all features on a common scale or the same
range of values. Scaling is the technique used to transform variables to a common scale.
• Several techniques can be used for scaling data -
o Standardization - This is the most common scaling technique. It involves subtracting

the mean of the data from each value and dividing it by the standard deviation. The
resulting data would have zero mean and unit standard deviation in this case.
o Min-max scaling - This technique transforms all values in a feature to a common

range (0, 1). It is performed by subtracting the minimum value from each data point
and dividing by the range of the data.
Binning
• It is a process of converting numerical or continuous variables into multiple bins or sets of

intervals. This makes data easy to analyze and understand.
• For example, the age features can be converted into intervals such as (0-10, 11-20, ..) or
(child, young, …).
Read More on https://www.wallstreetmojo.com/data-binning/

One Hot Encoding
• One Hot Encoding can be defined as a process of transforming categorical variables into
numerical features that can be used as input to Machine Learning and Deep Learning
algorithms.
• It is a binary vector representation of a categorical variable where the length of the vector is
equal to the number of distinct categories in the variable. In this vector, all values would be 0
except the ith value, which will be 1, representing the ith category of the variable.
Read more on https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/
5.6 Introduction to Classification and Clustering Techniques

Classification
Classification is a process of categorizing data or objects into predefined classes or categories based
on their features or attributes. In machine learning, classification is a type of supervised
learning technique where an algorithm is trained on a labeled dataset to predict the class or category
of new, unseen data.
The main objective of classification is to build a model that can accurately assign a label or category
to a new observation based on its features. For example, a classification model might be trained on a
dataset of images labeled as either dogs or cats and then used to predict the class of new, unseen
images of dogs or cats based on their features such as color, texture, and shape.
Types of Classification
Classification is of two types:
1. Binary Classification: In binary classification, the goal is to classify the input into one of two
classes or categories. Example – On the basis of the given health conditions of a person, we
have to determine whether the person has a certain disease or not.
2. Multiclass Classification: In multi-class classification, the goal is to classify the input into one
of several classes or categories. For Example – On the basis of data about different species of
flowers, we have to determine which specie our observation belongs to.
Types of classification algorithms
There are various types of classifiers. Some of them are :
• Linear Classifiers: Linear models create a linear decision boundary between classes. They are
simple and computationally efficient. Some of the linear classification models are as follows:
• Logistic Regression
• Support Vector Machines having kernel = ‘linear’
• Single-layer Perceptron
• Stochastic Gradient Descent (SGD) Classifier
• Non-linear Classifiers: Non-linear models create a non-linear decision boundary between

classes. They can capture more complex relationships between the input features and the
target variable. Some of the non-linear classification models are as follows:
• K-Nearest Neighbours
• Kernel SVM
• Naive Bayes
• Decision Tree Classification
• Ensemble learning classifiers:
• Random Forests,
• AdaBoost,
• Bagging Classifier,
• Voting Classifier,
• ExtraTrees Classifier
• Multi-layer Artificial Neural Networks
Type of Learners in Classifications Algorithm
In machine learning, classification learners can also be classified as either “lazy” or “eager” learners.
• Lazy Learners: Lazy Learners are also known as instance-based learners, lazy learners do not
learn a model during the training phase. Instead, they simply store the training data and use
it to classify new instances at prediction time. It is very fast at prediction time because it
does not require computations during the predictions. it is less effective in high-dimensional
spaces or when the number of training instances is large. Examples of lazy learners include k-
nearest neighbors and case-based reasoning.
• Eager Learners: Eager Learners are also known as model-based learners, eager learners learn
a model from the training data during the training phase and use this model to classify new
instances at prediction time. It is more effective in high-dimensional spaces having large
training datasets. Examples of eager learners include decision trees, random forests, and
support vector machines.
Classification model Evaluations
Evaluating a classification model is an important step in machine learning, as it helps to assess the
performance and generalization ability of the model on new, unseen data. There are several metrics
and techniques that can be used to evaluate a classification model, depending on the specific
problem and requirements. Here are some commonly used evaluation metrics:
Classification Accuracy
Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio of
number of correct predictions to the total number of input samples.
It works well only if there are equal number of samples belonging to each class.
For example, consider that there are 98% samples of class A and 2% samples of class B in our training
set. Then our model can easily get 98% training accuracy by simply predicting every training sample
belonging to class A.
When the same model is tested on a test set with 60% samples of class A and 40% samples of class B,
then the test accuracy would drop down to 60%. Classification Accuracy is great, but gives us the false
sense of achieving high accuracy.
The real problem arises, when the cost of misclassification of the minor class samples are very high. If
we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick person is much
higher than the cost of sending a healthy person to more tests.
Logarithmic Loss
Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for multi-class
classification. When working with Log Loss, the classifier must assign probability to each class for all
the samples. Suppose, there are N samples belonging to M classes, then the Log Loss is calculated as
below :
where,
y_ij, indicates whether sample i belongs to class j or not
p_ij, indicates the probability of sample i belonging to class j
Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates higher
accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.
In general, minimising Log Loss gives greater accuracy for the classifier.
Confusion Matrix
Confusion Matrix as the name suggests gives us a matrix as output and describes the complete
performance of the model.
Lets assume we have a binary classification problem. We have some samples belonging to two classes
: YES or NO. Also, we have our own classifier which predicts a class for a given input sample. On testing
our model on 165 samples ,we get the following result.
Confusion Matrix
There are 4 important terms :
• True Positives : The cases in which we predicted YES and the actual output was also YES.
• True Negatives : The cases in which we predicted NO and the actual output was NO.
• False Positives : The cases in which we predicted YES and the actual output was NO.
• False Negatives : The cases in which we predicted NO and the actual output was YES.
Accuracy for the matrix can be calculated by taking average of the values lying across the “main
diagonal” i.e
Confusion Matrix forms the basis for the other types of metrics.
Area Under Curve
Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for binary
classification problem. AUC of a classifier is equal to the probability that the classifier will rank a
randomly chosen positive example higher than a randomly chosen negative example. Before
defining AUC, let us understand two basic terms :
• True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True Positive
Rate corresponds to the proportion of positive data points that are correctly considered as
positive, with respect to all positive data points.
• True Negative Rate (Specificity) : True Negative Rate is defined as TN / (FP+TN). False Positive
Rate corresponds to the proportion of negative data points that are correctly considered as
negative, with respect to all negative data points.
• False Positive Rate : False Positive Rate is defined as FP / (FP+TN). False Positive Rate
corresponds to the proportion of negative data points that are mistakenly considered as
positive, with respect to all negative data points.
False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR and TPR both are
computed at varying threshold values such as (0.00, 0.02, 0.04, …., 1.00) and a graph is drawn. AUC is
the area under the curve of plot False Positive Rate vs True Positive Rate at different points in [0, 1].
As evident, AUC has a range of [0, 1]. The greater the value, the better is the performance of our
model.
F1 Score
F1 Score is used to measure a test’s accuracy
F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells
you how precise your classifier is (how many instances it classifies correctly), as well as how robust it
is (it does not miss a significant number of instances).
High precision but lower recall, gives you an extremely accurate, but it then misses a large number of
instances that are difficult to classify. The greater the F1 Score, the better is the performance of our
model. Mathematically, it can be expressed as :
F1 Score tries to find the balance between precision and recall.
• Precision : It is the number of correct positive results divided by the number of positive
results predicted by the classifier.
• Recall : It is the number of correct positive results divided by the number of all relevant
samples (all samples that should have been identified as positive).
Mean Absolute Error
Mean Absolute Error is the average of the difference between the Original Values and the Predicted
Values. It gives us the measure of how far the predictions were from the actual output. However,
they don’t gives us any idea of the direction of the error i.e. whether we are under predicting the
data or over predicting the data. Mathematically, it is represented as :
Mean Squared Error
Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only difference being that MSE
takes the average of the square of the difference between the original values and the predicted
values. The advantage of MSE being that it is easier to compute the gradient, whereas Mean
Absolute Error requires complicated linear programming tools to compute the gradient. As, we take
square of the error, the effect of larger errors become more pronounced then smaller error, hence
the model can now focus more on the larger errors.
Mean Squared Error
It is important to choose the appropriate evaluation metric(s) based on the specific problem and
requirements, and to avoid overfitting by evaluating the model on independent test data.
How does classification work?
The basic idea behind classification is to train a model on a labeled dataset, where the input data is
associated with their corresponding output labels, to learn the patterns and relationships between
the input data and output labels. Once the model is trained, it can be used to predict the output
labels for new unseen data.
The classification process typically involves the following steps:
1. Understanding the problem: Before getting started with classification, it is important to

understand the problem you are trying to solve. What are the class labels you are trying to
predict? What is the relationship between the input data and the class labels?
• Suppose we have to predict whether a patient has a certain disease or not, on the
basis of 7 independent variables, called features. This means, there can be only two
possible outcomes:
• The patient has the disease, which means “True”.
• The patient has no disease. which means “False”.
• This is a binary classification problem.
2. Data preparation: Once you have a good understanding of the problem, the next step is to
prepare your data. This includes collecting and preprocessing the data and splitting it into
training, validation, and test sets. In this step, the data is cleaned, preprocessed, and
transformed into a format that can be used by the classification algorithm.
• X: It is the independent feature, in the form of an N*M matrix. N is the no. of

observations and M is the number of features.
• y: An N vector corresponding to predicted classes for each of the N observations.
3. Feature Extraction: The relevant features or attributes are extracted from the data that can
be used to differentiate between the different classes.
• Suppose our input X has 7 independent features, having only 5 features influencing
the label or target values remaining 2 are negligibly or not correlated, then we will
use only these 5 features only for the model training.
4. Model Selection: There are many different models that can be used for classification,
including logistic regression, decision trees, support vector machines (SVM), or neural
networks. It is important to select a model that is appropriate for your problem, taking into
account the size and complexity of your data, and the computational resources you have
available.
5. Model Training: Once you have selected a model, the next step is to train it on your training
data. This involves adjusting the parameters of the model to minimize the error between the
predicted class labels and the actual class labels for the training data.
6. Model Evaluation: Evaluating the model: After training the model, it is important to evaluate
its performance on a validation set. This will give you a good idea of how well the model is
likely to perform on new, unseen data.
• Log Loss or Cross-Entropy Loss, Confusion Matrix, Precision, Recall, and AUC-ROC
curve are the quality metrics used for measuring the performance of the model.
7. Fine-tuning the model: If the model’s performance is not satisfactory, you can fine-tune it
by adjusting the parameters, or trying a different model.
8. Deploying the model: Finally, once we are satisfied with the performance of the model, we
can deploy it to make predictions on new data. it can be used for real world problem.
Classification Lifecycle
Applications of Classification Algorithm
Classification algorithms are widely used in many real-world applications across various domains,
including:
• Email spam filtering
• Credit risk assessment
• Medical diagnosis
• Image classification
• Sentiment analysis.
• Fraud detection
• Quality control
• Recommendation systems
Clustering
It is basically a type of unsupervised learning method. An unsupervised learning method is a method

in which we draw references from datasets consisting of input data without labeled responses.
Generally, it is used as a process to find meaningful structure, explanatory underlying processes,
generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.
For example The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below
picture.
It is not necessary for clusters to be spherical as depicted below:

DBSCAN: Density-based Spatial Clustering of Applications with Noise
These data points are clustered by using the basic concept that the data point lies within the given
constraint from the cluster center. Various distance methods and techniques are used for the
calculation of the outliers.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the unlabelled data
present. There are no criteria for good clustering. It depends on the user, and what criteria they may
use which satisfy their need. For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), finding “natural clusters” and describing their unknown
properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in
finding unusual data objects (outlier detection). This algorithm must make some assumptions that
constitute the similarity of points and each assumption make different and equally valid clusters.
Clustering Methods:
• Density-Based Methods: These methods consider the clusters as the dense region having
some similarities and differences from the lower dense region of the space. These methods
have good accuracy and the ability to merge two clusters. Example DBSCAN (Density-Based
Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to Identify Clustering
Structure), etc.
• Hierarchical Based Methods: The clusters formed in this method form a tree-type structure
based on the hierarchy. New clusters are formed using the previously formed one. It is
divided into two category
• Agglomerative (bottom-up approach)

• Divisive (top-down approach)
Examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering
and using Hierarchies), etc.
• Partitioning Methods: These methods partition the objects into k clusters and each partition
forms one cluster. This method is used to optimize an objective criterion similarity function
such as when the distance is a major parameter example K-means, CLARANS (Clustering
Large Applications based upon Randomized Search), etc.
• Grid-based Methods: In this method, the data space is formulated into a finite number of
cells that form a grid-like structure. All the clustering operations done on these grids are fast
and independent of the number of data objects example STING (Statistical Information Grid),
wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms: K-means clustering algorithm – It is the simplest unsupervised learning

algorithm that solves clustering problem.K-means algorithm partitions n observations into k clusters
where each observation belongs to the cluster with the nearest mean serving as a prototype of the
cluster.
Applications of Clustering in different fields:
1. Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
2. Biology: It can be used for classification among different species of plants and animals.
3. Libraries: It is used in clustering different books on the basis of topics and information.
4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.
6. Earthquake studies: By learning the earthquake-affected areas we can determine the

dangerous zones.
7. Image Processing: Clustering can be used to group similar images together, classify images
based on content, and identify patterns in image data.
8. Genetics: Clustering is used to group genes that have similar expression patterns and identify
gene networks that work together in biological processes.
9. Finance: Clustering is used to identify market segments based on customer behavior, identify
patterns in stock market data, and analyze risk in investment portfolios.
10. Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.
11. Manufacturing: Clustering is used to group similar products together, optimize production
processes, and identify defects in manufacturing processes.
12. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases,
which helps in making accurate diagnoses and identifying effective treatments.
13. Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial
transactions, which can help in detecting fraud or other financial crimes.
14. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak
hours, routes, and speeds, which can help in improving transportation planning and
infrastructure.
15. Social network analysis: Clustering is used to identify communities or groups within social
networks, which can help in understanding social behavior, influence, and trends.
16. Cybersecurity: Clustering is used to group similar patterns of network traffic or system
behavior, which can help in detecting and preventing cyberattacks.
17. Climate analysis: Clustering is used to group similar patterns of climate data, such as
temperature, precipitation, and wind, which can help in understanding climate change and
its impact on the environment.
18. Sports analysis: Clustering is used to group similar patterns of player or team performance
data, which can help in analyzing player or team strengths and weaknesses and making
strategic decisions.
19. Crime analysis: Clustering is used to group similar patterns of crime data, such as location,
time, and type, which can help in identifying crime hotspots, predicting future crime trends,
and improving crime prevention strategies.
Linear Regression Model Representation
The representation is a linear equation that combines a specific set of input values (x), the solution to
which is the predicted output for that set of input values (y). As such, both the input values (x) and
the output value are numeric.
The linear equation assigns one scale factor to each input value or column, called a coefficient and
represented by the capital Greek letter Beta (B). One additional coefficient is added, giving the line
an additional degree of freedom (e.g., moving up and down on a two-dimensional plot) and is often
called the intercept or the bias coefficient.
For example, in a simple regression problem (a single x and a single y), the form of the model would
be:
Y= β0 + β1x
In higher dimensions, the line is called a plane or a hyper-plane when we have more than one input
(x). The representation, therefore, is in the form of the equation and the specific values used for the
coefficients (e.g., β0and β1 in the above example).
Performance of Regression
The regression model’s performance can be evaluated using various metrics like MAE, MAPE, RMSE,
R-squared, etc.
Mean Absolute Error (MAE)
By using MAE, we calculate the average absolute difference between the actual values and the
predicted values.
Mean Absolute Percentage Error (MAPE)
MAPE is defined as the average of the absolute deviation of the predicted value from the actual
value. It is the average of the ratio of the absolute difference between actual & predicted values and
actual values.
Root Mean Square Error (RMSE)
RMSE calculates the square root average of the sum of the squared difference between the actual
and the predicted values.
R-squared values
R-square value depicts the percentage of the variation in the dependent variable explained by the
independent variable in the model.
RSS = Residual sum of squares: It measures the difference between the expected and the actual
output. A small RSS indicates a tight fit of the model to the data. It is also defined as follows:
TSS = Total sum of squares: It is the sum of data points’ errors from the response variable’s mean.
R2 value ranges from 0 to 1. The higher the R-square value better the model. The value of R2
increases if we add more variables to the model, irrespective of whether the variable contributes to
the model or not. This is the disadvantage of using R2.
Adjusted R-squared values

The Adjusted R2 value fixes the disadvantage of R2. The adjusted R2 value will improve only if the
added variable contributes significantly to the model, and the adjusted R2 value adds a penalty to the
model.
where R2 is the R-square value, n = the total number of observations, and k = the total number of
variables used in the model, if we increase the number of variables, the denominator becomes
smaller, and the overall ratio will be high. Subtracting from 1 will reduce the overall Adjusted R2. So
to increase the Adjusted R2, the contribution of additive features to the model should be significantly
high.
Simple Linear Regression Example
For the given equation for the Linear Regression,
If there is only 1 predictor available, then it is known as Simple Linear Regression.
While executing the prediction, there is an error term that is associated with the equation.
The SLR model aims to find the estimated values of β1 & β0 by keeping the error term (ε) minimum.
Multiple Linear Regression Example
For the given equation of Linear Regression,
if there is more than 1 predictor available, then it is known as Multiple Linear Regression.
The equation for MLR will be:
β1 = coefficient for X1 variable
β2 = coefficient for X2 variable
β3 = coefficient for X3 variable and so on…
β0 is the intercept (constant term). While making the prediction, there is an error term that is
associated with the equation.
The goal of the MLR model is to find the estimated values of β0, β1, β2, β3… by keeping the error term
(i) minimum.
Broadly speaking, supervised machine learning algorithms are classified into two types-
1. Regression: Used to predict a continuous variable
2. Classification: Used to predict discrete variable

Linear regression is one of the statistical methods of predictive analytics to predict the target variable
(dependent variable). When we have one independent variable, we call it Simple Linear Regression.
If the number of independent variables is more than one, we call it Multiple Linear Regression.
A mathematical formulation of Multiple Linear Regression
In Linear Regression, we try to find a linear relationship between independent and dependent
variables by using a linear equation on the data.
The equation for a linear line is-
Y=mx + c
Where m is slope and c is the intercept.
In Linear Regression, we are actually trying to predict the best m and c values for dependent variable
Y and independent variable x. We fit as many lines and take the best line that gives the least possible
error. We use the corresponding m and c values to predict the y value.
The same concept can be used in multiple Linear Regression where we have multiple independent
variables, x1, x2, x3…xn.
Now the equation changes to-
Y=M1X1 + M2X2 + M3M3 + …MnXn+C
The above equation is not a line but a plane of multi-dimensions.
Model Evaluation:
A model can be evaluated by using the below methods-
1. Mean absolute error: It is the mean of absolute values of the errors, formulated as-
1. Mean squared error: It is the mean of the square of errors.
1. Root mean squared error: It is just the square root of MSE.
Applications
1. The effect of the independent variable on the dependent variable can be calculated.
2. Used to predict trends.
3. Used to find how much change can be expected in a dependent variable with change in an
independent variable.
Polynomial Regression
Polynomial regression is a non-linear regression. In Polynomial regression, the relationship of the

dependent variable is fitted to the nth degree of the independent variable.
Equation of polynomial regression:

Underfitting and Overfitting
When we fit a model, we try to find the optimized, best-fit line, which can describe the impact of the
change in the independent variable on the change in the dependent variable by keeping the error
term minimum. While fitting the model, there can be 2 events that will lead to the bad performance
of the model. These events are
1. Underfitting
2. Overfitting
Underfitting
Underfitting is the condition where the model cannot fit the data well enough. The under-fitted
model leads to low accuracy of the model. Therefore, the model is unable to capture the
relationship, trend, or pattern in the training data. Underfitting of the model could be avoided by
using more data or by optimizing the parameters of the model.
Overfitting
Overfitting is the opposite case of underfitting, i.e., when the model predicts very well on training
data and is not able to predict well on test data or validation data. The main reason for overfitting
could be that the model is memorizing the training data and is unable to generalize it on a
test/unseen dataset. Overfitting can be reduced by making feature selection or by using
regularisation techniques.

Unit 5 Intro To Machine Learning

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Unit 5 Intro To Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 5 Intro To Machine Learning

Uploaded by

Copyright:

Available Formats

Machine Learning is an application of Artificial Intelligence (AI) that enables a program(software) to

5.1 Machine Learning as a Process:

• High bias and low variance.

• Fails to learn the complexities and nuances present in the dataset.

• Model complexity is insufficient.

• Insufficient training or not enough features are used.

• Increase model complexity by using more features, a more sophisticated algorithm,

• Use more advanced algorithms that can capture complex patterns.

• Low bias and high variance.

• Captures noise and irrelevant patterns present in the training set.

• Model is too complex relative to the amount of data available.

Balancing Underfitting and Overfitting:

1. Algorithm or Model: A learner can be a specific algorithm or a model that implements a

4. Generalization: Ideally, a good learner generalizes well, meaning it performs accurately on

• Reinforcement Learners: Learn through trial-and-error interactions with an environment,

5.3 Types of Machine Learning

What is Supervised Learning?

Number of rooms Price

What is Unsupervised Learning?

We can group the unsupervised learning problems as:

What is Reinforcement Learning?

5.4 Popular Machine Learning Algorithms

2. Unsupervised Learning Algorithms:

• Hierarchical Clustering: Builds a tree of clusters by either iteratively merging smaller

• Association Rule Learning: Discovers interesting relationships between variables in large

3. Semi-Supervised Learning Algorithms:

4. Reinforcement Learning Algorithms:

• Q-Learning: A model-free reinforcement learning algorithm that learns by interacting with

5. Neural Network Algorithms:

What is Feature Engineering in Machine Learning

• Feature extraction is a feature engineering technique that creates new variables by

Need for Feature Engineering in Machine Learning

2. Improvement in model performance - Feature engineering can improve the machine

4. Reduced overfitting - By removing irrelevant or redundant features, feature engineering can

Steps in Feature Engineering

4. Benchmark/Evaluation - This step involves evaluating the performance of the ML model

Feature Engineering Techniques for Machine Learning

• In feature engineering, Imputation is a technique used to

Read More on https://deepai.org/machine-learning-glossary-and-terms/Imputation

• Several techniques can be used for scaling data -

o Standardization - This is the most common scaling technique. It involves subtracting

o Min-max scaling - This technique transforms all values in a feature to a common

• It is a process of converting numerical or continuous variables into multiple bins or sets of

Read More on https://www.wallstreetmojo.com/data-binning/

Read more on https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/

5.6 Introduction to Classification and Clustering Techniques

Classification is of two types:

There are various types of classifiers. Some of them are :

• Support Vector Machines having kernel = ‘linear’

• Stochastic Gradient Descent (SGD) Classifier

• Non-linear Classifiers: Non-linear models create a non-linear decision boundary between

• Decision Tree Classification

• Ensemble learning classifiers:

• Multi-layer Artificial Neural Networks

Type of Learners in Classifications Algorithm

y_ij, indicates whether sample i belongs to class j or not

p_ij, indicates the probability of sample i belonging to class j