Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views32 pages

dbms-10 marks

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 32

What are the issues with machine learning

Machine learning (ML) is powerful, but it comes with several challenges and
issues that researchers and practitioners need to address. Here are some key
ones:
1. Data-Related Issues
Quality of Data: ML models rely heavily on high-quality data. Incomplete,
noisy, or biased data can lead to poor performance.
Data Scarcity: Some domains lack sufficient labeled data for training robust
models.
2. Overfitting and Underfitting
Overfitting: The model performs well on training data but poorly on unseen
data because it has "memorized" rather than "learned" patterns.
Underfitting: The model is too simplistic to capture the underlying patterns in
the data, leading to poor performance on both training and test data.
3. Computational Challenges
Resource-Intensive Training: Training complex models like deep learning
requires significant computational power, time, and energy.
Scalability: Managing and processing large datasets efficiently can be
challenging.
4. Interpretability and Explainability
Complex ML models (like deep neural networks) are often considered "black
boxes," making it hard to understand how they make decisions.Lack of
interpretability can hinder trust and adoption in critical fields like healthcare or
finance.
5. Ethical and Privacy Concerns
Data Privacy: Collecting and using personal data for training models raises
privacy concerns.
Ethical Use: Misuse of ML, such as for surveillance or spreading
misinformation, poses ethical dilemmas.
6. Generalization
 ML models may struggle to generalize to new, unseen data, especially if
the training data does not represent the real-world distribution.
7. Model Deployment and Maintenance
Adaptation to Change: Real-world data can change over time (data drift),
requiring frequent model updates.
Integration: Embedding ML models into existing systems and workflows can be
complex.
8. Reproducibility
Variations in implementation, data preprocessing, or random seeds can make
reproducing ML results difficult.
9. Cost
Developing and deploying ML systems can be expensive due to the need for
skilled personnel, computational resources, and data preparation.
10. Ethical AI Development
Ensuring that models are designed, trained, and used in a way that aligns with
societal values and fairness can be challenging but crucial.
UNIT-2

Model of Supervised learning.


Supervised machine learning is a fundamental approach for machine learning
and artificial intelligence. It involves training a model using labeled data, where
each input comes with a corresponding correct output. The process is like a
teacher guiding a student—hence the term “supervised” learning.Supervised
learning can be applied in various forms, including supervised learning
classification and supervised learning regression, making it a crucial technique
in the field of artificial intelligence and supervised data mining.
Supervised learning can be applied to two main types of problems:
1. Regression

Regression algorithms are used if there is a relationship between the


input variable and the output variable. It is used for the prediction of
continuous variables, such as Weather forecasting, Market Trends, etc.
Below are some popular Regression algorithms which come under
supervised learning:
o Linear Regression

o Regression Trees

o Non-Linear Regression

o Bayesian Linear Regression

o Polynomial Regression
2. Classification

Classification algorithms are used when the output variable is


categorical, which means there are two classes such as Yes-No, Male-
Female, True-false, etc.

Spam Filtering,

o Random Forest

o Decision Trees

o Logistic
Regression

o Support vector Machines

Steps Involved in Supervised Learning:

o First Determine the type of training dataset

o Collect/Gather the labelled training data.

o Split the training dataset into training dataset, test dataset,


and validation dataset.

o Determine the input features of the training dataset, which


should have enough knowledge so that the model can accurately
predict the output.

o Determine the suitable algorithm for the model, such as support


vector machine, decision tree, etc.

o Execute the algorithm on the training dataset. Sometimes we


need validation sets as the control parameters, which are the
subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If
the model predicts the correct output, which means our model is
accurate.

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the
output on the basis of prior experiences.

o In supervised learning, we can have an exact idea about the


classes of objects.

o Supervised learning model helps us to solve various real-world


problems such as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the


complex tasks.

o Supervised learning cannot predict the correct output if the test


data is different from the training dataset.

o Training required lots of computation times.

o In supervised learning, we need enough knowledge about the


classes of object.

Evaluating performance of a model.


To evaluate the performance or quality of the model, different metrics are
used, and these metrics are known as performance metrics or evaluation
metrics. These performance metrics help us understand how well our model
has performed for the given data.
To evaluate the performance of a classification model, different metrics
are used, and some of them are as follows:

o Accuracy

o Confusion Matrix

o Precision

o Recall

o F-Score
o AUC(Area Under the Curve)-ROC

Accuracy
o The accuracy metric is one of the simplest Classification metrics
to implement, and it can be determined as the number of correct
predictions to the total number of predictions.
o It can be formulated as:

Confusion Matrix
A confusion matrix is a tabular representation of prediction
outcomes of any binary classifier, which is used to describe the
performance of the classification model on a set of test data
when true values are known.

Precision

The precision metric is used to overcome the limitation of


Accuracy. The precision determines the proportion of positive
prediction that was actually correct. It can be calculated as the
True Positive or predictions that are actually true to the total
positive predictions (True Positive and False Positive).

Recall or Sensitivity
It is also similar to the Precision metric; however, it aims to
calculate the proportion of actual positive that was identified
incorrectly. It can be calculated as True Positive or predictions
that are actually true to the total number of positives, either
correctly predicted as positive or incorrectly predicted as
negative
F-Scores
F-score or F1 Score is a metric to evaluate a binary classification model on the
basis of predictions that are made for the positive class.

 Improving of machine learning model.


Improve Performance of ML Models
1. Choosing the Right Algorithms
Algorithms are the key factor used to train the ML models. The data feed into
this that helps the model to learn from and predict with accurate results.
Hence, choosing the right algorithm is important to ensure the performance of
your machine learning model.
Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, kNN,
K-Means, Random Forest and Dimensionality Reduction Algorithms and
Gradient Boosting are the leading ML algorithms you can choose as per your
ML model compatibility.
2.Use the Right Quantity of Data
The next important factor you can consider while developing a machine
learning model is choosing the right quantity of data sets. And there are
multirole factors and for deep learning-based ML models, a huge quantity of
datasets is required for algorithms.
Depending on the complexities of problem and learning algorithms, model skill,
data size evaluation and use of statistical heuristic rule are the leading factors
determine the quantity and types of training data sets that help in improving
the performance of the model.
3.Quality of Training Data Sets
Just like quantity, the quality of machine learning training data set is another
key factor, you need to keep in mind while developing an ML model. If the
quality of machine learning training data sets is not good or accurate your
model will never give accurate results, affecting the overall performance of the
model not suitable to use in real-life.
Supervised or Unsupervised ML
Moreover, the above-discussed ML algorithms, the performance of such AI-
based models are affected by methods or process of machine learning. And
supervised, unsupervised and reinforcement learning are the algorithm consist
of a target/outcome variable (or dependent variable) which is to be predicted
from a given set of predictors (independent variables).

Model Validation and Testing: Building a machine learning model is not enough
to get the right predictions, as you have to check the accuracy and need to
validate the same to ensure get the precise results. And validating the model
will improve the performance of the ML model.

What are the basics features of Engineering.


Feature engineering is a vital step in the machine learning (ML) pipeline. It
involves transforming raw data into a format that is more suitable for model
building. This process typically includes feature transformation and feature
subset selection. These techniques can improve the predictive performance of
models, reduce computational costs, and provide insights into the underlying
data. Let’s break down these concepts.
Feature Transformation:Feature transformation involves converting raw
features into a new format or scale that improves the learning algorithm’s
ability to model the data. It includes scaling, encoding, normalization, and
generating new features based on domain knowledge or mathematical
transformations. Common methods are:
1. Scaling and Normalization
 Scaling changes the range of feature values. This is crucial when features
have different units or magnitudes. For example:
o Standardization: Converts data to have a mean of 0 and a
standard deviation of 1 using z=x−μσz = \frac{x - \mu}{\sigma}.
o Min-Max Scaling: Rescales data to a fixed range, typically [0, 1],
using x′=x−xminxmax−xminx' = \frac{x - x_{\text{min}}}{x_{\
text{max}} - x_{\text{min}}}.
 Normalization adjusts the feature values to have a specific scale, such as
scaling rows to unit norms (useful for distance-based algorithms like k-
Nearest Neighbors).
2. Encoding Categorical Variables
Machine learning models often require numeric input.
 One-Hot Encoding: Creates binary columns for each category, assigning a
1 or 0.
 Label Encoding: Assigns a unique integer to each category (less suited for
non-ordinal data as it implies an order).
3. Binning
Binning groups continuous values into discrete intervals. For example, age can
be grouped into bins like "child," "adult," and "senior." It can reduce noise and
simplify models but may lose information.
4. Polynomial Features
Creating polynomial combinations of features (e.g., x2,xyx^2, xy) can capture
nonlinear relationships in the data.
5. Logarithmic and Exponential Transformations
 Logarithmic transformations are used to compress skewed data.
 Exponential transformations can expand compressed data.
6. Principal Component Analysis (PCA)
PCA reduces the dimensionality of the data while preserving as much variance
as possible. It transforms the data into a set of linearly uncorrelated
components.
7. Feature Interactions
Feature interaction involves creating new features by combining existing ones.
For instance, the product of two features x1x_1 and x2x_2 can uncover
complex relationships.
8. Handling Missing Data
 Mean, median, or mode substitution for numeric or categorical data.
 Advanced techniques like K-Nearest Neighbors Imputation or
Multivariate Imputation.
Feature Subset Selection
Feature subset selection involves choosing a subset of the most relevant
features for training the model.
 Reducing overfitting by eliminating redundant or irrelevant features.
 Improving computational efficiency.
 Enhancing model interpretability.
There are three main approaches to feature subset selection:
1. Filter Methods
Filter methods use statistical techniques to rank features based on their
relevance to the target variable. They are independent of the machine learning
model and are computationally efficient.
Examples: Correlation: Measures the relationship between features and the
target (e.g., Pearson correlation for linear relationships).
Mutual Information: Captures the dependency between features and the
target.
Chi-Square Test: Evaluates categorical features by assessing the independence
of feature-target pairs.
Advantages:
 Simple and fast.
 Useful for preprocessing large datasets.
Disadvantages:
 Ignores interactions between features.
2. Wrapper Methods
Wrapper methods evaluate subsets of features by training a model and
assessing its performance. They are computationally intensive but consider
feature interactions.
Techniques:
 Forward Selection: Starts with no features and adds them iteratively
based on model performance.
 Backward Elimination: Starts with all features and removes the least
significant one iteratively.
 Recursive Feature Elimination (RFE): Repeatedly trains the model and
removes the least important features.
Advantages:
 More accurate than filter methods as they consider feature interactions.
Disadvantages:
 Computationally expensive.
 Prone to overfitting with small datasets.
3. Embedded Methods
Embedded methods perform feature selection during the model training
process. These are model-based methods that balance the efficiency of filter
methods with the accuracy of wrapper methods.
Examples:
 Lasso Regression (L1 Regularization): Penalizes the absolute magnitude
of feature coefficients, driving some to zero.
 Ridge Regression (L2 Regularization): Penalizes the squared magnitude
of coefficients but retains all features.
 Tree-Based Methods: Feature importance is derived from tree-based
models like Random Forest or Gradient Boosting.
Advantages:
 Integrated with model training, making them efficient.
 Capture feature interactions to some extent.
Disadvantages:
 Dependent on the chosen model.
Best Practices for Feature Engineering
Understand the Data Analyze the data distribution, relationships, and missing
values before applying transformations.Experiment with Techniques Try
multiple feature transformation and selection methods to find what works best
for your problem.Avoid Overfitting Use techniques like cross-validation to
ensure feature engineering choices generalize to unseen data.Automate with
Libraries Tools like Scikit-learn, Featuretools, and PyCaret can help streamline
the process.Domain Knowledge Incorporate domain expertise to create
meaningful features.

UNIT – 3
Explain Bayes theorem.
In ML, Bayes' theorem enhances classification and decision-making by
providing accurate predictions based on learned data. It helps ML systems
establish relationships between data and output, enabling revised predictions
that result in more accurate decisions and actions, even with uncertain or
incomplete data. Bayes' theorem can be derived using product rule
and conditional probability of event X with known event Y:

o According to the product rule we can express as the


probability of event X with known event Y as follows;
1. P(X ? Y)= P(X|Y) P(Y) {equation 1}

o Further, the probability of event Y with known event X:


1. P(X ? Y)= P(Y|X) P(X) {equation 2}
Mathematically, Bayes theorem can be expressed by combining
both equations on right hand side. We will
get:

In ML, Bayes' theorem underpins algorithms that


help models form relationships between input data
and predictive output. This leads to more accurate models that can better
adapt to new and changing data.
The Bayesian approach in ML assigns a probability distribution to all elements,
including model parameters and variables. It is often used in probabilistic
models and provides a foundation for multiple ML algorithms and techniques,
including the following:
Naïve Bayes classifier. This common ML algorithm is used for classification
tasks. It relies on Bayes' theorem to make classifications based on given
information and assumes that different features are conditionally independent
given the class.
Bayes optimal classifier. This is a type of theoretical model that finds the most
optimal, or probable, prediction by averaging over all possible models weighted
by their posterior probabilities based on training data.
Bayesian optimization. This sequential design strategy searches for optimal
outcomes based on prior knowledge. It is particularly useful for objective
functions that are complex or noisy.
Bayesian networks. Sometimes referred to as Bayesian belief networks,
Bayesian networks are probabilistic graphical models that depict relationships
among variables via conditional dependencies.
Bayesian linear regression. This conditional modeling technique finds posterior
probability through a linear regression model, where the mean of one variable
is described by the linear combination of other variables.
Bayesian neural networks. An extension of traditional neural networks, these
models help control overfitting by incorporating uncertainty in weights through
posterior distributions, informing a model's output with predictions based on
historical data.
Bayesian model averaging. This approach averages predictions from different
models to make predictions about new observations, with each considered
model weighted by its model probability.
Bayesian ML's ability to improve prediction accuracy using data makes it useful
for many ML tasks, such as fraud detection, spam filtering, medical diagnostics,
weather predictions, forensic analysis, robot decision-making and more.
dvantages of Bayes' theorem include the following:
 Combines information in an accessible and interpretable way.
 Improves the accuracy of predictions and hypotheses.
 Accounts for unknowns and uncertainties in the data.
 Produces more realistic and reliable predictions.
 Allows for data adjustments, increasing flexibility.
Disadvantages or challenges associated with Bayes' theorem include the
following:
 Requires a prior probability, which can sometimes be subjective or
difficult to determine.
 Focuses narrowly on finding posterior probability given prior probability.
 Is computationally complex, which can result in high compute costs,
especially in ML use cases with large volumes of data and numerous
parameters.
SUPERVISED LEARNING : CLASSIFICATION, CLASSIFICATION MODEL,
CLASSIFICATION LEARNING STEPS
In supervised learning, the labelled training data provide the basis for learning.
According to the definition of machine learning, this labelled training data is
the experience or prior knowledge
Some more examples of supervised learning are as follows:
Prediction of results of a game based on the past analysis of results
Predicting whether a tumor is malignant or benign on the basis of the analysis
of data
Price prediction in domains such as real estate, stocks, etc.

 When we are trying to


predict a categorical or
nominal variable, the
problem is known as a
classification problem. A
classification problem is one where the output variable is a category such
as ‘red’ or ‘blue’ or ‘malignant tumor’ or ‘benign tumor’, etc.
 Whereas when we are trying to predict a numerical variable such as
‘price’, ‘weight’, etc. the problem falls under the category of regression.
Supervised machine learning is as good as the data used to train it. If the
training data is poor in quality, the prediction will also be far from being
precise. The fig depicts the typical process of classification where a
classification model is obtained from the labelled training data by a classifier
algorithm. On the basis of the model, a class label is assigned to the test
data.

 classification is a type of supervised learning where a target feature, which


is of categorical type, is predicted for test data on the basis of the
information imparted by the training data.
The target categorical feature is known as class .
Some typical classification problems include the following:
Image classification
Disease prediction
Win–loss prediction of
games
Prediction of
natural calamity such
as earthquake, flood,
etc.
Handwriting
recognition

Problem Identification:
The problem needs to be
a well-formed problem,
i.e. a problem with well-defined goals and benefit, which has a long-term
impact.
Identification of Required Data:The required data set that represents the
problem needs to be identified/evaluated. For example: If the problem is to
predict whether a tumor is malignant or benign, then the corresponding
patient data sets related to tumors are to be identified
Data Pre-processing: the data is gathered - raw format and is not ready for
immediate analysis. All the unnecessary/irrelevant data elements are
removed. It ensures, the data is ready to be fed into the ML algorithm.
Definition of Training Data Set: the user should decide what kind of data set is
to be used as a training set. A set of data input (X) and corresponding outputs
(Y) is gathered either from human experts or experiments.For Ex: signature
analysis, the training data set might be a single handwritten alphabet, a
handwritten word or an entire line.
Algorithm Selection: Involves determining the structure of the learning
function and the corresponding learning algorithm. On the basis of various
parameters, the best algorithm for a given problem is chosen.
Training: The learning algorithm identified in the previous step is run on the
training set for further fine tuning.
Evaluation with the Test Data Set
Training data is run on the algorithm, and its performance is measured here. If
a suitable result is not obtained, further training of parameters may be
required

K Nearest Neighbor
The unknown and unlabelled data which comes for a prediction problem is
judged on the basis of the training data set elements which are similar to the
unknown element. So, the class label of the unknown element is assigned on
the basis of the class labels of the similar training data set elements ( can be
considered as neighbours of the unknown element).
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number
of nearest neighbours to be considered)
Steps:
Do for all test data points:Calculate the distance (usually Euclidean distance) of
the test data point from the different training data points.Find the closest ‘k’
training data points, i.e. training data points whose distances are least from the
test data point.
If k = 1:Then assign class label of the training data point to the test data point
Else:Whichever class label is predominantly present in the training data points,
assign that class label to the test data point
In the kNN algorithm, the class label of the test data elements is decided by the
class label of the training data elements which are neighbouring. the most
common approach adopted by kNN to measure similarity between two data
elements is Euclidean distance.

In the k-NN algorithm, the value of ‘k’ indicates the number of neighbors that
need to be considered.
For example, if the value of k is 3, only three nearest neighbours or three
training data elements closest to the test data element are considered. Out of
the three data elements, the class which is under majority voting is considered
as the class label to be assigned to the test data. In case, the value of k is 1,
only the closest training data element is considered. The class label of that data
element is directly assigned to the test data element.
Strengths of the k-NN algorithm
Extremely simple algorithm – easy to understand Very effective in certain
situations.Very fast or almost no time required for the training phase
Weaknesses of the k-NN algorithm
Does not learn anything in the real sense. Classification is done completely on
the basis of the training data. So, it has a heavy reliance on the training data.
Because there is no model trained in real sense and the classification is done
completely on the basis of the training data, the classification process is very
slow. Also, a large amount of computational space is required to load the
training data for classification.
Decision Tree
Decision tree learning is one of the most widely adopted algorithms for
classification. As the name indicates, it
builds a model in the form of a tree
structure. A decision tree is used for multi-
dimensional analysis with multiple classes. It
is characterized by fast execution time and
ease in the interpretation of the rules. The goal of decision tree learning is to
create a model (based on the past data called past vector) that predicts the
value of the output variable based on the input variables in the feature
vector.Each node (or decision node) of a decision tree corresponds to one of
the feature vector. From every node, there are edges to children, wherein there
is an edge for each of the possible values (or range of values) of the feature
associated with the node. The tree terminates at different leaf nodes (or
terminal nodes) where each leaf node represents a possible value for the
output variable. The output variable is determined by following a path that
starts at the root and is guided by the values of the input variables. A decision
tree is usually represented in the format depicted in Figure. Each internal node
(represented by boxes) tests an attribute (represented as ‘A’/‘B’ within the
boxes). Each branch corresponds to an attribute value (T/F) in the above case.
Each leaf node assigns a classification. The first node is called as ‘Root’ Node.
Branches from the root node are called as ‘Branch’ Nodes where ‘A’ is the
Root Node (first node). ‘B’ is the Branch Node. ‘T’ & ‘F’ are Leaf Nodes.Thus, a
decision tree consists of three types of nodes: Root Node, Branch Node, Leaf
Node

Random Forest Algorithm


It is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression
problems in ML.It is based on the concept of ensemble learning, which is a
process of combining multiple classifiers to solve a complex problem and to
improve the performance of the model.As the name suggests, "Random Forest
is a classifier that contains a number of decision trees on various subsets of the
given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and
it predicts the final output.The greater number of trees in the forest leads to
higher accuracy and prevents the problem of overfitting.
The following steps
explain the working
Random Forest
Algorithm:Step 1:
Select random
samples from a given
data or training set.Step 2: This algorithm will construct a decision tree for
every training data.Step 3: Voting will take place by averaging the decision
tree.Step 4: Finally, select the most voted prediction result as the final
prediction result.This combination of multiple models is called Ensemble.
Ensemble uses two methods: Bagging , Boosting
Bagging, also known as Bootstrap Aggregation, is the ensemble technique used
by random forest.Bagging chooses a random sample/random subset from the
entire data set. This random sample is called Bootstrap sample.
Boosting is an ensembling technique that attempts to build a strong classifier
from the number of weak classifiers. It is done by building a model by using
weak models in series.
Applications of Random Forest: Banking, Medicine, Land Use, Marketing.
Advantages of Random Forest:It is capable of performing both Classification
and Regression tasks.It is capable of handling large datasets with high
dimensionality.It enhances the accuracy of the model and prevents the
overfitting issue.
Disadvantages of Random Forest:Although random forest can be used for both
classification and regression tasks, it is not more suitable for Regression tasks.

Support Vectors
SVM is a model, which can do linear classification as well as regression.
Primarily, it is used for Classification problems in Machine Learning.In SVM, a
model is built to discriminate the data instances belonging to different
classes.Let us assume for the sake of simplicity that the data instances are
linearly separable. In this case, when mapped in a two- dimensional space, the
data instances belonging to different classes fall in different sides of a straight
line drawn in the two-dimensional space.If the same concept is extended to a
multi dimensional feature space, the straight line dividing data instances
belonging to different classes transforms to a hyperplane as depicted in Figure.
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.SVM chooses the extreme points/vectors that
help in creating the hyperplane. These extreme cases are called as support
vectors, and hence algorithm is termed as Support Vector Machine.
Hyperplane: There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known as
the hyperplane of SVM. The dimensions of the hyperplane depend on the
features present in the dataset, which means if there are 2 features (as shown
in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.We always create a hyperplane
that has a maximum margin, which means the maximum distance between the
data points.Support Vectors: The data points or vectors that are the closest to
the hyperplane and which affect the position of the hyperplane are termed as
Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset
has two features x1 and x2. We want a classifier that can classify the pair(x1,
x2) of coordinates in either green or blue. So as it is 2-d space by just using a
straight line, we can easily separate these two classes. But there can be
multiple lines that can separate these classes. Hence, the SVM algorithm helps
to find the best line or decision boundary; this best boundary or region is called
as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the
vectors and the
hyperplane is called as
margin. And the goal of
SVM is to maximize
this margin. The
hyperplane with
maximum margin is
called the optimal
hyperplane.
UNIT-4
Simple Linear Regression
Simple Linear Regression is a type of Regression algorithms that models the
relationship between a dependent variable and a single independent variable.
The relationship shown by a Simple Linear Regression model is linear or a
sloped straight line, hence it is called Simple Linear Regression.
y= a0+a1x+ ε
0= It is the intercept of the
Regression line
a1= It is the slope of the regression
line, which tells whether the line is
increasing or decreasing.
ε = The error term.

Simple Linear regression algorithm has mainly two objectives:


Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year,
etc.
Slope of the simple linear regression model
Slope of a straight line represents how much the line in a graph changes in the
vertical direction (Y-axis) over a change in the horizontal direction (X-axis) as
shown in Figure.
Slope = Change in Y/Change in X
Rise is the change in Y-axis (Y2 − Y1) and Run is
the change in X-axis (X2 − X1).

There can be two types of slopes in a linear regression model:


positive slope and negative slope.
Different types of regression lines based on the type of slope include :
 Linear positive slope
 Linear positive slope Curve
 Linear negative slope Curve
 Linear negative slope
Advantages of Simple Linear Regression
Simplicity, Interpretability, Computationally Efficient
Disadvantages of Simple Linear Regression
Assumes Linearity, Sensitive to Outliers, Overfitting and Underfitting
Applications of Simple Linear Regression
 Predicting Sales
 Housing Prices:
 Salary Prediction

Multi Linear Regression


Multiple Linear Regression is a statistical method that models the relationship
between two or more features (independent variables) and a continuous target
variable (dependent variable). It is an extension of simple linear regression,
where the goal is to predict the target variable based on multiple predictors.
In a multiple regression model, two or more independent variables, i.e.
predictors are involved in the model. It is an extension of linear regression
The general equation for multiple linear regression is:
Y=β0+β1X1+β2X2+⋯+βnXn+ϵ
Where:
 Y is the dependent variable (target).
 X1,X2,…,Xn are the independent variables (features).
 β0 is the intercept term.
 β1,β2,…,βn are the coefficients (weights) of the independent variables.
 ϵ epsilonϵ is the error term.
Steps to Perform Multiple Linear Regression
Data Collection
Data Preprocessing
Train the Model
Model Evaluation
Prediction
Assumptions in Regression Analysis
1. The dependent variable (Y) can be calculated / predicated based on
independent variables (X’s) plus an error term (ε).
2. The number of observations (n) is greater than the number of
parameters (k) to be estimated, i.e. n > k.
3. Relationships determined by regression are only relationships of
association based on the data set.
4. Regression line can be valid only over a limited range of data. If the line
is extended (outside the range of extrapolation), it may only lead to
wrong predictions
5. The error term (ε) is normally distributed. This also means that the mean
of the error (ε) has an expected value of 0.

Polynomial Regression Model


Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree
polynomial. The Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
It is also called the special case of Multiple Linear Regression in ML. Because
we add some polynomial terms to the Multiple Linear regression equation to
convert it into Polynomial Regression.It is a linear model with some
modification in order to increase the accuracy.The dataset used in Polynomial
regression for training is of non-linear nature.It makes use of a linear regression
model to fit the complicated and non-linear functions and datasets.
Steps for Polynomial Regression:
The main steps involved in Polynomial Regression are given below:
 Data Pre-processing
 Build a Linear
Regression model and fit
it to the dataset
 Build a Polynomial
Regression model and fit
it to the dataset
 Visualize the result for Linear Regression and Polynomial Regression
model.
 Predicting the output.
Advantages of Polynomial Regression
1. Flexibility for Non-Linear Datasets
2. Applicability Across Various Fields
3. Better Performance in Specific Scenarios
Disadvantages of Polynomial Regression
1. Overfitting Risk with High-Degree Polynomials
2. Computational Complexity
3. Selecting the Optimal Polynomial Degree
Applications of Polynomial Regression in Machine Learning
Predicting Tissue Growth Rates
Estimating Mortality Rates
Speed Control in Automated Systems

Logistic Regression
Logistic regression is another supervised learning algorithm which is used to
solve the classification problems. In classification problems, we have
dependent variables in a binary or discrete format such as 0 or 1.Logistic
regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.It is a predictive analysis algorithm
which works on the concept of probability.Logistic regression is a type of
regression, but it is different from the linear regression algorithm in the term
how they are used.Logistic regression uses sigmoid function or logistic function
which is a complex cost function. This sigmoid function is used to model the
data in logistic regression. The function can be represented as:
o f(x)= Output between the 0 and 1 value.
o x= input to the function
o e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as
follows:
It uses the concept of threshold
levels, values above the
threshold level are rounded up
to 1, and values below the
threshold level are rounded up
to 0.
There are three types of logistic
regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Assumptions in logistic regression
The following assumptions must hold when building a logistic regression
model:
There exists a linear relationship between logit function and independent
variables.The dependent variable Y must be categorical (1/0) and take binary
value, e.g. if pass then Y = 1; else Y = 0. The data meets the ‘iid’ criterion, i.e.
the error terms, ε, are independent from one another and identically
distributed The error term follows a binomial distribution [n, p]
n = # of records in the data
p = probability of success (pass, responder)
Maximum Likelihood
Maximum Likelihood Estimation is a method of determining the parameters
(mean, standard deviation, etc) of normally distributed random sample data or
a method of finding the best fitting Probability Density Function over the
random sample data.The term maximum likelihood represents that we are
maximizing the likelihood function, called the Maximization of the Likelihood
Function.The maximum likelihood estimation is a base of some machine
learning and deep learning approaches used for classification problems.
 Maximum Likelihood is a function that describes the data points and
their likeliness to the model for best fitting.
 Maximum likelihood is different from the probabilistic methods, where
probabilistic methods work on the principle of calculation probabilities.
In contrast, the likelihood method tries o maximize the likelihood of data
observations according to the data distribution.
 Maximum likelihood is an approach used for solving the problems like
density distribution and is a base for some algorithms like logistic
regression.
 The approach is very similar and is predominantly known as the
perceptron trick in terms of deep learning methods.
 We calculate Likelihood based on conditional probabilities.

L=F( [X1=x1],[X2=x2],…,[Xn=xn] ∣ P)=Πi=1nPxi(1−P)1−xiL=F( [X1=x1],[X2=x2],


See the equation given below.

…,[Xn=xn] ∣ P)=Πi=1nPxi(1−P)1−xi

 where,
L -> Likelihood value
F -> Probability distribution function
P -> Probability
X1, X2, … Xn -> random sample of size n taken from the
whole population.
x1, x2, … xn -> values that these random sample (Xi) takes
when determining the PDF.
Π -> product from 1 to n.
Unit – 5
What is Clustering and different types of clustering?
The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis. Clustering aims at forming groups of
homogeneous data points from a heterogeneous dataset. It evaluates the
similarity based on a metric like Euclidean distance, Cosine similarity,
Manhattan distance, etc. and then group the points with highest similarity
score together.

Partitioning Clustering:It is a type of clustering that divides the data


into non-hierarchical groups. It is also known as the centroid-based
method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.In this type, the dataset is divided into a set
of k groups, where K is used to define the number of pre-defined groups. The
cluster center is created in such a way that the distance between the data
points of one cluster is minimum as compared to another cluster centroid.

Density-Based Clustering:The density-based clustering method connects the


highly-dense areas into clusters, and the arbitrarily shaped distributions are
formed as long as the dense region can be connected. This algorithm does it
by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each
other by sparser areas.

Hierarchical Clustering:Hierarchical clustering can be used as an alternative


for the partitioned clustering as there is no requirement of pre-specifying the
number of clusters to be created. In this technique, the dataset is divided into
clusters to create a tree-like structure, which is also called a dendrogram.
The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.

K-Medios
K-Means Clustering
K-Means is one of the most popular clustering algorithms due to its simplicity
and efficiency. It works by partitioning data points into a predefined number of
clusters (denoted by K). The algorithm proceeds as follows:
 Choose K initial centroids (either randomly or by other methods like K-
Means++).
 Assign each data point to the nearest centroid.
 Recalculate the centroids based on the assigned data points.
 Repeat the process until convergence (i.e., the centroids no longer
change significantly). K-Means is fast and works well when the clusters
are spherical, but it may struggle with clusters of different shapes or
sizes.
2. Hierarchical Clustering
Hierarchical clustering builds a tree-like
structure called a dendrogram, which
shows how clusters are nested within
each other. There are two main types:
 Agglomerative (bottom-up):
Each data point starts as its own
cluster, and the closest clusters
are merged iteratively until all
points belong to a single cluster.
 Divisive (top-down): All points are initially in one cluster, which is then
recursively split into smaller clusters. Agglomerative hierarchical
clustering is more commonly used, and the method can be visualized by
cutting the dendrogram at a certain level to define the desired number
of clusters. It does not require the number of clusters to be predefined,
but it can be computationally expensive for large datasets.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based clustering algorithm that groups points based on
their density in the feature space. It works by identifying regions of high point
density, which are considered as clusters. Points in low-density areas are
labeled as noise or outliers. The main parameters in DBSCAN are:
 Epsilon (ε): The maximum distance between two points to be considered
neighbors.
 MinPts: The minimum number of points required to form a dense region
(a cluster). DBSCAN is robust to outliers and can find clusters of arbitrary
shape, making it more flexible than K-Means. However, it is sensitive to
the choice of parameters.
4. Mean Shift Clustering
Mean Shift is a non-parametric clustering algorithm that works by shifting the
centroid of data points iteratively towards regions of higher density. It does not
require the number of clusters to be specified in advance. The algorithm works
by computing the mean of the points within a window (called the kernel), and
the centroid is moved to this mean. This process continues until the centroids
converge. It can detect clusters of arbitrary shapes and is less sensitive to
outliers than K-Means, but it can be computationally expensive.
5. Gaussian Mixture Models (GMM)
Gaussian Mixture Models are a probabilistic clustering technique based on the
assumption that the data is generated from a mixture of several Gaussian
distributions. The model estimates the parameters of the distributions (mean,
covariance, and weight) using the Expectation-Maximization (EM) algorithm.
GMM is more flexible than K-Means because it allows for clusters with different
shapes and densities. It is particularly useful when the data exhibits a
probabilistic distribution.
6. Affinity Propagation
Affinity Propagation is a clustering algorithm that does not require the number
of clusters to be specified. It works by exchanging messages between data
points to find "exemplars" (representative points) that best describe the
clusters. The algorithm uses similarity between points to define clusters, and
each point sends and receives messages iteratively. It tends to be more
computationally expensive than K-Means but can produce better results for
certain types of data.
Applications of Clustering
o In Identification of Cancer Cells
o In Search Engines
o Customer Segmentation
o In Biology
o In Land Use

Applications of UnSupervised Learning


1. Market Segmentation
Unsupervised learning techniques like clustering are widely used in market
segmentation to identify distinct groups of customers based on their
purchasing behavior, demographics, or other characteristics. This information
helps businesses tailor their marketing strategies and offerings to specific
customer segments.
2. Anomaly Detection
Unsupervised learning algorithms are used for anomaly detection in various
domains, including cybersecurity, fraud detection, and equipment
maintenance. These algorithms can identify unusual patterns or outliers in data
that deviate significantly from normal behavior, helping to detect fraudulent
transactions, security breaches, or equipment failures.
3. Recommendation Systems
Unsupervised learning techniques, particularly collaborative filtering and
matrix factorization, are used in recommendation systems to provide
personalized recommendations to users. These systems analyze user behavior
and preferences to identify similar users or items and make recommendations
based on past interactions.
4. Image and Document Clustering
Unsupervised learning algorithms like K-means and hierarchical clustering are
used for image and document clustering. In image clustering, these algorithms
can group similar images based on visual features, enabling tasks like image
organization and search. In document clustering, they can group similar
documents based on their content, facilitating tasks like document
categorization and topic modeling.
5. Genomics and Bioinformatics
Unsupervised learning techniques are widely used in genomics and
bioinformatics for gene expression analysis, protein sequence clustering, and
functional annotation. These techniques help researchers uncover patterns and
relationships in biological data, leading to insights into disease mechanisms,
drug discovery, and personalized medicine.
6. Neuroscience
Unsupervised learning algorithms, such as fMRI scans and EEG recordings, are
used in neuroscience to analyze neural activity data. These algorithms can
identify patterns and structures in brain activity, helping researchers
understand brain function, map neural circuits, and diagnose neurological
disorders.
7. Natural Language Processing (NLP)
Unsupervised learning techniques like word embeddings and topic modeling
are used in NLP for tasks such as document clustering, word similarity analysis,
and semantic understanding. These techniques help extract meaningful
representations from text data and uncover latent structures in language.

Supervised learning vs UnSupervised learning

Supervised Learning Unsupervised Learning

Supervised learning algorithms are Unsupervised learning algorithms


trained using labeled data. are trained using unlabeled data.

Supervised learning model takes


Unsupervised learning model does
direct feedback to check if it is
not take any feedback.
predicting correct output or not.

Supervised learning model predicts Unsupervised learning model finds


the output. the hidden patterns in data.

In supervised learning, input data is


In unsupervised learning, only input
provided to the model along with the
data is provided to the model.
output.

The goal of unsupervised learning is


The goal of supervised learning is to
to find the hidden patterns and
train the model so that it can predict
useful insights from the unknown
the output when it is given new data.
dataset.

Unsupervised learning does not


Supervised learning needs
need any supervision to train the
supervision to train the model.
model.
Supervised learning can be Unsupervised Learning can be
categorized classified
in Classification and Regression probl in Clustering and Associations probl
ems. ems.

Unsupervised learning can be used


Supervised learning can be used for
for those cases where we have only
those cases where we know the input
input data and no corresponding
as well as corresponding outputs.
output data.

Unsupervised learning model may


Supervised learning model produces
give less accurate result as
an accurate result.
compared to supervised learning.

Supervised learning is not close to Unsupervised learning is more close


true Artificial intelligence as in this, to the true Artificial Intelligence as
we first train the model for each it learns similarly as a child learns
data, and then only it can predict the daily routine things by his
correct output. experiences.

It includes various algorithms such as


Linear Regression, Logistic It includes various algorithms such
Regression, Support Vector Machine, as Clustering, KNN, and Apriori
Multi-class Classification, Decision algorithm.
tree, Bayesian Logic, etc.

Apriori Algorithm
To improve the efficiency of level-wise generation of frequent itemsets, an
important property is used called Apriori property which helps by reducing the
search space.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of
Apriori algorithm is its anti-monotonicity of support measure. Apriori assumes
that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Steps for Apriori Algorithm
Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database,


and select the minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than
the minimum or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value
than the threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Advantages are as follows:


1. Simplicity & ease of implementation
2. The rules are easy to human-readable * interpretable
3. Works well on unlabelled data
4. Flexibility
5. Extensions for multiple use cases can be created easily
6. The algorithm is widely used & studied
Disadvantages of Apriori Algorithm are as follows:
1. Time & space overhead
2. Higher memory usage
3. Bias of minimum support threshold
4. Inability to handle numeric data

You might also like