Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views24 pages

Regression

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 24

1.

• In the discipline of ML, RA is a key concept. It's classified as supervised


learning because the algorithm is taught both input and output labels. By
estimating how one variable influences the other, it aids in the
establishment of a link between the variables.
• In ML, regression refers to mathematical techniques that allow data
scientists to forecast a continuous outcome (y) based on the values of one
or more predictor variables (x). Because of its ease of application in
predicting and forecasting, linear regression is perhaps the most popular
type of RA.

RA is the primary technique to solve the regression problems in ML using


data modelling. It involves determining the best fit line, which is a line that
passes through all the data points in such a way that distance of the line
from each data point is minimized.
Terminologies related to RA:

How to obtain best fit line:


Types of Regression Analysis Techniques
Types of linear regression: SLR and MLR
Linear Regression Model
Using simple linear regression, predict the incentive received on the basis of
sales done by a salesman. Calculate the regression coefficients, write the
regression equation, and predict the incentive received by an employee who
studies did sales above 20000 rupees.

In simple linear regression, we model the relationship between the


independent variable (sales) and the dependent variable (incentive) as a
straight line. The regression equation for simple linear regression is:
Y = a + bX
Where:

- Y is the dependent variable (incentive).


- X is the independent variable (sales).
- a is the intercept (the point where the line intersects the Y-axis).
- b is the slope (the change in Y for a unit change in X).
To calculate the regression coefficients a and b, we need a dataset of sales &
corresponding incentive values. Let's assume we have the following dataset:
Sales (X) (in rupees): [15000, 18000, 21000, 24000, 27000, 30000]

Incentive (Y) (in rupees): [500, 600, 700, 800, 900, 1000]

We can calculate a and b using the following formulas:


Objective of Regression Modeling
The main objective is to analyze and quantify the
relationship between one or more independent
variables (predictors or features) and a dependent
variable (the target or outcome).

RA is a statistical technique used in various fields,


including economics, finance, biology, social sciences,
and ML, to achieve several key objectives:
Objectives of Regression Modelling
Objective Description
Making predictions about the dependent variable based on
Prediction
independent variables.
Understanding Analyzing the strength, direction, and nature of relationships
Relationships between variables.
Identifying significant independent variables and simplifying models
Variable Selection
by excluding irrelevant ones.
Assessing the significance of relationships and determining if they
Hypothesis Testing
are statistically meaningful.
Evaluating the model's goodness of fit and predictive accuracy using
Model Evaluation
various metrics.
Drawing conclusions about the population based on sample data
Inference
and testing hypotheses.
Control and Using regression to optimize independent variables for desired
Optimization outcomes in various applications.
2. Training data and Testing Data
Training Data:

used to teach a ML model to recognize patterns, relationships, and


make predictions.
It's the data the model learns from.

Testing Data:

used to evaluate the model's performance and assess how well it generalizes to
new, unseen data.
It's like a final exam for the model.
Comparison of training data and testing data :
Aspect Training Data Testing Data
Purpose Used for teaching the model. Used for evaluating the model.

A separate portion of the dataset, usually


Typically a substantial portion of the dataset,
Composition only input features (sometimes includes
including input features and target values.
withheld target values for evaluation).

Model's performance is assessed by


Model learns from training data by analyzing making predictions on testing data and
Usage
patterns & minimizing errors b/w predictions & actual values. comparing them to actual values (target
values).

A separate set of emails, never seen by


In spam email filtering, thousands of labeled emails the model during training, is used to test
Example
are used for the model to learn what spam looks like. how well the model can identify spam in
new emails.
In summary,

training data is used to teach the model, while testing


data is used to assess how well the model has learned
and can generalize to new, unseen examples.

The clear separation of training and testing data helps


ensure that ML models are not only accurate on the
data they were trained on but also perform well in real-
world scenarios.
3. Overfitting and Underfitting
Overfitting:

Occurs when a statistical model, such as a ML algorithm, is too complex relative to the data it's trying to fit.
In other words, the model is excessively flexible and tries to capture all the noise and random fluctuations in
the training data, rather than just the underlying patterns or relationships.

This can lead to the following characteristics:

High Training Accuracy: An overfit model often performs exceptionally well on the training data because it
has essentially memorized it, including the noise.

Poor Generalization: The problem with overfitting is that it doesn't generalize well to new, unseen data.
When you apply an overfit model to new data, it tends to perform poorly because it's making predictions
based on the noise it learned during training.

Complexity: Overfit models are usually complex, with many parameters or features, and they may exhibit
intricate patterns that don't reflect the true underlying relationships in the data.
Overfitting Example:

Imagine you have a group of students, and you want to predict their test scores based on the
number of hours they spent studying. You collect data from 50 students and build a complex
model that considers not only the number of hours they studied but also factors like the
exact minute they started studying, the type of music they listened to, and even the color of
their pens.

This model fits the training data incredibly well. It predicts each student's test score precisely
based on all these variables. However, when you use this model to predict the scores of new
students who weren't part of your original dataset, it performs poorly. It's almost like your
model has memorized the previous students' scores and study habits but can't generalize to
new students.

In simple terms, overfitting is like having an overly complicated recipe that only works for
specific ingredients and measurements. It doesn't adapt well to new ingredients.
Underfitting:

Occurs when a model is too simplistic to capture the underlying patterns in the
data. Essentially, the model is too rigid and fails to represent the complexities in
the dataset. This leads to the following characteristics:

Low Training and Test Accuracy: An underfit model performs poorly on both the
training data and new data. It doesn't fit the training data well and, as a result,
can't make accurate predictions.

Oversimplified: Underfit models are often overly simplistic, such as using a linear
model to represent a nonlinear relationship in the data.

Bias: Underfit models are biased because they make strong assumptions about
the data that may not hold true.
Underfitting Example:

Now, consider the opposite scenario. You build a very simplistic model that predicts test scores
solely based on the average number of hours all students in your dataset spent studying. This
model doesn't pay attention to individual study habits, time of day, or other factors. It's just a
straight line, representing the average.

This simple model doesn't fit your training data very well. It consistently underestimates or
overestimates the actual test scores. When you use this model to predict scores for the same
students it was trained on, it doesn't do a great job either because it's too basic.

In simple terms, underfitting is like trying to guess the doneness of a steak by only looking at
the color of the outside. You're missing all the important details inside.

The key is to find a balance between these extremes. We want a model that captures the essential patterns in the
data without being too simple or too complex. It's like having a recipe that's just right, adaptable to different
ingredients but still able to make a delicious dish. This balanced model will perform well on both the data it was
trained on and new, unseen data.

You might also like