Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
1 views

lecture 9-10

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

lecture 9-10

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Regression

Regression
 Regression is a statistical approach used to analyze the relationship
between a dependent variable (target variable) and one or more
independent variables (predictor variables). The objective is to
determine the most suitable function that characterizes the connection
between these variables.

 It is a supervised machine learning technique, used to predict the


value of the dependent variable for new, unseen data. It models the
relationship between the input features and the target variable,
allowing for the estimation or prediction of numerical values.
Terminologies Related to Regression
Analysis
Response Variable: The primary factor to predict or understand in
regression, also known as the dependent variable or target variable.

Predictor Variable: Factors influencing the response variable, used to predict


its values; also called independent variables.

Outliers: Observations with significantly low or high values compared to


others, potentially impacting results and best avoided.

Multicollinearity: High correlation among independent variables, which can


complicate the ranking of influential variables.

Underfitting and Overfitting: Overfitting occurs when an algorithm


performs well on training but poorly on testing, while underfitting indicates
poor performance on both datasets.
Types
Depending on the number of input variables, the regression problem classified into
1) Simple linear regression
2) Multiple linear regression
Simple Linear Regression
Used to predict a continuous dependent variable based on a single independent variable.
Simple linear regression should be used when there is only a single independent variable.
Multiple Regression
Used to predict a continuous dependent variable based on multiple independent variables.
Multiple linear regression should be used when there are multiple independent variables.
Non Linear Regression
Relationship between the dependent variable and independent variable(s) follows a nonlinear
pattern.
Provides flexibility in modeling a wide range of functional forms.
Linear regression
Linear regression is one of the simplest and most widely used statistical
models. This assumes that there is a linear relationship between the
independent and dependent variables. This means that the change in the
dependent variable is proportional to the change in the independent variables.
The equation for simple linear regression is
𝑦ො = 𝛽0 + 𝛽1 𝑋
where:
𝑦 is the dependent variable

𝑋 is the independent variable

𝛽0 is the intercept

𝛽1 is the slope
Multiple Linear Regression
This involves more than one independent variable and one dependent
variable. The equation for multiple linear regression is:
𝑦ො = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ … … 𝛽𝑛 𝑋𝑛
where:
𝑦 is the dependent variable

𝑋1 , 𝑋2 , … , 𝑋𝑛 are the independent variables

𝛽0 is the intercept

𝛽1 , 𝛽2 , … , 𝛽𝑛 are the slopes


Best fit line
The best Fit Line equation provides a straight line that represents the
relationship between the dependent and independent variables. The
slope of the line indicates how much the dependent variable changes for
a unit change in the independent variable(s)
Linear regression performs the task to
predict a dependent variable value (y) based
on a given independent variable (x)).
Hence, the name is Linear Regression. In
the figure above, X (input) is the work
experience and Y (output) is the salary of a
person. The regression line is the best-fit
line for our model.
Formulas for Simple regression
The equation of a simple linear regression line is:

𝑦ො = 𝛽0 + 𝛽1 𝑋

σ 𝑋𝑖 − 𝑋ത 𝑦𝑖 − 𝑦ത
𝛽0 = 2
σ 𝑋𝑖 − 𝑋ത

𝛽1 = 𝑦ത − 𝛽0 𝑋ത
RMSE
One way to assess how well a regression model fits a dataset is to
calculate the root mean square error, which is a metric that tells us the
average distance between the predicted values from the model and the
actual values in the dataset.
The lower the RMSE, the better a given model is able to “fit” a dataset.
The formula to find the root mean square error, often
abbreviated RMSE, is as follows:
σ𝒏𝒊=𝟏 𝒚𝒊 − 𝒚
ෝ𝒊 𝟐
RMSE =
𝒏
Why Use RMSE
Measures Model Accuracy:
RMSE tells us how well the regression line fits the data by calculating the
average error.
Smaller RMSE values indicate a better fit.
Sensitive to Large Errors:
Squaring the errors gives more weight to large differences, making RMSE
sensitive to outliers.
Easy Interpretation:
RMSE is in the same units as the dependent variable (e.g., if predicting price in
USD, RMSE will also be in USD).
Applications of Regression
Predicting prices: For example, a regression model could be used to predict
the price of a house based on its size, location, and other features.

Forecasting trends: For example, a regression model could be used to


forecast the sales of a product based on historical sales data and economic
indicators.

Identifying risk factors: For example, a regression model could be used to


identify risk factors for heart disease based on patient data.

Making decisions: For example, a regression model could be used to


recommend which investment to buy based on market data.
Advantages of Regression

Easy to understand and interpret

Robust to outliers

Can handle both linear and nonlinear relationships.

Disadvantages of Regression

Assumes linearity

Sensitive to multicollinearity

May not be suitable for highly complex relationships


import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# Given data
x = np.array([1, 2, 4, 3, 5]).reshape(-1, 1) # Reshape for sklearn
y = np.array([1, 3, 3, 2, 5])
# Create and fit the linear regression model
model = LinearRegression()
model.fit(x, y)
# Predict using the model
y_pred = model.predict(x)
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
# Display results
print("Regression Coefficient (Slope):", model.coef_[0])
print("Intercept:", model.intercept_)
print("Predicted Values:", y_pred)
print("RMSE:", rmse)
# Plotting the data
plt.scatter(x, y, color='blue', label='Actual Data') # Original points
plt.plot(x, y_pred, color='red', label='Best Fit Line') # Regression line
plt.title("Linear Regression with Best Fit Line")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.legend()
plt.show()
Example
Let’s consider there is a company, and it has to improve the sales of the
product. The company spends money on different advertising media
such as TV, radio, and newspaper to increase the sales of its products.
The company records the money spent on each advertising media (in
thousands of dollars) and the number of units of product sold (in
thousands of units).
Now we have to help the company to find out the most effective way to
spend money on advertising media to improve sales for the next year
with a less advertising budget.
TV Advertising ($1000) Sales ($1000)

10 9

20 18

30 25

40 28

50 35

60 40

70 50

80 55

90 62

100 70
The equation of a simple linear regression line is:

𝑦ො = 𝛽0 + 𝛽1 𝑋

σ 𝑋𝑖 − 𝑋ത 𝑦𝑖 − 𝑦ത
𝛽0 = 2
σ 𝑋𝑖 − 𝑋ത

𝛽1 = 𝑦ത − 𝛽0 𝑋ത
Thus, the regression line is:
𝑦ො = 0.66𝑥 + 3
For 𝑥 = 85
𝑦 = 0.66 × 85 + 3 = 59.1
So, the predicted sales for an $85,000 TV advertising budget are
approximately 59100 units.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Data: Advertising budget (X) and sales (Y)
x = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).reshape(-1, 1) # Independent variable
y = np.array([9, 18, 24, 28, 35, 40, 50, 55, 62, 70]) # Dependent variable
# Create and fit the linear regression model
model = LinearRegression()
model.fit(x, y)
# Predict using the model
y_pred = model.predict(x)
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
# Display results
print("Regression Coefficient (Slope):", model.coef_[0])
print("Intercept:", model.intercept_)
print("Predicted Values:", y_pred)
print("RMSE:", rmse)
# Plotting the results
plt.scatter(x, y, color="blue", label="Actual Sales Data")
plt.plot(x, y_pred, color="red", label="Best Fit Line")
plt.title("Advertising Budget vs Sales")
plt.xlabel("Advertising Budget (in $1000s)")
plt.ylabel("Sales (in $1000s)")
plt.legend()
plt.show()
# Predict sales for a reduced budget (e.g., $85,000)
reduced_budget = 85 # $85,000
predicted_sales = model.predict(np.array([[reduced_budget]]))
print("Predicted Sales for $85,000 Advertising Budget:",
predicted_sales[0])

Regression Coefficient (Slope): 0.6563636363636365


Intercept: 2.999999999999993
Predicted Values: [ 9.56363636 16.12727273 22.69090909 29.25454545 35.81818182
42.38181818 48.94545455 55.50909091 62.07272727 68.63636364]
RMSE: 1.2919330126174908

You might also like