lecture 9-10
lecture 9-10
Regression
Regression is a statistical approach used to analyze the relationship
between a dependent variable (target variable) and one or more
independent variables (predictor variables). The objective is to
determine the most suitable function that characterizes the connection
between these variables.
𝛽0 is the intercept
𝛽1 is the slope
Multiple Linear Regression
This involves more than one independent variable and one dependent
variable. The equation for multiple linear regression is:
𝑦ො = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ … … 𝛽𝑛 𝑋𝑛
where:
𝑦 is the dependent variable
𝛽0 is the intercept
𝑦ො = 𝛽0 + 𝛽1 𝑋
σ 𝑋𝑖 − 𝑋ത 𝑦𝑖 − 𝑦ത
𝛽0 = 2
σ 𝑋𝑖 − 𝑋ത
𝛽1 = 𝑦ത − 𝛽0 𝑋ത
RMSE
One way to assess how well a regression model fits a dataset is to
calculate the root mean square error, which is a metric that tells us the
average distance between the predicted values from the model and the
actual values in the dataset.
The lower the RMSE, the better a given model is able to “fit” a dataset.
The formula to find the root mean square error, often
abbreviated RMSE, is as follows:
σ𝒏𝒊=𝟏 𝒚𝒊 − 𝒚
ෝ𝒊 𝟐
RMSE =
𝒏
Why Use RMSE
Measures Model Accuracy:
RMSE tells us how well the regression line fits the data by calculating the
average error.
Smaller RMSE values indicate a better fit.
Sensitive to Large Errors:
Squaring the errors gives more weight to large differences, making RMSE
sensitive to outliers.
Easy Interpretation:
RMSE is in the same units as the dependent variable (e.g., if predicting price in
USD, RMSE will also be in USD).
Applications of Regression
Predicting prices: For example, a regression model could be used to predict
the price of a house based on its size, location, and other features.
Robust to outliers
Disadvantages of Regression
Assumes linearity
Sensitive to multicollinearity
10 9
20 18
30 25
40 28
50 35
60 40
70 50
80 55
90 62
100 70
The equation of a simple linear regression line is:
𝑦ො = 𝛽0 + 𝛽1 𝑋
σ 𝑋𝑖 − 𝑋ത 𝑦𝑖 − 𝑦ത
𝛽0 = 2
σ 𝑋𝑖 − 𝑋ത
𝛽1 = 𝑦ത − 𝛽0 𝑋ത
Thus, the regression line is:
𝑦ො = 0.66𝑥 + 3
For 𝑥 = 85
𝑦 = 0.66 × 85 + 3 = 59.1
So, the predicted sales for an $85,000 TV advertising budget are
approximately 59100 units.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Data: Advertising budget (X) and sales (Y)
x = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]).reshape(-1, 1) # Independent variable
y = np.array([9, 18, 24, 28, 35, 40, 50, 55, 62, 70]) # Dependent variable
# Create and fit the linear regression model
model = LinearRegression()
model.fit(x, y)
# Predict using the model
y_pred = model.predict(x)
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
# Display results
print("Regression Coefficient (Slope):", model.coef_[0])
print("Intercept:", model.intercept_)
print("Predicted Values:", y_pred)
print("RMSE:", rmse)
# Plotting the results
plt.scatter(x, y, color="blue", label="Actual Sales Data")
plt.plot(x, y_pred, color="red", label="Best Fit Line")
plt.title("Advertising Budget vs Sales")
plt.xlabel("Advertising Budget (in $1000s)")
plt.ylabel("Sales (in $1000s)")
plt.legend()
plt.show()
# Predict sales for a reduced budget (e.g., $85,000)
reduced_budget = 85 # $85,000
predicted_sales = model.predict(np.array([[reduced_budget]]))
print("Predicted Sales for $85,000 Advertising Budget:",
predicted_sales[0])