ch2 Linear Regression
ch2 Linear Regression
Predictive Analytics
Assoc. Prof. Onur Doğan
Chapter 1 Linear Regression
onur.dogan@bakircay.edu.tr
Content
Ch1 Introduction to Predictive Practice
Analytics Ch6 Decision Tree
Ch2 Linear Regression Ch7 Artificial Neural Network
Ch3 Logistic Regression Practice
Practice Ch8 Association Analysis
Ch4 K-Means Clustering Presentations*
Ch5 K-Nearest Neighbour Presentations*
Midterm Presentations*
Midterm Break Final Exam
* Student presentations
Basic Topics:
• Regression Model Types
• Finding and Prediction Regression Equations
• Least-Squares Method
• Correlation
• Explaining Variability
𝑌 = 𝛼 + 𝛽𝑥 + 𝜀
• Y: Dependent (outcome) variable and is assumed to have a certain error
• X: Independent (cause) variable and is assumed to be measured without error.
• 𝛼: constant and is the value of Y when X=0.
• 𝛽: It is the regression coefficient and expresses the amount of change that will occur in Y
in terms of its own unit in response to a 1 unit change in X in terms of its own unit.
• 𝜀 : Random error term and is assumed to be normally distributed. This assumption is not
necessary for parameter estimation but for significance checks of the coefficients.
Parameter Estimation (Least-Squares Method- LSM)
• The regression model is tried to be calculated using observation
values taken as a sample related to the problem of interest.
Therefore, the values in our model will be estimated values.
• The least squares method minimizes the sum of the difference
between the estimated and actual coefficients and variables (i.e. the
total error).
𝑌 = 𝛼 + 𝛽𝑥 + 𝜀
Parameter Estimation (Least-Squares Method- LSM)
𝑌 = 𝛼 + 𝛽𝑥 + 𝜀
𝜀 = 𝛼 + 𝛽𝑥 − 𝑌
𝑛 𝑛
𝜀 2 = 𝑌 − 𝛼 − 𝛽𝑥 2
𝑖=1 𝑖=1
To find the smallest value of the error, take the derivative with respect to 𝛼 and 𝛽 and set it equal to
zero. (Recall that the point whose derivative is equal to zero is the minimum point.)
𝑌 = 𝑛𝛼 + 𝛽 𝑋𝑖
𝑋𝑌 = 𝛼 𝑋𝑖 + 𝛽 𝑋𝑖 2
σ 𝑋𝑖 −𝑋ത 𝑌𝑖 −𝑌ത
𝛼 = 𝑌ത − 𝛽 𝑋ത 𝛽= σ 𝑋𝑖 −𝑋ത 2
𝑌 = 𝛼 + 𝛽𝑥𝑖
Example
• To investigate the dependence between
store size and annual sales revenue, data
for seven stores are given below. Find
the regression equation.
Scatter plot
σ 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത
𝑏=
Example σ 𝑋𝑖 − 𝑋ത 2
A. Regression
B. Correlation
Unless otherwise
specified, correlation
is looked at before
regression.
Correct
Answer Grade xy x^2 y^2
17 94 1598 289 8836
13 73 949 169 5329
12 59 708 144 3481
15 80 1200 225 6400
16 93 1488 256 8649
14 85 1190 196 7225
16 66 1056 256 4356
16 79 1264 256 6241
18 77 1386 324 5929
19 91 1729 361 8281
Total 156 797 12568 2476 64727
Average 15.6 79.7 1256.8 247.6 6472.7
σ𝑥σ𝑦 156 × 797
σ 𝑥𝑦 − 12568 −
𝑟= 𝑛 = 10 = 0.5961
σ𝑥 2 σ𝑦 2 1562 7972
σ 𝑥2 − σ 𝑦2 − 2476 − 64727 −
𝑛 𝑛 10 10
Regression Line
100
80
60
40
20
0
10 12 14 16 18 20
Explaining Variability
• 𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑖𝑙𝑖𝑡𝑦 + 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑛 𝑉𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦
𝑇𝑜𝑝𝑙𝑎𝑚 𝑑𝑒ğ𝑖ş𝑘𝑒𝑛𝑙𝑖𝑘 𝑆𝑆𝑇: 𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = 𝑦𝑖 − 𝑦ത 2
𝑖=1 𝑛
2
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑖𝑙𝑖𝑡𝑦 𝑆𝑆𝐸: 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = 𝑦ෝ𝑖 − 𝑦ത
𝑖=1
𝑛
𝑖=1
Explaining Variability
• Explained Variability (or Regression Variability): This is the portion of the
total variability that can be explained by the regression model. It
represents the variability in the dependent variable that is accounted for by
the relationship between the independent variable(s) and the dependent
variable as captured by the regression equation. In other words, it's the
variability that the regression model can "explain" or "predict."
• Unexplained Variability (or Residual Variability): Also known as residual
variability, this component represents the part of total variability that
cannot be explained by the regression model. It accounts for the
differences between the observed values of the dependent variable and
the predicted values based on the regression equation. Residuals (the
differences between observed and predicted values) are used to quantify
this unexplained variability.
Explaining Variability
𝑛
2
𝑦ෝ𝑖 − 𝑦𝑖
𝑖=1
𝑦ෝ𝑖 − 𝑦ത 2
𝑖=1
Explaining Variability
• Determination Coefficient R2
𝑆𝑆𝐸 𝑆𝑆𝑅
𝑅2 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
• Determination Coefficient is the square of the correlation.
2
σ𝑥σ𝑦
σ 𝑥𝑦 −
𝑅2 = 𝑟 2 = 𝑛
σ 𝑥 2 σ 𝑦 2
σ 𝑥2 − σ 𝑦2 −
𝑛 𝑛
Example
Correct
• Calculate the coefficient of determination in
Answer Grade
the previous example. (What is the proportion
17 94
of change explained by the regression 13 73
equation?) 12 59
𝑌 = 30.10 + 3.18𝑥𝑖 15 80
16 93
𝑟 2 = 0.59612 = 0.3553
14 85
• Only 35% of the variation can be explained by 16 66
the equation. The regression equation cannot 16 79
explain 65% of the variation. 18 77
19 91
Regression Line
100 𝑌 = 30.10 + 3.18𝑥𝑖
80 𝑦ത = 79.7
60
40
20
0
10 12 14 16 18 20
Assumption (Important)
𝑌 = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘
Logistic Regression