Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

ch2 Linear Regression

Uploaded by

sahiny883
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

ch2 Linear Regression

Uploaded by

sahiny883
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

MIS335

Predictive Analytics
Assoc. Prof. Onur Doğan
Chapter 1 Linear Regression
onur.dogan@bakircay.edu.tr
Content
Ch1 Introduction to Predictive Practice
Analytics Ch6 Decision Tree
Ch2 Linear Regression Ch7 Artificial Neural Network
Ch3 Logistic Regression Practice
Practice Ch8 Association Analysis
Ch4 K-Means Clustering Presentations*
Ch5 K-Nearest Neighbour Presentations*
Midterm Presentations*
Midterm Break Final Exam

* Student presentations
Basic Topics:
• Regression Model Types
• Finding and Prediction Regression Equations
• Least-Squares Method
• Correlation
• Explaining Variability

Ch2 (Simple – Multiple)


Linear Regression
------------------------------------------

Simple Linear Regression


Intended Use
Regression analysis is primarily used to model causality and provide
prediction.
• Predicting the values of the dependent (response) variable based on
the values of at least one independent (explanatory) variable
• Explains the effect of independent variables on the dependent
variable

Newspaper sales → Weather, agenda, promotion, etc.


Why is it needed?
Statistical tests (z-test, t-test, chi-square test, etc.) and hypothesis tests
do not provide any information about the relationship between
variables. However, when looking at the scatter diagrams, it can be felt
that there may be a relationship between the variables, but these
relationships cannot be revealed with this type of analysis. Therefore,
new methods are needed to determine the direction (regression) and
strength (correlation) of the relationship between variables. These
methods are generally called regression (curve fitting) and correlation
analysis.
Regression
Model Types
Simple Linear Regression
Simple linear regression allows us to test whether there is a linear relationship between
two normally distributed, discrete variables. One variable is the prediction variable and one
is the outcome variable.

𝑌 = 𝛼 + 𝛽𝑥 + 𝜀
• Y: Dependent (outcome) variable and is assumed to have a certain error
• X: Independent (cause) variable and is assumed to be measured without error.
• 𝛼: constant and is the value of Y when X=0.
• 𝛽: It is the regression coefficient and expresses the amount of change that will occur in Y
in terms of its own unit in response to a 1 unit change in X in terms of its own unit.
• 𝜀 : Random error term and is assumed to be normally distributed. This assumption is not
necessary for parameter estimation but for significance checks of the coefficients.
Parameter Estimation (Least-Squares Method- LSM)
• The regression model is tried to be calculated using observation
values taken as a sample related to the problem of interest.
Therefore, the values in our model will be estimated values.
• The least squares method minimizes the sum of the difference
between the estimated and actual coefficients and variables (i.e. the
total error).

𝑌 = 𝛼 + 𝛽𝑥 + 𝜀
Parameter Estimation (Least-Squares Method- LSM)
𝑌 = 𝛼 + 𝛽𝑥 + 𝜀
𝜀 = 𝛼 + 𝛽𝑥 − 𝑌
𝑛 𝑛

෍ 𝜀 2 = ෍ 𝑌 − 𝛼 − 𝛽𝑥 2

𝑖=1 𝑖=1
To find the smallest value of the error, take the derivative with respect to 𝛼 and 𝛽 and set it equal to
zero. (Recall that the point whose derivative is equal to zero is the minimum point.)

෍ 𝑌 = 𝑛𝛼 + 𝛽 ෍ 𝑋𝑖

෍ 𝑋𝑌 = 𝛼 ෍ 𝑋𝑖 + 𝛽 ෍ 𝑋𝑖 2

If 𝛼 and 𝛽 are extracted from here,


Parameter Estimation (Least-Squares Method- LSM)

σ 𝑋𝑖 −𝑋ത 𝑌𝑖 −𝑌ത
𝛼 = 𝑌ത − 𝛽 𝑋ത 𝛽= σ 𝑋𝑖 −𝑋ത 2

𝑌෠ = 𝛼 + 𝛽𝑥𝑖
Example
• To investigate the dependence between
store size and annual sales revenue, data
for seven stores are given below. Find
the regression equation.

X: Store size (square feet)


Y: Annual sales volume
Example

Scatter plot
σ 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത
𝑏=
Example σ 𝑋𝑖 − 𝑋ത 2

𝑌෠ = 𝛼 + 𝛽𝑥𝑖 = 1636.415 + 1.487𝑥𝑖 𝑎 = 𝑌ത − 𝑏𝑋ത

We predict that a one-unit increase in X (store size in feet) would


increase the average Y (annual sales) by a predicted 1.487 units.
Example
• The number of questions answered correctly, and the grades
obtained are given in the table. Correct answers Grade
17 94
Using regression analysis, predict 13 73
the grade of a student with 20 12 59
correct answers. (Assume that 15 80
16 93
questions have different points)
14 85
16 66
16 79
18 77
19 91
Correct answer Grade 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത 𝑋𝑖 − 𝑋ത 2

Example 17 94 1.4 14.3 20.02 1.96


13 73 -2.6 -6.7 17.42 6.76
12 59 -3.6 -20.7 74.52 12.96
σ 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത 15 80 -0.6 0.3 -0.18 0.36
𝛽=
σ 𝑋𝑖 − 𝑋ത 2 16 93 0.4 13.3 5.32 0.16
14 85 -1.6 5.3 -8.48 2.56
𝛼 = 𝑌ത − 𝛽𝑋ത 16 66 0.4 -13.7 -5.48 0.16
16 79 0.4 -0.7 -0.28 0.16
18 77 2.4 -2.7 -6.48 5.76
19 91 3.4 11.3 38.42 11.56
Total 134.8 42.4
Average 15.6 79.7

𝑌෠ = 30.10 + 3.18𝑥𝑖 = 30.10 + 3.18 × 20 = 93.69


Correlation
• Correlation shows whether there is a linear relationship between
two variables. In other words, it shows whether changes in variables
affect each other.
σ𝑥σ𝑦
σ 𝑥𝑦 −
𝑟= 𝑛
σ 𝑥 2 σ 𝑦 2
σ 𝑥2 − σ 𝑦2 −
𝑛 𝑛
• −1 < 𝑟 < +1
• Negative values indicate a negative relationship and positive values
indicate a positive relationship.
• When the values are 1 or -1, there is a perfect relationship. When the
values approach 0, the relationship between the variables decreases.
If there is no linear relationship, r=0.
Which one first?

A. Regression

B. Correlation
Unless otherwise
specified, correlation
is looked at before
regression.
Correct
Answer Grade xy x^2 y^2
17 94 1598 289 8836
13 73 949 169 5329
12 59 708 144 3481
15 80 1200 225 6400
16 93 1488 256 8649
14 85 1190 196 7225
16 66 1056 256 4356
16 79 1264 256 6241
18 77 1386 324 5929
19 91 1729 361 8281
Total 156 797 12568 2476 64727
Average 15.6 79.7 1256.8 247.6 6472.7
σ𝑥σ𝑦 156 × 797
σ 𝑥𝑦 − 12568 −
𝑟= 𝑛 = 10 = 0.5961
σ𝑥 2 σ𝑦 2 1562 7972
σ 𝑥2 − σ 𝑦2 − 2476 − 64727 −
𝑛 𝑛 10 10
Regression Line
100

80

60

40

20

0
10 12 14 16 18 20
Explaining Variability
• 𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑖𝑙𝑖𝑡𝑦 + 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑛 𝑉𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦
𝑇𝑜𝑝𝑙𝑎𝑚 𝑑𝑒ğ𝑖ş𝑘𝑒𝑛𝑙𝑖𝑘 𝑆𝑆𝑇: 𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ෍ 𝑦𝑖 − 𝑦ത 2

𝑖=1 𝑛
2
𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑖𝑙𝑖𝑡𝑦 𝑆𝑆𝐸: 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ෍ 𝑦ෝ𝑖 − 𝑦ത
𝑖=1
𝑛

𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑉𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑆𝑆𝑅: 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = ෍ 𝑦ෝ𝑖 − 𝑦𝑖 2

𝑖=1
Explaining Variability
• Explained Variability (or Regression Variability): This is the portion of the
total variability that can be explained by the regression model. It
represents the variability in the dependent variable that is accounted for by
the relationship between the independent variable(s) and the dependent
variable as captured by the regression equation. In other words, it's the
variability that the regression model can "explain" or "predict."
• Unexplained Variability (or Residual Variability): Also known as residual
variability, this component represents the part of total variability that
cannot be explained by the regression model. It accounts for the
differences between the observed values of the dependent variable and
the predicted values based on the regression equation. Residuals (the
differences between observed and predicted values) are used to quantify
this unexplained variability.
Explaining Variability
𝑛
2
෍ 𝑦ෝ𝑖 − 𝑦𝑖
𝑖=1

෍ 𝑦ෝ𝑖 − 𝑦ത 2

𝑖=1
Explaining Variability
• Determination Coefficient R2
𝑆𝑆𝐸 𝑆𝑆𝑅
𝑅2 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
• Determination Coefficient is the square of the correlation.
2
σ𝑥σ𝑦
σ 𝑥𝑦 −
𝑅2 = 𝑟 2 = 𝑛
σ 𝑥 2 σ 𝑦 2
σ 𝑥2 − σ 𝑦2 −
𝑛 𝑛
Example
Correct
• Calculate the coefficient of determination in
Answer Grade
the previous example. (What is the proportion
17 94
of change explained by the regression 13 73
equation?) 12 59
𝑌෠ = 30.10 + 3.18𝑥𝑖 15 80
16 93
𝑟 2 = 0.59612 = 0.3553
14 85
• Only 35% of the variation can be explained by 16 66
the equation. The regression equation cannot 16 79
explain 65% of the variation. 18 77
19 91
Regression Line
100 𝑌෠ = 30.10 + 3.18𝑥𝑖

80 𝑦ത = 79.7

60

40

20

0
10 12 14 16 18 20
Assumption (Important)

• Doğrusallık (Linearity): The relationship between the mean of X and


Y is linear (Scatterplot).
• Eş varyanslık (Homoscedasticity): The variance of the error terms
(variance of residuals) is similar (identical) across the values of the
independent variables (plotting the regression standardized residuals
against the regression standardized estimated value).
• Bağımsızlık (Independence- Autocorrelation): There is no
relationship between residuals and Y; in other words, Y is independent
of errors (Durbin-Watson statistic).
• Normallik (Normality): For any fixed value of X, Y is normally
distributed (Kolmogorov-Smirnov Test).
Doğrusallık (Linearity)
Not an assumption, but a usual
Uç Değerler (Outlier) check that needs to be done
Eş varyans (Homoscedasticity)
Eş varyans (Homoscedasticity)
------------------------------------------

Multiple Linear Regression


Multiple Linear Regression
• Simple Linear Regression→ Single dependent, single independent variable
• Multiple Linear Regression→Single dependent, multiple independent variables

• In the questions of the number of questions answered correctly and grades


• X: Correct answers
• Y: Grades
• If we examine the study time, the number of correct answers and the grades
obtained
• X1: Correct answers, X2: Study time
• Y: Grade
Multiple Linear Regression
Definition:
• It is a technique used to investigate the relationship between more
than one independent variable and the dependent variable.
Regression coefficient
𝑌෠ = 𝑎 + 𝑏𝑥 ⟹ 𝑌෠ = 𝑎 + 𝑏𝑖 𝑥𝑖

𝑌෠ = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘

Sample explanation (𝑏1 ) : When other variables remain constant, the


effect of a one-unit change in 𝑥1 on the dependent variable
Multiple Linear Regression
𝑏 = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑌
• Define matrices 𝑋, 𝑋 𝑇 and 𝑌
• Calculate the matrix product 𝑋 𝑇 𝑋 and find the inverse (with 𝐴−1 𝐴 = 𝐼)
• Other calculations are made.
• The result is a matrix with 1 column and n rows (n: number of
independent variables). The first-row element represents the constant a
in the previously defined 𝑌෠ = 𝑎 + 𝑏𝑥𝑖 , the others represent the
coefficients of the independent variables.
Example
For 5 students, the grades obtained in a test Study
exam, the IQ levels of the students and the Student Grade IQ
time taken to study for the exam are shown.
Time
a. What(s) are the independent
1 100 110 40
variable(s)? 2 90 120 30
b. Evaluate the relationship between 3 80 100 20
(dependent-independent) variables. 4 70 90 0
5 60 80 10
3
Solution 𝑋𝑇𝑋 −1
101Τ5 −7Τ30 1Τ6
= −7Τ30 1Τ360 −1Τ450
𝑎 𝑥1 𝑥2 𝑦 1Τ6 −1Τ450 1Τ360
1 1 110 40 100
1 120 30 90 4 𝑏 = 𝑋𝑇𝑋 −1 𝑋 𝑇 𝑌
𝑋 = 1 100 20 𝑌 = 80
𝑏0 20
1 90 0 70
1 80 10 60 𝑏 = 𝑏1 = 0.5 Regression coefficients
2
𝑏2 0.5
1 1 1 1 1
𝑋 𝑇 = 110 120 100 90 80
40 30 20 0 10 𝑌෠ = 20 + 0.5𝑥1 + 0.5𝑥2
3
5 500 100 Comment: Study time is as important as IQ.
𝑋 𝑇 𝑋 = 500 51000 10800
100 10800 3000
What's Next Week?

Logistic Regression

You might also like