Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

1-Multiple Regression

نو.،و

Uploaded by

Omar Qasim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

1-Multiple Regression

نو.،و

Uploaded by

Omar Qasim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Multiple Linear Regression

• Most regression problems involve more than one independent


variable (predictor).
• If each independent variables varies in a linear manner with Y, the
estimated regression function in this case is:
ෝ𝒊 = 𝒃𝟎 + 𝒃𝟏 𝑿𝟏𝒊 + 𝒃𝟐 𝑿𝟐𝒊 + ⋯ + 𝒃𝒌 𝑿𝒌𝒊
𝒚

• The optimal values for the 𝒃𝒊 can again be found by minimizing the
sum squares error (SSE).
• The resulting function fits a hyperplane to our sample data.

1
Example Dataset: IQ scores
• Dependent variable (y): Performance IQ scores (PIQ) from the revised
Wechsler Adult Intelligence Scale. This variable served as the investigator’s
measure of the individual's intelligence.
• Potential independent variable (x1): Brain size based on the count
obtained from MRI scans (given as count/10,000)
• Potential independent variable (x2): Height in inches
• Potential independent variable (x3): Weight in pounds
• Potential independent variable (x4): Gender (categorical variable)
0= male, 1= female

2
Let's start with some descriptive statistics
• Analyze → Descriptive Statistics→ Descriptives

3
Descriptive statistics SPSS output

Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Performance IQ score 38 72 150 111.34 22.598
Brain (MRI) 38 79.06 107.95 90.6758 7.25628
Height 38 62.0 77.0 68.421 3.9938
Weight 38 106 192 151.05 23.479
Valid N (listwise) 38

4
For the categorical variable
• Analyze → Descriptive Statistics→ Frequencies

5
Frequencies SPSS output
Gender

Valid
Freq Percent Percent Cum.
Percent
Valid Male 19 50.0 50.0 50.0
Female 19 50.0 50.0 100.0
Total 38 100.0 100.0

6
Correlation analysis
• Analyze → Correlate→ Bivariate

7
Correlation analysis SPSS output
Correlations
Performance Brain (MRI) Height
IQ score
Pearson Correlation .378*
Brain (MRI) Sig. (2-tailed) .019
N 38
Pearson Correlation -.093 .588**
Height Sig. (2-tailed) .578 .000
N 38 38
Pearson Correlation .003 .513** .700**
Weight Sig. (2-tailed) .988 .001 .000
N 38 38 38
*. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed).

8
Regression Model Building
Three tables should be included in the regression analysis:
1- Table of Coefficients (Estimates).
2- Fit Measure (adjusted R squared).
3- ANOVA Table.

9
Let’s start with including all independent
variables (Full Regression Model)
• Analyze → Regression→ linear

10
Table of Coefficients
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients t Sig.
Model B Std. Error Beta
1 (Constant) 107.217 62.052 1.728 .093
Brain (MRI) 2.199 .563 .706 3.907 <.001
Height -2.644 1.212 -.467 -2.182 .036
Weight -.064 .199 -.066 -.321 .750
Gender -9.460 6.544 -.212 -1.446 .158
a. Dependent Variable: Performance IQ score

11
Table of Coefficients
The estimated regression equation can be written as:
𝑦ො = 107.2 + 2.2 𝐵𝑟𝑎𝑖𝑛 − 2.64 𝐻𝑒𝑖𝑔ℎ𝑡 − 0.06 𝑊𝑒𝑖𝑔ℎ𝑡 − 9.46(𝐺𝑒𝑛𝑑𝑒𝑟)
The coefficients may be interpreted as follows:
• 𝑏0 (𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡) = 107.2 : The performance IQ scores (PIQ) is 107.2
when brain size is zero, height is zero, weight is zero and gender is
male. (Has no logic)
• 𝑏1 (𝐵𝑟𝑎𝑖𝑛) = 2.2 : When the brain size increases by one unit, PIQ
increases by 2.2 units holding other variables constant.
• 𝑏2 𝐻𝑒𝑖𝑔ℎ𝑡 = −2.64 : When the height increases by one inch, PIQ
decreases by 2.64 units holding other variables constant

12
Table of Coefficients
The estimated regression equation can be written as:
𝑦ො = 107.2 + 2.2 𝐵𝑟𝑎𝑖𝑛 − 2.64 𝐻𝑒𝑖𝑔ℎ𝑡 − 0.06 𝑊𝑒𝑖𝑔ℎ𝑡 − 9.46(𝐺𝑒𝑛𝑑𝑒𝑟)
The coefficients may be interpreted as follows:
• 𝑏3 𝑊𝑒𝑖𝑔ℎ𝑡 = −0.06 : When the weight increases by one lbs, PIQ
decreases by 0.06 units holding other variables constant
• 𝑏4 𝐺𝑒𝑛𝑑𝑒𝑟 = −9.46 : Females (coded as 1) on average were 9.46
points lower than males (coded as 0).
• However, the p-values for both weight and gender are higher than
the 5% significance level. Thus, they should be removed from the
final regression model and the regression equation is re-estimated.

13
Multicollinearity
• A high correlation between two independent variables such that the
two variables contribute redundant information to the model. When
highly correlated independent variables are included in the regression
model, they can adversely affect the regression results.

14
Multicollinearity
• One method of measuring and detecting multicollinearity is known as
the variance inflation factor (VIF).
• A VIF equal to 1.0 for a given independent variable indicates that this
independent variable is not correlated with the remaining
independent variables in the model.
• The greater the multicollinearity, the larger the VIF.

15
Multicollinearity
• Generally, if VIF <5 for a particular independent variable, then we do
not consider multicollinearity a problem for that variable.
• VIF ≥ 5 implies that the correlation between the independent
variables is too extreme and should be dealt with by dropping
variables from the model.

16
SPSS VIF
• Analyze → Regression→ linear → Statistics → Collinearity diagnostics

17
SPSS VIF output

Collinearity Statistics
Model
Tolerance VIF
(Constant)
Brain (MRI) .615 1.626
1 Height .438 2.282
Weight .470 2.129
Gender .933 1.072

18
Selecting the Model
• We want to identify the simplest model that adequately accounts for
the systematic variation in the Y variable.
• Arbitrarily using all the independent variables may result in
overfitting. We want to avoid overfitting the data.
• As additional independent variables are added to a model:
- The R2 statistic can only increase.
- The Adjusted-R2 statistic can increase or decrease.
• The R2 statistic can be artificially inflated by adding any independent
variable to the model. We can compare adjusted R2 values as a
heuristic to tell if adding an additional independent variable really
helps.
19
Stepwise Regression
• One option in regression analysis is to bring all possible independent
variables into the model in one step (Full regression). This is what we
have done in the previous sections.
• Another option for developing a regression model is called stepwise
regression.
• Stepwise regression is the step-by-step iterative construction of a
regression model that involves the selection of independent
variables to be used in a final model.

20
Stepwise Regression in SPSS
• Analyze → Regression→ linear → Chose Method: Stepwise

21
Stepwise Regression in SPSS Output
Coefficientsa
Unstandardized Standardized Collinearity
Coefficients Coefficients t Sig. Statistics

Model B Std. Error Beta Tolerance VIF


(Constant) 4.652 43.712 .106 .916
1
Brain (MRI) 1.177 .481 .378 2.448 .019 1.000 1.000

(Constant) 111.276 55.867 1.992 .054


2 Brain (MRI) 2.061 .547 .662 3.770 .001 .654 1.529
Height -2.730 .993 -.482 -2.749 .009 .654 1.529
a. Dependent Variable: Performance IQ score

22
Stepwise Regression in SPSS Output
• The final estimated regression equation can be written as:
𝑦ො = 111.3 + 2.06 𝐵𝑟𝑎𝑖𝑛 − 2.73 𝐻𝑒𝑖𝑔ℎ𝑡
The coefficients may be interpreted as follows:
• 𝑏1 (𝐵𝑟𝑎𝑖𝑛) = 2.06 : When the brain size increases by one unit, PIQ
increases by 2.06 units holding other variables constant.
• 𝑏2 𝐻𝑒𝑖𝑔ℎ𝑡 = −2.73 : When the height increases by one inch, PIQ
decreases by 2.73 units holding other variables constant.

23
Testing the parameters (regression coefficients)
We can test the significance for each parameter β0 , β1 and β2 using
t- test as follows;
H0: β0 = 0 H1: β0 ≠ 0
P-value = 0.054
Since the P-value for the t-test appearing in the table is almost 0.05, we will reject
H0. This means that that the constant (β0) is significantly different from 0.
H0: β1 = 0 H1: β1 ≠ 0
P-value = <0.001
Since the P-value for the t-test appearing in the table is less than 0.05, we will
reject H0. This suggests that the slope parameter for Brain (β1) is significant, which
means that the β1 is significantly different from 0.

24
Testing the parameters (regression coefficients)
We can test the significance for each parameter β0 , β1 and β2 using
t- test as follows;
H0: β2 = 0 H1: β2 ≠ 0
P-value = 0.009
Since the P value for the t-test is less than 0.05, we will reject H0. This
suggests that the slope parameter for Height (β2) is significant, which means
that the β2 is significantly different from 0.

25
Model Fit
Model Summary

Model R R Square Adjusted R Std. Error of


Square the Estimate
1 .378a .143 .119 21.212
2 .543b .295 .255 19.510
a. Predictors: (Constant), Brain (MRI)

b. Predictors: (Constant), Brain (MRI), Height

• The adjusted R-square (Coefficient of determination) suggest that


25.5% of the variation that occurs in PIQ is explained by brain and
height. The remaining 74.5% remains unexplained by the regression
model.

26
Testing the overall model (ANOVA)
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 2697.094 1 2697.094 5.994 .019b
Residual 16197.459 36 449.929
Total 18894.553 37
2 Regression 5572.741 2 2786.371 7.321 .002c
Residual 13321.811 35 380.623
Total 18894.553 37
a. Dependent Variable: Performance IQ score
b. Predictors: (Constant), Brain (MRI)
c. Predictors: (Constant), Brain (MRI), Height

27
Testing the overall model (ANOVA)
H0: β1 = β2 =0 (model is not significant)
H1: At least one β’s ≠ 0 (model is significant)
• The P-value for ANOVA F-test is equal to 0.02. It is less than 0.05
significance level. So, we reject H0.
• This means that the overall regression model is significant and can be
used for prediction.

28
Linear Regression Assumptions
It is essential to check the assumptions of the linear regression model.
If the assumptions are valid, the regression results will be reliable.
There are different assumptions, the most important are:
1- Linearity: The relationship between the X’s and Y should be
linear.
2- Multicollinearity: There should be no (or little) multicollinearity.
3- Homoscedasticity: The variance of residual (error term) is the
same for any value of X.
4- Normality: The residuals (error terms) should be normally
distributed.

29
1- Linearity
• linearity can be checked by plotting the residuals (on the vertical axis)
versus each independent variable (on the horizontal axis) : the
pattern should be approximately linear.
• Alternatively, linearity can be checked by plotting the outcome
variable against the independent (predictor) variable: the pattern
should be approximately linear.
• A curving pattern suggests that a linear model may not be the best fit
and that a more complex model (for example, a quadratic term) may
need to be added.

30
Linearity Checking SPSS
• Analyze → Regression→ linear → Save → Check unstandardized
residuals
Note: SPSS will save the
residuals of the model
and add a new variable
to the data set (RES_1)
which contains the
calculated residuals.

31
Scatter plot of the residuals against the
independent variables

Since they are random patterns, then linearity is achieved.


32
2- Multicollinearity
• Multicollinearity can be checked by computing the Variance Inflation
Factor (VIF) discussed earlier.
• Generally, if VIF <5 for a particular independent variable, then we do
not consider multicollinearity a problem for that variable.

33
3- Homoscedasticity
• Residual plots also can be used to determine whether the residuals
have a constant variance.
• When we have developed a multiple regression model, we can
analyze the equal variance assumption by plotting the residuals
ෝ values.
against the fitted 𝒚
• Analyze → Regression→ linear → Save → Check unstandardized
residuals AND Check unstandardized Predicted values

34
Scatter plot of the residuals against the Fitted
values

The variance of the residuals stays approximately constant over the


range of the fitted values.
35
4- Normality
• The need for normally distributed model errors occurs when we want
to test a hypothesis about the regression model.
• Small departures from normality do not cause serious problems.
• However, if the model errors depart dramatically from a normal
distribution, there is cause for concern.
• Examining the residuals will allow us to detect such dramatic
departures.
• One method for graphically analyzing the residuals is to form a
frequency histogram of the residuals to determine whether the
general shape is normal or using a Q-Q plot.

36
SPSS Q-Q plot
• Analyze →
Descriptive Statistics→
Q-Q plots

37
Normal Q-Q plot of unstandardized residuals
If the points are around the
45-degree line, Normality
is achieved.

Otherwise, Normality
assumption is violated.

38

You might also like