Multicollinarity
Multicollinarity
Multicollinarity
NURA A 1
NURA A 02/13/2024
Multicollinearity
Another common regression problem: Multicollinearity
Definition: collinear = highly correlated
Multicollinearity = inclusion of highly correlated independent
variables in a single regression model
Recall: High correlation of X variables causes problems for
estimation of slopes (b’s)
Recall: variable denominators approach zero, coefficients may
wrong/too large.
NURA A 02/13/2024 2
OLS Assumption 7: No independent variable is a perfect linear function of other explanatory variables (no
perfect multicollinarity)
Perfect correlation occurs when two variables have a Pearson’s correlation coefficient of +1 or -1.
When one of the variables changes, the other variable also changes by a completely fixed proportion.
The two variables move in union.
Perfect correlation suggests that two variables are different forms of the same variable.
For example, games won and games lost have a perfect negative correlation (-1).
The temperature in Fahrenheit and Celsius have a perfect positive correlation (+1).
Ordinary least squares cannot distinguish one variable from the other when they are perfectly correlated.
If you specify a model that contains independent variables with perfect correlation, your statistical
software can’t fit the model, and it will display an error message.
You must remove one of the variables from the model to proceed.
NURA A 02/13/2024 3
NURA A 02/13/2024 4
NURA A 02/13/2024 5
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.986699903
R Square 0.973576698
Adjusted R Square0.845273785
Standard Error 5.220060351
Observations 10
ANOVA
df SS MS F Significance F
Regression 2 8032.00776 4016.004 294.7630701 1.75E-07
Residual 8 217.992241 27.24903
Total 10 8250
Coefficients Standard Error t Stat P-value Lower 95%Upper 95% Lower 95.0% Upper 95.0%
Intercept -31.4985451 5.30169743 -5.94122 0.000345362 -43.7243 -19.2728 -43.7242813 -19.27280891
X Variable 1 0 0 65535 #NUM! 0 0 0 0
X Variable 2 0.882638215 0.05140983 17.16867 #NUM! 0.764087 1.00119 0.764086928 1.001189502
NURA A 02/13/2024 6
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.990719
R Square 0.981525
Adjusted R Square
0.976246
Standard Error 4.666306
Observations 10
ANOVA
df SS MS F Significance F
Regression 2 8097.579 4048.79 185.943 8.5717E-07
Residual 7 152.4209 21.77442
Total 9 8250
Coefficients
Standard Error t Stat P-value Lower 95% Upper 95%Lower 95.0% Upper 95.0%
Intercept -33.3083 4.852675 -6.86391 0.00024 -44.783087 -21.8336 -44.783087 -21.833581
X Variable 1 -2.92888 4.237158 -0.69124 0.51168 -12.9481704 7.090402 -12.94817 7.090402207
X Variable 2 1.494507 0.86122 1.735337 0.12626 -0.54195537 3.53097 -0.5419554 3.530969994
NURA A 02/13/2024 7
Sources of Multicollinearity
NURA A 02/13/2024 8
Practical consequences of Multicollinearity
Large variances and standard errors of coefficients: when correlation between pairs of
explanatory variables is high; or there is fall in precision of estimators
Wider confidence intervals: as a consequence of larger standard errors
Insignificant t-ratios: as a consequence of larger standard errors (H0 accepted)
High R2 but low t: seem to be contradictory results.
Estimators and standard errors are very sensitive to changes in data: unstable
Wrong signs of coefficients: May not be according to economic and finance theory.
Difficulty in assessing the individual contribution of regresses to the explained
variation: due to correlated repressor's
NURA A 02/13/2024 9
NURA A 02/13/2024 10
NURA A 02/13/2024 11
NURA A 02/13/2024 12
NURA A 02/13/2024 13
NURA A 02/13/2024 14
CONTD
Although there are no specific tests about what value of the VIF should cause concern of
Multicollinearity, the commonly given rule of thumb is that, VIFs of 10 and higher (tolerances
of 0.10 or less) pose of concern that Multicollinearity is a problem. In this study also Variance
Inflation Factors (VIF) are used.
In addition to VIF, the existences of Multicollinearity problems among qualitative discrete
variables are corrected using contingency coefficient (correlation matrix). The contingency
coefficient value that exceeds 0.75 was associated with the problem of Multicollinearity for
discrete variables (Gujarati, 2004). To detect this problem, coefficient of contingency is
computed from the survey data. The contingency coefficients are calculated as follows.
NURA A 02/13/2024 15
Multicollinearity: Remedial Measures
1. Do nothing
If Multicollinearity is mild or If the purpose is only forecasting.
If data deficiency is the problem, we have no choice over data.
It is better to try to increase the data by extending the sample if possible.
Multicolinearity is not a problem if theory permits us to estimate the missing
coefficient e.g. in Cobb-Douglas production function, if we assume constant
returns to scale the either of alpha and beta can be estimated if one is
estimated by regression.
2. Drop one of the variables
Drop the one that is less significant or drop the one with larger VIF. But this
may lead to wrong model specification or may go against theoretical
considerations (e.g. dropping price of substitute in demand function). An
example of dropping variables is of import and export in GDP equation
NURA A 02/13/2024 16
CONT
D
3. Transform the variable Combine the variables
(we just add exports and imports to get a new variable labeled as openness).
Another option is to convert the variables (import = f(GNP, CPI) we can divide
by CPI (real imports = f (real GNP)); but error term may become heteroskedastic.
Another way is to use first difference form (loss of one observation)
𝒀𝒕 − 𝒀𝒕 − 𝟏 = 𝜷𝟐(𝑿𝟐𝒕 − 𝑿𝟐, 𝒕 − 𝟏) + 𝜷𝟑(𝑿𝟑𝒕 − 𝑿𝟑, 𝒕 − 𝟏) + 𝒗𝒕 (may not be
appropriate in cross sectional data; has no sense)
4. Get additional data and increase the sample size or
5. Combine cross section and time series (pool) or
6. Use of panel data or
7. Use ridge regression, factor analysis etc
NURA A 02/13/2024 17
Multicollinearity
Examples from Business and Economics Consider the data given in MC
It contains data on quantity demanded, prices, and monthly income in
thousands and prices of two different substitutes.
First example has 50 observations in total. But first let us consider a small
example
The data is as follows
NURA A 02/13/2024 18
NURA A 02/13/2024 19
Auxiliary regressions are produce using Microsoft Excel:
NURA A 02/13/2024 20
Let us use the file MC.xlsx and Fit different Regressions. First Look at the correlation matrix.
NURA A 02/13/2024 21
NURA A 02/13/2024 22
NURA A 02/13/2024 23
NURA A 02/13/2024 24
Auto data 02 excel
Let us consider the price of used car and the different characteristic of car
-------------+----------------------------------------------------------------
rep78 | 910.9859 304.5274 2.99 0.004 302.6226 1519.349
mpg | -106.7122 81.15836 -1.31 0.193 -268.8446 55.42027
weight | 4.959534 1.119624 4.43 0.000 2.722827 7.196241
length | -115.0177 38.56456 -2.98 0.004 -192.0592 -37.97612
_cons | 11934.51 5774.178 2.07 0.043 399.2604 23469.75
------------------------------------------------------------------------------
From the above table the coefficient for length is negative which is not expected, because in most cases as the
car length increase the price also increases (see figure below).
25
26
We also expect that there might be a correlation between weight and length of a car. In most case as a
length of a car increases weight also increases. Therefore, we expect strong positive correlation between
these two variables.
The correlation coefficient (0.9478) indicated that there is high correlation between length and
weight.
If we test multicolinearity using the VIF
. vif
The VIF is high (VIF>10)for weight and length. Therefore the two tests indicated that there is
high multicolinearity in the model.
27
One of the solutions for such problem is to drop one of the collinear variables. Let us drop
weight.
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rep78 | 698.2603 341.0836 2.05 0.045 17.06941 1379.451
mpg | -178.1566 90.21653 -1.97 0.053 -358.3315 2.018202
length | 30.68091 22.83613 1.34 0.184 -14.926 76.28781
_cons | 1783.936 6011.455 0.30 0.768 -10221.77 13789.64
vif
We can see that there is no multicolinearity (VIF<10) after dropping the weight variable.
28
NURA A 02/13/2024 29