Assignment For Viva
Assignment For Viva
Definition
Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables in a regression model are highly correlated with each other. In other words,
multicollinearity indicates a strong linear relationship among the predictor variables. This can
create challenges in the regression analysis because it becomes difficult to determine the
individual effects of each independent variable on the dependent variable accurately.
Example:
1. Determining the electricity consumption of a household from the household income
and the number of electrical appliances. Here, we know that the number of electrical
appliances in a household will increase with household income. However, this cannot
be removed from the dataset.
2. Creating A Variable For BMI From The Height And Weight Variables Would Include
Redundant Information In the model, and the new variable will be a highly correlated
variable.
3. Including variables for temperature in Fahrenheit and temperature in Celsius.
4. In a dataset containing the status of marriage variable with two unique values:
‘married’, and ’single’. Creating dummy variables for both of them would include
redundant information. We can make do with only one variable containing 0/1 for
‘married’/’single’ status.
What Causes Multicollinearity?
Multicollinearity could occur due to the following problems:
1. Multicollinearity could exist because of the problems in the dataset at the time of
creation. These problems could be because of poorly designed experiments, highly
observational data, or the inability to manipulate the data.
2. Multicollinearity could also occur when new variables are created which are
dependent on other variables.
3. Including identical variables in the dataset.
4. Inaccurate use of dummy variables can also cause a multicollinearity problem. This is
called the Dummy variable trap.
5. Insufficient data, in some cases, can also cause multicollinearity problems.
Heteroscedasticity
Definition
Heteroscedasticity means the variance of the disturbances ui is not constant ∀i, conditional on
X, because it depends on one or several variables.The presence of heteroskedasticity is quite
frequent when working with cross-section data.
Let’s assume a regression model that determines consumption as a functionof income. The
variance of the error term might be expected to increase as income increases.
Examples:
1. The range in family income between the poorest and richest family in town is the
classical example of heteroscedasticity.
2. The range in annual sales between a corner drug store and general store.
3. when the prices of a product are studied at the launch of a new model, heteroskedasticity
is predictable. But for rainfall or income comparisons, the nature of dispersion cannot
be predicted.
• Among other reasons, the nature of the variable can be a major cause.
• It majorly occurs in cross-sectional studies.
Detection Of Heteroscedasticity
Informal methods to identify the problem of heteroscedasticity
1. Checking Nature of the problem
2. Graphical inspection of residuals
Formal methods to identify the problem of heteroscedasticity
1. Park Test
2. Glejser test
3. White's test
4. Spearman's rank correlation test
5. Goldfeld-Quandt test
6. Breusch- Pagan test
𝐸(𝑢 𝑖 𝑢𝑗 ) ≠ 0 𝑓𝑜𝑟 𝑖 ≠ 𝑗
The disruption caused by a strike this quarter may very well affect output next quarter
The increases in the consumption expenditure of one family may very well prompt another
family to increase its consumption expenditure.
Example:
Ex1: Athletes competing against exceptionally good or bad teams. This is especially evident
in baseball because teams play each other 3-4 times in a row.
Ex2: Certain pests which inhibit plant growth may be more
prevalent in some areas
Ex3: A technical analyst can learn how the stock price of a particular day is affected by those
of previous days through autocorrelation. Thus, he can estimate how the price will move in
the future.
Reasons of of Autocorrelation :
i) Inertia
Inertia or sluggishness in economic time-series is a great reason for autocorrelation. For
example, GNP, production, price index, employment, and unemployment exhibit business
cycles. Starting at the bottom of the recession, when the economic recovery starts, most of
these series start moving upward. In this upswing, the value of a series at one point in time is
greater than its previous values. These successive periods (observations) are likely to be
interdependent.
ii) Omitted Variables Specification Bias
The residuals (which are proxies for 𝑢 𝑖) may suggest that some variables that were originally
candidate but were not included in the model (for a variety of reasons) should be included.
This is the case of excluded variable specification bias. Often the inclusion of such variables
may remove the correlation pattern observed among the residuals.
iii) Model Specification: Incorrect Functional Form
Autocorrelation can also occur due to the miss-specification of the model.
iv) Effect of Cobweb Phenomenon
The quantity supplied in the period of many agricultural commodities depends on their price
in period . This is called the Cobweb phenomenon. This is because the decision to plant a
crop in a period of
iv) Effect of Lagged Relationship
Many times in business and economic research the lagged values of the dependent variable
are used as explanatory variables. For example, to study the effect of tastes and habits on
consumption in a period , consumption in period is used as an explanatory variable since
consumer do not change their consumption habits readily for psychological, technological, or
institutional reasons. s influenced by the price of the commodity in that period.
Consequences of Autocorrelation:
1. OLS is no longer efficient, since it treats all observations as being of equal
importance,
while there is much more information about the line to be had from some
observations
than from others, and when estimating the line more attention should be paid to the
observations having small variances than those with larger variances.
2. The standard errors are biased, and thus hypothesis tests and confidence intervals
based
on the t-distribution are likely to be invalid.
3. When the disturbance terms are serially correlated then the OLS estimators of the 𝛽̂s
are still unbiased and consistent but the optimist property (minimum variance
property) is not satisfied.
4. The OLS estimators will be inefficient and therefore, no longer BLUE.
5. The estimated variance of the regression coefficients will be biased and inconsistent
and will be greater than the variances of estimate calculated by other methods,
therefore, hypothesis testing is no longer valid. In most of the cases, 𝑅 2 will be
overestimated (indicating a better fit than the one that truly exists). The t- and F-
statistics will tend to be higher.
Detection Method of Autocorrelation:
1. Durbin-Watson Test
2. Ljung-Box Q Test
3. ACF plots
Remedial Measure of Autocorrelation:
1. Cochrane-Orcutt Procedure
2. Hildreth-Lu Procedure
3. First Differences Procedure
4. Forecasting Issues
Independent Observations
The fifth assumption about regression and correlation analysis is that successive residuals
should be independent. This means that there is not a pattern to the residuals , the residuals are
not highly correlated, and there are not long runs of positive or negative residuals. When
successive residuals are correlated, we refer to this condition as autocorrelation.
Autocorrelation frequently occurs when the data are collected over a period of time. For
examole,we wish to predict yearly sales of Ages Sofware Inc. based on the time and the amount
spent on advertising. The dependent variables yearly sales and independent variables are time
and amount spent on advertising. It is likely that for a period of time the actual points will be
above the regression plane (remember there are two independent variables) and then for a
period of a time the points will be below the regression plane. The graph below shows that the
residuals plotted on the vertical axis and the fitted values 𝑌̂ on the horizontalaxis. Note the run
of residuals above the mean of the residuals, followed by a run below the mean. A scatter plot
such as this would indicate possible autocorrelation.
There is a test for autocorrelation, called Durbin-Watson. We present the details of this test in
Chapter 16, Section 16.10.
Successive residuals are correlated in time series data because an event in one time period often
influences the event in the next period . To explain, the owner of a furniture store decides to
have a sale this month and spends a large amount of money advertising the event . We would
expect a correlation between sales and advertising expense, but all the results of the increase
in advertising are not experienced this month. It is likely that some of the effect of the
advertising caries over into next month. Therefore we expect correlation among the residuals.
The regression relationship in a time series is written
𝑌𝑡 = 𝛼 + 𝛽𝑋𝑡 + 𝑒𝑡
Where the subscript t is used in place of in i to suggest the data were collected overtimr.If the
residuals are correlated problems occur when we try to conduct tests of hypotheses about the
regression coefficients. Also, a confidence interval or a prediction interval, where the multiple
standard error of estimate is used , may not yield the correct results. The autocorrelation
,reported as r, is the strength of the association among the residuals. The r has the same meaning
as the coefficient of correlation .That is values close to -1.00 or 1.00 indicate a strong
association , and values near 0 indicate no association. Instead of directly conducting a
hypothesis test on r, we use the Durbin-Watson statistic .
The the Durbin-Watson statistic, identified by the letter d, is computed by first determining the
residuals for each observations. That is 𝑒𝑡= (𝑌 − 𝑌̂). Next, we compute d using the following
relationship .
σ𝒏
𝒕=𝟐(𝒆𝒕 −𝒆𝒕−𝟏 )
𝟐
DURBIN-WATSON STATISTIC 𝒅= σ𝒏 𝟐 [16-4]
𝒕=𝟏(𝒆𝒕 )
To determine the numerator of formula (16-4), we lag each of the residuals one period and then
square the difference between consecutive residuals . This may also be called finding the
differences .This accounts for summing the observations from 2, rather than from 1, up to n. In
the denominator ,we square the residuals and sum over all n observations.
The value of the Durbin-Watson statistic can range from 0 to 4. The value of d is 2.00 when
there is no autocorrelation among the the residuals . when the value of d gets close to zero , this
indicates positive autocorrelation. Values beyond 2 indicate negative autocorrelation.Negative
autocorrelation seldom exists in practice .To occur, successive residuals would tend to be large
, but would have opposite signs.
To conduct a test for autocorrelation, the null and alternative hypotheses are:
𝐻0 : 𝑁𝑜 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 (𝜌 = 0)
𝐻1 : 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 (𝜌 > 0)
Recall from the previous chapter that r refers to the sample correlation and that 𝜌 is the
correlation coefficients in the population. The critical values for d are reported in Appendix
B.10. To determine the critical value, we need 𝛼 9the significance level), n (the sample size)
and k (the number of independent variables). The decision rule for the Durbin-Watson test is
altered from what we are used to. As usual, there is a range of ever ,there is also a range of
values where d is inconclusive. That is, in the inconclusive range the null hypothesis is neither
rejected nor not rejecyed. To state this more formally:
• Vales less than dl cause the rejection of the null hypotheses.
• Values greater than du will result in the null hypothesis not being rejected .
• Values of d between dl and du yield inconclusive results.
The subscript l refers to the the lower limit of d and subscript u the upper limit . How do we
interpret the various decisions for the test for residual correlation ? If the null hypothesis is not
rejected, we conclude
that autocorrelation is not present .The residuals are not correlated, there is no autocorrelation
present, and the regression assumption has been met. There will not be any problem with the
estimated value of standard error of estimate . If the null hypothesis is rejected, then we
conclude that autocorrelation is present .
The usual remedy for autocorrelation is to include another predictor variable that captures time
order. For example,we might use the square root of Y instead of Y.This transformation will
result in a change in the distribution of the residuals. If the result falls in the inconclusive
range, more sophisticated tests are needed, or conservatively, we treat the conclusion as
rejecting the null hypothesis.
An example will show the details of the durbin-Watson test and how the results are interpreted.
Example:
Banner Rocker Company manufactures and markets rocking chais. The company developed a
special rocker for senior citizens, which it advertises extensively on TV. Banner’s market for
the special chair is the Carollinas, Florida , and Arizona where thre are many senior citizens
and retired people. The president of Banner Rocker is studying the association between his
advertising expense (X) and the number of rockers sold over the last 20 months (Y). He
collected the following data . He would like to create a model to forecast sales , based on the
amount spent on advertising, but is concerned that, because he gathered these data over
consecutive months, there might be problems with autocorrelation.
Month Sales(000) Advertising($millions)
1 153 5.5
2 156 5.5
3 153 5.3
4 147 5.5
5 159 5.4
6 160 5.3
7 147 5.5
8 147 5.7
9 152 5.9
10 160 6.2
11 169 6.3
12 176 5.9
13 176 6.1
14 179 6.2
15 184 6.2
16 181 6.5
17 192 6.7
18 205 6.9
19 215 6.5
20 209 6.4
Determine the regression equation. Is advertising a good predictor of sales ? If the owner were
to increase the amount spent on advertising by $1,000,000, how many additional chairs can he
expect to sell? Investigate the possibility of autocorrelation.
Solution:
The first step is to determine the regression equation.
Regression Equation
Sales = -43.8
+ 35.95 Advertising
Coefficients
SE T- P-
Term Coef Coef Value Value VIF
Constant -43.8 34.4 -1.27 0.220
Advertising 35.95 5.75 6.26 0.000 1.00
Model Summary
R- R-
S R-sq sq(adj) sq(pred)
12.3474 68.50% 66.75% 61.86%
Analysis of Variance
Adj Adj F- P-
Source DF SS MS Value Value
Regression 1 5968 5967.7 39.14 0.000
Advertising 1 5968 5967.7 39.14 0.000
Error 18 2744 152.5
Lack-of-Fit 10 1472 147.2 0.93 0.554
Pure Error 8 1272 159.0
Total 19 8712
The coefficient of determination is 68.5 percent .So we know there is strong
positive association between the variables .We conclude that , as we increase the
amount spent on advertising, we can expect sell more chairs .Of course this is
what we ha hoped.
𝑌̂ = −43.80 + 35.950𝑋
This equation indicates that an increase of 1 in X will result in an increase ofn 35.95 in Y. So
an increase of $1,000,000 in advertising will increase sales by 35,950 chairs . To put it another
way, it will cost $27.82 in additional advertising expense per chair sold, found by
$1,000,000/35,950.
What about the potential problem with autocorrelation ? Many software packages, such as
Minitab, will calculate the value of the test and to see the details of formula(16-4), we use an
excel spreadsheet.
To investigate the possible autocorrelation, we need to determine the residuals for each
observations. We find the fitted values -that is, the Y- for each of the 20 months. This
information is shown in the fourth column, column D. Next we find the residual , which is the
difference between the actual value and fitted values. So for the first month:
𝑌̂ = −43.80 + 35.950𝑋 = −43.80 + 35.950(5.5) =
153.925
The residual, reported in column E, is slightly different due to rounding in the software. Notice
in particular the string of five negative residuals in rows 8 through 12. In column F, we lag the
residuals one period. In column G, we find the difference between the current residual and the
residual in the square this difference. Using the values from the software:
(𝑒𝑡 − 𝑒𝑡−1 )2 =(𝑒2 − 𝑒2−1 )2 =[2.0763-(-0.9237)]2 =(3.0000)2 =9.0000
The other values in column G are found the same way. The values in column H are the square
of those column E.
(e1 )2 =(-0.9237)2 =0.8531
To find the value of d, we need the sums of columns G and H.These sums are noted in yellow
in spreadsheet.
Now to answer the question as to whether there is significant autocorrelation. The null and the
alternative hypotheses are stated as follows.
𝐻0 : 𝑁𝑜 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
𝐻1 : 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
The critical value of d is found in Appendex B.10, a portion of which is shown below. There is
one independent variable, so k=1, the level of significance is 0.05, and the sample size is 20.
We move to 0.05 table the columns where k=1, and the row of 20. The reported values are
dl=1.20 and du =1.41.The null hypothesis is rejected if d<1.20 and not rejected if d>1.41. No
conclusion is rejected if d is between 1.20 and 1.41.
k 1 2
n d d d d
15 1.08 1.36 0.95 1.54
16 1.1 1.37 0.98 1.54
17 1.13 1.38 1.02 1.54
18 1.16 1.39 1.05 1.53
19 1.18 1.4 1.08 1.53
20 1.2 1.41 1.1 1.54
21 1.22 1.42 1.13 1.54
22 1.24 1.43 1.15 1.54
23 1.26 1.44 1.17 1.54
24 1.27 1.45 1.19 1.55
25 1.29 1.45 1.21 1.55
0 d
du
dl
Because the computed value of d is 0.8522, which is less than the dl, we reject the null
hypothesis and accept alternate hypothesis. We conclude that the residuals are autocorrelated.
We have violated one of the regression assumptios. What do we do? The presence of
autocorrelation usually means that the regression model has not been correctly specified. It is
likely we need to add one or more independent variables that have some time-ordered effects
on the dependent variable. The simplest independent variable to add is one that represents the
time periods.
CHAPTER 5
TIME SERIES
5.1 Time Series Models
A time series model accounts for patterns of the past movement of a variable and uses that
information to predict its future movements, i.e, it is a sophisticated method of extrapolating
data. At times it is desirable to smooth a time series and thus eliminate some of the more volatile
short term fluctuations.
5.2 Modeling Trend By Using Polynomial Functions
We begin with simple models that can be used to forecast a time series on the basis of its past
behavior. Most of the series we encounter are not ncontinuous in time, instead they consist of
discrete observations made at regular intervals of time. We denote the values of a time series
by {yt}, t=1,2,…,T.
5.3 Autocorrelation
The assumption that errors corresponding to different observations are uncorrelated often
breaks down in the time series data. When the error terms from different time periods are
correlated, we say that the error term is autocorrelated. For example, if we are predicting the
growthof stock dividends, an overestimate in one year is likely to lead to overestimates in the
succeeding years
Figure 5.2: Positive and Negative Autocorrelation
In this section we mainly deal with the problem of first order autocorrelation, in which errors
in one time period are correlated directly with errors in the ensuingperiod. Autocorrelation can
be positiveas well as negative.
Autocorrelation will no affect the unbiasedness or consistency of OLS estimators, but it does
affect their efficiency.
We assume that each of the error terms in a linear regression model is drawn form normal
population with 0 expected value and constant variance,but that the errors are not independent
over time. Here the model is
𝑌𝑡 = 𝛽0 + 𝛽1 𝑋1𝑡 + 𝛽2 𝑋2𝑡 +. . . +𝛽𝑘 𝑋𝑘𝑡 + 𝜀𝑡 , t=1,2,…,T
With
𝜀𝑡 = 𝜌𝜀𝑡−1 + 𝑣𝑡 , 0 ≤ |𝜌| ≤ 1 (5.5)
Where vt is distributed 𝑁(0, 𝜎𝜀2 ) and is independent of other errors over time and 𝜀𝑡 is
distributed 𝑁(0, 𝜎𝜀2 ) and is not independent of other errors over time.
2 )
𝐶𝑜𝑣(𝜀𝑡 , 𝜀𝑡−1 ) = 𝐸 (𝜀𝑡 , 𝜀𝑡−1 ) = 𝐸 [(𝜌𝜀𝑡−1 + 𝑣𝑡 )𝜀𝑡−1 ] = 𝜌𝐸(𝜀𝑡−1 = 𝜌𝜎𝜀2
(5.7)
Likewise
𝐶𝑜𝑣(𝜀𝑡 , 𝜀𝑡−1 ) = 𝐸 (𝜀𝑡 , 𝜀𝑡−2 ) = 𝜌 2 𝜎𝜀2 , 𝐶𝑜𝑣(𝜀𝑡 , 𝜀𝑡 −1 ) = 𝐸(𝜀𝑡 , 𝜀𝑡−3 ) = 𝜌 3
𝜎𝜀2 …………………… 𝐶𝑜𝑣(𝜀𝑡 , 𝜀𝑡−𝑟 ) = 𝐸 (𝜀𝑡 , 𝜀𝑡−1 ) = 𝜌 𝑟 𝜎𝜀2 𝜎𝜀2
(5.8)
A useful formula for the first-order autocorrelation coefficient 𝜌 is
( )
𝜌 = 𝐶𝑜𝑣 𝜀𝑡 , 𝜀𝑡−1 ⁄𝜎 2
(5.9)
5.3.1 Tests for autocorrelation
Durbin-Watson Test
We shall now consider a test of null hypothesis of that no autocorrelation is present (𝜌 = 0) .
By far the most popular test autocorrelation is the Durbin-Watson test. This test involves the
calculation of a test statistic based on the OLS residuals. The statistic is defined as
σ𝑇 ̂ 𝑡−𝜀̂ 𝑡−1 ) 2
𝑡=2 ( 𝜀
𝐷𝑊 = σ𝑇
(5.10)
̂ 2𝑡=1
𝑡=1 𝜀
When successive values of 𝜀̂𝑡 are close to each other, the DW statistic will be low, indicating
the presence of positive autocorrelation. By taking several approximations, it is possible to
show that
𝐷𝑊 = 2(1 − 𝜌) (5.11)
The DW statistic will lie in the range of 0 to 4, with a value near 2 indicating no first order
autocorrelation. Positive autocorrelation is associated with DW values below 2, and negative
autocorrelation is associated with DW values above 2.
Table 5.1: Durbin-Watson Table
Value of DW Result
0<DW<dL Reject null hypothesis; positive autocorrelation
dL<DW<dU Indeterminate
dU<DW<2 Accept null hypothesis
2<DW<4-dU Accept null hypothesis
4-du <DW<4-dL Indeterminate
4-dL<DW<4 Reject null hypothesis; negative autocorrelation
Exact interpretation of the DW statistic is difficult because the sequence of error terms depends
not only on the sequence of 𝜀’s, but also on the sequence of all the X values. For this reason,
most tables include test statistics which vary with the number of independent variables and the
number of observations. Two limits are given by Durbin and Watson (1950 and 151), usually
labeled dL and dU . These limits help to summarize the DW test as shown in the above table.
Example 5.1:
Here we consoder a time series data given by Chatterjee and Hadi
Table 5.2: Consumer Expenditure Data
Consumer Expenditure Money Stock
214.6 159.3
217.7 161.2
219.6 162.8
227.2 164.6
230.9 165.9
233.3 167.9
234.1 168.3
232.3 169.7
233.7 170.5
236.5 171.6
Figure 5.3:Time Series Plot of Residuals for the Consumer Expenditure Data
Time series plot of residuals indicates that positive autocorrelation is present in the data. For
this data we obtain DW=8195.21/7587.92=0.328.
At the 5% level of significance, the critical values corresponding to n=20 are d L =1.20 and
dU=1.41. Since the observed value of d is less than d L , we reject the null hypothesis, and
conclude that positive autocorrelation is present in our data.
5.3.2 Corrections for Autocorrelation
Generalized Difference Method
If 𝜌 were known, it would be easy to adjust the OLS regression method to obtain efficient
estimates of parameters. This procedure involves the use of generalized differencing to alter
the linear model into one in which the errors are inpendent.
Let us assume that the model
𝑌𝑡 = 𝛽0 + 𝛽1 𝑋1𝑡 + 𝛽2 𝑋2𝑡 +. . . +𝛽𝑘 𝑋𝑘𝑡 + 𝜀𝑡 , t=1,2,…,T (5.12)
Multiplaying Eq. (5.13) by 𝜌 and subtracting from Eq(5.12), we obtain the transformed model
𝑌𝑡∗ = 𝛽0 (1 − 𝜌) + 𝛽1 𝑋1𝑡
∗ ∗
+ 𝛽2 𝑋2𝑡 ∗
+ ⋯ + 𝛽𝑘 𝑋𝑘𝑡 + 𝑣𝑡
Where
𝑌𝑡∗ = 𝑌𝑡 − 𝜌𝑌𝑡−1 , ∗
𝑋2𝑡 ∗
= 𝑋2𝑡 − 𝜌𝑋2𝑡−1 , … , 𝑋𝑘𝑡 = 𝑋𝑘𝑡 − 𝜌𝑋𝑘𝑡−1 𝑎𝑛𝑑 𝑣𝑡 = 𝜀𝑡 − 𝜌𝜀𝑡−1
(5.14)
Now it is easy to show that
𝑉 (𝑣𝑡 ) = 𝐸 (𝑣𝑡 )2 = 𝐸[(𝜀𝑡 − 𝜌𝜀𝑡−1 )2 ] = 𝐸(𝜀𝑡2 ) − 2𝜌𝐸 (𝜀𝑡 , 𝜀𝑡 −1 ) + 𝜌 2 𝐸(𝜀𝑡−1
2 )
= 𝜎𝜀2 −
2𝜌 2 𝜎𝜀2 + 𝜌 2 𝜎𝜀2 = (1 − 𝜌 2 )𝜎𝜀2 = 𝜎𝑣2
(5.15)
And so on.
The iterative procedure can be carried on for as many steps as desired. Standard procedure is
to stop the iterations when the new estimate of 𝜌 differs from the previous one by less than
0.001 or 0.0005.
Example 5.2: Example 5.1 Continued. For the Consumer Expenditure Data, we have 𝜌̂ =
0.751. We corrected both Y and X as described in step 2. We fit corrected Y on corrected X,
compute the residuals and also compute the revised Durbin-Watson statistic which is 1.43.
From the table given by Durbin and Watson (for n=19 and o=1), we obtain d L=1.18 and dU=1.40.
Since the D-W statistic lies between dU and 2, we may accept the null hypothesis and conclude
that of there is no evidence of autocorrelation in this data.
5.4 Comparisons of Different Methods
Minitab computes three measures of accuracy of fitted model: MAPE, MAD, and MSD for
each of the forcasting and smoothing methods. For all three measures, the smaller value, the
better the fit of the model. Use these statistitics to comparethe fits of the fits of thr different
methods.
MAPE, or Mean Absolute Percentage Error, measures the accuracy of fitted time series values.
It expresses accuracy as a percentage.
σ|( 𝑦𝑡 −𝑦̂𝑡 ) ⁄𝑦𝑡 |
𝑀𝐴𝑃𝐸 = × 100 (5.19)
𝑇
Where 𝑦𝑡 equals the actual value, 𝑦̂𝑡 equals the fitted value, and T equals the number of
observations.
Assigment-03
Problem 1: The U.S. Economy for the period 1981-2005, data have given in the following
table 1.1, which relate to the Indian Economy for the period. The Y variable in the table
is the aggregate (for the economy as a whole) private final consumption expenditure
(PFCE) and the X variable is gross domestic product (GDP), a measure of aggregate
income, both measured in Rupee crore at 1999-2000 prices. Therefore, the data are in
"real" terms; that is, they are measured in constant (1999-2000) prices.
Table 1.1: Data on Y (Personal final consumption expenditure) and X (Gross Domestic
Product), both in 1999-00 prices measured in rupee crore.
Year Y X
1981 566866 678033
1982 572536 697861
1983 616974 752669
1984 634757 782484
1985 661249 815049
1986 682116 850217
1987 705495 880267
1988 749530 969702
1989 786725 1029178
Questions
1990 821863 1083572
1991 839593 1099072 1. Fit a simple linear regression model for the above data &
1992 861245 1158025 make comments
1993 898682 1223816 2. Construct the ANOVA table &test roe the significance of
1994 942359 1302076 regression.
3. Calculate t-statistics to assess the contribution of each
1995 999729 1396974
regressor to the model. Use α = 0.05.
1996 1077445 1508378
1997 1109656 1573263
1998 1181797 1678410
1999 1253643 1786526
2000 1292986 1864773
2001 1367758 1972912
2002 1397069 2047733
2003 1493871 2222591
2004 1579255 2389660
2005 1689861 2604532
d=read.csv(file.choose(),header=T);d
attach(d)
# Fitting regression model
model<- lm(Y~X, data=d);model
summary(model)
# ANOVA
anova(model)
# T test statistic
t<-t.test(X,Y);t
The ANOVA table shows that F-value =20452 and P-value is 0.0021 which is less than 5%
confidence interval (α = 0.05). In this case, the 𝑝 𝑣𝑎𝑙𝑢𝑒 (0.0021) < 𝛼 (0.05)
So, we can conclude that the GDP is significant on PCE and the Regression model is
statistically significant.
From the output of R programming, we can see that the 𝑡 value is 0.275 and the result on 2
degrees of freedom we obtain the value of 0.0241 which is less than the calculated value.
So, we can conclude the test is statistically significant.
Problem 2: A soft drink bottler is analyzing the vending machine service routes in his
distribution system .He is interested in predicting the amount of time required by the
route driver to service the vending machines in an outlet. This service activity includes
stocking the machine with beverage products and minor maintenance or housekeeping.
The industrial engineer responsible for the study has suggested that the two most
important variables affecting the delivery time (y) are the number of cases of product
stocked (x1), and the distance walked by the route driver (x2). The engineer has collected
25 observations on delivery time, which are shown in Table 3.2.
d1=read.csv(file.choose(),header=T);d1
attach(d1)
# Fitting regression model
model1<- lm(Y~X, data=d1);model1
# ANOVA
anova(model1)
# Calculate R-square and Adj-R-square
summary(model1)
# t-statistic
t1<-t.test(X,Y);t1
Answer to the question number (1)
Assuming that the "Delivery time" (Y) to two independent variables: "Number of cases product
stocked" (X1) and "Route driver" (X2). We’ve got the multiple regression equation
Delivery time = 0.000143 + 0.722 * Number of cases product stocked + 0.0012 * Route
driver
The coefficient in front of the "Number of cases product stocked" (X1) is 0.722. This indicates
that, holding the "Route driver" constant, a one-unit increase in the "Number of cases product
stocked" is associated with an expected increase in "Delivery time" of 0.722 units. This
suggests that as the number of cases of products stocked increases, the delivery time is expected
to increase by a certain amount.
Overall, this multiple regression equation suggests that both the number of cases of products
stocked and the number of route drivers have an impact on the "Delivery time." The intercept
term and the coefficients of the independent variables provide insights into how changes in
these variables are associated with changes in the dependent variable.
Sum of Mean
Y DF Square Square F-value P value
X 1 10157.5 10157.5 11527.35 0.000000022
x 1 0 0 0.0003 0.98
Residual 11 9.7 0.9
From the output of R programming, we have the Multiple R-Square value is 0.999 and the
Adjusted R-square value is 0.9989 which defines that the model fits the data extremely well
and that the included independent variables are highly effective in explaining the variation in
the dependent variable. In conclusion, we can say that these high values are indicative of a
well-fitting model.
Answer to the question number (4)
From the output of R programming, we can see that the 𝑡 value is 0.286 and the result on 11
degrees of freedom we obtain the value of 0.87 which is greater than the calculated value.
So, we can conclude the test is statistically insignificant.
Answer to the question number (4)
The calculated 95% confidence interval on the mean delivery time for an outlet requiring x 1 =8
cases and the distance x2 =275 feet is
95% Confidence Interval: (5.453894 , 6.758392)
# Given coefficients
se_intercept <-0.292692
se_x1 <-0.006765
se_x2 <-0.069909
sample_size <-13
[1] 2600 2600 1000 1000 1200 800 1000 1200 1200 1200 2200 2400 1000 2600
[15] 2200 2600 1400 1600 2400 1600 2000 1600 1800 1400 1000 800 1200 2200
[29] 2400 1800 800 1400 1600 2400 1600 2200 1400 800 2400 1600 800 1600
[43] 1400 2600 2600 2400 2200 1600 1600 1800 1800 1000 1000 2200 1400 2600
[57] 2200 1600 1600 2200 2200 2000 1400 1400 800 2600 1400 2400 2400 2400
[71] 2400 1800 1800 1400 1200 1200 2400 2400 2000 2400 1600 2000 1400 1400
[85] 2600 2200 800 2600 1000 2600 800 800 1400 1600 1600 1800 2400 2200
[99] 1600 800
Problem 4: Child Mortality in relation to per Capita GNP and Female Literacy Rate
Consider the behavior of child mortality (CM) in relation to per capita GNP (PGNP). Consider
the data given in Table 6.4. These are cross-sectional data for 64 countries on child mortality
and a few other variables. For now, concentrate on the variables child mortality (CM) and per
capita GNP (PGNP). Keep in mind that CM is the number of deaths of children under five per
1000 live births; PGNP is per capita GNP in 1980, and female literacy rate (FLR) is measures
in percent.
Obs CM FLFP PGNP TFR
1 128 37 1870 6.66
2 204 22 130 6.15 Question
3 202 16 310 7 a. Estimate the model and interpret the
4 197 65 570 6.25 coefficients. Does the regression result make
5 96 76 2050 3.81 sense?
6 209 26 200 6.44
b. Obtain the ML estimators of the estimators
7 170 45 670 6.19
using likelihood ratio (LR) test. Test the
8 240 29 300 5.89 validity and make comments.
9 241 11 120 5.89
10 55 55 290 2.36 c. Compute Wald statistics and Multiplier
11 75 87 1180 3.93 statistics (LM).
12 129 55 900 5.99 d. Are all three tests asymptotically
13 24 93 1730 3.5 equivalent? What is the relationship
14 165 31 1150 7.41 between them?
15 94 77 1160 4.21
16 96 80 1270 5
17 148 30 580 5.27
18 98 69 660 5.21
19 161 43 420 6.5
20 118 47 1080 6.12
21 269 17 290 6.19
22 189 35 270 5.05
23 126 58 560 6.16
24 12 81 4240 1.8
25 167 29 240 4.75
26 135 65 430 4.1
27 107 87 3020 6.66
28 72 63 1420 7.28
29 128 49 420 8.12
d2<-read.csv(file.choose(),header =T );d2
attach(d2)
model<-lm(CM~FLFP+PGNP+TFR, data=d2);model
The multiple regression equation:
Child Mortality (CM) = 168.30 - 1.76 * FLFP - 0.0055 * PGNP + 12.86 * TFR
In this equation, FLFP (Female Labor Force Participation) is one of the independent variables.
For every unit increase in FLFP, the Child Mortality is expected to decrease by 1.76 units,
holding the other variables constant.
Again, PGNP (Per Capita Gross National Product) is another independent variable. For every
unit increase in PGNP, the Child Mortality is expected to decrease by 0.0055 units, holding the
other variables constant.
On the other hand, TFR (Total Fertility Rate) is the third independent variable. For every unit
increase in TFR, the Child Mortality is expected to increase by 12.86 units, holding the other
variables constant.
The constant term (168.30) represents the intercept or the Child Mortality value when all the
independent variables (FLFP, PGNP, TFR) are zero.
It's important to note that these interpretations assume a linear relationship between the
variables and that the assumptions of multiple regression (such as linearity, independence of
errors, homoscedasticity, and absence of multicollinearity) are met.
# Simulated data
set.seed(123)
From the R output, we have likelihood ratio test is 17.1786 which defines the complex model
significantly improves the likelihood of the data compared to the simpler model.
Answer to the question no (3)
# Simulated data
set.seed(123)
# Display results
From the output in R, we can see the Wald test statistic is 50.81 and the Lagrange
Multiplier test statistic is 3.92. On the other hand, the Wald test statistic describes the
coefficient is further from zero and a stronger effect.
Problem 5: Consumption Expenditure on Income and Wealth
Consider the Consumption Expenditure about Income and Wealth. Consider the data
given in Table 10.5 For now concentrate on the variables Consumption (Y), Income
(X2), and Wealth (X3)
Y X2 X3
70 80 810
65 100 1009
90 120 1273
95 140 1425
110 160 1633
115 180 1876 Question
120 200 2052 1. Estimate the model and interpret the coefficients.
140 220 2201 Does the regression result make sense?
155 240 2435 2. Test the significance of regression and interpret.
150 260 2686 3. wouldyou expect Multicollinearity?
4. If collinearity is expected, how would you resolve the
problem?
**********Problem 5*************
d3<-read.csv(file.choose(),header=T);d3
attach(d3)
head(d3)
model3<-lm(Y~X2+X3, data=d3);model3
summary(model3)
anova(model3)
# Correlation Matrix
my_cor<-cor(d3);my_cor
round(my_cor,2)
eigen(my_cor)$values
# Condition Matrix
k<-max(eigen(my_cor)$values)/min(eigen(my_cor)$values);k
CI<-sqrt(k);CI
From the output in R, we can see the Condition Index is 56.22 which implies there exists seve
re multicollinearity because condition index is greater than 30.
Multicollinearity occurs when two or more independent variables in a regression model are hi
ghly correlated with each other, which can lead to issues in interpreting the individual coeffici
ents and affecting the stability of the model. Here are several strategies to address multicolline
arity in R:
1. Feature Selection: Consider removing one or more of the correlated variables from the
model. This reduces the redundancy and can help improve model stability. You c
an use techniques like domain knowledge, variable importance measures, or stepwise r
egression for feature selection.
2. Combine Variables: If it makes sense in your domain, you could create composite vari
ables that represent a combination of correlated variables. This could help in capturing
the shared information without directly including the correlated variables.
3. Principal Component Analysis (PCA): PCA transforms correlated variables into a new
set of uncorrelated variables (principal components) that can be used in the regression.
However, the interpretation of these components might be challenging in terms of thei
r original meaning.
4. VIF (Variance Inflation Factor) Analysis: Calculate the VIF for each variable. High V
IF values (typically above 5 or 10) indicate high multicollinearity. If you find variable
s with high VIF, consider removing one or addressing the correlation issue.
5. Partial Regression Plots: Visualize relationships between each predictor and the respo
nse while controlling for other variables. High slopes in these plots might suggest
multicollinearity.
***********6***********
d4<-read.csv(file.choose(),header=T);d4
attach(d4)
head(d4)
model4<-lm(C~Yd+W+I, data=d4);model4
summary(model4)
anova(model4)
From the output in R, we can see the p-value is 0.00000024 which is less than 0.05. So, we
can conclude the model is statistically significant and the model suggests that the predictors
(Real disposable personal income , wealth, income, real interest) collectively have a
meaningful impact on real consumption expenditure.
Answer to the question no( 4)
From the output in R, we have the multiple R-square value is 0.9994. So, we can conclude
the predictors (Real disposable personal income , wealth, income, real interest) can explain
almost 99% variation of the real consumption expenditre.
d5<-read.csv(file.choose(),header=T);d5
attach(d5)
model5<-lm(Y~X1+X2+X3+X4+X5, data=d5);model5
summary(model5)
anova(model5)
t2<-t.test(X1,Y);t2
# Correlation Matrix
my_cor<-cor(d5);my_cor
round(my_cor,2)
eigen(my_cor)$values
# Condition Matrix
k<-max(eigen(my_cor)$values)/min(eigen(my_cor)$values);k
CI<-sqrt(k);CI
2. (-2.83X1): This term is associated with the GNP implicit price deflator (X1). The
coefficient is negative (-2.83), which means that as the GNP implicit price deflator
increases, the number of people employed (Y) is expected to decrease. The larger the
increase in X1, the larger the decrease in Y, all else being equal.
3. (0.809X2): This term is associated with GNP (X2) in millions of dollars. The coefficient
is positive (0.809), indicating that as the GNP increases, the number of people employed
(Y) is expected to increase. A larger GNP (X2) would lead to a larger Y, holding other
variables constant.
4. (-2.60X3): This term is associated with the number of people unemployed (X3) in
thousands. The coefficient is negative (-2.60), meaning that as the number of
unemployed people increases, the number of people employed (Y) is expected to
decrease. This is an intuitive relationship, as more unemployment typically corresponds
to fewer people being employed.
5. (-1.97X4): This term is associated with the number of people in the armed forces (X4).
The coefficient is negative (-1.97), suggesting that an increase in the number of people
in the armed forces would lead to a decrease in the number of people employed (Y),
assuming other factors remain constant.
From the output in R, we can see the p-value is 0.00043 which is less than 0.05. So, we can
conclude the model is statistically significant and the model suggests that the predictors (X1,
X2, X3, X4, X5) collectively have a meaningful impact on number of people employed.
Answer to the question no(4)
From the output in R, we can see the Condition Index is 3954.68 which implies there exists se
vere multicollinearity because condition index is greater than 30.
Problem 8:
Table 10.13 gives data on imports, GDP, and WPI the wholesale price index (WPI) for
India over the period 1980-81 to 2005-09. Consider the following model:
𝐼𝑛 𝐼𝑚𝑝𝑜𝑟𝑡𝑠𝑡 = 𝛽1 + 𝛽2 𝐼𝑛𝐺𝐷𝑃𝑡 + 𝛽3 𝐼𝑛𝐶𝑃𝐼𝑡 + 𝑢 𝑡
Table 10.13 Imports, GDP at market price and WPI, 1980-81 to 2008-09
#Read Data
library(readxl)
data=read_excel(file.choose())
data
attach(data)
head(data)
data=data.frame(data)
data
#Taken Log
ln_Imports=log(Imports)
ln_GDP=log(GDP)
Ln_CPI=log(WPI)
#regress
model1=lm(ln_Imports~ln_GDP)
model2=lm(ln_Imports~Ln_CPI)
model3=lm(ln_GDP~Ln_CPI)
summary(model1)
summary(model2)
summary(model3)
#The best solutions here be to express imports and GDP in real terms
dividing each byCPI
ln_impCpi=log(Imports/WPI)
ln_impCpi
ln_GDPcPI=log(GDP/WPI)
ln_GDPcPI
Model4=lm(ln_impCpi~ln_GDPcPI)
Model4
summary(Model4)
Answer to the question no (a)
Assuming that “Imports” to two independent variables:GDP and CPI. We have got the
regression equation
𝐼𝑛 𝐼𝑚𝑝𝑜𝑟𝑡𝑠𝑡 = 0.91206 − 0.00101𝐼𝑛𝐺𝐷𝑃𝑡 + 2.31284𝐼𝑛𝐶𝑃𝐼𝑡
Coefficients:
i)
Coefficients:
Interpretation:
The regression of lnImports on lnGDP shows the two variables are highly correlated, perhaps
suggesting that the data suffer from collinearity problem.
ii)
Coefficients:
Interpretation:
The regression of lnImports on lnCPI shows the two variables are moderate correlated,
perhaps suggesting that the data suffer from moderate collinearity problem.
iii)
Coefficients:
Interpretation:
The auxiliary regression of lnGDP on lnCPI shows the two variables are not highly
correlated, perhaps suggesting that the data does not suffer from collinearity problem.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Interpretation:
The regression of (ln Imports/LnCPI) on (LnGDP/lnCPI) shows the two variables are not
highly correlated, perhaps suggesting that the data does not suffer from collinearity problem.
Problem 9:
Example 11.1 & 11.2: Park Test & Glejser Test Relationship between Compensation and
Productivity
Consider Table 11. 1 gives data on average wages per employee and labor productivity in 11
manufacturing industry groups for the year 19998-99. The data is averaged across three states
of India namely Andhra Pradesh, Bihar and Gujarat.
Table11.1 Wages per employee (Rs.) amd Productivity (Rs.) in Manufacturing Industries in
India :1998-99
wages per employee productivity std Questions
31660.94 561506.15 2527.91
39654.76 1027032.58 10082.87 a.Estimate the parameters of this model and
16394.52 455223.97 5763.08 interpret the results.Do they make economic
31139.27 687717.5 17663.02 sense?
56247.38 929562.01 8756.23 b.Would you expect the error variance in the
21316.48 538554.9 10311.67 preceding model to be
21566.75 549645.86 5052.37 heteroscedastic? Why?
39175.69 749620.67 20937.34
47845.94 935242.07 38567.69 c.Illustrate the Park approach to detect
53601.11 785937.34 22479.52 heteroscedasticity.
24711.36 306195.84 19807.13 d.Illustrate Glejser Test to detect
heteroscedasticity
data
attach(data)
head(data)
data=data.frame(data)
data
#Linear Model
model=lm(W~P)
model
summary(model)
ui=resid(model)
ui
absui=abs(ui)
absui
model=lm(absui~P) ###gleser test
summary(model)
###park test
model=lm(W~P)
model
summary(model)
#Calculate Residual
res=resid(model)
res
#Squared Residual
u2=res^2
u2
#Taken Log
lnu2=log(u2)
lnP=log(P)
#Linear model rsidual to independent variable
model=lm(lnu2~lnP)
model
. summary(model)
Coefficients:
Interpretation:
The coefficient term tells the change in wages for 1$ change in Productivity i.e if
Productivity rises by 1$ then wages rises by 0.0486 . If you are familiar with derivatives then
you can relate it as the rate of change of wages with respect to productivity .
Park test:
The residuals obtained from regression are then regressed on productivity as suggested giving
the following results:
̂
ln 𝑢 2𝑖 = 29.25 − 0.92𝑙𝑛𝑃
Coefficients:
Glejser Test:
The absolute value of the residuals obtained from regression was regressed on average
productivity , giving the following results:
|𝑢̂ 𝑖 | = 5.093e + 03 + 2.189e − 03 P
Coefficients:
Problem 11:
#Read Data
library(readxl)
data=read_excel(file.choose())
data
attach(data)
head(data)
data=data.frame(data)
data
H
class(H)
H=as.integer(H)
H
class(H)
#Taken Log
lnCt=log(C)
lnCt
lnIt=log(I)
lnIt
lnLt=log(L)
lnLt
lnHt=log(H)
lnHt
lnAt=log(A)
lnAt
#Linear Model
m=lm(lnCt~lnIt+lnLt+lnHt+lnAt)
m
summary(m)
#Calculate Residuals
res=resid(m)
res
#Standard Residuals
standard_res=rstandard(m)
standard_res
data.frame(res,standard_res)
plot(res,standard_res)
#Install Packages
install.packages('lmtest')
library('lmtest')
# Perform the Durbin-Watson test
dwtest(m)
# Carry out the Durbin-Watson test
durbinWatsonTest(m)
#perform Breusch-Godfrey test
bgtest(m,order = 1,data=data)
bgtest(m,order = 2,data=data)
bgtest(m,order = 3,data=data)
Coefficients:
Interpretation:As we can see, the coefficient of I,L and A are statistically significant and have
the economically meaningful impact in C.
1 0.023221931 0.2103062
3 -0.244654772 -2.1208100
4 -0.105724521 -0.9115771
5 0.133004568 1.2856635
6 0.217517839 1.8832038
7 0.065299848 0.5822464
8 -0.069953503 -0.6055174
9 0.031972362 0.2748056
10 0.054325844 0.4706137
11 0.033478166 0.2930961
12 0.005092719 0.1067780
13 0.043591106 0.3952097
14 -0.067973587 -0.5688233
15 -0.130943477 -1.1051998
16 -0.181224048 -1.5785523
17 -0.061124395 -0.5291117
18 -0.081443504 -0.6947837
19 -0.050786470 -0.4360024
20 0.149156767 1.2589808
21 0.104284912 0.8876354
22 0.097347239 0.8847658
23 0.076340437 0.7012328
24 0.160041001 1.4119118
25 0.087774845 0.7538771
26 -0.031058589 -0.2660261
27 -0.150567847 -1.3214143
28 -0.189168137 -1.6936295
29 0.048761866 0.4471027
30 0.033411400 0.3281660
Interpretation: Here's the residuals vs. standard residuals plot for the data
set's the points on the plot show pattern or trend, suggesting that there is a
relationship between the residuals and standard rsiduals. We will see that
probably suggest autocorrelation.
Answer to the question no (c)
Durbin-Watson test
data: m
Breusch-Godfrey test:
data: m
data: m
data: m
Problem 12:
The data are given in table 12.8.
data
attach(data)
head(data)
data=data.frame(data)
data
m=lm(Y~X)
m
summary(m)
library(lmtest)
dwtest(m)
plot(m)
#chochrane-orcutt
# estimate rho (correlation coefficient)
e=residuals(m)
e
resid(m)
et= e[2:20]
et
et1 <- e[1:(20-1)]
rhohat <- sum(et*et1) / sum(e^2)
rhohat
y <- with(data, Y- rhohat*lag(Y))
x <- with(data, X- rhohat*lag(X))
x
y
Model12<- lm(y ~ x)
Model12
# Cochrane.Orcutt correction
ccl = cochrane.orcutt(m)
𝑌̂ = 193719.5 + 9975.8𝑋
Coefficients:
Durbin-Watson test
data: m