Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Assignment For Viva

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment For Viva

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Multicollinearity

Definition
Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables in a regression model are highly correlated with each other. In other words,
multicollinearity indicates a strong linear relationship among the predictor variables. This can
create challenges in the regression analysis because it becomes difficult to determine the
individual effects of each independent variable on the dependent variable accurately.
Example:
1. Determining the electricity consumption of a household from the household income
and the number of electrical appliances. Here, we know that the number of electrical
appliances in a household will increase with household income. However, this cannot
be removed from the dataset.
2. Creating A Variable For BMI From The Height And Weight Variables Would Include
Redundant Information In the model, and the new variable will be a highly correlated
variable.
3. Including variables for temperature in Fahrenheit and temperature in Celsius.
4. In a dataset containing the status of marriage variable with two unique values:
‘married’, and ’single’. Creating dummy variables for both of them would include
redundant information. We can make do with only one variable containing 0/1 for
‘married’/’single’ status.
What Causes Multicollinearity?
Multicollinearity could occur due to the following problems:

1. Multicollinearity could exist because of the problems in the dataset at the time of
creation. These problems could be because of poorly designed experiments, highly
observational data, or the inability to manipulate the data.
2. Multicollinearity could also occur when new variables are created which are
dependent on other variables.
3. Including identical variables in the dataset.
4. Inaccurate use of dummy variables can also cause a multicollinearity problem. This is
called the Dummy variable trap.
5. Insufficient data, in some cases, can also cause multicollinearity problems.

The Consequences of Multicollinearity


1. Imperfect multicollinearity does not violate Assumption 6. Therefore the Gauss-Markov
Theorem tells us that the OLS estimators are BLUE. So then why do we care about
multicollinearity?
2. The variances and the standard errors of the regression coefficient estimates will increase.
This means lower t-statistics.
3. The overall fit of the regression equation will be largely unaffected by multicollinearity.
This also means that forecasting and prediction will be largely unaffected.
4. Regression coefficients will be sensitive to specifications. Regression coefficients can
change substantially when variables are added or dropped
The Detection of Multicollinearity
1. High Correlation Coefficients
2. High R2 with low t-Statistic Values
3. High Variance Inflation Factors (VIFs)
4. Pairwise Scatterplot and Correlation Coefficients
5. Eigenvalue Method
Remedies for Multicollinearity
No single solution exists that will eliminate multicollinearity. Certain approaches may be
useful:
1. Do Nothing
Live with what you have.
2. Drop a Redundant Variable
If a variable is redundant, it should have never been included in the model in the first place.
So dropping it actually is just correcting for a specification error. Use economic theory to
guide your choice of which variable to drop.
3. Transform the Multicollinear Variables
Sometimes you can reduce multicollinearity by re-specifying the model, for instance, create a
combination of the multicollinear variables. As an example, rather than including the
variables GDP and population in the model, include GDP/population (GDP per capita)
instead
4. Increase the Sample Size
Increasing the sample size improves the precision of an estimator and reduces the adverse
effects of multicollinearity. Usually adding data though is not feasible.

Heteroscedasticity
Definition
Heteroscedasticity means the variance of the disturbances ui is not constant ∀i, conditional on
X, because it depends on one or several variables.The presence of heteroskedasticity is quite
frequent when working with cross-section data.
Let’s assume a regression model that determines consumption as a functionof income. The
variance of the error term might be expected to increase as income increases.

Examples:
1. The range in family income between the poorest and richest family in town is the
classical example of heteroscedasticity.
2. The range in annual sales between a corner drug store and general store.
3. when the prices of a product are studied at the launch of a new model, heteroskedasticity
is predictable. But for rainfall or income comparisons, the nature of dispersion cannot
be predicted.

Possible Causes of Heteroskedasticity:


• The values of the variables in the regression equation vary substantially in different
observations. The variation in the omitted variables and the measurement errors which
are jointly responsible for the disturbance term will be relatively small when y and x
are
small and relatively large when they are large (economic variables tending to move in
size together).
• In time series analysis if y and x are growing over time, then it may well happen that
the ariance of the disturbance term is also growing over time.
• v It occurs in data sets with large ranges and oscillates between the largest and smallest
values.

• It occurs due to a change in factor proportionality.

• Among other reasons, the nature of the variable can be a major cause.
• It majorly occurs in cross-sectional studies.

• Some regression models are prone to heteroskedastic dispersion.

• An improper selection of regression models can cause it.


• It can also be caused by data set formations and inefficiency of calculations as well.
Consequences of Heteroskedasticity:
1. OLS is no longer efficient, since it treats all observations as being of equal importance,
while there is much more information about the line to be had from some observations
than from others, and when estimating the line more attention should be paid to the
observations having small variances than those with larger variances.
2. The standard errors are biased, and thus hypothesis tests and confidence intervals based
on the t-distribution are likely to be invalid
3.First, note that we do not need the homoskedasticity asssumption to show the unbiasedness
of OLS. Thus, OLS is still unbiased.
4.However, the homoskedasticity assumption is needed to show thee¢ ciency of OLS. Hence,
OLS is not BLUE any longer.
5.The variances of the OLS estimators are biased in this case. Thus, the usual OLS t statistic
and confidence intervals are no longer validfor inference problem.
6.We can still use the OLS estimators by Ending heteroskedasticity-robust estimators of the
variances.
7.Alternatively, we can devise an e¢ cient estimator by re-weighting the data appropriately to
take into account of heteroskedasticity.

Detection Of Heteroscedasticity
Informal methods to identify the problem of heteroscedasticity
1. Checking Nature of the problem
2. Graphical inspection of residuals
Formal methods to identify the problem of heteroscedasticity
1. Park Test
2. Glejser test
3. White's test
4. Spearman's rank correlation test
5. Goldfeld-Quandt test
6. Breusch- Pagan test

Remedial Measures Of Heteroscedasticity


Remedial measures when true error variance is known
1. Generalized Least Squares
2. Weighted Least Squares (WLS) Estimator
Remedial measures when true error variance is unknown
1. Feasible Generalized Least Squares (FGLS) Estimator
2. White’s Correction
3. Plausible Assumptions about Heteroscedasticity Pattern
Autocorrelation
Meaning Nature of Autocorrelation:
Autocorrelation may be defined as “correlation between members of series of observations
ordered in time [as in time series data] or space [as in cross-sectional data].’’ the CLRM
assumes that:

𝐸(𝑢 𝑖 𝑢𝑗 ) ≠ 0 𝑓𝑜𝑟 𝑖 ≠ 𝑗

The disruption caused by a strike this quarter may very well affect output next quarter
The increases in the consumption expenditure of one family may very well prompt another
family to increase its consumption expenditure.
Example:
Ex1: Athletes competing against exceptionally good or bad teams. This is especially evident
in baseball because teams play each other 3-4 times in a row.
Ex2: Certain pests which inhibit plant growth may be more
prevalent in some areas
Ex3: A technical analyst can learn how the stock price of a particular day is affected by those
of previous days through autocorrelation. Thus, he can estimate how the price will move in
the future.
Reasons of of Autocorrelation :
i) Inertia
Inertia or sluggishness in economic time-series is a great reason for autocorrelation. For
example, GNP, production, price index, employment, and unemployment exhibit business
cycles. Starting at the bottom of the recession, when the economic recovery starts, most of
these series start moving upward. In this upswing, the value of a series at one point in time is
greater than its previous values. These successive periods (observations) are likely to be
interdependent.
ii) Omitted Variables Specification Bias
The residuals (which are proxies for 𝑢 𝑖) may suggest that some variables that were originally
candidate but were not included in the model (for a variety of reasons) should be included.
This is the case of excluded variable specification bias. Often the inclusion of such variables
may remove the correlation pattern observed among the residuals.
iii) Model Specification: Incorrect Functional Form
Autocorrelation can also occur due to the miss-specification of the model.
iv) Effect of Cobweb Phenomenon
The quantity supplied in the period of many agricultural commodities depends on their price
in period . This is called the Cobweb phenomenon. This is because the decision to plant a
crop in a period of
iv) Effect of Lagged Relationship
Many times in business and economic research the lagged values of the dependent variable
are used as explanatory variables. For example, to study the effect of tastes and habits on
consumption in a period , consumption in period is used as an explanatory variable since
consumer do not change their consumption habits readily for psychological, technological, or
institutional reasons. s influenced by the price of the commodity in that period.

Consequences of Autocorrelation:
1. OLS is no longer efficient, since it treats all observations as being of equal
importance,
while there is much more information about the line to be had from some
observations
than from others, and when estimating the line more attention should be paid to the
observations having small variances than those with larger variances.
2. The standard errors are biased, and thus hypothesis tests and confidence intervals
based
on the t-distribution are likely to be invalid.
3. When the disturbance terms are serially correlated then the OLS estimators of the 𝛽̂s
are still unbiased and consistent but the optimist property (minimum variance
property) is not satisfied.
4. The OLS estimators will be inefficient and therefore, no longer BLUE.
5. The estimated variance of the regression coefficients will be biased and inconsistent
and will be greater than the variances of estimate calculated by other methods,
therefore, hypothesis testing is no longer valid. In most of the cases, 𝑅 2 will be
overestimated (indicating a better fit than the one that truly exists). The t- and F-
statistics will tend to be higher.
Detection Method of Autocorrelation:
1. Durbin-Watson Test
2. Ljung-Box Q Test
3. ACF plots
Remedial Measure of Autocorrelation:
1. Cochrane-Orcutt Procedure
2. Hildreth-Lu Procedure
3. First Differences Procedure
4. Forecasting Issues

Independent Observations
The fifth assumption about regression and correlation analysis is that successive residuals
should be independent. This means that there is not a pattern to the residuals , the residuals are
not highly correlated, and there are not long runs of positive or negative residuals. When
successive residuals are correlated, we refer to this condition as autocorrelation.
Autocorrelation frequently occurs when the data are collected over a period of time. For
examole,we wish to predict yearly sales of Ages Sofware Inc. based on the time and the amount
spent on advertising. The dependent variables yearly sales and independent variables are time
and amount spent on advertising. It is likely that for a period of time the actual points will be
above the regression plane (remember there are two independent variables) and then for a
period of a time the points will be below the regression plane. The graph below shows that the
residuals plotted on the vertical axis and the fitted values 𝑌̂ on the horizontalaxis. Note the run
of residuals above the mean of the residuals, followed by a run below the mean. A scatter plot
such as this would indicate possible autocorrelation.

There is a test for autocorrelation, called Durbin-Watson. We present the details of this test in
Chapter 16, Section 16.10.

16.10 The Durbin-Watson Statistic


Time series data or observations collected successively over a period of time present a
particular difficulty when you use the technique of regression . One of the assumptions
traditionally used in regression is that the successive residuals are independent. This means that
there is not a pattern to the residuals are not highly correlated and there is not long runs of
positive or negative residuals .In chart 16-10 , the residuals are scaled on the vertical axis and
the Y values along the horizontal axis.Notice there are “runs” of residuals above and below the
0 line .If we computed the correlation between successive residuals, it is likely the correlation
would be strong.

CHART 16-10 Correlated Residuals


This condition is called auto correlation or serial correlation.

AUTOCORRELATION successive residuals are correlated

Successive residuals are correlated in time series data because an event in one time period often
influences the event in the next period . To explain, the owner of a furniture store decides to
have a sale this month and spends a large amount of money advertising the event . We would
expect a correlation between sales and advertising expense, but all the results of the increase
in advertising are not experienced this month. It is likely that some of the effect of the
advertising caries over into next month. Therefore we expect correlation among the residuals.
The regression relationship in a time series is written
𝑌𝑡 = 𝛼 + 𝛽𝑋𝑡 + 𝑒𝑡
Where the subscript t is used in place of in i to suggest the data were collected overtimr.If the
residuals are correlated problems occur when we try to conduct tests of hypotheses about the
regression coefficients. Also, a confidence interval or a prediction interval, where the multiple
standard error of estimate is used , may not yield the correct results. The autocorrelation
,reported as r, is the strength of the association among the residuals. The r has the same meaning
as the coefficient of correlation .That is values close to -1.00 or 1.00 indicate a strong
association , and values near 0 indicate no association. Instead of directly conducting a
hypothesis test on r, we use the Durbin-Watson statistic .
The the Durbin-Watson statistic, identified by the letter d, is computed by first determining the
residuals for each observations. That is 𝑒𝑡= (𝑌 − 𝑌̂). Next, we compute d using the following
relationship .

σ𝒏
𝒕=𝟐(𝒆𝒕 −𝒆𝒕−𝟏 )
𝟐
DURBIN-WATSON STATISTIC 𝒅= σ𝒏 𝟐 [16-4]
𝒕=𝟏(𝒆𝒕 )

To determine the numerator of formula (16-4), we lag each of the residuals one period and then
square the difference between consecutive residuals . This may also be called finding the
differences .This accounts for summing the observations from 2, rather than from 1, up to n. In
the denominator ,we square the residuals and sum over all n observations.
The value of the Durbin-Watson statistic can range from 0 to 4. The value of d is 2.00 when
there is no autocorrelation among the the residuals . when the value of d gets close to zero , this
indicates positive autocorrelation. Values beyond 2 indicate negative autocorrelation.Negative
autocorrelation seldom exists in practice .To occur, successive residuals would tend to be large
, but would have opposite signs.
To conduct a test for autocorrelation, the null and alternative hypotheses are:

𝐻0 : 𝑁𝑜 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 (𝜌 = 0)
𝐻1 : 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 (𝜌 > 0)

Recall from the previous chapter that r refers to the sample correlation and that 𝜌 is the
correlation coefficients in the population. The critical values for d are reported in Appendix
B.10. To determine the critical value, we need 𝛼 9the significance level), n (the sample size)
and k (the number of independent variables). The decision rule for the Durbin-Watson test is
altered from what we are used to. As usual, there is a range of ever ,there is also a range of
values where d is inconclusive. That is, in the inconclusive range the null hypothesis is neither
rejected nor not rejecyed. To state this more formally:
• Vales less than dl cause the rejection of the null hypotheses.
• Values greater than du will result in the null hypothesis not being rejected .
• Values of d between dl and du yield inconclusive results.
The subscript l refers to the the lower limit of d and subscript u the upper limit . How do we
interpret the various decisions for the test for residual correlation ? If the null hypothesis is not
rejected, we conclude
that autocorrelation is not present .The residuals are not correlated, there is no autocorrelation
present, and the regression assumption has been met. There will not be any problem with the
estimated value of standard error of estimate . If the null hypothesis is rejected, then we
conclude that autocorrelation is present .
The usual remedy for autocorrelation is to include another predictor variable that captures time
order. For example,we might use the square root of Y instead of Y.This transformation will
result in a change in the distribution of the residuals. If the result falls in the inconclusive
range, more sophisticated tests are needed, or conservatively, we treat the conclusion as
rejecting the null hypothesis.
An example will show the details of the durbin-Watson test and how the results are interpreted.
Example:
Banner Rocker Company manufactures and markets rocking chais. The company developed a
special rocker for senior citizens, which it advertises extensively on TV. Banner’s market for
the special chair is the Carollinas, Florida , and Arizona where thre are many senior citizens
and retired people. The president of Banner Rocker is studying the association between his
advertising expense (X) and the number of rockers sold over the last 20 months (Y). He
collected the following data . He would like to create a model to forecast sales , based on the
amount spent on advertising, but is concerned that, because he gathered these data over
consecutive months, there might be problems with autocorrelation.
Month Sales(000) Advertising($millions)
1 153 5.5
2 156 5.5
3 153 5.3
4 147 5.5
5 159 5.4
6 160 5.3
7 147 5.5
8 147 5.7
9 152 5.9
10 160 6.2
11 169 6.3
12 176 5.9
13 176 6.1
14 179 6.2
15 184 6.2
16 181 6.5
17 192 6.7
18 205 6.9
19 215 6.5
20 209 6.4

Determine the regression equation. Is advertising a good predictor of sales ? If the owner were
to increase the amount spent on advertising by $1,000,000, how many additional chairs can he
expect to sell? Investigate the possibility of autocorrelation.
Solution:
The first step is to determine the regression equation.
Regression Equation
Sales = -43.8
+ 35.95 Advertising

Coefficients
SE T- P-
Term Coef Coef Value Value VIF
Constant -43.8 34.4 -1.27 0.220
Advertising 35.95 5.75 6.26 0.000 1.00

Model Summary
R- R-
S R-sq sq(adj) sq(pred)
12.3474 68.50% 66.75% 61.86%

Analysis of Variance
Adj Adj F- P-
Source DF SS MS Value Value
Regression 1 5968 5967.7 39.14 0.000
Advertising 1 5968 5967.7 39.14 0.000
Error 18 2744 152.5
Lack-of-Fit 10 1472 147.2 0.93 0.554
Pure Error 8 1272 159.0
Total 19 8712
The coefficient of determination is 68.5 percent .So we know there is strong
positive association between the variables .We conclude that , as we increase the
amount spent on advertising, we can expect sell more chairs .Of course this is
what we ha hoped.

How many more chairs we can expect to sell if we increase advertising by


$1,000,000? We must be careful with the units of the data. Sales are in thousands
of chairs and advertising expense is in million of dollars.The regression equation
is:

𝑌̂ = −43.80 + 35.950𝑋

This equation indicates that an increase of 1 in X will result in an increase ofn 35.95 in Y. So
an increase of $1,000,000 in advertising will increase sales by 35,950 chairs . To put it another
way, it will cost $27.82 in additional advertising expense per chair sold, found by
$1,000,000/35,950.
What about the potential problem with autocorrelation ? Many software packages, such as
Minitab, will calculate the value of the test and to see the details of formula(16-4), we use an
excel spreadsheet.

Month Y X 𝑌̂ 𝑒𝑡 𝑒𝑡−1 (𝑒𝑡 − 𝑒𝑡−1 )2 𝑒𝑡2


1 153 5.5 153.9237 -0.92366 0.8531478
2 156 5.5 153.9237 2.07634 -0.92366 9 4.3111878
3 153 5.3 146.7336 6.266378 2.07634001 17.5564176 39.267492
4 147 5.5 153.9237 -6.92366 6.26637791 173.9770998 47.937068
5 159 5.4 150.3286 8.671359 -6.92366 243.204616 75.192466
6 160 5.3 146.7336 13.26638 8.67135896 21.11419915 175.99678
7 147 5.5 153.9237 -6.92366 13.2663779 407.6376304 47.937068
8 147 5.7 161.1137 -14.1137 -6.92366 51.69664499 199.19647
9 152 5.9 168.3037 -16.3037 -14.113698 4.796266 265.8118
10 160 6.2 179.0888 -19.0888 -16.303736 7.756541652 364.382
11 169 6.3 182.6838 -13.6838 -19.088793 29.21382015 187.2467
12 176 5.9 168.3037 7.696264 -13.683812 457.1076412 59.232483
13 176 6.1 175.4938 0.506226 7.69626421 51.69664499 0.2562651
14 179 6.2 179.0888 -0.08879 0.50622631 0.35404755 0.0078841
15 184 6.2 179.0888 4.911207 -0.0887926 25 24.119958
16 181 6.5 189.8738 -8.87385 4.91120736 190.0277923 78.745205
17 192 6.7 197.0639 -5.06389 -8.8738495 14.51581121 25.642955
18 205 6.9 204.2539 0.746075 -5.0638874 33.75565961 0.5566275
19 215 6.5 189.8738 25.12615 0.74607472 594.3880959 631.32344
20 209 6.4 186.2788 22.72117 25.1261505 5.783933853 516.25154
2338.582862 2744.2685

To investigate the possible autocorrelation, we need to determine the residuals for each
observations. We find the fitted values -that is, the Y- for each of the 20 months. This
information is shown in the fourth column, column D. Next we find the residual , which is the
difference between the actual value and fitted values. So for the first month:
𝑌̂ = −43.80 + 35.950𝑋 = −43.80 + 35.950(5.5) =
153.925

𝑒1 = 𝑌1 − 𝑌̂1 = 153 − 153.925 = −0.925

The residual, reported in column E, is slightly different due to rounding in the software. Notice
in particular the string of five negative residuals in rows 8 through 12. In column F, we lag the
residuals one period. In column G, we find the difference between the current residual and the
residual in the square this difference. Using the values from the software:
(𝑒𝑡 − 𝑒𝑡−1 )2 =(𝑒2 − 𝑒2−1 )2 =[2.0763-(-0.9237)]2 =(3.0000)2 =9.0000

The other values in column G are found the same way. The values in column H are the square
of those column E.
(e1 )2 =(-0.9237)2 =0.8531
To find the value of d, we need the sums of columns G and H.These sums are noted in yellow
in spreadsheet.

σ𝑛𝑡=2 (𝑒𝑡 − 𝑒𝑡−1 )2 2338.583


𝑑= = = 0.8522
σ𝑛𝑡=1 (𝑒𝑡 )2 2744.269

Now to answer the question as to whether there is significant autocorrelation. The null and the
alternative hypotheses are stated as follows.
𝐻0 : 𝑁𝑜 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
𝐻1 : 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛

The critical value of d is found in Appendex B.10, a portion of which is shown below. There is
one independent variable, so k=1, the level of significance is 0.05, and the sample size is 20.
We move to 0.05 table the columns where k=1, and the row of 20. The reported values are
dl=1.20 and du =1.41.The null hypothesis is rejected if d<1.20 and not rejected if d>1.41. No
conclusion is rejected if d is between 1.20 and 1.41.

k 1 2
n d d d d
15 1.08 1.36 0.95 1.54
16 1.1 1.37 0.98 1.54
17 1.13 1.38 1.02 1.54
18 1.16 1.39 1.05 1.53
19 1.18 1.4 1.08 1.53
20 1.2 1.41 1.1 1.54
21 1.22 1.42 1.13 1.54
22 1.24 1.43 1.15 1.54
23 1.26 1.44 1.17 1.54
24 1.27 1.45 1.19 1.55
25 1.29 1.45 1.21 1.55

Positive Autocorrelation Undetermined No


Autocorrelation

0 d

du
dl

Reject H0 Inconclusive Accept H0

0.85 1.20 1.41

Because the computed value of d is 0.8522, which is less than the dl, we reject the null
hypothesis and accept alternate hypothesis. We conclude that the residuals are autocorrelated.
We have violated one of the regression assumptios. What do we do? The presence of
autocorrelation usually means that the regression model has not been correctly specified. It is
likely we need to add one or more independent variables that have some time-ordered effects
on the dependent variable. The simplest independent variable to add is one that represents the
time periods.

CHAPTER 5
TIME SERIES
5.1 Time Series Models
A time series model accounts for patterns of the past movement of a variable and uses that
information to predict its future movements, i.e, it is a sophisticated method of extrapolating
data. At times it is desirable to smooth a time series and thus eliminate some of the more volatile
short term fluctuations.
5.2 Modeling Trend By Using Polynomial Functions
We begin with simple models that can be used to forecast a time series on the basis of its past
behavior. Most of the series we encounter are not ncontinuous in time, instead they consist of
discrete observations made at regular intervals of time. We denote the values of a time series
by {yt}, t=1,2,…,T.

Figure 5.1:Time Series Data

We sometimes can describe a time series yt by using a trend model defined as


𝑦𝑡 = 𝑇𝑅𝑡 + 𝜀𝑡 (5.1)

where TRt is the trend in time period t.


No trend model: 𝑇𝑅𝑡 = 𝛽0 (5.2)

Linear Trensd Model: 𝑇𝑅𝑡 = 𝛽0 + 𝛽1 𝑡 (5.3)

Polynomial Trend of order p: 𝑇𝑅𝑡 = 𝛽0 + 𝛽1 𝑡 + 𝛽2 𝑡 2 +. . . +𝛽𝑝 𝑡 𝑝 (5.4)

5.3 Autocorrelation
The assumption that errors corresponding to different observations are uncorrelated often
breaks down in the time series data. When the error terms from different time periods are
correlated, we say that the error term is autocorrelated. For example, if we are predicting the
growthof stock dividends, an overestimate in one year is likely to lead to overestimates in the
succeeding years
Figure 5.2: Positive and Negative Autocorrelation

In this section we mainly deal with the problem of first order autocorrelation, in which errors
in one time period are correlated directly with errors in the ensuingperiod. Autocorrelation can
be positiveas well as negative.
Autocorrelation will no affect the unbiasedness or consistency of OLS estimators, but it does
affect their efficiency.
We assume that each of the error terms in a linear regression model is drawn form normal
population with 0 expected value and constant variance,but that the errors are not independent
over time. Here the model is
𝑌𝑡 = 𝛽0 + 𝛽1 𝑋1𝑡 + 𝛽2 𝑋2𝑡 +. . . +𝛽𝑘 𝑋𝑘𝑡 + 𝜀𝑡 , t=1,2,…,T

With
𝜀𝑡 = 𝜌𝜀𝑡−1 + 𝑣𝑡 , 0 ≤ |𝜌| ≤ 1 (5.5)
Where vt is distributed 𝑁(0, 𝜎𝜀2 ) and is independent of other errors over time and 𝜀𝑡 is
distributed 𝑁(0, 𝜎𝜀2 ) and is not independent of other errors over time.

Now 𝑉 (𝜀𝑡 ) = 𝜎𝜀2 = 𝐸 (𝜀𝑡2 ) = 𝐸 [(𝜌𝜀𝑡−1 + 𝑣𝑡 )2 + 𝜌 2 𝐸 (𝜀𝑡−1


2 )
+ 𝐸(𝑦𝑡2 )]
𝜎𝑣2
= 𝜌 2 𝜎𝜀2 + 𝜎𝑣2 => 𝑉 (𝜀𝑡 ) = 𝜎𝜀2 = ( (5.6)
1−𝜌2 )

2 )
𝐶𝑜𝑣(𝜀𝑡 , 𝜀𝑡−1 ) = 𝐸 (𝜀𝑡 , 𝜀𝑡−1 ) = 𝐸 [(𝜌𝜀𝑡−1 + 𝑣𝑡 )𝜀𝑡−1 ] = 𝜌𝐸(𝜀𝑡−1 = 𝜌𝜎𝜀2
(5.7)
Likewise
𝐶𝑜𝑣(𝜀𝑡 , 𝜀𝑡−1 ) = 𝐸 (𝜀𝑡 , 𝜀𝑡−2 ) = 𝜌 2 𝜎𝜀2 , 𝐶𝑜𝑣(𝜀𝑡 , 𝜀𝑡 −1 ) = 𝐸(𝜀𝑡 , 𝜀𝑡−3 ) = 𝜌 3
𝜎𝜀2 …………………… 𝐶𝑜𝑣(𝜀𝑡 , 𝜀𝑡−𝑟 ) = 𝐸 (𝜀𝑡 , 𝜀𝑡−1 ) = 𝜌 𝑟 𝜎𝜀2 𝜎𝜀2
(5.8)
A useful formula for the first-order autocorrelation coefficient 𝜌 is
( )
𝜌 = 𝐶𝑜𝑣 𝜀𝑡 , 𝜀𝑡−1 ⁄𝜎 2
(5.9)
5.3.1 Tests for autocorrelation
Durbin-Watson Test
We shall now consider a test of null hypothesis of that no autocorrelation is present (𝜌 = 0) .
By far the most popular test autocorrelation is the Durbin-Watson test. This test involves the
calculation of a test statistic based on the OLS residuals. The statistic is defined as
σ𝑇 ̂ 𝑡−𝜀̂ 𝑡−1 ) 2
𝑡=2 ( 𝜀
𝐷𝑊 = σ𝑇
(5.10)
̂ 2𝑡=1
𝑡=1 𝜀

When successive values of 𝜀̂𝑡 are close to each other, the DW statistic will be low, indicating
the presence of positive autocorrelation. By taking several approximations, it is possible to
show that
𝐷𝑊 = 2(1 − 𝜌) (5.11)
The DW statistic will lie in the range of 0 to 4, with a value near 2 indicating no first order
autocorrelation. Positive autocorrelation is associated with DW values below 2, and negative
autocorrelation is associated with DW values above 2.
Table 5.1: Durbin-Watson Table
Value of DW Result
0<DW<dL Reject null hypothesis; positive autocorrelation
dL<DW<dU Indeterminate
dU<DW<2 Accept null hypothesis
2<DW<4-dU Accept null hypothesis
4-du <DW<4-dL Indeterminate
4-dL<DW<4 Reject null hypothesis; negative autocorrelation

Exact interpretation of the DW statistic is difficult because the sequence of error terms depends
not only on the sequence of 𝜀’s, but also on the sequence of all the X values. For this reason,
most tables include test statistics which vary with the number of independent variables and the
number of observations. Two limits are given by Durbin and Watson (1950 and 151), usually
labeled dL and dU . These limits help to summarize the DW test as shown in the above table.

Example 5.1:
Here we consoder a time series data given by Chatterjee and Hadi
Table 5.2: Consumer Expenditure Data
Consumer Expenditure Money Stock
214.6 159.3
217.7 161.2
219.6 162.8
227.2 164.6
230.9 165.9
233.3 167.9
234.1 168.3
232.3 169.7
233.7 170.5
236.5 171.6

Figure 5.3:Time Series Plot of Residuals for the Consumer Expenditure Data

Time series plot of residuals indicates that positive autocorrelation is present in the data. For
this data we obtain DW=8195.21/7587.92=0.328.
At the 5% level of significance, the critical values corresponding to n=20 are d L =1.20 and
dU=1.41. Since the observed value of d is less than d L , we reject the null hypothesis, and
conclude that positive autocorrelation is present in our data.
5.3.2 Corrections for Autocorrelation
Generalized Difference Method
If 𝜌 were known, it would be easy to adjust the OLS regression method to obtain efficient
estimates of parameters. This procedure involves the use of generalized differencing to alter
the linear model into one in which the errors are inpendent.
Let us assume that the model
𝑌𝑡 = 𝛽0 + 𝛽1 𝑋1𝑡 + 𝛽2 𝑋2𝑡 +. . . +𝛽𝑘 𝑋𝑘𝑡 + 𝜀𝑡 , t=1,2,…,T (5.12)

Holds for all time periods. Then we also have


𝑌𝑡−1 = 𝛽0 + 𝛽1 𝑋1𝑡−1 + 𝛽2 𝑋2𝑡 −1 +. . . +𝛽𝑘 𝑋𝑘𝑡−1 + 𝜀𝑡−1 , t=1,2,…,T (5.13)

Multiplaying Eq. (5.13) by 𝜌 and subtracting from Eq(5.12), we obtain the transformed model
𝑌𝑡∗ = 𝛽0 (1 − 𝜌) + 𝛽1 𝑋1𝑡
∗ ∗
+ 𝛽2 𝑋2𝑡 ∗
+ ⋯ + 𝛽𝑘 𝑋𝑘𝑡 + 𝑣𝑡

Where
𝑌𝑡∗ = 𝑌𝑡 − 𝜌𝑌𝑡−1 , ∗
𝑋2𝑡 ∗
= 𝑋2𝑡 − 𝜌𝑋2𝑡−1 , … , 𝑋𝑘𝑡 = 𝑋𝑘𝑡 − 𝜌𝑋𝑘𝑡−1 𝑎𝑛𝑑 𝑣𝑡 = 𝜀𝑡 − 𝜌𝜀𝑡−1
(5.14)
Now it is easy to show that
𝑉 (𝑣𝑡 ) = 𝐸 (𝑣𝑡 )2 = 𝐸[(𝜀𝑡 − 𝜌𝜀𝑡−1 )2 ] = 𝐸(𝜀𝑡2 ) − 2𝜌𝐸 (𝜀𝑡 , 𝜀𝑡 −1 ) + 𝜌 2 𝐸(𝜀𝑡−1
2 )
= 𝜎𝜀2 −
2𝜌 2 𝜎𝜀2 + 𝜌 2 𝜎𝜀2 = (1 − 𝜌 2 )𝜎𝜀2 = 𝜎𝑣2
(5.15)

𝐶𝑜𝑣(𝑣𝑡 , 𝑣𝑡 −1 ) = 𝐸 (𝑣𝑡 , 𝑣𝑡 −1 ) = 𝐸[(𝜀𝑡 − 𝜌𝜀𝑡−1 ), (𝜀𝑡−1 − 𝜌𝜀𝑡−2 )] = 𝐸(𝜀𝑡 , 𝜀𝑡−1 ) −


2 )
𝜌𝐸(𝜀𝑡, 𝜀𝑡−2 ) − 𝜌𝐸 (𝜖𝑡−1 + 𝜌 2 𝐸(𝜀𝑡−1, 𝜀𝑡−2 ) = 𝜌𝜎𝜀2 − 𝜌 3 𝜎𝜀2 − 𝜌𝜎𝜀2 + 𝜌 3 𝜎𝜀2 = 0
(5.16)
Hence we have an error process which is independently distributed with 0 mean and constant
variance. Thus the OLS estimators of the transformed model will be efficient.
Cochrane-Orcutt Procedure
Cochrane and Orcutt (1949) suggested a procedure which involves a series of iterations, each
of which produces a better estimate of 𝜌 than does the previous one.
Step 1: The OLS method is used to estimate regression parameters and the correlation
coefficient is estimated by
σ𝑟𝑡=2 𝜀̂ 𝑡𝜀̂ 𝑡−1
𝜌̂ = (5.17)
√σ𝑟𝑡=2 𝜀̂ 2𝑡 √σ𝑟𝑡=2 𝜀̂ 2𝑡−1

Step 2: The estimated value of 𝜌 is used to perform the generalized differences


𝑌𝑡∗ = 𝑌𝑡 − 𝜌𝑌𝑡−1 , ∗
𝑋1𝑡 ∗
= 𝑋1𝑡 − 𝜌𝑋1𝑡−1 , … , 𝑋𝑘𝑡 = 𝑋𝑘𝑡 − 𝜌𝑋𝑘𝑡−1

We estimate regression parameters from the transformed model:


𝑌𝑡∗ = 𝛽0 (1 − 𝜌) + 𝛽1 𝑋1𝑡
∗ ∗
+ 𝛽2 𝑋2𝑡 ∗
+ ⋯ + 𝛽𝑘 𝑋𝑘𝑡 + 𝑣𝑡

Calculate a new set of residuals as

𝜀̂̂𝑡 = 𝑌𝑡 − 𝛽̂0 − 𝛽̂1 𝑋1𝑡 − 𝛽̂2 𝑋2𝑡 − ⋯ − 𝛽̂𝑘 𝑋𝑘𝑡


Step 3: Recalculate the correlation coefficient
σ𝑟𝑡=2 𝜀̂ 𝑡𝜀̂ 𝑡−1
𝜌̂ = (5.18)
√ σ𝑟𝑡=2 𝜀̂ 2𝑡 √σ𝑟𝑡=2 𝜀̂ 2𝑡−1

And so on.
The iterative procedure can be carried on for as many steps as desired. Standard procedure is
to stop the iterations when the new estimate of 𝜌 differs from the previous one by less than
0.001 or 0.0005.
Example 5.2: Example 5.1 Continued. For the Consumer Expenditure Data, we have 𝜌̂ =
0.751. We corrected both Y and X as described in step 2. We fit corrected Y on corrected X,
compute the residuals and also compute the revised Durbin-Watson statistic which is 1.43.
From the table given by Durbin and Watson (for n=19 and o=1), we obtain d L=1.18 and dU=1.40.
Since the D-W statistic lies between dU and 2, we may accept the null hypothesis and conclude
that of there is no evidence of autocorrelation in this data.
5.4 Comparisons of Different Methods
Minitab computes three measures of accuracy of fitted model: MAPE, MAD, and MSD for
each of the forcasting and smoothing methods. For all three measures, the smaller value, the
better the fit of the model. Use these statistitics to comparethe fits of the fits of thr different
methods.
MAPE, or Mean Absolute Percentage Error, measures the accuracy of fitted time series values.
It expresses accuracy as a percentage.
σ|( 𝑦𝑡 −𝑦̂𝑡 ) ⁄𝑦𝑡 |
𝑀𝐴𝑃𝐸 = × 100 (5.19)
𝑇

Where 𝑦𝑡 equals the actual value, 𝑦̂𝑡 equals the fitted value, and T equals the number of
observations.

Assigment-03
Problem 1: The U.S. Economy for the period 1981-2005, data have given in the following
table 1.1, which relate to the Indian Economy for the period. The Y variable in the table
is the aggregate (for the economy as a whole) private final consumption expenditure
(PFCE) and the X variable is gross domestic product (GDP), a measure of aggregate
income, both measured in Rupee crore at 1999-2000 prices. Therefore, the data are in
"real" terms; that is, they are measured in constant (1999-2000) prices.
Table 1.1: Data on Y (Personal final consumption expenditure) and X (Gross Domestic
Product), both in 1999-00 prices measured in rupee crore.
Year Y X
1981 566866 678033
1982 572536 697861
1983 616974 752669
1984 634757 782484
1985 661249 815049
1986 682116 850217
1987 705495 880267
1988 749530 969702
1989 786725 1029178
Questions
1990 821863 1083572
1991 839593 1099072 1. Fit a simple linear regression model for the above data &
1992 861245 1158025 make comments
1993 898682 1223816 2. Construct the ANOVA table &test roe the significance of
1994 942359 1302076 regression.
3. Calculate t-statistics to assess the contribution of each
1995 999729 1396974
regressor to the model. Use α = 0.05.
1996 1077445 1508378
1997 1109656 1573263
1998 1181797 1678410
1999 1253643 1786526
2000 1292986 1864773
2001 1367758 1972912
2002 1397069 2047733
2003 1493871 2222591
2004 1579255 2389660
2005 1689861 2604532

(R code for the solution)

d=read.csv(file.choose(),header=T);d
attach(d)
# Fitting regression model
model<- lm(Y~X, data=d);model
summary(model)
# ANOVA
anova(model)
# T test statistic
t<-t.test(X,Y);t

Answer to the question number (1)


Assuming "Personal final consumption expenditure" (Y) to the "Gross Domestic Product" (X).
We have the regression equation is
Personal final consumption expenditure = 261.20 + 2.18 * Gross Domestic Product
The coefficient in front of the "Gross Domestic Product" (GDP) is 2.18. This indicates the
expected change in "Personal final consumption expenditure" for a one-unit change in GDP. In
other words, for each additional unit of GDP, the "Personal final consumption expenditure" is
expected to increase by 2.18 units, assuming all other factors remain constant.
Overall, this regression equation suggests that there is a positive linear relationship between
"Gross Domestic Product" and "Personal final consumption expenditure." As GDP increases,
it is estimated that personal consumption expenditure will also increase. The intercept term of
261.20 implies that even when GDP is zero (which is unlikely), there's still some predicted
consumption expenditure, possibly representing a base level of consumption or some other
factors not accounted for in the model.

Answer to the question number (2)

Y DF Sum of Square Mean Square F-value P value


GDP 1 26961312 437810.38
20452 0.0021
Residual 23 24552.34 3905.054

The ANOVA table shows that F-value =20452 and P-value is 0.0021 which is less than 5%
confidence interval (α = 0.05). In this case, the 𝑝 𝑣𝑎𝑙𝑢𝑒 (0.0021) < 𝛼 (0.05)

So, we can conclude that the GDP is significant on PCE and the Regression model is
statistically significant.

Answer to the question number (3)

From the output of R programming, we can see that the 𝑡 value is 0.275 and the result on 2
degrees of freedom we obtain the value of 0.0241 which is less than the calculated value.
So, we can conclude the test is statistically significant.

Problem 2: A soft drink bottler is analyzing the vending machine service routes in his
distribution system .He is interested in predicting the amount of time required by the
route driver to service the vending machines in an outlet. This service activity includes
stocking the machine with beverage products and minor maintenance or housekeeping.
The industrial engineer responsible for the study has suggested that the two most
important variables affecting the delivery time (y) are the number of cases of product
stocked (x1), and the distance walked by the route driver (x2). The engineer has collected
25 observations on delivery time, which are shown in Table 3.2.

Table 3.2: Delivery Time Data Questions

Obs Y X x y xi2 yixi


1 4.4567 6 -6 -4.218 36 25.308 1. Fit a multiple linear regression
2 5.77 7 -5 -2.9047 25 14.5235
model for the above data and make
comments.
3 5.9787 8 -4 -2.696 16 10.784
2. Construct the ANOVA table and test
4 7.3317 9 -3 -1.343 9 4.029 for the significance of regression.
5 7.3182 10 -2 -1.3565 4 2.713 3. Calculate R2 and R2 adj
6 6.5844 11 -1 -2.0903 1 2.0903 4. Calculate t-statistics to assess the
7 7.8182 12 0 -0.8565 0 0 contribution of each regressor to the
model. Use a =0.05.
8 7.8351 13 1 -0.8396 1 -0.8396
5. Construct 95% confidence interval
9 11.0223 14 2 2.3476 4 4.6952 on the mean delivery time for an
10 10.6738 15 3 1.9991 9 5.9973 outlet requiring x 1 =8 cases and the
11 10.8361 16 4 2.1614 16 8.6456 distance x 2 =275 feet.
12 13.615 17 5 4.9403 25 24.7015
13 13.531 18 6 4.8565 36 29.1378
Sum 112.7712 156 0 0 182 131.786

(R code for the solution)

d1=read.csv(file.choose(),header=T);d1
attach(d1)
# Fitting regression model
model1<- lm(Y~X, data=d1);model1
# ANOVA
anova(model1)
# Calculate R-square and Adj-R-square
summary(model1)
# t-statistic
t1<-t.test(X,Y);t1
Answer to the question number (1)
Assuming that the "Delivery time" (Y) to two independent variables: "Number of cases product
stocked" (X1) and "Route driver" (X2). We’ve got the multiple regression equation
Delivery time = 0.000143 + 0.722 * Number of cases product stocked + 0.0012 * Route
driver
The coefficient in front of the "Number of cases product stocked" (X1) is 0.722. This indicates
that, holding the "Route driver" constant, a one-unit increase in the "Number of cases product
stocked" is associated with an expected increase in "Delivery time" of 0.722 units. This
suggests that as the number of cases of products stocked increases, the delivery time is expected
to increase by a certain amount.
Overall, this multiple regression equation suggests that both the number of cases of products
stocked and the number of route drivers have an impact on the "Delivery time." The intercept
term and the coefficients of the independent variables provide insights into how changes in
these variables are associated with changes in the dependent variable.

Answer to the question number (2)

Sum of Mean
Y DF Square Square F-value P value
X 1 10157.5 10157.5 11527.35 0.000000022
x 1 0 0 0.0003 0.98
Residual 11 9.7 0.9

P-values are 0.00000022 and 0.98. Let's assess their significance:


A p-value of 0.00000022 is highly significant, indicating a strong likelihood that the
corresponding variable is a significant predictor of the "Delivery time and a p-value of 0.98 is
not significant, suggesting that the corresponding variable is likely not a meaningful predictor
of the "Delivery time”.

Answer to the question number (3)

From the output of R programming, we have the Multiple R-Square value is 0.999 and the
Adjusted R-square value is 0.9989 which defines that the model fits the data extremely well
and that the included independent variables are highly effective in explaining the variation in
the dependent variable. In conclusion, we can say that these high values are indicative of a
well-fitting model.
Answer to the question number (4)
From the output of R programming, we can see that the 𝑡 value is 0.286 and the result on 11
degrees of freedom we obtain the value of 0.87 which is greater than the calculated value.
So, we can conclude the test is statistically insignificant.
Answer to the question number (4)
The calculated 95% confidence interval on the mean delivery time for an outlet requiring x 1 =8
cases and the distance x2 =275 feet is
95% Confidence Interval: (5.453894 , 6.758392)

The R codes for calculating 95% CI is given below:

# Given coefficients

intercept <- 0.000143

coef_x1 <- 0.722

coef_x2 <- 0.0012

# Given standard errors

se_intercept <-0.292692

se_x1 <-0.006765

se_x2 <-0.069909

sample_size <-13

# Given confidence level

confidence_level <- 0.95

# Calculate the standard error of the estimate

se_estimate <- sqrt(se_intercept^2 + (coef_x1 * se_x1)^2 + (coef_x2 * se_x2)^2)

# Calculate the margin of error

margin_of_error <- qt((1 + confidence_level) / 2, df = sample_size - 3) * se_estimate

# Calculate the predicted mean delivery time

predicted_mean <- intercept + coef_x1 * 8 + coef_x2 * 275

# Calculate the confidence interval

lower_ci <- predicted_mean - margin_of_error

upper_ci <- predicted_mean + margin_of_error

# Print the confidence interval

cat("95% Confidence Interval:", lower_ci, "to", upper_ci)


Problem 3: (Monte Carlo study): Refer to the 10 X values given in table 2.4. Let β1= 25
and β2 = 0.5. Assume 𝝁𝒊 ~𝑵(𝟎, 𝟗) that is, 𝝁𝒊 are normally distributed with mean 0 and
variance 9.
Y X
700 800 Questions
650 1000
900 1200 1. Generate 100 samples using these values, obtaining 100 estimates
950 1400 of β1 and β2
1100 1600 2. Graph these estimates.
3. What conclusions can you draw from the Monte Carlo study?
1150 1800
1200 2000
1400 2200
1550 2400
1500 2600
(R code for the solution)

********Monte Carlo Study********


d2<-read.csv(file.choose(),header =T );d2
attach(d2)
set.seed(1)
# Generating Random Numbers
e=rnorm(100,0,9);e
# Random Sample
x=sample(d2[,2],100,replace = T);x
# Model
y=25+0.5*x+e;y
# Linear Model
model=lm(y~x);model
summary(model)
anova(model)
plot(x,y)
abline(model)
monte.t=function() {
e=rnorm(100,0,3);e
x=sample(d2[,2],100,replace = T);x
y=25+0.5*x+e;y
model=lm(y~x);model
beta=c(beta1=coef(model)[1],beta2=coef(model)[2])
}
replicate(100,monte.t())
plot(density(beta))
curve(dnorm(x,y),0,9)

Answer to the question number (1)

[1] 2600 2600 1000 1000 1200 800 1000 1200 1200 1200 2200 2400 1000 2600
[15] 2200 2600 1400 1600 2400 1600 2000 1600 1800 1400 1000 800 1200 2200
[29] 2400 1800 800 1400 1600 2400 1600 2200 1400 800 2400 1600 800 1600
[43] 1400 2600 2600 2400 2200 1600 1600 1800 1800 1000 1000 2200 1400 2600
[57] 2200 1600 1600 2200 2200 2000 1400 1400 800 2600 1400 2400 2400 2400
[71] 2400 1800 1800 1400 1200 1200 2400 2400 2000 2400 1600 2000 1400 1400
[85] 2600 2200 800 2600 1000 2600 800 800 1400 1600 1600 1800 2400 2200
[99] 1600 800

Answer to the question number (2)

Problem 4: Child Mortality in relation to per Capita GNP and Female Literacy Rate
Consider the behavior of child mortality (CM) in relation to per capita GNP (PGNP). Consider
the data given in Table 6.4. These are cross-sectional data for 64 countries on child mortality
and a few other variables. For now, concentrate on the variables child mortality (CM) and per
capita GNP (PGNP). Keep in mind that CM is the number of deaths of children under five per
1000 live births; PGNP is per capita GNP in 1980, and female literacy rate (FLR) is measures
in percent.
Obs CM FLFP PGNP TFR
1 128 37 1870 6.66
2 204 22 130 6.15 Question
3 202 16 310 7 a. Estimate the model and interpret the
4 197 65 570 6.25 coefficients. Does the regression result make
5 96 76 2050 3.81 sense?
6 209 26 200 6.44
b. Obtain the ML estimators of the estimators
7 170 45 670 6.19
using likelihood ratio (LR) test. Test the
8 240 29 300 5.89 validity and make comments.
9 241 11 120 5.89
10 55 55 290 2.36 c. Compute Wald statistics and Multiplier
11 75 87 1180 3.93 statistics (LM).
12 129 55 900 5.99 d. Are all three tests asymptotically
13 24 93 1730 3.5 equivalent? What is the relationship
14 165 31 1150 7.41 between them?
15 94 77 1160 4.21
16 96 80 1270 5
17 148 30 580 5.27
18 98 69 660 5.21
19 161 43 420 6.5
20 118 47 1080 6.12
21 269 17 290 6.19
22 189 35 270 5.05
23 126 58 560 6.16
24 12 81 4240 1.8
25 167 29 240 4.75
26 135 65 430 4.1
27 107 87 3020 6.66
28 72 63 1420 7.28
29 128 49 420 8.12

Answer to the question number (1)


R codes for solution

d2<-read.csv(file.choose(),header =T );d2

attach(d2)

model<-lm(CM~FLFP+PGNP+TFR, data=d2);model
The multiple regression equation:
Child Mortality (CM) = 168.30 - 1.76 * FLFP - 0.0055 * PGNP + 12.86 * TFR
In this equation, FLFP (Female Labor Force Participation) is one of the independent variables.
For every unit increase in FLFP, the Child Mortality is expected to decrease by 1.76 units,
holding the other variables constant.
Again, PGNP (Per Capita Gross National Product) is another independent variable. For every
unit increase in PGNP, the Child Mortality is expected to decrease by 0.0055 units, holding the
other variables constant.
On the other hand, TFR (Total Fertility Rate) is the third independent variable. For every unit
increase in TFR, the Child Mortality is expected to increase by 12.86 units, holding the other
variables constant.

The constant term (168.30) represents the intercept or the Child Mortality value when all the
independent variables (FLFP, PGNP, TFR) are zero.

It's important to note that these interpretations assume a linear relationship between the
variables and that the assumptions of multiple regression (such as linearity, independence of
errors, homoscedasticity, and absence of multicollinearity) are met.

Answer to the question number (2)


R codes for solution

**************ML estimator using likelihood ratio test*********

# Simulated data

set.seed(123)

# Fit the two nested models

model_full <- lm(CM ~ FLFP+PGNP+TFR) # Full model

model_reduced <- lm(CM ~ FLFP) # Reduced model (intercept only)

# Likelihood ratio test statistic

lrt_stat <- 2 * (logLik(model_full) - logLik(model_reduced))

# Degrees of freedom difference between the models

df_diff <- df.residual(model_reduced) - df.residual(model_full)


# Calculate the p-value using chi-square distribution
p_value <- 1 - pchisq(lrt_stat, df = df_diff)
# Display results
cat("Likelihood Ratio Test Statistic:", lrt_stat, "\n")
cat("Degrees of Freedom Difference:", df_diff, "\n")
cat("P-value:", p_value, "\n")

From the R output, we have likelihood ratio test is 17.1786 which defines the complex model
significantly improves the likelihood of the data compared to the simpler model.
Answer to the question no (3)

****************Wald statistics and Lagrange Multiple statistics*******

# Simulated data

set.seed(123)

# Fit the linear regression model

model <- lm(CM ~ FLFP+PGNP+TFR, data=d2);model

# Wald Test for a single coefficient (e.g., coefficient of x)

coef_estimate <- coef(model)[2] # Coefficient estimate for x

coef_std_error <- sqrt(vcov(model)[2, 2]) # Standard error of the coefficient


estimate

# Wald test statistic

wald_statistic <- (coef_estimate / coef_std_error)^2

# Degrees of freedom for the test

df_wald <- 1 # Since we're testing a single coefficient

# Calculate the p-value using the chi-square distribution

p_value_wald <- 1 - pchisq(wald_statistic, df = df_wald)


# Display results

cat("Wald Test Statistic:", wald_statistic, "\n")

cat("Degrees of Freedom:", df_wald, "\n")

cat("P-value:", p_value_wald, "\n")

# Lagrange Multiplier (LM) Test (Breusch-Pagan Test for heteroskedasticity)

# Fit the auxiliary regression to test for heteroskedasticity

model_reduced <- lm(residuals(model)^2 ~ FLFP)

# Lagrange Multiplier statistic

lm_statistic <- nobs(model) * summary(model_reduced)$r.squared

# Degrees of freedom for the test

df_lm <- 1 # Number of restrictions being tested

# Calculate the p-value using the chi-square distribution

p_value_lm <- 1 - pchisq(lm_statistic, df = df_lm)

# Display results

cat("Lagrange Multiplier (LM) Test Statistic:", lm_statistic, "\n")

cat("Degrees of Freedom:", df_lm, "\n")

cat("P-value:", p_value_lm, "\n")

From the output in R, we can see the Wald test statistic is 50.81 and the Lagrange
Multiplier test statistic is 3.92. On the other hand, the Wald test statistic describes the
coefficient is further from zero and a stronger effect.
Problem 5: Consumption Expenditure on Income and Wealth
Consider the Consumption Expenditure about Income and Wealth. Consider the data
given in Table 10.5 For now concentrate on the variables Consumption (Y), Income
(X2), and Wealth (X3)

Y X2 X3
70 80 810
65 100 1009
90 120 1273
95 140 1425
110 160 1633
115 180 1876 Question
120 200 2052 1. Estimate the model and interpret the coefficients.
140 220 2201 Does the regression result make sense?
155 240 2435 2. Test the significance of regression and interpret.
150 260 2686 3. wouldyou expect Multicollinearity?
4. If collinearity is expected, how would you resolve the
problem?

(R codes for solution)

**********Problem 5*************

d3<-read.csv(file.choose(),header=T);d3

attach(d3)

head(d3)

model3<-lm(Y~X2+X3, data=d3);model3

summary(model3)

anova(model3)

# Correlation Matrix

my_cor<-cor(d3);my_cor

round(my_cor,2)

# Eigen System Analysis

eigen(my_cor)$values

# Condition Matrix

k<-max(eigen(my_cor)$values)/min(eigen(my_cor)$values);k

CI<-sqrt(k);CI

Answer to the question no( 1)


Here the multiple regression equation is Consumption (Y) = 24.77 + 0.94 (Income) – 0.04
(Wealth). From the coefficients alone, the model suggests that both income and wealth have
an influence on consumption. An increase in income is associated with an increase in
consumption, while an increase in wealth is associated with a slight decrease in consumption.
Answer to the question no( 2)
From the output in R, we can see the p-value is 0.02 which is less than 0.05. So, we can
conclude the model is statistically significant and the model suggests that the predictors
(Income & wealth) collectively have a meaningful impact on explaining the variation in
Consumption.
Answer to the question no( 3)

From the output in R, we can see the Condition Index is 56.22 which implies there exists seve
re multicollinearity because condition index is greater than 30.

Answer to the question no( 4)

Multicollinearity occurs when two or more independent variables in a regression model are hi
ghly correlated with each other, which can lead to issues in interpreting the individual coeffici
ents and affecting the stability of the model. Here are several strategies to address multicolline
arity in R:

1. Feature Selection: Consider removing one or more of the correlated variables from the
model. This reduces the redundancy and can help improve model stability. You c
an use techniques like domain knowledge, variable importance measures, or stepwise r
egression for feature selection.

2. Combine Variables: If it makes sense in your domain, you could create composite vari
ables that represent a combination of correlated variables. This could help in capturing
the shared information without directly including the correlated variables.

3. Principal Component Analysis (PCA): PCA transforms correlated variables into a new
set of uncorrelated variables (principal components) that can be used in the regression.
However, the interpretation of these components might be challenging in terms of thei
r original meaning.

4. VIF (Variance Inflation Factor) Analysis: Calculate the VIF for each variable. High V
IF values (typically above 5 or 10) indicate high multicollinearity. If you find variable
s with high VIF, consider removing one or addressing the correlation issue.

5. Partial Regression Plots: Visualize relationships between each predictor and the respo
nse while controlling for other variables. High slopes in these plots might suggest
multicollinearity.

Problem 6: Consumption Function for United States, 1947-2000


Consider a concrete set of data on real consumption expenditure(C), real disposable
personal income (Yd), real wealth (W), and real interest rate (I) for the United States for
the period 1947-2000. The raw data are given in Table 10.7.
Year C Yd W I
1947 976.4 1035.2 5166.815 -10.3509
1948 998.1 1090 5280.757 -4.7198
1949 1025.3 1095.6 5607.351 1.044063
1950 1090.9 1192.7 5759.515 0.407346
1951 1107.1 1227 6086.056 -5.28315
1952 1142.4 1266.8 6243.864 -0.27701
1953 1197.2 1327.5 6355.613 0.561137
Question
1954 1221.9 1344 6797.027 -0.13848
1955 1310.4 1433.8 7172.242 0.261997 1. Estimate the model and
1956 1348.8 1502.3 7375.18 -0.73612 interpret the coefficients.
1957 1381.8 1539.5 7315.286 -0.26068 Does the regression result
1958 1393 1553.7 7869.975 -0.57463 make sense?
1959 1470.7 1623.8 8188.054 2.295943 2. Test the significance of
1960 1510.8 1664.8 8351.757 1.511181 regression and interpret
1961 1541.2 1720 8971.872 1.296432 3. Calculate R2 for this model
and make comments.
1962 1617.3 1803.5 9091.545 1.395922
1963 1684 1871.5 9436.097 2.057616
1964 1784.8 2006.9 10003.4 2.026599
1965 1897.6 2131 10562.81 2.111669
1966 2006.1 2244.6 10522.04 2.020251
1967 2066.2 2340.5 11312.07 1.212616
1968 2184.2 2448.2 12145.41 1.054986
1969 2264.8 2524.3 11672.25 1.732154
1970 2317.5 2630 11650.04 1.166228
1971 2405.2 2745.3 12312.92 -0.71224
1972 2550.5 2874.3 13499.92 -0.15574
1973 2675.9 3072.3 13080.96 1.413839
1974 2653.7 3051.9 11868.79 -1.04257
1975 2710.9 3108.5 12634.36 -3.53359
1976 2868.9 3243.5 13456.78 -0.65677
1977 2992.1 3360.7 13786.31 -1.19043
1978 3124.7 3527.5 14450.5 0.113048
1979 3203.2 3628.6 15340 1.70421
1980 3193 3658 15964.95 2.298496
1981 3236 3741.1 15964.99 4.703847
1982 3275.5 3791.7 16312.51 4.449027
1983 3454.3 3906.9 16944.85 4.690972
1984 3640.6 4207.6 17526.75 5.848332
1985 3820.9 4347.8 19068.35 4.330504
1986 3981.2 4486.6 20530.04 3.768031
1987 4113.4 4582.5 21235.69 2.819469
1988 4279.5 4784.1 22331.99 3.287061
1989 4393.7 4906.5 23659.8 4.317956
1990 4474.5 5014.2 23105.13 3.595025
1991 4466.6 5033 24050.21 1.802757
1992 4594.5 5189.3 24418.2 1.007439
1993 4748.9 5261.3 25092.33 0.62479
1994 4928.1 5397.2 25218.6 2.206002
1995 5075.6 5539.1 27439.73 3.333143
1996 5237.5 5677.7 29448.19 3.083201
1997 5423.9 5854.5 32664.07 3.12
1998 5683.7 6168.6 35587.02 3.583909
1999 5968.4 6320 39591.26 3.245271
2000 6257.8 6539.2 38167.72 3.57597

R codes for solution

***********6***********

d4<-read.csv(file.choose(),header=T);d4

attach(d4)

head(d4)

model4<-lm(C~Yd+W+I, data=d4);model4

summary(model4)

anova(model4)

Answer to the question no( 1)

The multiple regression equation is C=-20.633+0.7340Yd+0.036W-5.521 I


The slope 0.7340 suggest that if for every unit personal income increased then consumption
expenditure increased by about 0. 7340.If for every unit increased wealth then consumption
expenditure increased by about 0. 035.But for every unit increased interest rate consumption
expenditure will be decreased by about 5. 521.The intercept is -20.633 means that there the
personal income(I), real wealth (W), real interest rate (I) is zero, the real consumption
expenditure(C) is about -20.633.
Answer to the question no( 2)

From the output in R, we can see the p-value is 0.00000024 which is less than 0.05. So, we
can conclude the model is statistically significant and the model suggests that the predictors
(Real disposable personal income , wealth, income, real interest) collectively have a
meaningful impact on real consumption expenditure.
Answer to the question no( 4)

From the output in R, we have the multiple R-square value is 0.9994. So, we can conclude
the predictors (Real disposable personal income , wealth, income, real interest) can explain
almost 99% variation of the real consumption expenditre.

Problem 7: The Longley data, 1947-1962


Consider the Longley data. The data are time series for the years 1947-1962 and pertain
to Y=number of people employed, in thousands; Xi=GNP implicit price deflator;
X2=GNP, millions of dollars; X3=number of people unemployed in thousands; X4=
number of people in the armed forces; X5=noninstitutionalized population over 14 years
of age; X6=year, equal to 1 in 1947, 2 in 1948, and 16 in 1962. The raw data are given in
Table 10.8.
obs Y X1 X2 X3 X4 X5 TIME
1947 60323 830 234289 2356 1590 107608 1
1948 61122 885 259426 2325 1456 108632 2
1949 60171 882 258054 3682 1616 109773 3
1950 61187 895 284599 3351 1650 110929 4
1951 63221 962 328975 2099 3099 112075 5
1952 63639 981 346999 1932 3594 113270 6
1953 64989 990 365385 1870 3547 115094 7
1954 63761 1000 363112 3578 3350 116219 8
1955 66019 1012 397469 2904 3048 117388 9
1956 67857 1046 419180 2822 2857 118734 10
1957 68169 1084 442769 2936 2798 120445 11
1958 66513 1108 444546 4681 2637 121950 12
1959 68655 1126 482704 3813 2552 123366 13
1960 69564 1142 502601 3931 2514 125368 14
1961 69331 1157 518173 4806 2572 127852 15

1. Estimate the regression coefficient.


2. Test for significance of regression.
3. Calculate t statistics for testing the hypothesis
4. Detect multicollinearity problem.
(R codes for solution)

d5<-read.csv(file.choose(),header=T);d5

attach(d5)

model5<-lm(Y~X1+X2+X3+X4+X5, data=d5);model5

summary(model5)

anova(model5)

t2<-t.test(X1,Y);t2
# Correlation Matrix

my_cor<-cor(d5);my_cor

round(my_cor,2)

# Eigen System Analysis

eigen(my_cor)$values

# Condition Matrix

k<-max(eigen(my_cor)$values)/min(eigen(my_cor)$values);k

CI<-sqrt(k);CI

Answer to the question no(1)

Certainly, the multiple regression equation we’ve got :


Y = 0.122 - 2.83X1 + 0.809X2 - 2.60X3 - 1.97X4 - 1.30X5
In this equation, Y represents the number of people employed (in thousands), and X1, X2, X3,
X4, and X5 are different independent variables used to predict the number of people employed.
Let's break down the coefficients of each independent variable to understand their effects on
the number of people employed:
1. 0.122This is the intercept term. It represents the estimated number of people employed
when all independent variables (X1 to X5) are zero. However, since it's not practically
meaningful to have zero values for all the variables, the intercept term mainly helps to
adjust the baseline of the model.

2. (-2.83X1): This term is associated with the GNP implicit price deflator (X1). The
coefficient is negative (-2.83), which means that as the GNP implicit price deflator
increases, the number of people employed (Y) is expected to decrease. The larger the
increase in X1, the larger the decrease in Y, all else being equal.

3. (0.809X2): This term is associated with GNP (X2) in millions of dollars. The coefficient
is positive (0.809), indicating that as the GNP increases, the number of people employed
(Y) is expected to increase. A larger GNP (X2) would lead to a larger Y, holding other
variables constant.

4. (-2.60X3): This term is associated with the number of people unemployed (X3) in
thousands. The coefficient is negative (-2.60), meaning that as the number of
unemployed people increases, the number of people employed (Y) is expected to
decrease. This is an intuitive relationship, as more unemployment typically corresponds
to fewer people being employed.

5. (-1.97X4): This term is associated with the number of people in the armed forces (X4).
The coefficient is negative (-1.97), suggesting that an increase in the number of people
in the armed forces would lead to a decrease in the number of people employed (Y),
assuming other factors remain constant.

6. (-1.30X5): This term is associated with the noninstitutionalized population over 14


years of age (X5). The coefficient is negative (-1.30), implying that as the population
over 14 years of age increases, the number of people employed (Y) is expected to
decrease, assuming other variables are held constant.
In summary, the multiple regression equation indicates how changes in the various independent
variables (GNP implicit price deflator, GNP, number of people unemployed, number of people
in the armed forces, and noninstitutionalized population over 14 years of age) are associated
with changes in the number of people employed (Y). The coefficients for each variable give us
insight into the direction and magnitude of these relationships.
Answer to the question no(2)

From the output in R, we can see the p-value is 0.00043 which is less than 0.05. So, we can
conclude the model is statistically significant and the model suggests that the predictors (X1,
X2, X3, X4, X5) collectively have a meaningful impact on number of people employed.
Answer to the question no(4)

From the output in R, we can see the Condition Index is 3954.68 which implies there exists se
vere multicollinearity because condition index is greater than 30.

Problem 8:
Table 10.13 gives data on imports, GDP, and WPI the wholesale price index (WPI) for
India over the period 1980-81 to 2005-09. Consider the following model:
𝐼𝑛 𝐼𝑚𝑝𝑜𝑟𝑡𝑠𝑡 = 𝛽1 + 𝛽2 𝐼𝑛𝐺𝐷𝑃𝑡 + 𝛽3 𝐼𝑛𝐶𝑃𝐼𝑡 + 𝑢 𝑡

Table 10.13 Imports, GDP at market price and WPI, 1980-81 to 2008-09

Year Imports GDP WPI


1980-81 12,549 143,762 37
1981-82 13, 608 168,600 40
1982-83 14,293 188,262 41
1983-84 15,831 219,496 45
1984-85 17,134 245,515 49
1985-86 19,658 2771991 51
1986-87 20,096 311177 54
1987-88 221244 354,343 58 Questions:
1988-89 28235 421,567 62
a.Estimate the parameters of this model using the
1989-90 35,328 4861179 67
data given in the table.
1990-91 43,193 568,674 74
1991 -92 47,851 653,117 84 b.Do you suspect that there is multicollinearity in
the data?
1992-93 63,375 748,367 92
1993-94 73,101 859,220 100 c.Regress:
1994-95 89,971 1,012,770 113 1. ln 𝐼𝑚𝑝𝑜𝑟𝑡𝑠𝑡 = 𝐴1 + 𝐴2 𝑙𝑛𝐺𝐷𝑃𝑡
1995-96 122,678 1,188,012 122 2. 𝐼𝑛 𝐼𝑚𝑝𝑜𝑟𝑡𝑠𝑡 = 𝐵1 + 𝐵2 𝑙𝑛𝐶𝑃𝐼𝑡
1996-97 138,920 1,368,209 127 3. 𝐼𝑛 𝐺𝐷𝑃𝑡 = 𝐶1 + 𝐶2 𝑙𝑛𝐶𝑃𝐼𝑡
1997-98 154,176 1,522,547 133 On the basis og these regressions, what can you
1998-99 178,332 1,740,985 141 say about the nature of multico1linearity in the
data?
1999-00 215,236 1,936,831 145
2000-01 230,873 2,089,500 156 d.Suppose there is multicollinearity in the data but
2001-02 245,200 2,271,984 161 𝛽̂2 and 𝛽̂3
2002-03 297,206 2,454,561 167 are individually significant at the 5 percent level
2003-04 359,108 2,754,620 176 and the overall F test is also significant. In this case
2004-05 501,065 3,149,407 187 we worry about the collinearity problem?
2005-06 660,409 3,706,473 196
2006-07 840,506 4,283,979 206
2007-08 1,012,312 4,947,857 216
2008-09 1,374,436 5,574,448 234
(R codes for solutions)

#Read Data
library(readxl)
data=read_excel(file.choose())

data

attach(data)
head(data)
data=data.frame(data)
data

#Taken Log
ln_Imports=log(Imports)
ln_GDP=log(GDP)
Ln_CPI=log(WPI)

#Linear Regression Model


Model=lm(ln_Imports~ln_GDP+Ln_CPI)
Model
summary(Model)

#regress
model1=lm(ln_Imports~ln_GDP)
model2=lm(ln_Imports~Ln_CPI)
model3=lm(ln_GDP~Ln_CPI)
summary(model1)
summary(model2)
summary(model3)

#The best solutions here be to express imports and GDP in real terms
dividing each byCPI
ln_impCpi=log(Imports/WPI)
ln_impCpi
ln_GDPcPI=log(GDP/WPI)
ln_GDPcPI
Model4=lm(ln_impCpi~ln_GDPcPI)
Model4
summary(Model4)
Answer to the question no (a)
Assuming that “Imports” to two independent variables:GDP and CPI. We have got the
regression equation
𝐼𝑛 𝐼𝑚𝑝𝑜𝑟𝑡𝑠𝑡 = 0.91206 − 0.00101𝐼𝑛𝐺𝐷𝑃𝑡 + 2.31284𝐼𝑛𝐶𝑃𝐼𝑡

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.91206 1.21993 0.748 0.461

ln_GDP -0.00101 0.15387 -0.007 0.995

Ln_CPI 2.31284 0.29809 7.759 3.13e-08 ***

Residual standard error: 0.4871 on 26 degrees of freedom

Multiple R-squared: 0.8921, Adjusted R-squared: 0.8838


F-statistic: 107.5 on 2 and 26 DF, p-value: 2.68e-13

Answe to the question no (b)


A glance at these result would suggest that we have the multicollinearity problem, for R 2
value is very high but the t ratios of one explanatory variables are statistically insignificant
(lnGDP). Probably there is multicollinearity problem in the data.

Answer to the question no (c)

i)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.5584 2.0279 -1.262 0.218

ln_GDP 1.0123 0.1454 6.964 1.75e-07 ***

Residual standard error: 0.8703 on 27 degrees of freedom

Multiple R-squared: 0.6423, Adjusted R-squared: 0.6291

F-statistic: 48.49 on 1 and 27 DF, p-value: 1.746e-07

Interpretation:
The regression of lnImports on lnGDP shows the two variables are highly correlated, perhaps
suggesting that the data suffer from collinearity problem.

ii)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.9056 0.7158 1.265 0.217

Ln_CPI 2.3112 0.1547 14.943 1.41e-14 ***

Residual standard error: 0.478 on 27 degrees of freedom

Multiple R-squared: 0.8921, Adjusted R-squared: 0.8881

F-statistic: 223.3 on 1 and 27 DF, p-value: 1.412e-14

Interpretation:
The regression of lnImports on lnCPI shows the two variables are moderate correlated,
perhaps suggesting that the data suffer from moderate collinearity problem.
iii)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.3552 0.9123 6.966 1.73e-07 ***

Ln_CPI 1.6444 0.1971 8.341 5.96e-09 ***

Residual standard error: 0.6092 on 27 degrees of freedom


Multiple R-squared: 0.7204, Adjusted R-squared: 0.7101

F-statistic: 69.57 on 1 and 27 DF, p-value: 5.964e-09

Interpretation:
The auxiliary regression of lnGDP on lnCPI shows the two variables are not highly
correlated, perhaps suggesting that the data does not suffer from collinearity problem.

Answer to the question no(d)


The best solutions here be to express imports and GDP in real terms dividing each by CPI.
The results are as follows:

Coefficients:
Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.5600 2.0360 0.766 0.4502

ln_GDPcPI 0.5762 0.2180 2.643 0.0135 *

Residual standard error: 0.8152 on 27 degrees of freedom

Multiple R-squared: 0.2056, Adjusted R-squared: 0.1761

F-statistic: 6.986 on 1 and 27 DF, p-value: 0.01351

Interpretation:
The regression of (ln Imports/LnCPI) on (LnGDP/lnCPI) shows the two variables are not
highly correlated, perhaps suggesting that the data does not suffer from collinearity problem.

Problem 9:
Example 11.1 & 11.2: Park Test & Glejser Test Relationship between Compensation and
Productivity
Consider Table 11. 1 gives data on average wages per employee and labor productivity in 11
manufacturing industry groups for the year 19998-99. The data is averaged across three states
of India namely Andhra Pradesh, Bihar and Gujarat.
Table11.1 Wages per employee (Rs.) amd Productivity (Rs.) in Manufacturing Industries in
India :1998-99
wages per employee productivity std Questions
31660.94 561506.15 2527.91
39654.76 1027032.58 10082.87 a.Estimate the parameters of this model and
16394.52 455223.97 5763.08 interpret the results.Do they make economic
31139.27 687717.5 17663.02 sense?
56247.38 929562.01 8756.23 b.Would you expect the error variance in the
21316.48 538554.9 10311.67 preceding model to be
21566.75 549645.86 5052.37 heteroscedastic? Why?
39175.69 749620.67 20937.34
47845.94 935242.07 38567.69 c.Illustrate the Park approach to detect
53601.11 785937.34 22479.52 heteroscedasticity.
24711.36 306195.84 19807.13 d.Illustrate Glejser Test to detect
heteroscedasticity

(R codes for solution)


#Read Data
library(readxl)
data=read_excel(file.choose())

data

attach(data)
head(data)
data=data.frame(data)
data

#Linear Model

### Glejser Test

model=lm(W~P)
model
summary(model)
ui=resid(model)
ui

absui=abs(ui)
absui
model=lm(absui~P) ###gleser test
summary(model)

###park test
model=lm(W~P)
model
summary(model)
#Calculate Residual
res=resid(model)
res
#Squared Residual
u2=res^2
u2
#Taken Log
lnu2=log(u2)
lnP=log(P)
#Linear model rsidual to independent variable
model=lm(lnu2~lnP)
model
. summary(model)

Answer to the question no (a)


Assuming that “Imports” to two independent variables:GDP and CPI. We have got the
regression equation
Wages= 1597.0044+ 0.0486Productivity

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1597.0044 8601.4907 0.186 0.85682

P 0.0486 0.0120 4.051 0.00288 **

Residual standard error: 8526 on 9 degrees of freedom


Multiple R-squared: 0.6458, Adjusted R-squared: 0.6064

F-statistic: 16.41 on 1 and 9 DF, p-value: 0.002882

Interpretation:
The coefficient term tells the change in wages for 1$ change in Productivity i.e if
Productivity rises by 1$ then wages rises by 0.0486 . If you are familiar with derivatives then
you can relate it as the rate of change of wages with respect to productivity .

Answer to the question no (c)

Park test:
The residuals obtained from regression are then regressed on productivity as suggested giving
the following results:
̂
ln 𝑢 2𝑖 = 29.25 − 0.92𝑙𝑛𝑃

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 29.2515 22.6025 1.294 0.228

lnP -0.9156 1.6886 -0.542 0.601

Residual standard error: 1.931 on 9 degrees of freedom

Multiple R-squared: 0.03164, Adjusted R-squared: -0.07596

F-statistic: 0.294 on 1 and 9 DF, p-value: 0.6008


Interpretation:Obviously, there is no statistically significant relationship between the two
variables. Following the Park test , one may conclude that there is no heteroscedasticity in the
error variance .

Answer to the question no(d)

Glejser Test:
The absolute value of the residuals obtained from regression was regressed on average
productivity , giving the following results:
|𝑢̂ 𝑖 | = 5.093e + 03 + 2.189e − 03 P

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.093e+03 4.436e+03 1.148 0.281


P 2.189e-03 6.188e-03 0.354 0.732

Residual standard error: 4398 on 9 degrees of freedom

Multiple R-squared: 0.01372, Adjusted R-squared: -0.09587

F-statistic: 0.1252 on 1 and 9 DF, p-value: 0.7316

Interpretation:Obviously, there is no statistically significant relationship between the two


variables. Following the Glejser test , one may conclude that there is no heteroscedasticity in
the error variance .

Problem 11:

Refer to the data on the copper industry given in table 12.7.


Year C G I L H Α Questions
1951 21.89 330.2 45.1 220.4 1,491.00 19
52 22.29 347.2 50.9 259.5 1 ,504.0 19.41 a.From these data estimate the
53 19.63 366.1 53.3 256.3 1,438.00 20.93 following regression
54 22.85 366.3 53.6 249.3 1,551.00 21.78 model:𝑙𝑛𝐶𝑡 = 𝛽1 + 𝛽2 𝑙𝑛𝐼𝑡 +
55 33.77 399.3 54.6 352.3 1,646.00 23.68 𝛽3 𝑙𝑛𝐿𝑡 + 𝛽4 𝑙𝑛𝐻𝑡 + 𝛽5 𝑙𝑛𝐴𝑡 + 𝑢𝑡
56 39.18 420.7 61.1 329.1 1,349.00 26.01 Interpret the results.
57 30.58 442 61.9 219.6 1,224.00 27.52
b.Obtain the residuals and
58 26.3 447 57.9 234.8 1,382.00 26.89
standardized residuals from
59 30.7 483 64.8 237.4 1,553.70 26.85
the preceding regression and
60 32.1 506 66.2 245.8 1296.1 27.23
plot them. What can you surmise
61 30 523.3 66.7 229.2 1,365.00 25.46
about the presence of
62 30.8 563.8 72.2 233.9 11492.5 23.88
autocorrelation in these
63 30.8 594.7 76.5 234.2 1,634.90 22.62
residuals?
64 32.6 635.7 81.7 347 1,561.00 23.72
65 35.4 688.1 89.8 468.1 1,509.70 24.5 c.Estimate the Durbin-Watsond
66 36.6 753 97.8 555 1,195.80 24.5 statistic and comment on the
67 38.6 796.3 100 418 1,321.90 24.98 nature of autocorrelation
68 42.2 868.5 106.3 525.2 1,545.40 25.58 present in the data.
69 47.9 935.5 111.1 620.7 1,499.50 27.18
d.Carry out the run test and see
70 58.2 982.4 107.8 588.6 1,469.00 28.72
if your answer differs from that
71 52 1,063.40 109.6 444.4 2,084.50 29
just given in (c).
72 51.2 1,171.10 119.7 427.8 2,378.50 26.67
73 59.5 1,306.60 129.8 727.1 2,057.50 25.33 e.How would you find out if an
74 77.3 1,412.90 129.3 877.6 1,352.50 34.06 AR(p) process better describes
75 64.2 1,528.80 117.8 556.6 1,171.40 39.79 autocorrelation than an AR(1)
76 69.6 1,700.10 129.8 780.6 1,547.60 44.49 process?
77 66.8 1,887.20 137.1 750.7 1,989.80 51.23
78 66.5 2,127.60 145.2 709.8 2,023.30 54.42
79 98.3 2,628.80 152.5 935.7 1,749.20 61.01
80 101.4 2,633.10 147.1 940.9 1,298.50 70.87

(R codes for solution)

#Read Data
library(readxl)
data=read_excel(file.choose())

data

attach(data)
head(data)
data=data.frame(data)
data
H
class(H)
H=as.integer(H)
H
class(H)
#Taken Log
lnCt=log(C)
lnCt
lnIt=log(I)
lnIt
lnLt=log(L)
lnLt
lnHt=log(H)
lnHt
lnAt=log(A)
lnAt
#Linear Model
m=lm(lnCt~lnIt+lnLt+lnHt+lnAt)
m
summary(m)
#Calculate Residuals
res=resid(m)
res
#Standard Residuals
standard_res=rstandard(m)
standard_res
data.frame(res,standard_res)
plot(res,standard_res)
#Install Packages
install.packages('lmtest')
library('lmtest')
# Perform the Durbin-Watson test
dwtest(m)
# Carry out the Durbin-Watson test
durbinWatsonTest(m)
#perform Breusch-Godfrey test
bgtest(m,order = 1,data=data)
bgtest(m,order = 2,data=data)
bgtest(m,order = 3,data=data)

Answer to the question no(a)


Assuming that “C t” to 4 independent variables:I,L,H and A. We have got the regression
equation
𝑙𝑛𝐶𝑡 = −1.64967 + 0.42354𝑙𝑛𝐼𝑡 + 0.30653𝑙𝑛𝐿 𝑡 + 0.02025𝑙𝑛𝐻𝑡 + 0.44063𝑙𝑛𝐴𝑡

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.64967 0.53971 -3.057 0.005423 **

lnIt 0.42354 0.17499 2.420 0.023443 *

lnLt 0.30653 0.12491 2.454 0.021767 *


lnHt 0.02025 0.06066 0.334 0.741371

lnAt 0.44063 0.10686 4.123 0.000386 ***

Residual standard error: 0.1233 on 24 degrees of freedom

(1 observation deleted due to missingness)

Multiple R-squared: 0.9325, Adjusted R-squared: 0.9212


F-statistic: 82.85 on 4 and 24 DF, p-value: 1.097e-13

Interpretation:As we can see, the coefficient of I,L and A are statistically significant and have
the economically meaningful impact in C.

Answer to the question no (b)


Residuals Standard Residuals

1 0.023221931 0.2103062

3 -0.244654772 -2.1208100

4 -0.105724521 -0.9115771

5 0.133004568 1.2856635

6 0.217517839 1.8832038

7 0.065299848 0.5822464

8 -0.069953503 -0.6055174

9 0.031972362 0.2748056

10 0.054325844 0.4706137

11 0.033478166 0.2930961

12 0.005092719 0.1067780

13 0.043591106 0.3952097

14 -0.067973587 -0.5688233

15 -0.130943477 -1.1051998
16 -0.181224048 -1.5785523

17 -0.061124395 -0.5291117

18 -0.081443504 -0.6947837

19 -0.050786470 -0.4360024

20 0.149156767 1.2589808

21 0.104284912 0.8876354

22 0.097347239 0.8847658

23 0.076340437 0.7012328

24 0.160041001 1.4119118

25 0.087774845 0.7538771

26 -0.031058589 -0.2660261

27 -0.150567847 -1.3214143

28 -0.189168137 -1.6936295

29 0.048761866 0.4471027

30 0.033411400 0.3281660

Residual vs Standard residual Plot

Interpretation: Here's the residuals vs. standard residuals plot for the data
set's the points on the plot show pattern or trend, suggesting that there is a
relationship between the residuals and standard rsiduals. We will see that
probably suggest autocorrelation.
Answer to the question no (c)

Durbin-Watson test

data: m

DW = 1.0588, p-value = 0.0004505

alternative hypothesis: true autocorrelation is greater than 0


Interpretation: As shown the Darbin-watson test the value of d statistic is 1.0588. Now n=30,
k=4, and 𝛼 = 0.05 , the lower limit of d is 1.138 . Since the computed d value is below this
critical d value , there is evidence of positive autocorrelation.
Answer to the question no (d)

Breusch-Godfrey test:

Breusch-Godfrey test for serial correlation of order up to 1

data: m

LM test = 6.8851, df = 1, p-value = 0.008692


Breusch-Godfrey test for serial correlation of order up to 2

data: m

LM test = 9.8954, df = 2, p-value = 0.0071

Breusch-Godfrey test for serial correlation of order up to 3

data: m

LM test = 9.9135, df = 3, p-value = 0.01932


The first order autoregression AR(1), then the BG test is known as Durbin’s M test.The BG
test is that the value of p, the length the lag lag, cannot be specified a priori.The result shows
that AR(1) is more significant than AR(2) and AR(3).

Problem 12:
The data are given in table 12.8.

Year Y X Estimate Y Residuals


1950-51 201,090 1 202,595.34 -1,505.34
1951-52 213,872 2 211,471.16 2,400.84
1952-53 222,503 3 220,346.98 2,156.02
1953-54 235,879 4 229,222.79 6,656.21
1954-55 243,617 5 238,098.61 5,518.39 Questions
1955-56 245,946 6 246,974.43 -1,028.43
1956-57 256,826 7 255,850.24 975.76 a.Verify the Durbin-Watson d=1.761.
1957-58 251,753 8 264,726.06 -12,973.06
b.Is there positive serial correlation in the
1958-59 274,864 9 273,601.88 1,262.12
disturbances?
1959-60 277,991 10 282,477.69 -4,486.69
1960-61 293,804 11 291,353.51 2,450.49 c.If so, estimate 𝜌 by the
1961-62 298,813 12 300,229.32 -1,416.32
1) Ther I-Nagar method.
1962-63 302,706 13 309,105.14 -6,399.14
2) Durbin two-step procedure.
1963-64 313,966 14 317,980.96 -4,014.96
3) Cochrane-Orcutt method.
1964-65 332,722 15 326,856.77 5,865.23
1965-66 333,017 16 335,732.59 -2,715.59 d.Use the Theil-Nagar method to transform
1966-67 337,344 17 344,608.41 -7,264.41 the data and run the regression on the
1967-68 356,429 18 353,484.22 2,944.78 transformed data.
1968-69 365,792 19 362,360.04 3,431.96
e.Does the regression estimatedin (d)
1969-70 379,378 20 371,235.86 8,142.14
exhibit autocorrelation? If so, how would
you get rid of it?

(R codes for solution)


#Read Data
library(readxl)
data=read_excel(file.choose())

data

attach(data)
head(data)
data=data.frame(data)
data
m=lm(Y~X)
m
summary(m)
library(lmtest)
dwtest(m)
plot(m)
#chochrane-orcutt
# estimate rho (correlation coefficient)

e=residuals(m)
e
resid(m)
et= e[2:20]
et
et1 <- e[1:(20-1)]
rhohat <- sum(et*et1) / sum(e^2)
rhohat
y <- with(data, Y- rhohat*lag(Y))
x <- with(data, X- rhohat*lag(X))
x
y
Model12<- lm(y ~ x)
Model12
# Cochrane.Orcutt correction
ccl = cochrane.orcutt(m)

Answer to the question no (a)


The regression result shows that:

𝑌̂ = 193719.5 + 9975.8𝑋

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 193719.5 2510.8 77.16 <2e-16 ***

X 8875.8 209.6 42.35 <2e-16 ***

Residual standard error: 5405 on 18 degrees of freedom

Multiple R-squared: 0.9901, Adjusted R-squared: 0.9895


F-statistic: 1793 on 1 and 18 DF, p-value: < 2.2e-16

Durbin-Watson test

data: m

DW = 1.7613, p-value = 0.2098

alternative hypothesis: true autocorrelation is greater than 0


(verified)

Answer to the question no(b)


Yes. For n=20 , k=1 and 𝛼 = 0.05, the lower limit d is1.20 .Since the computed d value is
below this critical d value , there is evidence of positive autocorrelation.

You might also like