Data Analysis
Data Analysis
ANALYSIS
Dr. Shahab
Aziz
Bahria University, Islamabad
1
Outline
• Regression
• Assumptions of regressions
• Time series analysis
• Descriptive Analysis
• Diagnostic tests
• Graphs
• Panel Data Analysis
• Importing results
• Download data from WDI and perform Panel Data Analysis
2
Types of Data
• Primary Data [ Survey ]
• Collected by researcher
Panel Data
4
5
6
Cross Sectional Data or a cross section of a
study population, in statistics and
econometrics is a type of data collected by
observing many subjects (such as
individuals, firms, countries, or regions) at
the one point or period of time.
Cross
sectional Data
GDP of three developing countries in 1990.
7
Regression and Correlation
8
9
Regression
• Most frequently used technique in Research.
• Can analyse the relationship between independent and dependent
variable.
• Dependent variable is outcome variable.
• Independent variable is used to achieve that outcome.
10
Calculate if one of the independent
variable or a set of independent
variable has a significant relationship
with dependent variable.
Make predictions.
11
• The ’independent’ variable
‘X’ is usually called the
repressor (there may be one
or more of these), the
’dependent’ variable y is the
response variable.
12
• Least Square Method-:
• The regression equation of X on Y is : X= a+bX
• Where,
• X=Dependent variable and Y=Independent variable
13
14
15
1. Model should be linear in parameters
Model
6. Error term have constant variance [Heteroscedasticity].
16
Y= a+bX
17
Suppose
3. None of the
independent Y= a+bX+e
variable are
correlated If there is positive correlation between x and e , x will
increase so e will also increase. Than the coefficient
[b] will be overestimated coz it is showing error effect
with e. as well.
18
19
5. Mean of error Term is zero.
20
Gauss-Markov Theorem : If 6 assumptions
hold, the parameters a, b will be BLUE.
7. The error
term is
normally Central Limit Theorem: As sample size
increases the distribution becomes normal
distributed.
Not required for estimation as 6
assumptions are enough. However for
checking significance it is important. To hold
this assumption.
21
How to Enter
Time Series Data
22
23
24
25
26
27
28
29
How to Enter Panel Data
30
31
32
Three Countries
33
34
Command for entering variables
1 is for country 1
2 for country 2
3 for country 3
35
Importing Data from excel
36
37
38
39
40
Multicollinearity
its Detection
&
Removal
41
Independent variables should
not be highly correlated.
Y=a+bX1+cX2
42
Multicollinearity Explained
SC= a+b1 PM+b2FI
( Student Consumption= Pocket Money+ Father’s income)
Pocket money also related with Father’s income , It means SC depends upon b1
and b2. [ Income increase, pocket money also increase hence SC increases , SC is
effected by not only b1 but also with b2]
43
IV’s are correlated called Multicollinearity.
VIF= 1/1-R2
Violation of VIF is related to independent variables.
Assumption Y= a+b1X+b2Z+b3A
44
45
46
47
48
49
Leave as it is if VIF<10, 5, 3.3.
50
51
Autocorrelation
Detection
Removal
52
Autocorrelation
• Error terms observations are independent of each other or they are
not correlated with each other.
• When two variables are moving closely we call it correlation.
53
X X t-1 •When one variable is
correlated with its own lagged
10 8 value, we call it
“Autocorrelation”
20 18
•GDP, Savings, Income,
30 27 Consumption, Investment etc
40 38 will be mostly correlated with
their previous values.
50 47
54
• Y= a+b1X+b2Z+b3A+e
• e is error term , the variables which are not a part of the model.
• Error means mistake, not intentional, it should not be consistent.
• Error should be random, there should not exist any trend in error term.
• For example:
• Y= a+b1X+e [ There was a variable related with Y i.e. Z and we have not
included it in the model. Z is missed and not is a part of error term.
Z was important for Y.
55
• If variables are autocorrelated its not an issue.
• Z is in error term.
• Z is a time series and have autocorrelation.
Autocorrelation • Z is consistent which is making a trend , as Z is
explained in error term we see error term correlated with
its own term , AUTOCORRELATION.
• There is an issue when error terms are
correlated. Violates the assumption that error
should be random.
• However variables can be auto correlated.
56
Variable which was
significant before ,
due to auto Y-Y=Error
becomes [ Actual-estimated=error]
insignificance
hence not reliable.
Creates
again / + Auto, -
Auto.
e=+ e=+
57
58 How to Detect
Serial
Correlation H1: Auto exists [ P> 0.05] Reject
LM Test
IF DW value is > 2 and LM test
[confirms] hypothesis is
rejected than autocorrelation
exits.
59
Range of DW 0 to 4
If Prob value > 0.05 it means we can accept Ho. Shows no autocorrelation
If Pr0b value < 0.05 it means we can reject Ho. Shows autocorrelation in
regression model
60
Bring the omitted variable within the
model.
61
2. Cochrane-Orcutt Method / GLS
3. AR1
65
Removal
Addition of relevant
variable
72
There are different
groups of income and
we have seen their
consumption. Than
draw a regression line.
There will be some
errors i.e. difference
between actual and
estimated values. If we
draw the error if can
have certain
distribution. Plot it
vertically. If you see the
variance is constant for
all groups. This is called
HOMOSCADASTICITY.
73
If the variances are not constant,
assumption is violated it is
called: HETROSCADASTICITY”.
74
For example: there are three
cities and cities have low to
high population. There will be
different tax revenues for each
city. More population have
more variance. With less
population the deviation will
be less. So there must be a
factor due to which the
variance is not constant. Factor
can be a part of model or not
and it is called “
PROPORTIONAL FACTOR”.
75
Coefficient B have a
distribution with some mean
and stander error. The
variation in error term
doesn't change mean value
but it changes its deviation.
Beta remains the same
however variation might
increase or decrease which
changes its distribution like
Auto. So there will be a
problem of significance of a
variable the way it happens in
AUTOCORRELATION.
Estimates re not reliable.
76
Cross sectional data,
for example there are
100 countries and find
relationship between
consumption and
income. Every entity
will be having an error.
Like in time series
there is an error for
every time period. If
the variance is due to
population than draw a
plot will be like this. As
population increases
error increases. Can be
any factor may be
income etc.
77
78
79
CORRELATION
80
81
82
83
REGRESSION
84
85
86
87
Plot Data / Draw Graphs
• plot gdp fdi inf smc to enter
88
89
90
91
Summary statistics
• cor gdp fdi inf smc to enter
92
93
Histogram
• hist gdp enter
94
95
Regression
• ls gdp c fdi inf smc to enter
96
Regression
97
Stata
98
Importing The Data
99
100
101
102
103
104
REGRESSION
• regress GDP INF TO SMC FDI ENTER
105
106
Summary Statistics
Summarize
107
108
Correlation
109
110
111
112
Panel Data Analysis
in Stata
113
• Basics about panel Data
• Model estimation
• Diagnostic check
• Selection of appropriate model
114
115
116
117
118
Models of Panel Data
119
120
121
122
123
124
125
126
127
128
129
Copy and Paste Directly
• Can copy from excel and paste in Stata.
• 1st copy from excel and than click on data editor in Stata and simply
paste the data.
130
131
132
133
134
135
Analysis
136
Pooled OLS
137
138
139
140
141
142
Normality Test
143
Skewness and
kurtosis < +1, -1
144
Multicollinearity Test
145
Multicollinearity
VIF < 3.3,5, 10
146
Heteroscedasticity
147
Probability value >
0.05 , we cant reject
the Null hypothesis ,
means that there is
no heteroscedasticity
148
• Every cross section have some unique characteristics therefore pooled
regression is not suitable choice.
• Moving towards
149
• 1st step is we have to declare the data as panel data in Stata.
150
151
152
153
154
155
156
157
158
159
160
161
162
Can use syntax command
163
164
165
Need to store the result first before moving
on to Random effect model
166
167
168
169
170
Random Effect Model
171
172
173
Can use syntax command
174
175
176
177
178
•Final Model Selection Criterion
•Fixed or random
179
Housman Test
180
181
182
183
If Probability value <
0.05 it means fixed
model is appropriate
model , we can not
reject the Null
hypothesis that
random effect model
is appropriate
184
How to Import Results
185
Install asdoc use
this command
186
Click on this and
command will be
generated and in
beginning add asdoc
and press enter
187
188
189
Command for
correlation
190
Command for
correlation in word
file
191
192
Graphs
193
194
195
196
197
198
199
Analysis in R
200
Interface
201
Import Data
202
203
204
205
206
Attach data
• attach(Time_Series_Data)
207
208
Draw Graph
plot(gdp,fdi,main="graph")
209
Correlation cor(gdp,fdi)
210
Multiple Regression
211
Download Data from WDI &
Perform Panel Data Analysis
CLEANING OF THE DATA
Replacing double dots
with single
Double dots are missing
values and STATA don’t
recognise it
Data is in string form and
we need to “destring” all
the variables
“destring” command run for all
the variables
PANEL DATA ANALYSIS
Setting the data
as panel
Fixed effect
Save Results
Random Effect
Save Results
To choose between fixed or random