Stata Application Part I
Stata Application Part I
Stata Application Part I
Variables window - all variables in the currently open dataset will appear here
Note that the double equal sign ‘==’ is used to test for equality, while the single
equal sign ‘=’ is used in assigning values (male=1)
Ways of running Stata
There are two ways to operate Stata:
Interactive mode: Commands can be typed directly into the Command window and
executed by pressing Enter.
Batch mode: Commands can be written in a separate file (called a do-file) and executed
together in one step.
We will use interactive mode for our present discussion
Stata commands
Stata syntax is case sensitive. All Stata command names must be in
lower case.
Stata commands can be grouped as:
Preliminary commands/Commands to examine dataset
Data Management and Analysis commands
Graph construction commands
Post-estimation commands
Preliminary Commands
Adjusting memory size
If you use large datasets you may have to increase the memory which the
computer reserves for STATA from the default of 1 megabyte
Type a command: set mem/memory 30m (allocates 30 Megabyte to STATA).
Other abbreviations: k- kilobyte, g- gigabyte
Be careful: if you allocate too much memory to STATA the processing may
become very slow
If you want all the results to be displayed even if you miss the upper part, just
type: set more off, permanently-it ensures STATA executes all commands.
Otherwise, if your code is too long, the output window might be filled, and
STATA will display --more--at the bottom, not executing all commands
Clear- This default command that clears the memory before loading the
requested data file. It deletes all files, variables, and labels from the memory to
get ready to use a new data file. However, it does not delete any data saved to the
hard-drive.
Exit- to close STATA
Stata datasets always have the extension .dta.
Dataset Management commands
Stata Results window does not keep all the output you generate.
It only stores about 300-600 lines, and when it is full, it begins
to delete the old results as you add new results.
Thus, we need to use log to save the output (if you are using
Do-file) or copy the out put of your interest to MS-word
Select File Save Graph.
Practice
Suppose we want to estimate
yi = b0 + b1xi 1 + b 2xi 2 + b3xi 3 + e
Observe that meal has 85 missing values, and acs_k3 has unusual minimum value which is
negative (-21).
Focus on acs_k3.
summarize acs_k3 , detail
tab acs_k3
Let‘s continue checking the data
histogram acs_k3
Multiple regression
The model: y = x'b + e
, where y = academic performance of school (vector), and x‘= a vector of independent variables
Use the data under file name: achievement
Estimate using the command:Interprate the results:
reg api00 acs_k3 meals full.
reg api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll
hist enroll
Post-estimation commands
Some post estimation commands:
predict yhat- create a predicted variable called y
predict e/res, resid/residual- create a residual variable
generate e2=e*e
mfx- provides marginal effects
scatter y yhat x -Plot variables named y and yhat versus x.
scatter resids x- plot your residuals versus each of your x-variables
rvfplot command is to display the residuals by the fitted (predicted)
values. This can be useful in checking for outliers, non-normality, non-
linearity, etc
rvpplot explanvarlist: Graphs a residual versus individual predictor plot
avplot varlist command- this detects unusual and influential data. It is
not only works for the variables in the model, it also works for variables
that are not in the model, which is why it is called added-variable plot.
Regression Diagnostics
Test for normality of residuals: Testing the Residuals for Normality
-H0: residuals are normally distributed-reject/accept it
We can use a Smirnov-Kolmogorov test. The command for the test is:sktest resid/e
Shapiro-WilkW test for Normality (swilk r/e)-This tests the cumulative distribution of the
residuals against that of the theoretical normal distribution with a chi-square test.
kdensity rstud(e), normal- kdensity command (with the normal option) to show the
distribution of the residuals (in yellow), and a normal overlay (in red). It Creates a “kernel density
plot”, which is an estimate of the pdf that generated the data. The “normal” option lets you overlay
a normal probability distribution with the same mean and variance. The results look pretty close
to normal
The pnorm command produces a normal probability plot and it is another method of
testing whether the residuals from the regression are normally distributed(pnorm r/e).
qnorm command plots the quantiles of a variable against the quantiles of a normal distribution
The qnorm plot is more sensitive to deviances from normality in the tails of the distribution,
where the pnorm plot is more sensitive to deviances near the mean of the distribution
regress command with robust option- one solution is to choose robust statistics that is not
sensitive to outliers, such as median over mean
Cont’d---
Tests to detect specification errors
linktest- it creates two new variables, the variable of prediction, _hat, and
the variable of squared prediction, _hatsq. The model is then refit using these
two variables as predictors. _hat should be significant since it is the predicted
value. On the other hand, _hatsq shouldn't, because if our model is specified
correctly, the squared predictions should not have much explanatory power.
Ramsey’s (1969) regression equation specification error test ovtest or
ovtest, rhs- is test for functional form
The ovtest command with the rhs option tests whether higher order trend
effects (e.g. squared, cubed) are present but omitted from the regression
model.
H0 is that there are no omitted variables (no significant higher order
trends). If the test is significant, it suggests there are higher order trends in
the data but we have overlooked.
NB: ovtest - tests significance of powers of dependent while ovtest,
rhs-tests significance of powers of explanatory variables
Cont’d---
Checking Linearity
If the assumption of linearity is violated, the linear regression will try to fit a line to data
that does not follow a straight line. All we have to do is a scatter plot between the response
variable and the predictor to see if nonlinearity is present, such as a curved band or a big
wave-shaped curve.
we use the scatter y x command to show a scatterplot predicting y from x and use lfit to
show a linear fit, and then lowess to show a lowess smoother predicting y from x: reg y
x, then twoway (scatter y x) (lfit y x) (lowess y x)
reg api00 enroll
twoway (scatter api00 enroll) (lfit api00 enroll) (lowess api00 enroll)
Checking the linearity assumption is not so straightforward in the case of multiple
regression. The most straightforward thing to do is to plot the standardized residuals
against each of the predictor variables in the regression model.
predict r, resid, then use command acprplot(graphs an augmented component-
plus-residual plot, a.k.a. augmented partial residual plot).
acprplot x, lowess lsopts(bwidth(1)), then see whether the smoothed line is very
close to the ordinary regression line or not.
cprplot - graphs component-plus-residual plot, a.k.a. residual plot.
Cont’d---
Heteroskedasticity tests: Use consumption data
A Graphical test of heteroskedasticity: rvfplot/rvpplot, yline(0) option to put
a reference line at y=0. But it doesn’t tell us which residuals are outliers.
Goldfeld-Quandt test
Cook-Weisberg Test by command: hettest best explains it using p-value.
estat imtest heteroskedasticity test(white test)– Cameron and Trivedi (1990),
also includes tests for higher-order moments of residuals (skewness and
kurtosis): imtest/imtest,white
Multicollinearity test:
Vif- for continuous variables
reg api00 acs_k3 avg_ed grad_sch col_grad some_col
Vif
reg api00 acs_k3 grad_sch col_grad some_col
vif
CC in SPSS- for discrete variables
Cont’d---
Testing the residuals for Autocorrelation: Use consumption data
One can use the command, dwstat, after the regression to obtain the Durbin-
Watson d statistic to test for first-order autocorrelation. There are two steps to
follow:
1. Using tsset command declare the data is time series and specify the time
variable(eg. tsset year/month, yearly/monthly---), then regress y x
Type dwstat, then enter
For example the following result is obtained after passing the above two steps for
consumption data. Since DW d-statistic is less than 2 and also less than lower limit of
DW table value(1.20 at k=1,n=20 and α=5%), there is some degree of positive
autocorrelation in the data.
. dwstat