Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Stata Application Part I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

DATA MANAGEMENT AND

ANALYSIS USING STATA

2017 By: Mesfin M.


OUTLINES
Introduction
Data management
Simple and Multiple Regression
Diagnostics
Introduction
 Stata is a modern, case sensitive, full-featured and powerful command-driven
integrated statistical software package that provides you with tools you need
for
 smart data management facilities,
 a wide array of up-to-dated statistical techniques and data analysis, and
 an excellent system for producing publication-quality graphics
 It is a general purpose software (i.e. not specialized like E-views, PC-Give)
 Cross section, panel, and time series data analysis
 Especially suited for the former two
 Stata is available in several versions: Stata/IC (the standard version), Stata/SE
(an extended version) and Stata/MP (for multiprocessing).
 The major difference between the versions is the number of variables allowed
in memory, which is limited to 2,047 in standard Stata/IC, but can be much
larger in Stata/SE or Stata/MP.
Stata windows and its Interfaces
Cont’d…
 The Stata windows give you all the key information about the data file you are using,
recent commands, and the results of those commands. Stata windows are:
 Result window - recent commands and all outputs except graphics will displayed

 Command window -window where commands are entered for execution.

 Variables window - all variables in the currently open dataset will appear here

 Review window -Previously used commands are listed here

 Other important interfaces and toolbars


 Stata Data Editor- to edit the data file (needs to be opened)

 Stata Do-file Editor- to write or edit a program (needs to be opened). It Combine


many commands in a file which can be saved to make changes and rerun it later on
 Log file- Record all commands and the output of your STATA

 Graph- where only graphs displayed

 Stata Browser -to view the data file (needs to be opened)

 Stata Data Viewer- to get help on how to use Stata


The Stata menus and toolbar
 The three most important menus
 Data (for organising and managing the data),

 Graphics (for visual exploration & presentation),

 Statistics (for analysis).

Data, Graphics, Statistics


Cont’d…
 Stata Toolbar
 A few familiar icons are in the upper left for opening or saving files. Here
they’re for opening or saving datasets rather than programs.
Logical operators used in stata
Symbols Meaning
> Strictly greater
< Strictly lessthan
>= Greater than or equal to
<= Less than or equal to
== Equal to
!= or ~= Not equal to
& And
| Or
^ to the power of

Note that the double equal sign ‘==’ is used to test for equality, while the single
equal sign ‘=’ is used in assigning values (male=1)
Ways of running Stata
 There are two ways to operate Stata:
 Interactive mode: Commands can be typed directly into the Command window and
executed by pressing Enter.
 Batch mode: Commands can be written in a separate file (called a do-file) and executed
together in one step.
 We will use interactive mode for our present discussion
 Stata commands
 Stata syntax is case sensitive. All Stata command names must be in
lower case.
 Stata commands can be grouped as:
 Preliminary commands/Commands to examine dataset
 Data Management and Analysis commands
 Graph construction commands
 Post-estimation commands
Preliminary Commands
 Adjusting memory size
If you use large datasets you may have to increase the memory which the
computer reserves for STATA from the default of 1 megabyte
Type a command: set mem/memory 30m (allocates 30 Megabyte to STATA).
Other abbreviations: k- kilobyte, g- gigabyte
 Be careful: if you allocate too much memory to STATA the processing may
become very slow
 If you want all the results to be displayed even if you miss the upper part, just
type: set more off, permanently-it ensures STATA executes all commands.
Otherwise, if your code is too long, the output window might be filled, and
STATA will display --more--at the bottom, not executing all commands
 Clear- This default command that clears the memory before loading the
requested data file. It deletes all files, variables, and labels from the memory to
get ready to use a new data file. However, it does not delete any data saved to the
hard-drive.
 Exit- to close STATA
 Stata datasets always have the extension .dta.
Dataset Management commands

 Getting data into STATA/ Inputting Data


 Many Options:
 Manually enter data into the Stata Data Editor. Using keyboard. See data of testscr
 To define/label variable names:
 Note: variables are automatically named var1, var2, …
 Double-click on top of column to view/edit “Variable Properties” and change the name.
 Via command: rename old varname new varname
 Eg. rename var1testscr
 Value labels can also be defined.
 Via command: edit -you go back to data editor
 Entering data in this way is very tedious, and you will make data input
errors frequently
 Copy data into the Data Editor from another spreadsheet (ex.: Excel) and then save the
data. To save data file:
-Via drop-menu: File → Save As or
- Preserve and then close or
-Via command: save pathname/datafilename.dta
 See annual data
 Open existing Stata Data file. E.g. auto
Cont’d---

 Commands for Exploring Data


 Describe: Describe contents of dataset in memory
 List varlist: List values of variables/entire data set
 list in 1/10: lists observations 1 through 10
 list in -10/-1: list the last 10 observations
 Inspect varlist: will provide mini-histograms of data and provides a quick overview of data file and
is useful for checking data accuracy
 Count- counts the number of observations
 Codebook varlist: produces a kind of electronic codebook and is a great tool for getting a quick
overview of the variables in the data file.
 Sort: Sort observations in a dataset in ascending order
 Display: For any computation: e.g. display ln(10), display normal(1.96) after regression
Cont’d---

 Commands to modify Data


 Rename: Rename a variable
 Recode: Recode the values of a variable
 Generate: Creates a new variable. It is similar to “compute” in SPSS. E.g. gen
age2 = age*age
 Egen: Extended generate - has special functions that can be used when
creating a new variable. For example, egen avg = mean(cons) - creates
variable of average consumption
 Replace: Replaces one value with another value
 Drop: Drop variables (keeping others) e.g. drop var1 in 1500/2000 drops
observations 1500-2000 for var1
 Keep: The opposite of drop. It enables you just to keep the variables specified
which you want to use for your analysis.
Cont’d---
 Commands for Data Analysis: Descriptive Analysis
 summarize/sum varlist, detail- provides summary statistics such as obs, mean, SD, range. “, detail” gets you more
detail (skweness, etc)
 summarize in 1- summary statistits for observation number 1
 summarize in 1/10- summary statistics for observation number 1 – 10
 summarize in -10/-1- summary statistics for 10th from last to last observation
 ttest- is used to test a hypothesis about the mean or the difference between two means.
 Examples: ttest testscr = 0; tests the null hypothesis that the population mean of testscr is equal to 0 and computes a
95% confidence interval, but ttest testscr = 0, level(90) computes a 90% confidence interval
 Correlate var1 var2 - displays a matrix of Pearson correlations for the variable listed
 Tabstat: gives summary statistics for a set of continuous variable for each categorical variable
 Table: Create a table of statistics
 The three related commands that produce frequency tables for discrete variables
 tabulate or tab-produce a frequency table for one or two variables
 tab1 -produces a one-way frequency table for each variable in the variable list
 tab2 -produces all possible two-variable tables from the list of variables
 OLS Regression(Estimation):Command- reg/regress depvar expvar or with, robust option for
heteroscedasticity issue
Cont’d---
 Presenting data with graph
 The commands that draw graphs are
 graph twoway scatterplots, line plots,
 graph matrix scatterplot matrices
 graph bar bar charts
 graph dot dot charts
 graph box box-and-whisker plots
 graph pie pie charts
 Example
 graph twoway scatter c y
 We can show the regression line predicting cons from income using lfit option.
 twoway lfit c y
 Labeling graphs: scatter read write, title("title") subtitle("subtitle") xtitle("xtitle") ytitle("ytitle") note("note")
 Example
 scatter y c , title(Consumption pattern) xtitle(Income of the Household Head ) ytitle(Consumption of the
Household Head), or
 After the graph appears, you can edit it using the Graph Editor (either use File and then Start Graph Editor or push
the Graph Editor button)
Saving the Output

 Stata Results window does not keep all the output you generate.
 It only stores about 300-600 lines, and when it is full, it begins
to delete the old results as you add new results.
 Thus, we need to use log to save the output (if you are using
Do-file) or copy the out put of your interest to MS-word
 Select File  Save Graph.
Practice
 Suppose we want to estimate
yi = b0 + b1xi 1 + b 2xi 2 + b3xi 3 + e

where yi = academic performance of school


x1 = average class size in KG trough third grade
x2 = the percentage of students receiving free meals
x3 = the percentage of teachers who have full teaching
credentials.
 Use the data under file name: achievement
 In the data, api, acs_k3, meals, and, full stand for yi, x1, x2 , and x3, respectively.
Cont’d---
 describe
 list api00 acs_k3 meals full in 1/10
 Observe the four missing values for meals.
 codebook api00 acs_k3 meals

 summarize api00 acs_k3 meals full

 Observe that meal has 85 missing values, and acs_k3 has unusual minimum value which is
negative (-21).
 Focus on acs_k3.
 summarize acs_k3 , detail

 tab acs_k3
 Let‘s continue checking the data
 histogram acs_k3

 graph box acs_k3 (outliers are ploted as individual point) shows.dispersion


,skewness
 tab full
 graph matrix api00 acs_k3 meals full, half
Simple and Multiple Linear Regression
 Simple Regression
 We want to estimate the model:yi=β0+β1xi+εi
where yi = academic performance of school,
xi = number of students
 Use the data under file name: achievement2
 In the data set, api00 stands for yi and enroll stands for xi .
 Estimate and interprete the results.
 reg api00 enroll
 scatter api00 enroll
 twoway (scatter api00 enroll) (lfit api00 enroll)

 Multiple regression
 The model: y = x'b + e
, where y = academic performance of school (vector), and x‘= a vector of independent variables
 Use the data under file name: achievement
 Estimate using the command:Interprate the results:
 reg api00 acs_k3 meals full.
 reg api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll
 hist enroll
Post-estimation commands
 Some post estimation commands:
 predict yhat- create a predicted variable called y
 predict e/res, resid/residual- create a residual variable
 generate e2=e*e
 mfx- provides marginal effects
 scatter y yhat x -Plot variables named y and yhat versus x.
 scatter resids x- plot your residuals versus each of your x-variables
 rvfplot command is to display the residuals by the fitted (predicted)
values. This can be useful in checking for outliers, non-normality, non-
linearity, etc
 rvpplot explanvarlist: Graphs a residual versus individual predictor plot
 avplot varlist command- this detects unusual and influential data. It is
not only works for the variables in the model, it also works for variables
that are not in the model, which is why it is called added-variable plot.
Regression Diagnostics
 Test for normality of residuals: Testing the Residuals for Normality
-H0: residuals are normally distributed-reject/accept it
 We can use a Smirnov-Kolmogorov test. The command for the test is:sktest resid/e
 Shapiro-WilkW test for Normality (swilk r/e)-This tests the cumulative distribution of the
residuals against that of the theoretical normal distribution with a chi-square test.
 kdensity rstud(e), normal- kdensity command (with the normal option) to show the
distribution of the residuals (in yellow), and a normal overlay (in red). It Creates a “kernel density
plot”, which is an estimate of the pdf that generated the data. The “normal” option lets you overlay
a normal probability distribution with the same mean and variance. The results look pretty close
to normal
 The pnorm command produces a normal probability plot and it is another method of
testing whether the residuals from the regression are normally distributed(pnorm r/e).
 qnorm command plots the quantiles of a variable against the quantiles of a normal distribution
 The qnorm plot is more sensitive to deviances from normality in the tails of the distribution,
where the pnorm plot is more sensitive to deviances near the mean of the distribution
 regress command with robust option- one solution is to choose robust statistics that is not
sensitive to outliers, such as median over mean
Cont’d---
 Tests to detect specification errors
 linktest- it creates two new variables, the variable of prediction, _hat, and
the variable of squared prediction, _hatsq. The model is then refit using these
two variables as predictors. _hat should be significant since it is the predicted
value. On the other hand, _hatsq shouldn't, because if our model is specified
correctly, the squared predictions should not have much explanatory power.
 Ramsey’s (1969) regression equation specification error test ovtest or
ovtest, rhs- is test for functional form
 The ovtest command with the rhs option tests whether higher order trend
effects (e.g. squared, cubed) are present but omitted from the regression
model.
 H0 is that there are no omitted variables (no significant higher order
trends). If the test is significant, it suggests there are higher order trends in
the data but we have overlooked.
 NB: ovtest - tests significance of powers of dependent while ovtest,
rhs-tests significance of powers of explanatory variables
Cont’d---
 Checking Linearity
 If the assumption of linearity is violated, the linear regression will try to fit a line to data
that does not follow a straight line. All we have to do is a scatter plot between the response
variable and the predictor to see if nonlinearity is present, such as a curved band or a big
wave-shaped curve.
 we use the scatter y x command to show a scatterplot predicting y from x and use lfit to
show a linear fit, and then lowess to show a lowess smoother predicting y from x: reg y
x, then twoway (scatter y x) (lfit y x) (lowess y x)
 reg api00 enroll
 twoway (scatter api00 enroll) (lfit api00 enroll) (lowess api00 enroll)
 Checking the linearity assumption is not so straightforward in the case of multiple
regression. The most straightforward thing to do is to plot the standardized residuals
against each of the predictor variables in the regression model.
 predict r, resid, then use command acprplot(graphs an augmented component-
plus-residual plot, a.k.a. augmented partial residual plot).
 acprplot x, lowess lsopts(bwidth(1)), then see whether the smoothed line is very
close to the ordinary regression line or not.
 cprplot - graphs component-plus-residual plot, a.k.a. residual plot.
Cont’d---
 Heteroskedasticity tests: Use consumption data
 A Graphical test of heteroskedasticity: rvfplot/rvpplot, yline(0) option to put
a reference line at y=0. But it doesn’t tell us which residuals are outliers.
 Goldfeld-Quandt test
 Cook-Weisberg Test by command: hettest best explains it using p-value.
 estat imtest heteroskedasticity test(white test)– Cameron and Trivedi (1990),
also includes tests for higher-order moments of residuals (skewness and
kurtosis): imtest/imtest,white
 Multicollinearity test:
 Vif- for continuous variables
 reg api00 acs_k3 avg_ed grad_sch col_grad some_col
 Vif
 reg api00 acs_k3 grad_sch col_grad some_col
 vif
 CC in SPSS- for discrete variables
Cont’d---
 Testing the residuals for Autocorrelation: Use consumption data
 One can use the command, dwstat, after the regression to obtain the Durbin-
Watson d statistic to test for first-order autocorrelation. There are two steps to
follow:
 1. Using tsset command declare the data is time series and specify the time
variable(eg. tsset year/month, yearly/monthly---), then regress y x
 Type dwstat, then enter

 For example the following result is obtained after passing the above two steps for
consumption data. Since DW d-statistic is less than 2 and also less than lower limit of
DW table value(1.20 at k=1,n=20 and α=5%), there is some degree of positive
autocorrelation in the data.
. dwstat

Durbin-Watson d-statistic( 2, 20) = 1.12391


Cont’d---
 The Breusch-Godfrey- general test
 The BG test/the LM test
 Since the Durbin-Watson statistic assumes there is no endogeneity
even under the alternative hypothesis, an assumption which is typically
violated if there is serial correlation, so you really should use the
Breusch-Godfrey test instead
 There is a better way: Generate a casenum variable: gen casenum=_n
 Run the Ljung-Box Q statistic which tests previous lags for
autocorrelation and partial autocorrelation. The significance of the AC
(Autocorrelation) and PAC (Partial autocorrelation) is shown in the
Prob column. The command is : corrgram resid/e
End of the Session

You might also like