Comandos
Comandos
Comandos
Simons, 28-Jun-19
Useful Stata Commands (for Stata versions 13, 14, & 15)
Kenneth L. Simons
– This document is updated continually. For the latest version, open it from the course disk space. –
This document briefly summarizes Stata commands useful in ECON-4570 Econometrics and ECON-
6570 Advanced Econometrics.
This presumes a basic working knowledge of how to open Stata, use the menus, use the data editor, and
use the do-file editor. We will cover these topics in early Stata sessions in class. If you miss the
sessions, you might ask a fellow student to show you through basic usage of Stata, and get the
recommended text about Stata for the course and use it to practice with Stata.
More replete information is available in Lawrence C. Hamilton’s Statistics with Stata, Christopher F.
Baum’s An Introduction to Modern Econometrics Using Stata, and A. Colin Cameron and Pravin K.
Trivedi’s Microeconometrics using Stata. See: http://www.stata.com/bookstore/books-on-stata/ .
Readers on the Internet: I apologize but I cannot generally answer Stata questions. Useful places to
direct Stata questions are: (1) built-in help and manuals (see Stata’s Help menu), (2) your friends and
colleagues, (3) Stata’s technical support staff (you will need your serial number), (4) Statalist
(http://www.stata.com/statalist/) (but check the Statalist archives before asking a question there).
Most commands work the same in Stata versions 12, 11, 10, and 9.
Throughout, estimation commands specify robust standard errors (Eicker-Huber-White heteroskedastic-consistent standard
errors). This does not imply that robust rather than conventional estimates of Var[b|X] should always be used, nor that they
are sufficient. Other estimators shown here include Davidson and MacKinnon’s improved small-sample robust estimators for
OLS, cluster-robust estimators useful when errors may be arbitrarily correlated within groups (one application is across time
for an individual), and the Newey-West estimator to allow for time series correlation of errors. Selected GLS estimators are
listed as well. Hopefully the constant presence of “vce(robust)” in estimation commands will make readers sensitive to the
need to account for heteroskedasticity and other properties of errors typical in real data and models.
1
Kenneth L. Simons, 28-Jun-19
Contents
2
Kenneth L. Simons, 28-Jun-19
J0. Copying and Pasting from Stata to a Word Processor or Spreadsheet Program.............................19
J1. Tables of Regression Results Using Stata’s Built-In Commands ..................................................19
J2. Tables of Regression Results Using Add-On Commands .............................................................20
J2a. Installing or Accessing the Add-On Commands .....................................................................20
J2b. Storing Results and Making Tables ........................................................................................21
J2c. Near-Publication-Quality Tables ............................................................................................21
J2d. Understanding the Table Command’s Options .......................................................................22
J2e. Saving Tables as Files ............................................................................................................22
J2f. Wide Tables ...........................................................................................................................23
J2g. Storing Additional Results .....................................................................................................23
J2h. Clearing Stored Results..........................................................................................................23
J2i. More Options and Related Commands ....................................................................................23
J3. Tables of Summary Statistics and Tabulations .............................................................................23
K. Data Types, When 3.3 ¹ 3.3, and Missing Values.............................................................................24
L. Results Returned after Commands ....................................................................................................25
M. Do-Files and Programs ....................................................................................................................25
N. Monte-Carlo Simulations .................................................................................................................26
O. Doing Things Once for Each Group .................................................................................................27
P. Generating Variables for Time-Series and Panel Data .......................................................................28
P1. Creating a Time Variable ............................................................................................................28
P1a. Time Variable that Starts from a First Time and Increases by 1 at Each Observation .............28
P1b. Time Variable from a Date String .........................................................................................28
P1c. Time Variable from Multiple (e.g., Year and Month) Variables .............................................29
P1d. Time Variable Representation in Stata...................................................................................29
P2. Telling Stata You Have Time Series or Panel Data......................................................................30
P3. Lags, Forward Leads, and Differences ........................................................................................30
P4. Generating Means and Other Statistics by Individual, Year, or Group .........................................30
Q. Panel Data Statistical Methods .........................................................................................................31
Q1. Fixed Effects – Using Dummy Variables ....................................................................................31
Q2. Fixed Effects – De-Meaning .......................................................................................................31
Q3. Other Panel Data Estimators .......................................................................................................32
Q4. Time-Series Plots for Multiple Individuals .................................................................................32
R. Probit and Logit Models ...................................................................................................................33
R1. Interpreting Coefficients in Probit and Logit Models ..................................................................33
S. Other Models for Limited Dependent Variables ................................................................................35
S1. Censored and Truncated Regressions with Normally Distributed Errors ......................................35
S2. Count Data Models .....................................................................................................................36
S3. Survival Models (a.k.a. Hazard Models, Duration Models, Failure Time Models) .......................36
T. Instrumental Variables Regression....................................................................................................37
T1. GMM Instrumental Variables Regression ...................................................................................37
T2. Other Instrumental Variables Models ..........................................................................................38
U. Time Series Models .........................................................................................................................38
U1. Autocorrelations .........................................................................................................................39
U2. Autoregressions (AR) and Autoregressive Distributed Lag (ADL) Models .................................39
U3. Information Criteria for Lag Length Selection ............................................................................39
U4. Augmented Dickey Fuller Tests for Unit Roots ..........................................................................39
3
Kenneth L. Simons, 28-Jun-19
U5. Forecasting.................................................................................................................................40
U6. Break Tests ................................................................................................................................40
U6a. Breaks at Known Times ........................................................................................................40
U6b. Breaks at Unknown Times ....................................................................................................41
U7. Newey-West Heteroskedastic-and-Autocorrelation-Consistent Standard Errors ..........................42
U8. Dynamic Multipliers and Cumulative Dynamic Multipliers ........................................................42
V. System Estimation Commands .........................................................................................................43
V1. GMM System Estimators ...........................................................................................................43
V2. Three-Stage Least Squares .........................................................................................................43
V3. Seemingly Unrelated Regression ................................................................................................44
V4. Multivariate Regression..............................................................................................................44
W. Flexible Nonlinear Estimation Methods...........................................................................................44
W1. Nonlinear Least Squares ............................................................................................................44
W2. Generalized Method of Moments Estimation for Custom Models ..............................................45
W3. Maximum Likelihood Estimation for Custom Models ...............................................................45
X. Data Manipulation Tricks.................................................................................................................45
X1. Combining Datasets: Adding Rows ............................................................................................45
X2. Combining Datasets: Adding Columns .......................................................................................46
X3. Reshaping Data ..........................................................................................................................48
X4. Converting Between Strings and Numbers .................................................................................49
X5. Labels ........................................................................................................................................49
X6. Notes..........................................................................................................................................50
X7. More Useful Commands.............................................................................................................51
4
Kenneth L. Simons, 28-Jun-19
A. Loading Data
edit Opens the data editor, to type in or paste data. You must close the
data editor before you can run any further commands.
use "filename.dta" Reads in a Stata-format data file.
insheet delimited "filename.txt" Reads in text data (allowing for various text encodings), in Stata 14
or newer.
insheet using "filename.txt" Old way to read text data, faster for plain English-language text.
import excel "filename.xlsx", firstrow Reads data from an Excel file’s first worksheet, treating the first
row as variable names.
import excel "filename.xlsx", sheet("price data") firstrow Reads data from the worksheet named “price
data” in an Excel file, treating the first row as variable names.
save "filename.dta" Saves the data.
Before you load or save files, you may need to change to the right directory. Under the File menu,
choose “Change Working Directory…”, or use Stata’s “cd” command.
5
Kenneth L. Simons, 28-Jun-19
6
Kenneth L. Simons, 28-Jun-19
browse varlist Opens the data viewer, to look at data without changing them.
list varlist Lists data. If there’s more than 1 screenful, press space for the next
screen, or q to quit listing.
7
Kenneth L. Simons, 28-Jun-19
8
Kenneth L. Simons, 28-Jun-19
Should you need to distinguish reasons why data are missing, you could use Stata’s “extended
missing value” codes. These codes are written .a, .b, .c, …, .z. They all count as infinity when
compared versus normal numbers, but compared to each other they are ranked as . < .a < .b < .c <
… < .z. For this reason, to check whether a number in variable varname is missing you should use
not “varname==.” but “varname>=.”
9
Kenneth L. Simons, 28-Jun-19
10
Kenneth L. Simons, 28-Jun-19
F9. More
For functions available in equations in Stata, use Stata’s Help menu, choose Stata Command…,
and enter “functions”. To generate variables separately for different groups of observations, see
the commands in sections O and P4. For time-series and panel data, see section P, especially the
notations for lags, leads, and differences in section P3. If you need to refer to a specific
observation number, use a reference like x[3], meaning the valuable of the variable x in the 3rd
observation. In Stata “_n” means the current observation (when using generate or replace), so that
for example x[_n-1] means the value of x in the preceding observation, and “_N” means the
number of observations, so that x[_N] means the value of x in the last observation.
11
Kenneth L. Simons, 28-Jun-19
Other commands also report confidence intervals, and may be preferable because they do more,
such as computing a confidence interval for the difference in means between “by” groups (e.g.,
between men and women). See section G2. (Also, Stata’s “mean” command reports confidence
intervals.)
12
Kenneth L. Simons, 28-Jun-19
instead of “#” to add full interactions, for example c.age##i.male means age, male, and age´male.
Similarly, c.age##i.usstate means age, 49 state dummies, and 49 state dummies multiplied by age.
You can use “#” to create polynomials. For example, “age age#age age#age#age” is a third-
order polynomial, with variables age and age2 and age3. Having done this, you can use Stata’s
“margins” command to compute marginal effects: the average value of the derivatives d(y)/d(age)
across all observations in the sample. This works even if your regression equation includes
interactions of age with other variables.
Here are some examples using automated category dummies and interactions, termed “factor
variables” in the Stata manuals (see the User’s Guide U11.4 for more information):
reg yvar x1 i.x2, vce(robust) Includes a 0-1 dummy variables for the groups indicated by unique
values of variable x2.
reg wage c.age i.male c.age#i.male, vce(robust) Regress wage on age, male, and age´male.
reg wage c.age##i.male, vce(robust) Regress wage on age, male, and age´male.
reg wage c.age##i.male c.age#c.age, vce(robust) Regress wage on age, male, age´male, and age2.
reg wage c.age##i.male c.age#c.age c.age#c.age#i.male, vce(robust) Regress wage on age, male,
age´male, age2, and age2´male.
reg wage c.age##i.usstate c.age#c.age c.age#c.age#i.usstate, vce(robust) Regress wage on age,
49 state dummies, 49 variable that are age´statedummyk, age2, and
49 variable that are age2´statedummyk (k=1,…,49).
Speed Tip: Don’t “generate” lots of dummy variables and interactions – instead use this “factor
notation” to compute your dummy variables and interactions “on the fly” during statistical
estimation. This usually is much faster and saves lots of memory, if you have a really big dataset.
*
R. Davidson and J. MacKinnon, Estimation and Inference in Econometrics, Oxford: Oxford University
Press, 1993, section 16.3.
13
Kenneth L. Simons, 28-Jun-19
I. Post-Estimation Commands
Commands described here work after OLS regression. They sometimes work after other estimation
commands, depending on the command.
14
Kenneth L. Simons, 28-Jun-19
However, you may need to carry out F-tests, as well as compute confidence intervals and t-tests for
“linear combinations” of coefficients in the model. Here are example commands. Note that when
a variable name is used in this subsection, it really refers to the coefficient (the bk) in front of that
variable in the model equation.
lincom logpl+logpk+logpf Compute the estimated sum of three model coefficients, which are the
coefficients in front of the variables named logpl, logpk, and logpf.
Along with this estimated sum, carry out a t-test with the null
hypothesis being that the linear combination equals zero, and
compute a confidence interval.
lincom 2*logpl+1*logpk-1*logpf Like the above, but now the formula is a different linear
combination of regression coefficients.
lincom 2*logpl+1*logpk-1*logpf, level(#) As above, but this time change the confidence interval
to #% (e.g. use 99 for a 99% confidence interval).
test logpl+logpk+logpf==1 Test the null hypothesis that the sum of the coefficients of variables
logpl, logpk, and logpf, totals to 1. This only makes sense after a
regression involving variables with these names. After OLS
regression, this is an F-test. More generally, it is a Wald test.
test (logq2==logq1) (logq3==logq1) (logq4==logq1) (logq5==logq1) Test the null hypothesis
that four equations are all true simultaneously: the coefficient of
logq2 equals the coefficient of logq1, the coefficient of logq3 equals
the coefficient of logq1, the coefficient of logq4 equals the
coefficient of logq1, and the coefficient of logq5 equals the
coefficient of logq1; i.e., they are all equal to each other. After OLS
regression, this is an F-test. More generally, it is a Wald test.
test x3 x4 x5 Test the null hypothesis that the coefficient of x3 equals 0 and the
coefficient of x4 equals 0 and the coefficient of x5 equals 0. After
OLS regression, this is an F-test. More generally, it is a Wald test.
15
Kenneth L. Simons, 28-Jun-19
16
Kenneth L. Simons, 28-Jun-19
17
Kenneth L. Simons, 28-Jun-19
other heteroskedasticity tests that may be more appropriate. Stata’s imtest command also carries
out other tests, and the commands hettest and szroeter carry out different tests for
heteroskedasticity.
The Breusch-Pagan Lagrange multiplier test, which assumes normally distributed errors, can be
carried out after running a regression, by using the command:
estat hettest, normal Heteroskedasticity test - Breusch-Pagan Lagrange mulitplier.
Other tests that do not require normally distributed errors include:
estat hettest, iid Heteroskedasticity test – Koenker’s (1981)’s score test, assumes iid
errors.
estat hettest, fstat Heteroskedasticity test – Wooldridge’s (2006) F-test, assumes iid errors.
estat szroeter, rhs mtest(bonf) Heteroskedasticity test – Szroeter (1978) rank test for null
hypothesis that variance of error term is unrelated to each variable.
estat imtest Heteroskedasticity test – Cameron and Trivedi (1990), also includes
tests for higher-order moments of residuals (skewness and kurtosis).
For further information see the Stata manuals.
See also the ivhettest command described in section T1 of this document. This makes available
the Pagan-Hall test which has advantages over the results from “estat imtest”.
18
Kenneth L. Simons, 28-Jun-19
margins age After a regression where the x-variables involve age, compute
d(y)/d(age) on average among individuals in the sample.
margins , at(age=(20 25 30)) After a regression where the x-variables involve age, compute the
predicted value of the dependent variable, y, for the average
individual in the sample, given three alternative counterfactual
assumptions for age. That is, first replace each person’s age with 20,
and compute the fitted value of y for each individual in the sample,
and report the average fitted value. Then replace age with 25 and
report the average fitted value, and do the same for age 30. This tells
you what is predicted to happen for the average person in the sample
if they were of a particular age. Hence it lets you compare, for the
average of the individuals actually in your sample, the estimated
effects of age.
J0. Copying and Pasting from Stata to a Word Processor or Spreadsheet Program
To put results into Excel or Word, the following method is fiddly but sometimes helps. Select the
table you want to copy, or part of it, but do not select anything additional. Then choose Copy
Table from the Edit menu. Stata will copy information with tabs in the right places, to paste easily
into a spreadsheet or word processing program. For this to work, the part of the table you select
must be in a consistent format, i.e., it must have the same columns everywhere, and you must not
select any extra blank lines. (Stata figures out where the tabs go based on the white space between
columns.)
After pasting such tab-delimited text into Word, use Word’s “Convert Text to Table…”
command to turn it into a table. In Word 2007, from the Insert tab, in the Tables group, click
Table and select Convert Text to Table... (see: http://www.uwec.edu/help/Word07/tb-
txttotable.htm ); choose Delimited data with Tab characters as delimiters. Or if in Stata you used
Copy instead of Copy Table, you can Convert Text to Table... and choose Fixed Width data and
indicate where the columns break – but this “fixed width” approach is dangerous because you can
easily make mistakes, especially if some numbers span multiple columns. In either case, you can
then adjust the font, borderlines, etc. appropriately.
In section J2, you will see how to save tables as files that you can open in Word, Excel, and other
programs. These files are often easier to use than copying and pasting, and will help avoid
mistakes.
19
Kenneth L. Simons, 28-Jun-19
20
Kenneth L. Simons, 28-Jun-19
a folder named “stata extensions”. You merely need to tell Stata where to look (you could copy
the relevant files anywhere, and just tell Stata where). Type the command listed below in Stata.
You only need to run this command once after you start or restart Stata. Put the command at the
beginning of your do-files (you also may need to include the command “eststo clear” to avoid any
confusion with previous results – see section J2h).
adopath + folderToLookIn
Here, replace folderToLookIn with the name of the folder, by using one of the following two
commands (the first for ECON-4570 or -6560, the second for ECON-6570):
adopath + "//hass11.win.rpi.edu/classes/ECON-4570-6560/stata extensions"
adopath + "//hass11.win.rpi.edu/classes/ECON-6570/stata extensions"
(Note the use of forward slashes above instead of the Windows standard of backslashes for file
paths. If you use backslashes, you will probably need to use four backslashes instead of two at the
front of the file path. Why? In certain settings, including in do-files, Stata converts two
backslashes in a row into just one – for Stata \$ means $, \` means `, and \\ means \. This provides
a way to tell Stata that a dollar sign is not the start of a global macro but is just a dollar sign, that a
backquote is not the start of a local macro but is just a backquote, or that a backslash is desired
even though backslashes are used this way to designate special characters. (A local macro is
Stata’s name for a programmer’s local variable in a program or do-file, and a global macro is
Stata’s name for a programmer’s global variable in a program or do-file.))
21
Kenneth L. Simons, 28-Jun-19
esttab est1 est2 using mytable, rtf b(a3) se(a3) star(+ 0.10 * 0.05 ** 0.01 *** 0.001) r2(3) ar2(3) scalars(F) nogaps
Save a near-publication-quality table, putting it in a rich text file
(“mytable.rtf”) that can be opened by Word.
22
Kenneth L. Simons, 28-Jun-19
23
Kenneth L. Simons, 28-Jun-19
estpost summarize w x Summarize data on variables named w and x. This works just like the
summarize command, except that it “posts” the results as if they
were the results of statistical estimation.
esttab ., cells("mean sd count") noobs Make a table showing the summary statistics, with a row
for each variable and a column for each of mean, standard deviation,
and N (the count of nonmissing observations). This must be done
immediately after using the previous command, before storing other
results with eststo.
For more flexibility, the “estpost tabstat” command lets you choose additional summary
statistics:
estpost tabstat w x, listwise statistics(mean variance skewness kurtosis)
esttab ., cells("w x")
Get help on the tabstat command and the estpost command to see more choices of statistics to
view and ways to format the table.
For tabulations, get help on “estpost” and use the “estpost tabulate” command:
estpost tabulate x A one-way tabulation
esttab ., cells("b pct(fmt(2)) cumpct(fmt(2))") noobs
estpost tabulate w x A two-way tabulation, of w within groups by x.
esttab ., cell(colpct(fmt(2))) unstack noobs Rows are for values of w, columns for values of x.
Get help on “estpost” and look at the “estpost tabulate” command to see some formatting
options.
For another way to control the formatting of tabulations and other tables, try the “tabout” add-on
command, which has an introduction at: http://www.ianwatson.com.au/stata/tabout_tutorial.pdf .
24
Kenneth L. Simons, 28-Jun-19
about 7 digits, and compares it to the number 3.3, which is immediately put into double-precision for the
calculation and hence is accurate to 16 digits, and hence is different from the rating. Hence the first
observation will not be listed. Instead you could do this:
list if rating == float(3.3) The float 3.3 converts to a number accurate to only about 7 digits, the
same as the rating variable.
Missing values of numbers in Stata are written as a period. They occur if you enter missing values to
begin with, or if they arise in a calculation that has for example 0/0 or a missing number plus another
number. For comparison purposes, missing values are treated like infinity, and when you’re not used to
this you can get some weird results. For example, “replace z = 0 if y>3” causes z to be replaced with 0
not only if y has a known value greater than 3 but also if the value of y is missing. Instead use
something like this: “replace z = 0 if y>3 & y<.”. The same caution applies when generating variables,
anytime you use an if-statement, etc. (see sections F2, F3, and F5).
25
Kenneth L. Simons, 28-Jun-19
use "L:\myfolder\myfile.dta"
* I commented out the following three lines since I'm not using them now:
/* regress income age, vce(robust)
predict incomeHat
scatter incomeHat income age */
* Now do my polynomial age analyses:
gen age2 = age^2
gen age3 = age^3
eststo p3: regress income age age2 age3 bachelor, vce(robust)
eststo p2: regress income age age2 bachelor, vce(robust)
esttab p3 p2, b(a3) se(a3) star(+ 0.10 * 0.05 ** 0.01 *** 0.001) r2(3) ar2(3) scalars(F) nogaps
You can write programs in the do-file editor, and sometimes these are useful for repetitive tasks.
Here is a program to create some random data and compute the mean.
capture program drop randomMean Drops the program if it exists already.
program define randomMean, rclass Begins the program, which is “rclass”.
drop _all Drops all variables.
quietly set obs 30 Use 30 observations, and don’t say so.
gen r = uniform() Generate random numbers.
summarize r Compute mean.
return scalar average = r(mean) Return it in r(average).
end
Note above that “rclass” means the program can return a result. After doing this code in the do-file, you
can use the program in Stata. Be careful, as it will drop all of your data! It will then generate 30
uniformly-distributed random numbers, summarize them, and return the average. (By the way, you can
make the program work faster by using the “meanonly” option after the summarize command above,
although then the program will not display any output.)
N. Monte-Carlo Simulations
It would be nice to know how well our statistical methods work in practice. Often the only way to know
is to simulate what happens when we get some random data and apply our statistical methods. We do
this many times and see how close our estimator is to being unbiased, normally distributed, etc. (Our
OLS estimators will do better with larger sample sizes, when the x-variables are independent and have
larger variance, and when the random error terms are closer to normally distributed and have smaller
variance.) Here is a Stata command to call the above (at the end of section M) program 100,000 times
and record the result from each time.
simulate "randomMean" avg=r(average), reps(100000)
The result will be a dataset containing one variable, named avg, with 100,000 observations. Then you
can check the mean and distribution of the randomly generated sample averages, to see whether they
seem to be nearly unbiased and nearly normally distributed.
summarize avg
kdensity avg , normal
“Unbiased” means right on average. Since the sample mean, of say 30 independent draws of a random
variable, has been proven to give an unbiased estimate of the variable’s true population mean, you had
better find that the average (across all 100,000 experiments) result computed here is very close to the
true population mean. And the central limit theorem tells you that as a sample size gets larger, in this
case reaching the not-so-enormous size of 30 observations, the means you compute should have a
probability distribution that is getting close to normally distributed. By plotting the results from the
26
Kenneth L. Simons, 28-Jun-19
100,000 experiments, you can see how close to normally-distributed the sample mean is. Of course, we
would get slightly different results if we did another set of 100,000 random trials, and it is best to use as
many trials as possible – to get exactly the right answer we would need to do an infinite number of such
experiments.
Try similar simulations to check results of OLS regressions. You will need to change the program in
section M and alter the “simulate” command above. One approach is to change the program in section
M to return results named “b0”, “b1”, “b2”, etc., by setting them equal to the coefficient estimates
_b[varname], and then alter the “simulate” command above to use the regression coefficient estimates
instead of the mean (you might say “b0=r(b0) b1=r(b1) b2=r(b2)” in place of “avg=r(average)”). An
easier approach, though, is to get rid of the “, rclass” in the program at the end of section M, and just do
the regression in the program – the regression command itself will return results that you can use; your
simulate command might then be something like “simulate "randomReg" b0=_b[_cons] b1=_b[x1]
b2=_b[x2], reps(1000)”.
27
Kenneth L. Simons, 28-Jun-19
P1a. Time Variable that Starts from a First Time and Increases by 1 at Each Observation
If you have not yet created a time variable, and your data are in order and do not have gaps, you
might create a year, quarter, or day variable as follows:
generate year = 1900 + _n - 1 Create a new variable that specifies the year, beginning with 1900
in the first observation and increasing by 1 thereafter. Be sure your
data are sorted in the right order first.
generate quarter = tq(1970q1) + _n - 1 Create a new variable that specifies the time, beginning
with 1970 quarter 1 in the first observation, and increasing by 1
quarter in each observation. Be sure your data are sorted in the right
order first. The result is an integer number increasing by 1 for each
quarter (1960 quarter 2 is specified as 1, 1960 quarter 3 is specified
as 2, etc.).
format quarter %tq Tell Stata to display values of quarter as quarters.
generate day = td(01jan1960) + _n - 1 Create a new variable that specifies the time, beginning
with 1 Jan. 1960 in the first observation, and increasing by 1 day in
each observation. Be sure your data are sorted in the right order
first. The result is an integer number increasing by 1 for each day
(01jan1960 is specified as 0, 02 jan1960 is specified as 2, etc.).
format day %td Tell Stata to display values of day as dates.
Like the td(…) and tq(…) functions used above, you may also use tw(…) for week, tm(…) for
month, or th(…) for half-year. For more information, get help on “functions” and look under
“time-series functions”.
28
Kenneth L. Simons, 28-Jun-19
2003, “jan1-2003”, etc. Note the "MDY", which tells Stata the
ordering of the month, day, and year in the variable. If the order
were year, month, day, you would use "YMD".
format t %td This tells Stata the variable is a date number that specifies a day.
Like the daily(…) function used above, The similar functions monthly(strvar, "ym") or
monthly(strvar, "my"), and quarterly(strvar, "yq") or quarterly(strvar, "qy"), allow monthly or
quarterly date formats. Use %tm or %tq, respectively, with the format command. These date
functions require a way to separate the parts. Dates like “20050421” are not allowed. If d1 is a
string variable with such dates, you could create dates with separators in a new variable d2 suitable
for daily(…), like this:
gen str10 d2 = substr(d1, 1, 4) +"-" + substr(d1, 5, 2) +"-" + substr(d1, 7, 2) This uses the
substr(…) function, which returns a substring – the part of a string
beginning at the first number’s character for a length given by the
second number.
P1c. Time Variable from Multiple (e.g., Year and Month) Variables
What if you have a year variable and a month variable and need to create a single time variable?
Or what if you have some other set of time-period numbers and need to create a single time
variable? Stata has functions to build the time variable from its components:
gen t = ym(year, month) Create a single time variable t from separate year (the full 4-digit year)
and month (1 through 12) variables.
format t %tm This tells Stata to display the variable’s values in a human-readable
format like “2012m5” (meaning May 2012).
Other functions are available for other periods:
*For data in milliseconds, data must be stored in double-precision number format (see section K
above), using “gen double t = mdyhms(month, day, year, hour, minute, second)”. For any of the
other periodicities above, you can use long or double data types to store a broader range of
numbers than is possible using the default float data type. For data in milliseconds, a version
accounting for leap seconds uses “Cmdyhms(month, day, year, hour, minute, second)” and “%tC”.
If your data do not match one of these standard periodicities, you can create your own time
variable as in section P1a, but without using the “format” command to specify a human-readable
format (the time numbers will just display as the numbers they are).
29
Kenneth L. Simons, 28-Jun-19
number 0 (zero) to correspond to some particular time. For example, in daily format, 0 means the
first day of January in 1960, 1 means the second day of January 1960, -1 means the last day of
December 1959, and so on. In other formats except yearly (%ty), 0 means the first time period
(quarter, day, second, etc.) at the beginning of the year 1960.
To figure out the number that corresponds to a given date, you can use functions like tq(1989q4)
for a quarter, tm(2016apr) for a month, or td(16jan1770) for a day.
30
Kenneth L. Simons, 28-Jun-19
The above methods generate values for every observation within each by-group (i.e., they create
a variable with sensible values in every observation). If you just want to create a dataset of
summary statistics, with one observation per by-group, try Stata’s collapse command.
31
Kenneth L. Simons, 28-Jun-19
the absorb() option. For example the byvar might be the state, to
include fixed effects for states. Coefficient estimates will not be
reported for these fixed effect dummy variables.
32
Kenneth L. Simons, 28-Jun-19
This would make plots of each company’s employment in each year, with a separate plot for
each company, arranged in a grid. However, you might prefer to overlay these plots in a single
graph. You could do this as follows:
tsset
xtline employment , overlay
The “xtline” command with the “overlay” option puts all companies’ plots in a single graph,
instead of having a separate plot for each company.
See also section E4, which talks briefly about graphing.
33
Kenneth L. Simons, 28-Jun-19
34
Kenneth L. Simons, 28-Jun-19
document to see how to use notations like “i.” to create dummy variables, “c.” to specify
continuous variables, and “#” and “##” to specify interaction effects. Then you can check the
marginal effects of variables that are interacted or are involved in polynomials. For example:
35
Kenneth L. Simons, 28-Jun-19
which no data are observed when the count is zero or right-censored survival times; you can find
many such models in Stata.
S3. Survival Models (a.k.a. Hazard Models, Duration Models, Failure Time Models)
To fit survival models, or make plots or tables of survival or of the hazard of failure, you must first
tell Stata about your data. There are a lot of options and variants to this, so look for a book on the
subject if you really need to do this. A simple case is:
stset survivalTime, failure(dummyEqualToOneIfFailedElseZero) Tell Stata that you have
survival data, with each individual having one observation. The
variable survivalTime tells the elapsed time at which each individual
either failed or ceased to be studied. It is the norm in survival data
that some individuals are still surviving at the end of the study, and
hence that the survival times are censored from above, i.e., “right-
censored.” The variable dummyEqualToOneIfFailedElseZero
provides the relevant information on whether each option failed
during the study (1) or was right-censored (0).
sts graph , survival yscale(log) Plot a graph showing the fraction of individuals surviving as a
function of elapsed time. The optional use of “yscale(log)” causes
the vertical axis to be logarithmic, in which cases a line of constant
(negative) slope on the graph corresponds to a hazard rate that
remains constant over time. Another option is by(groupvar), in
which case separate survival curves are drawn for the different
groups each of which has a different value of groupvar. A hazard
curve can be fitted by specifying “hazard” instead of “survival”.
streg xvarlist, distribution(exponential) nohr vce(robust) After using stset, estimate an
exponential hazard model in which the hazard (Poisson arrival rate
of the first failure) is proportional to exp(xi'b) where xi includes the
independent variables in xvarlist. Other common models make the
hazard dependent on the elapsed time; such models can be specified
instead by setting the distribution() option to weibull, gamma,
gompertz, lognormal, loglogistic, or one of several other choices,
and a strata(groupvar) option can be used to assume that the function
of elapsed time differs between different groups.
36
Kenneth L. Simons, 28-Jun-19
stcox xvarlist, nohr vce(robust) After using stset, estimate a Cox hazard model in which the
hazard (Poisson arrival rate of the first failure) is proportional to
f(elapsed time) ´ exp(xi'b) where xi includes the independent
variables in xvarlist. The function of elapsed time is implicitly
estimated in a way that best fits the data, and a strata(groupvar)
option can be used to assume that the function of elapsed time differs
between different groups.
As always, see the Stata documentation and on-line help for lots more about survival analysis.
37
Kenneth L. Simons, 28-Jun-19
38
Kenneth L. Simons, 28-Jun-19
U1. Autocorrelations
corrgram varname Create a table showing autocorrelations (among other statistics) for
lagged values of the variable varname.
corrgram varname, lags(#) noplot You can specify the number of lags, and suppress the plot.
correlate x L.x L2.x L3.x L4.x L5.x L6.x L7.x L8.x Another way to compute autocorrelations,
for x with its first eight lags.
correlate L(0/8).x This more compact notation also uses the 0th through 8th lags of x and
computes the correlation.
correlate L(0/8).x, covariance This gives autocovariances instead of autocorrelations.
39
Kenneth L. Simons, 28-Jun-19
dfuller y, regress Show the associated regression when doing the Dickey-Fuller test.
dfuller y, lag(2) regress Carry out an augmented Dickey-Fuller test for nonstationarity using two
lags of y, checking the null hypothesis that y has a unit root, and
show the associated regression.
dfuller y, lag(2) trend regress As above, but now include a time trend term in the associated
regression.
For panel data unit root tests, see Stata’s “xtunitroot” command.
U5. Forecasting
regress y L.y L.x After a regression…
tsappend, add(1) Add an observation for one more time after the end of the sample. (Use
add(#) to add # observations.) Use browse after this to check what
happened. …
predict yhat, xb Then compute the predicted or forecasted value for each observation.
predict rmsfe, stdf And compute the standard error of the out-of-sample forecast.
If you want to compute multiple pseudo-out-of-sample forecasts, you could do something like
this:
gen actual = y
gen forecast = .
gen rmsfe = .
forvalues p = 30/50 {
regress y L.y if t<`p'
predict yhatTemp, xb
predict rmsfeTemp, stdf
replace forecast = yhatTemp if t==`p'+1
replace rmsfe = rmsfeTemp if t==`p'+1
drop yhatTemp rmsfeTemp
}
gen fcastErr = actual - forecast
tsline actual forecast Plot a graph of actual y versus forecasts made using prior data.
summarize fcastErr Check the mean and standard deviation of the forecast errors.
gen fcastLow = forecast - 1.96*stdf Low end of 95% forecast interval assuming there are
normally distributed and homoskedastic errors (otherwise the 1.96
would not be valid).
gen fcastHigh = forecast + 1.96*stdf High end of 95% forecast interval assuming there are
normally distributed and homoskedastic errors (otherwise the 1.96
would not be valid).
tsline actual fcastLow forecast fcastHigh if forecast<. Add forecast intervals to the graph of
actual versus forecast values of the y-variable.
40
Kenneth L. Simons, 28-Jun-19
model, so that the model includes the dummy variable (to allow the constant term to change) and
the multiples of the original variables times the dummy (to allow the coefficients of the other
variables to change); and (3) test the null hypothesis that all of the variables added in step 2 have
coefficients of zero by using Stata’s “test” command (section I2, page 14). This test is often called
a Chow test.
Breaks at a known time are particularly easy to test for using Stata version 14. With Stata 14,
you can first estimate a time series regression without a break, then use a command like this:
estat sbknown, break(tq(1973q3)) After a time series regression, test for a break in all
coefficients, with the break occurring immediately after quarter 3 of
1973.
estat sbknown, break(tq(1973q3) tq(1979q2)) After a time series regression, test for two breaks in
all coefficients, with the first break occurring immediately after
quarter 3 of 1973, and the second occurring immediately after
quarter 2 of 1979. Additional breaks can be indicated with spaces in
between.
estat sbknown, break(tq(1973q3)) varlist(varlist, constant) After a time series regression, test
for a break in selected coefficients, with the break occurring
immediately after quarter 3 of 1973. Only allow the constant term
and the coefficients of the variables in varlist to have their
coefficients change at the time of the break. Omit the “, constant” if
you do not want the constant term to change.
41
Kenneth L. Simons, 28-Jun-19
below, the first and last times are arbitrarily given as -25 and 45, but you will need to figure out the
correct values for your problem, by trimming say 15%. This Stata code should be used in the do-
file editor:
forvalues k = -25/45 { // Valid if break candidates after trimming are -25 to 45, change as needed!
generate late = t>`k'
generate lateL1y = late * L.y
generate lateL2y = late * L2.y
regress y L.y L2.y late lateL1y lateL2y
test late lateL1y lateL2y
display "`k'" r(F)
drop late lateL1y lateL2y
}
You could improve the above code. Putting “quietly” in front of a command suppresses its output.
You could also record the highest F-statistic so far from each F-test – to do, so (a) before the
forvalues loop you could put these commands:
scalar highestFSoFar = 0
scalar kAtHighestF = -99
Then (b) inside the forvalues loop, after the test command, you could put:
if r(F)>scalar(highestFSoFar) {
scalar highestFSoFar = r(F)
scalar kAtHighestF = `k'
}
Then (c) at the end you could put:
display "Highest F: " scalar(highestFSoFar)
display "k with highest F: " scalar(kAtHighestF)
If you are using a Stata version preceding 14, this is a hassle, but it has the benefit of getting you
used to valuable Stata programming techniques.
42
Kenneth L. Simons, 28-Jun-19
43
Kenneth L. Simons, 28-Jun-19
44
Kenneth L. Simons, 28-Jun-19
will try to minimize the sum of squared errors by searching through the space of all possible values
for the parameters. However, if we started by estimating b2 as zero, we might not be able to
search well – at that point, the estimate of b3 would have no effect on the sum of squared errors.
Instead, we start by estimating b2 as one, using the “{b2=1}”. The “=1” part tells Stata to start at
1 for this parameter.
Often you may have a linear combination of variables, as in the formula
yi = a1 + a 2 eb1x1i +b2 x 2i +b3x3i +b4 x 4i + ei . Stata has a shorthand notation, using “xb: varlist”, to enter the
linear combination:
nl ( y = {a1}+{a2=1}*exp({xb: x1 x2 x3 x4}) ) Estimate this nonlinear regression.
After a nonlinear regression, you might want to use the “nlcom” command to estimate a
nonlinear combination of the parameters.
45
Kenneth L. Simons, 28-Jun-19
46
Kenneth L. Simons, 28-Jun-19
identification code but commonly also because for some reason you have more than one
entry for a code that you meant to be unique. Usually you know what kind of match
should arise, and Stata requires you to specify this information. Indeed, this is really
important to avoid horrible mistakes. For m:m matches Stata’s merge command does not
do what you would expect – it does not create all the relevant pairs; instead you must use
Stata’s joinby command.
• Keeping matching, master-only, and using-only observations. It may be that not all
observations have matches across the master and using files. When no matches exist,
certain observations are orphans for which it is not possible to add columns of data from
the other dataset, so the added variables can only contain missing values. You may not
want these orphans kept in the resulting data, particularly for orphans from the using
dataset. Therefore Stata differentiates between matching (3), master-only (1), and using-
only (2) observations. It creates a variable named _merge that contains the numbers 3, 1,
and 2 respectively for these three cases. You can then drop observations that you do not
want after the match.
Even better, you can specify up-front to keep only observations of specific type(s). To
do so, use the keep(…) option. Inside the option, specify one or more of “match”,
“master”, and “using”, separated by spaces, to cause matching, master-only, and using-
only observations to be kept in the results. All other observations will be dropped in the
results (though not in any data files on your hard disk).
To ensure against mistakes, you can also use an “assert(…)” option to state that
everything should match, or that there should never be orphan observations from one of
the datasets. If your assertion that this is true turns out to be violated, then Stata will stop
with an error message so you can check what happened and fix possible problems. Inside
the assert(…) option, again specify one or more of “match”, “master”, and “using”,
separated by spaces, to assert that the results will contain only matching, master-only, and
using-only observations.
• Variables other than the identification code that are in both master and using datasets. If
a variable in the using dataset, other than the identification code, already exists in the
master dataset, then it is not possible to bring that column of data into the results as an
independent column (at least, not with the same variable name, and Stata just doesn’t
bring it in). The “update” option to the merge command causes missing values to be
filled in using values in the using dataset, and the “update” and “replace” options together
causes the reverse in which the using dataset’s values are preferred except where they
have missing values. For matching observations with variables with the same name in
both datasets, if you use the update option, then _merge can take values not just of 1, 2,
or 3, but also of 4 when a missing value was updated from the other dataset, or of 5 when
there were conflicting non-missing values of a variable in the two datasets.
• Reading in only selected variables. If you only want to read in some of the variables
from the using dataset, use the keepusing(varlist) option.
Before using the merge command, therefore, you need to go through each of the above issues and
figure out what you want to do. Then you can use commands such as the following:
merge 1:1 personID using filename Match in observations from the using dataset, with
personID as the identification code variable.
merge 1:1 country year month using filename Match in observations from the using dataset, with
country, year, and month jointly as the identification code.
47
Kenneth L. Simons, 28-Jun-19
merge 1:1 personID using filename, keep(match master) Match in observations from the using
dataset, with personID as the identification code variable, and only
keep observations that match or are in the master dataset; ignore
observations that are in the using dataset only.
merge 1:1 personID using filename, assert(match) Match in observations from the using
dataset, with personID as the identification code variable, and assert
that all observations in each dataset match – if they do not, stop with
an error message.
merge 1:1 _n using filename This one-to-one merge assumes that each observation i in the master
dataset matches to each observation i in the using dataset. This is
dangerous because it’s easy to mistakenly have a wrong sort order,
so this is not recommended!
merge m:1 countryID using filename Match in observations from the using dataset, with
personID as the identification code variable. This is specified as a
many-to-one match, so the master dataset may contain multiple
observations with the same countryID.
merge 1:m countryID using filename Match in observations from the using dataset, with
personID as the identification code variable. This is specified as a
one-to-many match, so the using dataset may contain multiple
observations with the same countryID.
joinby familyID using filename Carry out a m:m match in which you want all pairs of matching
observations. This is not what you get using “merge m:m”, and you
should not use “merge m:m” unless you really know what you are
doing. See Stata’s help for this command for further information
about options.
The merge command will display the number of resulting observations with _merge equal to 1, 2,
and 3. Always check the values of _merge after merging two datasets, to avoid errors.
Wide Form:
personid income2005 income2006 income2007 birthyear
1 32437 33822 41079 1967
2 50061 23974 28553 1952
Long Form:
personid year income birthyear
1 2005 32437 1967
1 2006 33822 1967
1 2007 41079 1967
2 2005 50061 1952
2 2006 23974 1952
2 2007 28553 1952
48
Kenneth L. Simons, 28-Jun-19
This is a trivially simple example because usually you would have many variables, not just
income, that transpose between wide and long form, plus you would have many variables, not just
birthyear, that are specific to the personid and don’t vary with the year.
Trivial or complex, all such cases can be converted from wide to long form or vice versa using
Stata’s reshape command:
reshape long income, i(personid) j(year) // Starting from wide form, convert to long form.
reshape wide income, i(personid) j(year) // Starting from long form, convert to wide form.
If you have more variables that, like income, need to transpose between wide and long form, and
regardless of how many variables there are that don’t vary with the year, just list the transposing
variables after “reshape long” or “reshape wide”, e.g.:
reshape long income married yrseduc, i(personid) j(year) Starting from wide form, convert to
long form.
reshape wide income married yrseduc, i(personid) j(year) Starting from long form, convert to
wide form.
X5. Labels
What if you have string variables that contain something other than numbers, like “male” versus
“female”, or people’s names? It is sometimes useful to convert these values to categorical
variables, with values 1,2,3,…, instead of strings. At the same time, you would like to record
which numbers correspond to which strings. The association between numbers and strings is
49
Kenneth L. Simons, 28-Jun-19
achieved using what are called “value labels”. Stata’s encode command creates a labeled numeric
variable from a string variable. Stata’s decode command does the reverse. For example:
encode personName, generate(personNameN)
decode personName, generate(personNameS)
If you do a lot of encoding and decoding, try the add-on commands rencode, rdecode, rencodeall,
and rdecodeall (outside RPI computer labs: from Stata’s Help menu, choose Search… and click the
button for “Search net resources,” search for “rencode”, click on the link that results, and click
“click here to install”; in my econometrics classes at RPI: follow the procedure in section J2a).
This example started with a string variable named personName, generated a new numeric
variable named personNameN with corresponding labels, and then generated a new string variable
personNameS that was once again a string variable just like the original. If you browse the data,
personNameN will seem to be just like the string variable personName because Stata will
automatically show the labels that correspond to each name. However, the numeric version may
take up a lot less memory.
If you want to create your own value labels for a variable, that’s easy to do. For example,
suppose a variable named female equals 1 for females or 0 for males. Then you might label it as
follows:
label define femaleLab 0 "male" 1 "female" This defines a label named “femaleLab”.
label values female femaleLab This tells Stata that the values of the variable named female
should be labeled using the label named “femaleLab”.
Once you have created a (labeled) numeric variable, it would be incorrect to compare the
contents of a variable to a string:
summarize if country=="Canada" This causes an error if country is numeric!
However, Stata lets you look up the value corresponding to the label:
summarize if country=="Canada":countryLabel You can look up the values from a label this
way. In this case, countryLabel is the name of a label, and
"Canada":countryLabel is the number for which the label is
“Canada” according to the label definition named countryLabel.
If you do not know the name of the label for a variable, use the describe command, and it will tell
you the name of each variable’s label (if it has a label). You can list all the values of a label with
the command:
label list labelname This lists all values and their labels for the label named labelname.
Stata also lets you label a whole dataset, so that when you get information about the data, the
label appears. It also lets you label a variable, so that when you would display the name of the the
variable, instead the label appears. For example:
label data "physical characteristics of butterfly species" This labels the data.
label variable income "real income in 1996 Australian dollars" This labels a variable.
X6. Notes
You may find it useful to add notes to your data. You record a note like this:
note: This dataset is proprietary; theft will be prosecuted to the full extent of the law.
However, notes are not by seen by users of the data unless the users make a point to read them.
To see what notes there are, type:
notes
Notes are a way to keep track of information about the dataset or work you still need to do. You
can also add notes about specific variables:
note income: Inflation-adjusted using Australian census data.
50
Kenneth L. Simons, 28-Jun-19
51