STATA Tutorial
STATA Tutorial
STATA Tutorial
This handout includes basic information to get started in using STATA. STATA is the basic
statistical and econometric computer package (and for sure, the one you will more need in your
professional life). As there is no official guide for learning STATA, the best way of get familiar with
it is by learning by doing. However, the software is also accompanied by a manual that is useful
as it contains both the technical part and applications. There are other helpful sources on the
internet that can be helpful. For more information visit the following websites:
STATA website
http://www.stata.com/
http://www.stata.com/links/
http://www.stata.com/support/faqs/
http://www.stata.com/capabilities/session.html
Other websites
http://www.ats.ucla.edu/stat/stata/
http://data.princeton.edu/stata/
http://www.cpc.unc.edu/services/computer/presentations/statatutorial/
Table of contents
2
STATA Tutorial
INTRODUCTION TO STATA
This is what the main STATA window looks like. Over the next few pages, I will go over what some
of the parts of the window and the icons mean.
3
STATA Tutorial
4
STATA Tutorial
Variable Window: Shows all of the Review Window: Any command that you have entered during
variables in your dataset the session will be available here. Instead of retyping the
command in the “command window”, you can click on it in the
“review window” and the same command will then appear in the
“command window!
Note: Sometimes only the first two windows appear. To make the other two visible, go to
Window and select the Review and Variable. Alternatively, go to Edit, click on Preferences, Load
Preferences Set and click on Factory Settings.
Hint: You can alphabetically order the variables in the Variable Window by typing aorder in the
Command Window
5
STATA Tutorial
6
STATA Tutorial
- STATA Data Editor: Allows you to view and manually make any changes to your dataset (for
instance, changing the variable content by typing in a different value for a particular
observation). When you open the data editor, the observations are numbered on the left-
hand side and the variable names are listed on the top. Try clicking on the variable name to
see what information is available for that particular variable.
- STATA Data Browser: Allows you to see the data without making any changes
- Go Button: By default, if you type in a command that doesn’t fit in the “results window”,
STATA displays some of the results and lists –more— (in blue color) at the bottom of the
results screen. To see the rest of the results, click on the “go button” or click on the blue “—
more--” at the bottom of the results window.
To turn this option off so that STATA simply scrolls through all of your results at once, type
set more off, or
set more off, permanently if you would like STATA to activate automatically this option
Or click on this Go
Button to see the rest of
the results
- Stop Button: Sometimes you type in a command and you realize that something is wrong
with the command. To stop STATA from continuing to estimate or display the results,
click on this button.
7
STATA Tutorial
Whenever there is a command that you don’t know how it works, you can find a basic help in
STATA. This is very useful since some commands have special features and they will not work
unless you do it in the right way. Go to Help, where you have three options for getting help:
1. If you click on Contents you can either type in a keyword or command name or you can
search through the different categories in the STATA index of commands,
2. Clicking on Search you will be able to type in any keyword or command to get information on
it, or
3. Clicking on Command allows you to type in any command to obtain information about it.
Alternatively, in the Command window you can type help followed by the name of the
command you have doubts about.
Example:
Click on Help, Contents, and then type in regress in the Contents window to see the help file for
this regression command
8
STATA Tutorial
LOG-FILES
A log file saves all the commands and results (including error messages) from your STATA session
in a file that you can later access. It is good practice to open a log file at each STATA session to
track all of your work and save yourself from duplication.
In using either of the options, a window will pop up giving you the option to select the location
where you want to save the log file.
9
STATA Tutorial
Note that when you have opened a log file this is indicated at the bottom right of the Results
window as shown by
1. View snapshot of log file: This opens a window that shows all of the commands and
results that you have typed in during that particular log session
2. Close log file: Completely ends the log session and commands typed after this will not
appear in the log file. If you want to close the log file you can also type: log close
3. Suspend log file: Allows you to temporarily stop commands and results from being saved
in the log file. You can resume the log file by clicking on the “log icon” and clicking
“resume suspended log file”
10
STATA Tutorial
A log file has an .smcl extension. To view a log file that you created previously, either:
1. Click on the Log File icon, open the log file by clicking on the Save button (if it is not already
open), and click on View existing file (read only), OR
2. Go to File, Log, View and then click on Browse to find the saved log file.
2. Append the results of the new session you are opening to this already saved log file, or
3. Overwrite the saved log file (WARNING: this will completely delete any commands and
results you had saved in this log file).
11
STATA Tutorial
DO-Files
We advise you to always use a do-file. Do-files record a series of commands you may need to
run repeatedly in different STATA sessions. To open a do-file, click on the do-file icon,
clear
cd “c:\YourDirectory”
set mem 10m
use saving.dta, clear
summarize
tab black, summ(educ)
gen old=0
replace old=1 if age>40
tab old
You can save your do-file in the same way you save a program, a word-file, etc. If you want to
open an existing do-file:
1. Go to the Do-file Editor, then File, then Open, as before, go to Tools and Do.
These files are useful since the instructions you use will become much more complicated
than just a few lines.
Do-files allow you to program in STATA. As you advance in your research you will need to make
your own programs to satisfy your special needs.
If you want to insert comments in the do-files, that is sentences that are for yourselves that
remind you about specific details about what you are doing, you have to start the sentence by an
*. Whatever you write after an * will not be considered. For instance the instruction in a do-file
12
STATA Tutorial
If you want to save the database you have opened in STATA type in:
save nameofdatabase.dta, replace
Why replace? If it already exists, you cannot change it unless you use this option.
- When the dataset is in STATA format, the extension for the dataset is .dta, for example,
filename.dta. There are different ways to open a dataset:
And type,
use nameofdatabase.dta, clear
Why clear? Because if there is another dataset opened in STATA, the program
will not allow you to open a new one such that two different databases are
overlapped.
Open the dataset SMOKE.dta (you will find it in the T drive). The dataset contains the
variables listed in the table below. Remember to first specify the directory from which
you will be working cd c:\Directory. Then type, use SMOKE.dta, clear
Variable Description
educ years of schooling
cigpric state cigarette price, cents per pack
white =1 if white
age in years
income annual income, $
cigs Cigarettes smoked per day
restaurn =1 if state restaurant smoking restrictions
lincome log(income)
agesq age^2
lcigpric log(cigprice)
13
STATA Tutorial
When the dataset opens successfully, the variables that exist in this dataset will be listed in the
Variable Window.
Hint:
If you try to open a dataset, and you get an error message like this:
no room to add more observations
An attempt was made to increase the number of observations beyond what is currently possible. You have the
following alternatives:
1. Store your variables more efficiently; see help compress. (Think of STATA's data area as the area
of a rectangle; STATA can trade off width and length.)
3. Increase the amount of memory allocated to the data area using the set memory command; see
help memory.
r(901);
It means that you need to increase the size of the memory in STATA. To do so, type in the
command window:
set1.mem
DO-FILES
200m
OR
set memory 200m
You actually have a number of options for the size of the memory – I happened to choose
200 megabytes in this case. See the help file for more information.
- Dataset files have a .dta extension. The datasets contain all the variables you use for
statistical/econometric analysis.
- Do-files are the programming files you need to run specific instructions and they can be
identified by the .do extension.
- Finally you may find the .smcl files that refer to log files and are the printable files that
store your output results.
14
STATA Tutorial
7. MERGING AND APPENDING DATASETS
Merge data files: assume that you want to merge your current data set (called the “master”
dataset) with another data (called the “using” dataset) file that includes, for each observation
identifier, additional variables that are of interest for the analysis. The two separate datasets are
shown below in tables 1 and 2. For analysis purposes you would like to have a dataset in which
var1, var2, var3 and var4 are listed together in the same dataset, as shown in table 3.
Table 1 Table 2
Master dataset Using dataset
id var1 var2 id var3 Var4
1 10 50 1 1 .
2 20 40 2 0 4
3 30 30 3 1 5
4 40 20 4 0 .
5 50 10 5 0 3
Table 3
id var1 var2 var3 Var4
1 10 50 1 .
2 20 40 0 4
3 30 30 1 5
4 40 20 0 .
5 50 10 0 3
As an exercise, find the dataset SMOKE_merge1.dta in the T drive, and merge it with the
dataset called SMOKE_merge2.dta. First, open both datasets and identify which variables
are the same in both files. This is the first step to identify the variable(s) that link the two
datasets (in our case this variable is id). If you go to help, you will find information on the
command merge and then type in the following:
Note: before merging you will need to sort in each dataset the variable that acts as the
link between the master and using dataset.
15
STATA Tutorial
STATA will merge the two datasets and will generate an additional variable called
“_merge”. Find out what the values of this variable indicate (search in Help).
Append data files: you might find the case you need to put together two datasets that
contain same variables but each of them corresponds to different groups of observations.
Table 4 represents five observations that have information on var1 and var2, whereas table 5
contain additional observations with information on the same variables. The objective is to
join them vertically such that the resulting dataset looks like table 6.
Table 4 Table 5
id var1 var2 id var1 var2
1 10 50 6 60 50
2 20 40 7 70 40
3 30 30 8 80 30
4 40 20 9 90 20
5 50 10 10 100 10
Table 6
id var1 var2
1 10 50
2 20 40
3 30 30
4 40 20
5 50 10
6 60 50
7 70 40
8 80 30
9 90 20
10 100 10
16
STATA Tutorial
Once you have opened a database you have variables that contain the information you want.
Each one, with a given name is going to appear in the Variables window. The first thing you
can do with it is to see how they look like….
1. If you go to Window, and then Data Editor you can take a look (as in EXCELL) as a list.
In general, if you see varlist written after the name of the command, this means that
you need to replace varlist with one or more variables of your choice from the
dataset that you have open. For instance, using the data SMOKE.dta if you type:
use SMOKE.dta, clear
summarize educ income
or
summarize educ income, detail
the summary statistics displayed will refer only to variables gdp and income
b. list varlist
c. If you have a variable that is discrete, i.e. years of schooling, gender, etc, you could
type:
tabulate varlist
Alternatively, you can obtain some data description and statistics clicking on the Data and
Statistics options in the tool bar at the top left of the STATA window. For instance, if you require
to have a description of a varlist click on Data → Describe Data → Describe Data in Memory. You
have the option to choose the variables to be summarised. The output obtained exactly the same
as the one you obtain typing the command describe. If you would like to obtain some summary
statistics, click on Statistics → Summary, tables and tests → Summary and descripƟve staƟsƟcs →
Summary statistics and the output will be identical to using the command summarize.
Another example on US savings data will go in detail about how to use these commands. This
database contains information about individuals’ savings, income, size of the household, etc. The
datatbase can be obtained by typing in the Command window:
17
STATA Tutorial
Variable Description
sav Household annual savings in dollars.
inc Household annual income in dollars.
size Number of members in the household.
educ Years of schooling of the household head.
age: Age of the household head.
Dummy variable (0 or 1) for whether the household head
black
declares himself to be African American
cons Always equal to 1.
If you type describe you will obtain the following STATA output:
From the table above, you notice that it contains information for 100 individuals. You also note
that the mean level of savings is 1582.51 dollars, and actually some people have negative savings
(the minimum is -5777, what does that mean?). Look at the black variable. What does a mean of
0.07 mean? It means that 7% of the sample household heads declares themselves to be African
American. You recognize that it is a dummy variable because min=0, max=1.
18
STATA Tutorial
Additional information can be obtained for discrete variables. For instance, educ contains only
integer values, from 2 to 20. If you want to see the number of observations in each education-
value, type in
tabulate educ
years educ, |
household |
head | Freq. Percent Cum.
------------+-----------------------------------
2 | 1 1.00 1.00
3 | 1 1.00 2.00
4 | 1 1.00 3.00
6 | 1 1.00 4.00
7 | 4 4.00 8.00
8 | 10 10.00 18.00
9 | 11 11.00 29.00
10 | 9 9.00 38.00
11 | 4 4.00 42.00
12 | 32 32.00 74.00
13 | 2 2.00 76.00
14 | 4 4.00 80.00
15 | 1 1.00 81.00
16 | 9 9.00 90.00
17 | 6 6.00 96.00
18 | 2 2.00 98.00
19 | 1 1.00 99.00
20 | 1 1.00 100.00
------------+-----------------------------------
Total | 100 100.00
In both cases above the command can be shortened, try typing des, sum or tab. STATA will
recognize the commands as describe, summarize and tabulate, respectively.
It allows to “run command on subsets of data” (STATA help). It is extremely useful when you
want to compute or generate variables according to specific groups of variables. For instance,
suppose that you want certain sub-groups of the whole sample. For instance, you are interested
in whether there are differences in the education levels by race (race and gender are very
interesting topics in Applied Econometrics) and you want to summarise the variable education
according to different educational levels, type in
Which problem do you encounter? You need to sort the variable. There are 2 ways:
19
STATA Tutorial
-> black = 0
-----------------------------------------------------------------------
-> black = 1
This is a very useful instruction. First, you specify which variable will be used to separate into
categories (there should not be too many; black has only 2). Second, for each category, you can
obtain information about other variable. Check by yourselves.
Alternatively, you can also type in
=1 if |
household | Summary of years educ, household
head is | head
black | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 11.806452 3.3435794 93
1 | 8.5714286 3.4572216 7
------------+------------------------------------
Total | 11.58 3.435348 100
So far we have seen examples of databases that have numeric variables. However, a part from
numeric variables there are also string variables. Numeric variables contain numbers and typically
displayed in the Data Editor/Browser in black (an example is variable age on the figure below).
String variable contain letter and are usually representing some kind of qualitative variable. String
variables are shown in red in the Data Editor/Browser (for instance, the variable gender shown in
the figure below). Summary and descriptive statistics are easily obtained using the commands
above for numeric variables. However, for string variables the output you obtain when using the
summarize command will provide no information. STATA will not calculate any statistics as the
data within the variable does not provide the information to compute numerical descriptions.
20
STATA Tutorial
To see the difference between numeric and string variables use the database WAGEM.dta. This
database contains information of wages, age and other working related variables. The variables
are listed below:
Variable Description
earns total earnings
age in years
educ years of schooling
gender male, female
marr married, not married
union =1 if belong to union
exper age - educ - 6
yrsmarr years married
Type in, use WAGEM.dta, clear. Then type sum, there will be no statistics computed for string
variables. However, you can obtain frequencies for these variables typing tab gender or tab
marital_status you will obtain the table of frequencies. Similarly, if you type in codebook gender
marital_status, you will only obtain frequencies but no statistics.
21
STATA Tutorial
9. UNDERSTANDING THE COMMAND LINE
When you examine a command in the help file, STATA has a set way of displaying the syntax of
the command. For example, if you search for regress in the STATA help file, you will see:
This is just a brief overview of how to interpret this line. For more detailed information, refer to
the STATA reference manual.
Everything that is not in brackets is compulsory for that command. For instance, if you only typed
regress
you would get the error message:
The “if” expression allows you narrowing down the subsample for which you want the command
to apply. The if instruction allows using logical statements. For instance, using again the dataset
SAVING.dta if you type, summarize income if age>40, STATA offers descriptive information about
the income variable for the subsample of observations for which age>40.
Relational
Arithmetic Logical (numeric and string)
-------------------- ------------------ ---------------------
+ addition ~ not > greater than
- subtraction ! not < less than
* multiplication | or >= > or equal
/ division & and <= < or equal
^ power == equal
~= not equal
+ string concatenation != not equal
Note that a double equal sign (==) is used for equality testing.
22
STATA Tutorial
Some examples:
23
STATA Tutorial
• generate: This command creates a new variable. Note that you cannot create twice the
same variable (if you try to do it STATA will remind you cannot). The corresponding
command is
generate newvar = exp
• replace: If you want to make some modifications to an existent variable, you have to use
this command.
• rename: you may want to change names of variables at your convenience (maybe to give
names to variables such that only the name defines the content).
As an example, let’s use the database we opened before. Suppose we want to create two
subgroups. One for young people (less or equal to 40 years old) and other for old people (more
than 40 years old). Then you can type:
generate old=0
replace old=1 if age>40
(The actual command is generate, but you can just type the abbreviation gen and it will be
recognized.)1
Then you generated another dummy variable, called old. Now, how does it look like?
tabulate old
If you want to change the name to that variable, i.e. you want it to be “Old” instead of “old”, then
type
1
Suppose you insist in generating again the same variable. If you type generate old=0, the following error message will
appear:
old already defined
r(110);
24
STATA Tutorial
Functions
Sometimes you need to manipulate variables using functions of existing variables. Some
examples are the following:
Type help functions to obtain a list of all the available mathematical and statistical functions.
Missing values
Missing values will be displayed in data in different ways. They can be represented by a dot (.) or
by blank value ( ). If it is a missing value in a string variable you need to type the missing value in
speachmarks (“.” or “ “).
If you try to divide by zero, or apply logarithm to a negative number, STATA will replace that with
a missing value. You will recognize them by a dot (.). In the examples above, if any of the
variables used to compute the new variable contains a missing value, it will assign missing value.
Labels
2
STATA is case sensitive. That is it matters whether you type database or Database or dATaBaSe. Each is considered a
different name. All commands need to be typed in small case.
25
STATA Tutorial
If you type in tabulate old
Dummy |
variable, 0 |
if age<=40, |
1 otherwise | Freq. Percent Cum.
------------+-----------------------------------
0 | 61 61.00 61.00
1 | 39 39.00 100.00
------------+-----------------------------------
Total | 100 100.00
- Label values
You may want to give some labels to values of a variable that represent different categories.
For instance, the variable old is coded as 0 and 1, but we don’t know which category
corresponds to each age breakdown. We can attach labels to these values such that we will
be able to identify the value with its corresponding category.
If you type tab Old, are there any changes in the variable? We have just asked STATA to store
these labels for variable Old under the name “old”. However, we need to ask STATA to
actually attach these labels to the variable. Type in
Type in tab Old and see any changes made. You should obtain a STATA output like the one
below,
Dummy |
variable, 0 |
if age<=40, |
1 otherwise | Freq. Percent Cum.
------------+-----------------------------------
no | 61 61.00 61.00
yes | 39 39.00 100.00
------------+-----------------------------------
Total | 100 100.00
26
STATA Tutorial
When you open your dataset there may be variables that you may not need at all. For instance,
you may observe that variables var1 and var2 do not contain any information and you may want
to delete them.
Type drop var1 var2 and these two variables will be removed from your using dataset.
Alternatively, you can also type keep var1 var2 var100 and only var1, var2 and var100 will be
deleted, whereas the variables specified in the command line are those kept by STATA.
Move variables
Variables may not be listed in the order that one may prefer. In your dataset your variables would
be displayed in the following order
You can use the command “move varname1 varname2”. This will reallocate the variable as
desired. However, you need to do this as many times as reallocations required. Alternatively, you
can type order sav educ age inc size black old cons. Only using one command STATA will arrange
the variables in the order specified.
Sort: you may also want to arrange the observations within a variable in ascending or
descending order.
Type sort age and age will be displayed in ascending order. If the variable contains missing values
they will be displayed first. String variables can also be ordered.
Recode variables: this will help you to change any value within a variable. For instance, variable
black is coded as 0 and 1, and you may want to have it coded as 1 and 2. If you type in
27
STATA Tutorial
Encode variables: some variables are displayed as strings. You may want to decode them and
have them as numeric. Using WAGEM.dta, type
encode marital_status, gen(married). What is this command generating? Which changes do you
observe?
(From STATA manual) “Dates and times are called %t values. %t values are numerical and
integral. The integral value records the number of time units that have passed from an agreed-
upon base, which for STATA is 1960.
Coding and interpretation of date and time (%t) values are as follows:”
+---------------------------------------------------------------------+
| | | ----- Numerical value & interpretation ------ |
| Format | Meaning | Value = -1 | Value = 0 | Value = 1 |
|--------+------------+---------------+---------------+---------------|
| %tc | clock | 31dec1959 | 01jan1960 | 01jan1960 |
| | | 23:59:59.999 | 00:00:00.000 | 00:00:00.001 |
| | | | | |
| %td | days | 31dec1959 | 01jan1960 | 02jan1960 |
| | | | | |
| %tw | weeks | 1959w52 | 1960w1 | 1960w2 |
| | | | | |
| %tm | months | 1959m12 | 1960m1 | 1960m2 |
| | | | | |
| %tq | quarters | 1959q4 | 1960q1 | 1960q2 |
| | | | | |
| %th | half-years | 1959h2 | 1960h1 | 1960h2 |
| | | | | |
| %tg | generic | -1 | 0 | 1 |
+---------------------------------------------------------------------+
If you have a data variable that is displayed in a numerical value like above, you can generate new
variables that contain the day, month or year associated with this variable. Check the variable
date. The way this variable is displayed does not confer any information. If you would like STATA
to display the corresponding date to the time value you need to format the variable:
format date %td (see table above for more options on how dates can be displayed, try for
instance “format date %tq”).
You may want to generate a different set of variables that capture the day, month and year in
which the event happened. The way to do so is to extract this information from the variable date.
28
STATA Tutorial
Copying and pasting results
If you would like to copy the results in the STATA output window and paste them in Word, it is
best to select the output, copy as text or table, and copy it in a Word document with font Courier
New and size 8 or 9.
When you have finished your session and you want to save your dataset because it has been
modified you can click on File → Save as and you will be prompted with a window that will give
you the option to save the new dataset in your preferred location. It is always desirable to keep
your initial dataset (SAVING.dta) unmodified and save any changes you make to the data into
another data file (i.e. SAVING2.dta).
If you click on the exit button by mistake, no worries! STATA will always ask you whether you
want to save any changes made. You can also close the session typing exit into the command
window and if the data has been modified you will get the following error message in the results
window:
It is upon you to decide whether to save it or not. If you want to save it you just follow the steps
mentioned above (click on File → Save as) and save it in your preferred locaƟon.
Note: if you put the name of the dataset you open before (in our example SAVING.dta) then you
will replace the old SAVING.dta with this one. It is advisable to save the work with a new name
(like SAVING2.dta).
If you don’t want to save any changes you can also type clear in the command window and exit
STATA straight away.
29
STATA Tutorial
Correlation
After learning basic concepts on STATA, here we do an application of analysis undertaken using
STATA. Using the US savings data SAVING.dta, suppose we want to estimate the following model:
sav = β 0 + β1 * inc + u
In this case, we are interested in the relation between income and savings. As your income
increases, you expect that your savings also increase. But, by how much? From an introductory
course in Macroeconomics you know that β1 is called the “propensity to save”. If β1 = 1 it means
that for each dollar increment (decrement) in people’s income, on average they also increase
their savings by one dollar. If β1 = 0.5 it means that for each dollar increment (decline) in
people’s income, on average they also increase their savings by fifty cents, and so on.
The first thing we can see about this variables is whether there is a correlation between them.
| sav inc
-------------+------------------
sav | 1.0000
inc | 0.2493 1.0000
(The actual command is correlate, but you can just type corr and it will be recognized.)
| sav inc
-------------+------------------
sav | 1.0000
|
|
inc | 0.2493 1.0000
| 0.0124
The p-value of the test is 0.0124 (you should remember how the test statistic is constructed).
Therefore you reject the null hypothesis that the correlation is zero.
30
STATA Tutorial
Simple regression
If you want the estimates of βˆ0 and βˆ1 (estimates of the parameters of interest) we put in
Command window:
------------------------------------------------------------------------------
sav | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inc | .1466283 .0575488 2.55 0.012 .0324247 .260832
_cons | 124.8424 655.3931 0.19 0.849 -1175.764 1425.449
------------------------------------------------------------------------------
A formal test is given by the t-test. In each case, it provides the t-value for a test that the
specified coefficient is zero (i.e. H o : β = 0; H a : β ≠ 0 ). For the significance just
observe the p-value (which appears below P>|t|). For the constant, we obtain a p-value
of 0.849, which means that it is not significant at any useful significance level. However
the slope coefficient has a p-value of 0.012, which means it is significant at 10% and 5%
significance level, although it is not at 1%.
A small p-value means the estimate coefficient is significant, a big one means no-significant. For
instance a p-value of 0.002 says that you reject the Null Hypothesis that the coefficient is
significant at the 1% significance level (since 0.002<0.01) and consequently also at 5% and 10%
levels. On the other hand a p-value of 0.035 means that you cannot reject the null hypothesis at
the 1% significance level, but you can do it at 5% and 10% (since 0.01<0.035<0.05<0.1). Finally a
p-value like 0.56 tells you that your estimate is not even significant at the 10% level
(0.56>0.1>0.05>0.001).
31
STATA Tutorial
• At the right top, you have a test for all the coefficients (other than the constant) being
jointly zero H o : β1 = 0 . This is the F-test. Look at the p-value: 0.0124. That means you
reject the null hypothesis. Note that in the simple regression case, the F-test is equivalent
to the t-test of β1 (for that reason you obtain the same p-value!)
By the way, did you know how to get the degrees of freedom for the F-test? Note that
the degrees of freedom are 1 for the numerator and N-2=98 for the denominator. This
can also be noted at the regression output.
The regression output also provides information about TSS (Total Sum of Squares), RSS
(Regression Sum of Squares) and ESS (Error Sum of Squares). From the top left you get:
RSS | 66368437
ESS | 1.0019e+09
-----------+---------------
TSS | 1.0683e+09
RSS 66368437
You obtain the following F-value: F1, N − 2 = = = 6.49
ESS / N − 2 1.0019e + 09 / 98
Finally, note that the p-value of the t-statistic, the F-statistic and the correlation
coefficient test is the same: 0.0124. Why?
RSS 66368437
R2 = = = 0.0621
TSS 1.0683e + 09
After you run a regression, basic information about your results is stored. Type
ereturn list
scalars:
e(N) = 100
e(df_m) = 1
e(df_r) = 98
e(F) = 6.491777898938016
e(r2) = .0621271647345948
e(rmse) = 3197.414708235473
e(mss) = 66368436.97882748
e(rss) = 1001899160.011173
e(r2_a) = .0525570337624989
e(ll) = -947.893503812726
e(ll_0) = -951.1005492750204
32
STATA Tutorial
macros:
e(title) : "Linear regression"
e(depvar) : "sav"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 2
e(V) : 2 x 2
functions:
e(sample)
• Scalars: they are just numbers that you can use in other applications. For instance is you
want to list again R2 you can type
display e(r2)
.06212716
• Matrices: here you can get the coefficient as a vector and the variance-covariance matrix.
e(b)[1,2]
inc _cons
y1 .14662835 124.84241
symmetric e(V)[2,2]
inc _cons
inc .00331186
_cons -32.924014 429540.14
display _b[inc]
.14662835
display _b[_cons]
124.84241
33
STATA Tutorial
Testing after regression
After you run a regression you can test other hypothesis. For instance, suppose you are
interested in H o : β1 = 1 . From your knowledge of Econometric Theory you would like to use
)
β1 − 1
tβ1 =1 = )
Var β1( )
which follows a t distribution with N-K degrees of freedom, N being the number of observations
and K the number of parameters estimated (in our case just 2). In STATA is easy to type
test inc=1
( 1) inc = 1
F( 1, 98) = 219.89
Prob > F = 0.0000
Why does it use an F-test? The F-test allows you to work with more complicated hypothesis. And
given a t random variable with d degrees of freedom, t2(d) has the same distribution as an F(1,d)
random variable.
If you want to replicate the F-test which appear in the top-right of your regression outcome you
can type:
test inc=0
( 1) inc = 0
F( 1, 98) = 6.49
Prob > F = 0.0124
You can also test more complex hypothesis. For instance, a test that both coefficients are zero
can be done by:
( 1) inc = 0
( 2) _cons = 0
F( 2, 98) = 15.49
Prob > F = 0.0000
34
STATA Tutorial
Special case 1: If we want to estimate a different model, without constant term, like
sav = β1 * inc + u
------------------------------------------------------------------------------
sav | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inc | .1561974 .0279389 5.59 0.000 .1007606 .2116343
------------------------------------------------------------------------------
Special case 2: In statistics module you have studied the behaviour of random variables. Suppose
that sav is a random variable which is normally distributed with unknown mean and unknown
variance, i.e. sav ≈ N ( µ , σ 2 ) What is an unbiased estimator of µ? As you may know, that is just
the sample mean, µ̂ = sav .
To get the sample mean you can just type summarize sav. An alternative way is using regression.
Type:
regress sav
------------------------------------------------------------------------------
sav | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 1582.51 328.4902 4.82 0.000 930.7142 2234.306
------------------------------------------------------------------------------
Then, you get µ̂ =1582.51. What about testing H0:µ=0? This exactly the t-test provided in the
regression output. The p-value is 0.000 that means you reject your null hypothesis!
35
STATA Tutorial
n n 2
∑ uˆ
i =1
i
2
i =1
(
= ∑ savi − βˆ0 − βˆ1 * inci )
How much is this quantity? After the regression, type:
predict u, resid
(This creates a new variable, whose name you need to specify, in our case u, with the predicted
values. You are generating a variable u that contains the error term u = sav − βˆ0 − βˆ1 * inc )3
generate u2=u^2
summarize u2
display r(sum)
(When you type summary, STATA calculates more things than what it reports. One of them is
under the name of r(sum), which contains the sum of all the values in the variable specified after
the summarize command. If you want to see this value you need the display command. Other
statistics can be seen by typing return list)
generate u2_=(sav-_b[_cons]-_b[inc]*inc)^2
summarize u2_
display r(sum)
How do we know that this value is the minimum? You could create any other variable with other
betas…..
generate u2_alt=(sav-2-1*inc)^2
summarize u2_alt
display r(sum)
3
Alternatively, you could have created this variable by typing the instructions:
gen u=sav - _b[_cons] + _b[inc]*inc
36
STATA Tutorial
You should get a variable with higher values since 2 and 1 are not the values that minimize the
squared errors, that is, βˆ0 and βˆ1 should not necessarily be 2 and 1.
You also know that OLS has other properties. Try to prove the following:
n n
1)
i =1 i =1
(
∑ uˆi = ∑ savi − βˆ0 − βˆ1 * inci = 0 )
n
2) ∑ inc * uˆ
i =1
i i =0
n
)
3) ∑ sav * uˆ
i =1
i i =0
Solution:
1) Try this:
summarize u
display r(sum)
2) Try this:
gen inc_u=inc*u
summarize inc_u
display r(sum)
3) Try this:
predict sav_hat
gen sav_hat_u=sav_hat*u
summarize sav_hat_u
display r(sum)
37
STATA Tutorial
ii. Graphs
Histograms. The first graph you can use is a histogram. It provides visualization about
what the distribution of a given variable look like.
histogram sav
1.5e-04
1.0e-04
Density
5.0e-05
0
What can you infer about it? First, note that there seems to be outliers in the right tail. In
other words, savings have a positive skewed distribution. Why?
Another way to see that is to impose a normal density function at the same time. Try
Here it is clearer, that values above $20000 are not likely to happen if savings were
normally distributed.
38
STATA Tutorial
Actually we can get very easily the skewness and kurtosis. Type
annual savings, $
-------------------------------------------------------------
Percentiles Smallest
1% -4230 -5577
5% -1255 -2883
10% -287 -2749 Obs 100
25% 189 -1389 Sum of Wgt. 100
Note that skewness is positive (as expected; also note that the median, 982, is smaller
than the mean, 1582.51). Also kurtosis is bigger than 3 (actually around 30), which gives
you enough proof that it is not normally distributed.
Scatter plots and lines. Now suppose that you want to visually inspect the relation
between sav and inc. Then type
39
STATA Tutorial
What does the regression did? It predicted some values, and we could see it in this way.
Now we are not creating the errors, but instead: sav _ hat = βˆ0 − βˆ1 * inc , that is the
predicted value4.
Now type
sort inc
(This sorts, i.e. orders, the data depending on their income values. If you want to learn
more about this type help sort or help gsort. The latter allows you to order in ascending
or descending order.)
scatter sav inc || line sav_hat inc, title(“Income and savings”) t2(“Actual values and
linear prediction”)
(The whole above instruction should be in ONE line, not TWO)
If you want to know which other options you have about graphs, type:
help graph
4
Altenatively, you could have created this variable by typing the instructions:
gen sav_hat=_b[_cons] + _b[inc]*inc
40
STATA Tutorial
iii. Multiple regression
In general, in applied econometric analysis you will use multiple regression, that is, regression
involving more than one explanatory variable. For instance, younger people would save in a
different manner from old people, even if they have the same level of income. If we want to
model this hypothesis we would be interested in the following model:
What do you think is the sign of β2? The life-cycle theory predicts that young people save for
retirement. Therefore we expect that this coefficient is negative. However, older individuals will
also be able to have better jobs, which could give us the opposite sign.
In STATA, multiple regression works in the same way as simple regression. Type
------------------------------------------------------------------------------
sav | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inc | .1555624 .0596697 2.61 0.011 .0371346 .2739903
age | -26.72823 45.03278 -0.59 0.554 -116.1058 62.64938
_cons | 1072.28 1726.415 0.62 0.536 -2354.176 4498.736
------------------------------------------------------------------------------
Note that β1 is similar to that obtained in the simple regression model. Now the coefficient on
age is negative, but not significant (p-value 0.554). This means that age is not a good explanatory
variable for the level of savings.
It is a good idea to spend some time interpreting those coefficients. When you have a multiple
regression model, you have to interpret the effect of one variable, conditional on the remaining
explanatory variables. For instance, β 1=0.155 means, increasing the level of income (keeping age
fixed) would have an average impact on savings of about 15 cents. This idea is similar to the
expression ceteris paribus that you often use in Economics.
In other words, to get β1=0.155, you are taking out the influence of age on savings. The following
example might help you. I can get β 1=0.155 in the following way, just using simple regressions:
reg sav age
predict u_sav, resid
reg inc age
predict u_inc, resid
reg u_sav u_inc
41
STATA Tutorial
The first two instructions run a simple regression between savings and age, and it constructs the
residuals. Note that these residuals, by definition, are not correlated to age (Why? Remember, in
n
a simple regression model ∑ uˆ * X
i =1
i i = 0 ). Instructions 3 and 4 do the same but with income and
Note: Check the t-statistic and the p-value in the last regression and compare it with the initial
multiple regression. They are different because you are not using the correct degrees of freedom
in the last regression.
Go back to the multiple regression model. The other parts of the regression output you interpret
them as before.
The t-test is specific for each coefficient. For instance, the t-test for the significance of income can
be obtained by
test inc
( 1) inc = 0
F( 1, 97) = 6.80
Prob > F = 0.0106
( 1) age = 0
F( 1, 97) = 0.35
Prob > F = 0.5542
The F-test (joint test for the significance of income and age) can be obtained similarly as
( 1) inc = 0
( 2) age = 0
F( 2, 97) = 3.40
Prob > F = 0.0374
42