Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

STATA Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

STATA Tutorial

Gabriel Montes-Rojas/Victoria Serra-Sastre


Department of Economics
City University London
STATA Tutorial

This handout includes basic information to get started in using STATA. STATA is the basic
statistical and econometric computer package (and for sure, the one you will more need in your
professional life). As there is no official guide for learning STATA, the best way of get familiar with
it is by learning by doing. However, the software is also accompanied by a manual that is useful
as it contains both the technical part and applications. There are other helpful sources on the
internet that can be helpful. For more information visit the following websites:

STATA website
http://www.stata.com/
http://www.stata.com/links/
http://www.stata.com/support/faqs/
http://www.stata.com/capabilities/session.html

Other websites
http://www.ats.ucla.edu/stat/stata/
http://data.princeton.edu/stata/
http://www.cpc.unc.edu/services/computer/presentations/statatutorial/

Table of contents

1. HOW TO ACCESS THE STATA PROGRAM ....................................................................... 3


2. WHAT THE DIFFERENT WINDOWS IN STATA MEAN ..................................................... 4
3. HELPFUL STATA BUTTONS ............................................................................................. 6
4. HOW TO GET HELP IN STATA ........................................................................................ 8
5. LOG-FILES AND DO-FILES............................................................................................... 9
6. HOW TO OPEN A DATASET.......................................................................................... 13
7. MERGING AND APPENDING DATASETS ...................................................................... 15
8. INITIAL EXPLORATION OF THE DATA: BASIC STATISTICS ............................................ 17
9. UNDERSTANDING THE COMMAND LINE..................................................................... 22
10. CREATING AND MANGING VARIABLES .................................................................... 24
11. OTHER USEFUL COMMANDS AND TIPS ................................................................... 27
12. EXITING STATA ......................................................................................................... 29
13. EXAMPLE: USING STATA FOR REGRESSION ANALYSIS ............................................ 30

2
STATA Tutorial

INTRODUCTION TO STATA

1. HOW TO ACCESS THE STATA SOFTWARE


• On one of the school computers, click on Start → Programs → – S to T
• Scroll down and click on StataSE 10

This is what the main STATA window looks like. Over the next few pages, I will go over what some
of the parts of the window and the icons mean.

3
STATA Tutorial

2. WHAT THE DIFFERENT WINDOWS IN STATA MEAN

Results Window: Command Window:


All of the results and any error If you type in a command here
messages will be displayed here. and hit enter, the command and
It shows you the commands that the results will show up in the
you have entered and the Results Window. Unless
results associated with these otherwise noted, each line
commands corresponds to a different
instruction

4
STATA Tutorial

Variable Window: Shows all of the Review Window: Any command that you have entered during
variables in your dataset the session will be available here. Instead of retyping the
command in the “command window”, you can click on it in the
“review window” and the same command will then appear in the
“command window!

Note: Sometimes only the first two windows appear. To make the other two visible, go to
Window and select the Review and Variable. Alternatively, go to Edit, click on Preferences, Load
Preferences Set and click on Factory Settings.

Hint: You can alphabetically order the variables in the Variable Window by typing aorder in the
Command Window

5
STATA Tutorial

3. HELPFUL STATA BUTTONS

Do-File STATA STATA Stop


Log File Go
Data Editor Data Button
Button
Browser

The following are brief explanations of the buttons indicated above:


- Log File: Allows you to begin, append, or overwrite a log file.
- Do-file: Allows you to save commands you may need to run many times.

6
STATA Tutorial

- STATA Data Editor: Allows you to view and manually make any changes to your dataset (for
instance, changing the variable content by typing in a different value for a particular
observation). When you open the data editor, the observations are numbered on the left-
hand side and the variable names are listed on the top. Try clicking on the variable name to
see what information is available for that particular variable.

- STATA Data Browser: Allows you to see the data without making any changes

- Go Button: By default, if you type in a command that doesn’t fit in the “results window”,
STATA displays some of the results and lists –more— (in blue color) at the bottom of the
results screen. To see the rest of the results, click on the “go button” or click on the blue “—
more--” at the bottom of the results window.

To turn this option off so that STATA simply scrolls through all of your results at once, type
set more off, or
set more off, permanently if you would like STATA to activate automatically this option

Or click on this Go
Button to see the rest of
the results

Click on this -more-


icon to see the rest of
the results

- Stop Button: Sometimes you type in a command and you realize that something is wrong
with the command. To stop STATA from continuing to estimate or display the results,
click on this button.

7
STATA Tutorial

4. HOW TO GET HELP IN STATA

Whenever there is a command that you don’t know how it works, you can find a basic help in
STATA. This is very useful since some commands have special features and they will not work
unless you do it in the right way. Go to Help, where you have three options for getting help:

1. If you click on Contents you can either type in a keyword or command name or you can
search through the different categories in the STATA index of commands,
2. Clicking on Search you will be able to type in any keyword or command to get information on
it, or
3. Clicking on Command allows you to type in any command to obtain information about it.

Alternatively, in the Command window you can type help followed by the name of the
command you have doubts about.

Example:
Click on Help, Contents, and then type in regress in the Contents window to see the help file for
this regression command

8
STATA Tutorial

5. LOG-FILES AND DO-FILES

LOG-FILES

A log file saves all the commands and results (including error messages) from your STATA session
in a file that you can later access. It is good practice to open a log file at each STATA session to
track all of your work and save yourself from duplication.

To start a new log file, you have two options:


1. Click on the Log File icon (see above), give the file a name (nameoflogfile.smcl), and save it
to the folder of your choice, OR
2. Go to File, Log, Begin and proceed as in option (1)

In using either of the options, a window will pop up giving you the option to select the location
where you want to save the log file.

9
STATA Tutorial

Alternatively, you can type: log using nameoflogfile.smcl, replace


Once it is open, everything you type will be store in the file nameoflogfile.smcl
After you have opened a log file, if you click on the Log File icon, it gives a number of options:

Note that when you have opened a log file this is indicated at the bottom right of the Results
window as shown by

1. View snapshot of log file: This opens a window that shows all of the commands and
results that you have typed in during that particular log session

2. Close log file: Completely ends the log session and commands typed after this will not
appear in the log file. If you want to close the log file you can also type: log close

3. Suspend log file: Allows you to temporarily stop commands and results from being saved
in the log file. You can resume the log file by clicking on the “log icon” and clicking
“resume suspended log file”

10
STATA Tutorial
A log file has an .smcl extension. To view a log file that you created previously, either:

1. Click on the Log File icon, open the log file by clicking on the Save button (if it is not already
open), and click on View existing file (read only), OR

2. Go to File, Log, View and then click on Browse to find the saved log file.

Hint: You have 3 options with log files:

1. Simply view a previously created file

2. Append the results of the new session you are opening to this already saved log file, or

3. Overwrite the saved log file (WARNING: this will completely delete any commands and
results you had saved in this log file).

11
STATA Tutorial
DO-Files

We advise you to always use a do-file. Do-files record a series of commands you may need to
run repeatedly in different STATA sessions. To open a do-file, click on the do-file icon,

Below you have an example of what a do-file is:

clear
cd “c:\YourDirectory”
set mem 10m
use saving.dta, clear
summarize
tab black, summ(educ)
gen old=0
replace old=1 if age>40
tab old

You can save your do-file in the same way you save a program, a word-file, etc. If you want to
open an existing do-file:

1. Go to the Do-file Editor, then File, then Open, as before, go to Tools and Do.

2. From the Command window. Type: do nameofdofile.do

These files are useful since the instructions you use will become much more complicated
than just a few lines.

Do-files allow you to program in STATA. As you advance in your research you will need to make
your own programs to satisfy your special needs.

If you want to insert comments in the do-files, that is sentences that are for yourselves that
remind you about specific details about what you are doing, you have to start the sentence by an
*. Whatever you write after an * will not be considered. For instance the instruction in a do-file

* This is a STATA tutorial.

will not have any effect on what STATA does.

12
STATA Tutorial

6. HOW TO OPEN A DATASET


- When the dataset is in excel format:

o Open the dataset filename.xls in excel


o Copy all columns
o Paste them in the Data Editor

If you want to save the database you have opened in STATA type in:
save nameofdatabase.dta, replace

Why replace? If it already exists, you cannot change it unless you use this option.

- When the dataset is in STATA format, the extension for the dataset is .dta, for example,
filename.dta. There are different ways to open a dataset:

o Click on the Open File icon, or


o Click on File, Open, or
o Set the directory where you have the information saved: in the Command
window type
cd c:\Directory1\Directory2 (whatever the location of the file)

And type,
use nameofdatabase.dta, clear
Why clear? Because if there is another dataset opened in STATA, the program
will not allow you to open a new one such that two different databases are
overlapped.

Open the dataset SMOKE.dta (you will find it in the T drive). The dataset contains the
variables listed in the table below. Remember to first specify the directory from which
you will be working cd c:\Directory. Then type, use SMOKE.dta, clear

Variable Description
educ years of schooling
cigpric state cigarette price, cents per pack
white =1 if white
age in years
income annual income, $
cigs Cigarettes smoked per day
restaurn =1 if state restaurant smoking restrictions
lincome log(income)
agesq age^2
lcigpric log(cigprice)

13
STATA Tutorial

When the dataset opens successfully, the variables that exist in this dataset will be listed in the
Variable Window.

Hint:
If you try to open a dataset, and you get an error message like this:
no room to add more observations
An attempt was made to increase the number of observations beyond what is currently possible. You have the
following alternatives:

1. Store your variables more efficiently; see help compress. (Think of STATA's data area as the area
of a rectangle; STATA can trade off width and length.)

2. Drop some variables or observations; see help drop.

3. Increase the amount of memory allocated to the data area using the set memory command; see
help memory.
r(901);
It means that you need to increase the size of the memory in STATA. To do so, type in the
command window:
set1.mem
DO-FILES
200m
OR
set memory 200m
You actually have a number of options for the size of the memory – I happened to choose
200 megabytes in this case. See the help file for more information.

Note: So far you need to recognize 3 different STATA files:

- Dataset files have a .dta extension. The datasets contain all the variables you use for
statistical/econometric analysis.

- Do-files are the programming files you need to run specific instructions and they can be
identified by the .do extension.

- Finally you may find the .smcl files that refer to log files and are the printable files that
store your output results.

14
STATA Tutorial
7. MERGING AND APPENDING DATASETS

Merge data files: assume that you want to merge your current data set (called the “master”
dataset) with another data (called the “using” dataset) file that includes, for each observation
identifier, additional variables that are of interest for the analysis. The two separate datasets are
shown below in tables 1 and 2. For analysis purposes you would like to have a dataset in which
var1, var2, var3 and var4 are listed together in the same dataset, as shown in table 3.

Table 1 Table 2
Master dataset Using dataset
id var1 var2 id var3 Var4
1 10 50 1 1 .
2 20 40 2 0 4
3 30 30 3 1 5
4 40 20 4 0 .
5 50 10 5 0 3

Table 3
id var1 var2 var3 Var4
1 10 50 1 .
2 20 40 0 4
3 30 30 1 5
4 40 20 0 .
5 50 10 0 3

As an exercise, find the dataset SMOKE_merge1.dta in the T drive, and merge it with the
dataset called SMOKE_merge2.dta. First, open both datasets and identify which variables
are the same in both files. This is the first step to identify the variable(s) that link the two
datasets (in our case this variable is id). If you go to help, you will find information on the
command merge and then type in the following:

merge id using SMOKE_merge2.dta

Did you get an error message?

Note: before merging you will need to sort in each dataset the variable that acts as the
link between the master and using dataset.

15
STATA Tutorial
STATA will merge the two datasets and will generate an additional variable called
“_merge”. Find out what the values of this variable indicate (search in Help).

Append data files: you might find the case you need to put together two datasets that
contain same variables but each of them corresponds to different groups of observations.
Table 4 represents five observations that have information on var1 and var2, whereas table 5
contain additional observations with information on the same variables. The objective is to
join them vertically such that the resulting dataset looks like table 6.

Table 4 Table 5
id var1 var2 id var1 var2
1 10 50 6 60 50
2 20 40 7 70 40
3 30 30 8 80 30
4 40 20 9 90 20
5 50 10 10 100 10

Table 6
id var1 var2
1 10 50
2 20 40
3 30 30
4 40 20
5 50 10
6 60 50
7 70 40
8 80 30
9 90 20
10 100 10

Example: Use SMOKE_append1.dta and SMOKE_append2.dta. Both datasets have the


same variables but different observations. Suppose you would like to append them as
you would like to analyse the whole data. The corresponding command is append.
Searching in Help for this command will give you more hints on this command works.
Type in:

append using SMOKE_append2.dta

Note that, as opposed to merging, variables do not need to be sorted.

16
STATA Tutorial

8. INITIAL EXPLORATION OF THE DATA: BASIC STATISTICS

Once you have opened a database you have variables that contain the information you want.
Each one, with a given name is going to appear in the Variables window. The first thing you
can do with it is to see how they look like….

1. If you go to Window, and then Data Editor you can take a look (as in EXCELL) as a list.

2. Otherwise you can type:


a. describe varlist (it lists the basic information about the database)
or
summarize varlist
(it gives you the basic statistics for all the variables, i.e. mean, std, min, max).
or
codebook varlist

In general, if you see varlist written after the name of the command, this means that
you need to replace varlist with one or more variables of your choice from the
dataset that you have open. For instance, using the data SMOKE.dta if you type:
use SMOKE.dta, clear
summarize educ income
or
summarize educ income, detail
the summary statistics displayed will refer only to variables gdp and income

b. list varlist
c. If you have a variable that is discrete, i.e. years of schooling, gender, etc, you could
type:
tabulate varlist

Alternatively, you can obtain some data description and statistics clicking on the Data and
Statistics options in the tool bar at the top left of the STATA window. For instance, if you require
to have a description of a varlist click on Data → Describe Data → Describe Data in Memory. You
have the option to choose the variables to be summarised. The output obtained exactly the same
as the one you obtain typing the command describe. If you would like to obtain some summary
statistics, click on Statistics → Summary, tables and tests → Summary and descripƟve staƟsƟcs →
Summary statistics and the output will be identical to using the command summarize.

Another example on US savings data will go in detail about how to use these commands. This
database contains information about individuals’ savings, income, size of the household, etc. The
datatbase can be obtained by typing in the Command window:

use http://fmwww.bc.edu/ec-p/data/wooldridge/SAVING, clear

17
STATA Tutorial

The variables in this dataset are:

Variable Description
sav Household annual savings in dollars.
inc Household annual income in dollars.
size Number of members in the household.
educ Years of schooling of the household head.
age: Age of the household head.
Dummy variable (0 or 1) for whether the household head
black
declares himself to be African American
cons Always equal to 1.

If you type describe you will obtain the following STATA output:

Contains data from http://fmwww.bc.edu/ec-p/data/wooldridge/SAVING.dta


obs: 100
vars: 7 31 Jan 2006 11:46
size: 1,400 (99.9% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
sav int %9.0g annual savings, $
inc int %9.0g annual income, $
size byte %9.0g family size
educ byte %9.0g years educ, household head
age byte %9.0g age of household head
black byte %9.0g =1 if household head is black
cons int %9.0g annual consumption, $
-------------------------------------------------------------------------------
Sorted by:

Typing summarize the output looks like


(The actual command is summarize, but you can just type summ and it will be recognized.)

Variable | Obs Mean Std. Dev. Min Max


-------------+-----------------------------------------------------
sav | 100 1582.51 3284.902 -5577 25405
inc | 100 9941.24 5583.998 750 32080
size | 100 4.35 1.493251 2 10
educ | 100 11.58 3.435348 2 20
age | 100 38.77 7.398956 26 54
black | 100 .07 .2564324 0 1
cons | 100 8358.73 5729.535 -13055 30280

From the table above, you notice that it contains information for 100 individuals. You also note
that the mean level of savings is 1582.51 dollars, and actually some people have negative savings
(the minimum is -5777, what does that mean?). Look at the black variable. What does a mean of
0.07 mean? It means that 7% of the sample household heads declares themselves to be African
American. You recognize that it is a dummy variable because min=0, max=1.

18
STATA Tutorial

Additional information can be obtained for discrete variables. For instance, educ contains only
integer values, from 2 to 20. If you want to see the number of observations in each education-
value, type in

tabulate educ

years educ, |
household |
head | Freq. Percent Cum.
------------+-----------------------------------
2 | 1 1.00 1.00
3 | 1 1.00 2.00
4 | 1 1.00 3.00
6 | 1 1.00 4.00
7 | 4 4.00 8.00
8 | 10 10.00 18.00
9 | 11 11.00 29.00
10 | 9 9.00 38.00
11 | 4 4.00 42.00
12 | 32 32.00 74.00
13 | 2 2.00 76.00
14 | 4 4.00 80.00
15 | 1 1.00 81.00
16 | 9 9.00 90.00
17 | 6 6.00 96.00
18 | 2 2.00 98.00
19 | 1 1.00 99.00
20 | 1 1.00 100.00
------------+-----------------------------------
Total | 100 100.00

In both cases above the command can be shortened, try typing des, sum or tab. STATA will
recognize the commands as describe, summarize and tabulate, respectively.

The “by” expression

It allows to “run command on subsets of data” (STATA help). It is extremely useful when you
want to compute or generate variables according to specific groups of variables. For instance,
suppose that you want certain sub-groups of the whole sample. For instance, you are interested
in whether there are differences in the education levels by race (race and gender are very
interesting topics in Applied Econometrics) and you want to summarise the variable education
according to different educational levels, type in

by black: summarise educ

Which problem do you encounter? You need to sort the variable. There are 2 ways:

o type in sort black and then run the command above, or


o type bysort black: summarise educ

19
STATA Tutorial

The STATA output obtained is as follows:

bys black: summarize educ

-> black = 0

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
educ | 93 11.80645 3.343579 3 20

-----------------------------------------------------------------------
-> black = 1

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
educ | 7 8.571429 3.457222 2 12

This is a very useful instruction. First, you specify which variable will be used to separate into
categories (there should not be too many; black has only 2). Second, for each category, you can
obtain information about other variable. Check by yourselves.
Alternatively, you can also type in

tabulate black, summ(educ)

=1 if |
household | Summary of years educ, household
head is | head
black | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 11.806452 3.3435794 93
1 | 8.5714286 3.4572216 7
------------+------------------------------------
Total | 11.58 3.435348 100

We can see the following:


1. The average education level is 11.58 (years of schooling)
2. Here we have that if the person is black, he/she has 8.57 years of schooling in comparison with
non-black of 11.81.

So far we have seen examples of databases that have numeric variables. However, a part from
numeric variables there are also string variables. Numeric variables contain numbers and typically
displayed in the Data Editor/Browser in black (an example is variable age on the figure below).
String variable contain letter and are usually representing some kind of qualitative variable. String
variables are shown in red in the Data Editor/Browser (for instance, the variable gender shown in
the figure below). Summary and descriptive statistics are easily obtained using the commands
above for numeric variables. However, for string variables the output you obtain when using the
summarize command will provide no information. STATA will not calculate any statistics as the
data within the variable does not provide the information to compute numerical descriptions.

20
STATA Tutorial

Numeric variable String variable

To see the difference between numeric and string variables use the database WAGEM.dta. This
database contains information of wages, age and other working related variables. The variables
are listed below:

Variable Description
earns total earnings
age in years
educ years of schooling
gender male, female
marr married, not married
union =1 if belong to union
exper age - educ - 6
yrsmarr years married

Type in, use WAGEM.dta, clear. Then type sum, there will be no statistics computed for string
variables. However, you can obtain frequencies for these variables typing tab gender or tab
marital_status you will obtain the table of frequencies. Similarly, if you type in codebook gender
marital_status, you will only obtain frequencies but no statistics.

21
STATA Tutorial
9. UNDERSTANDING THE COMMAND LINE

When you examine a command in the help file, STATA has a set way of displaying the syntax of
the command. For example, if you search for regress in the STATA help file, you will see:

regress depvar [indepvars] [if] [in] [weight] [, options]

This is just a brief overview of how to interpret this line. For more detailed information, refer to
the STATA reference manual.

Everything that is not in brackets is compulsory for that command. For instance, if you only typed
regress
you would get the error message:

last estimates not found


r(301);

because you did not include depvar.

• depvar represents the dependent variable in your regression


• indepvars represents the list of independent variables in your regression
• if and in allow you to choose a particular subset of the data for which you want your
command to run
• weight allows you to weight observations in a particular way
• , options allows you to make your command more specific. We will see the different choices
for options when we get into more complex issues like bootstrapping.

The “if” expression

The “if” expression allows you narrowing down the subsample for which you want the command
to apply. The if instruction allows using logical statements. For instance, using again the dataset
SAVING.dta if you type, summarize income if age>40, STATA offers descriptive information about
the income variable for the subsample of observations for which age>40.

After an if you can use the following logical operators:

Relational
Arithmetic Logical (numeric and string)
-------------------- ------------------ ---------------------
+ addition ~ not > greater than
- subtraction ! not < less than
* multiplication | or >= > or equal
/ division & and <= < or equal
^ power == equal
~= not equal
+ string concatenation != not equal

Note that a double equal sign (==) is used for equality testing.

22
STATA Tutorial

Some examples:

summarize inc if age>40


(summarize income for all those individuals aged above 40)

summarize inc if ~(age>40)


(summarize income for all those individuals aged 40 and below)
This is equivalent to summarize income if age<=40

summarize inc if black==1


(summarize income for those individuals with an African American household head)

summarize inc if black!=1


(summarize income for observations for which the household head is not African American)
summarize inc if age>40 | black==1
The operator | is equivalent to the UNION in mathematics.
(summarize income for individuals aged above 40 and household heal African American)

summarize inc if age>40 & black==1


The operator & is equivalent to the INTERSECTION in mathematics.
(summarize income for individuals aged above 40 or household heal African American)

23
STATA Tutorial

10. CREATING AND MANGING VARIABLES

• generate: This command creates a new variable. Note that you cannot create twice the
same variable (if you try to do it STATA will remind you cannot). The corresponding
command is
generate newvar = exp

• replace: If you want to make some modifications to an existent variable, you have to use
this command.

• rename: you may want to change names of variables at your convenience (maybe to give
names to variables such that only the name defines the content).

As an example, let’s use the database we opened before. Suppose we want to create two
subgroups. One for young people (less or equal to 40 years old) and other for old people (more
than 40 years old). Then you can type:

generate old=0
replace old=1 if age>40

(The actual command is generate, but you can just type the abbreviation gen and it will be
recognized.)1

Then you generated another dummy variable, called old. Now, how does it look like?

tabulate old

old | Freq. Percent Cum.


------------+-----------------------------------
0 | 61 61.00 61.00
1 | 39 39.00 100.00
------------+-----------------------------------
Total | 100 100.00

If you want to change the name to that variable, i.e. you want it to be “Old” instead of “old”, then
type

rename old Old

Now, there is no variable named old, but Old (check on Variables)2.

1
Suppose you insist in generating again the same variable. If you type generate old=0, the following error message will
appear:
old already defined
r(110);

24
STATA Tutorial

Functions

Sometimes you need to manipulate variables using functions of existing variables. Some
examples are the following:

gen newvar1=1 + educ^2 + age*2


(Here you are creating a new variable, where each observation contains 1 plus the square of the
corresponding element in educ and twice the age. Although both educ and age can be seen as
vectors, STATA allows you to work as if you were modifying element by element.)

gen newvar2=ln(educ) [natural logarithm]


(Here you are creating a new variable, where each observation contains the logarithm of the
value in educ. If educ contains a non-positive value missing value will appear in the new variable
for that observation.)

gen newvar3=exp(educ) [ e educ ]

Type help functions to obtain a list of all the available mathematical and statistical functions.

Missing values

Missing values will be displayed in data in different ways. They can be represented by a dot (.) or
by blank value ( ). If it is a missing value in a string variable you need to type the missing value in
speachmarks (“.” or “ “).

If you try to divide by zero, or apply logarithm to a negative number, STATA will replace that with
a missing value. You will recognize them by a dot (.). In the examples above, if any of the
variables used to compute the new variable contains a missing value, it will assign missing value.

Labels

- Label variables: attaches a label to a variable. In addition, if you run a codebook/tabulate of


the variable the label will be displayed on the top right of the output. For instance, using the
same data as above SAVING.dta, you can give a label to the variable old by typing

label var old “Dummy variable, 0 if age<=40, 1 otherwise”

2
STATA is case sensitive. That is it matters whether you type database or Database or dATaBaSe. Each is considered a
different name. All commands need to be typed in small case.

25
STATA Tutorial
If you type in tabulate old

Dummy |
variable, 0 |
if age<=40, |
1 otherwise | Freq. Percent Cum.
------------+-----------------------------------
0 | 61 61.00 61.00
1 | 39 39.00 100.00
------------+-----------------------------------
Total | 100 100.00

- Label values

You may want to give some labels to values of a variable that represent different categories.
For instance, the variable old is coded as 0 and 1, but we don’t know which category
corresponds to each age breakdown. We can attach labels to these values such that we will
be able to identify the value with its corresponding category.

First we need to define the labels. Type in

label define old 0 yes 1 no

If you type tab Old, are there any changes in the variable? We have just asked STATA to store
these labels for variable Old under the name “old”. However, we need to ask STATA to
actually attach these labels to the variable. Type in

label values Old old

Type in tab Old and see any changes made. You should obtain a STATA output like the one
below,

Dummy |
variable, 0 |
if age<=40, |
1 otherwise | Freq. Percent Cum.
------------+-----------------------------------
no | 61 61.00 61.00
yes | 39 39.00 100.00
------------+-----------------------------------
Total | 100 100.00

26
STATA Tutorial

11.OTHER USEFUL COMMANDS AND TIPS

Drop or keep variables

When you open your dataset there may be variables that you may not need at all. For instance,
you may observe that variables var1 and var2 do not contain any information and you may want
to delete them.

Type drop var1 var2 and these two variables will be removed from your using dataset.
Alternatively, you can also type keep var1 var2 var100 and only var1, var2 and var100 will be
deleted, whereas the variables specified in the command line are those kept by STATA.

Move variables

Variables may not be listed in the order that one may prefer. In your dataset your variables would
be displayed in the following order

sav inc size educ age black cons old

If you would like to have the variables in the following order...

sav educ age inc size black old cons

You can use the command “move varname1 varname2”. This will reallocate the variable as
desired. However, you need to do this as many times as reallocations required. Alternatively, you
can type order sav educ age inc size black old cons. Only using one command STATA will arrange
the variables in the order specified.

Sort: you may also want to arrange the observations within a variable in ascending or
descending order.

Type sort age and age will be displayed in ascending order. If the variable contains missing values
they will be displayed first. String variables can also be ordered.

Recode variables: this will help you to change any value within a variable. For instance, variable
black is coded as 0 and 1, and you may want to have it coded as 1 and 2. If you type in

recode black 0=2

and all values equal to 0 will be replaced by 2.

27
STATA Tutorial

Encode variables: some variables are displayed as strings. You may want to decode them and
have them as numeric. Using WAGEM.dta, type

encode marital_status, gen(married). What is this command generating? Which changes do you
observe?

Managing dates and time variables

(From STATA manual) “Dates and times are called %t values. %t values are numerical and
integral. The integral value records the number of time units that have passed from an agreed-
upon base, which for STATA is 1960.

Coding and interpretation of date and time (%t) values are as follows:”

+---------------------------------------------------------------------+
| | | ----- Numerical value & interpretation ------ |
| Format | Meaning | Value = -1 | Value = 0 | Value = 1 |
|--------+------------+---------------+---------------+---------------|
| %tc | clock | 31dec1959 | 01jan1960 | 01jan1960 |
| | | 23:59:59.999 | 00:00:00.000 | 00:00:00.001 |
| | | | | |
| %td | days | 31dec1959 | 01jan1960 | 02jan1960 |
| | | | | |
| %tw | weeks | 1959w52 | 1960w1 | 1960w2 |
| | | | | |
| %tm | months | 1959m12 | 1960m1 | 1960m2 |
| | | | | |
| %tq | quarters | 1959q4 | 1960q1 | 1960q2 |
| | | | | |
| %th | half-years | 1959h2 | 1960h1 | 1960h2 |
| | | | | |
| %tg | generic | -1 | 0 | 1 |
+---------------------------------------------------------------------+

If you have a data variable that is displayed in a numerical value like above, you can generate new
variables that contain the day, month or year associated with this variable. Check the variable
date. The way this variable is displayed does not confer any information. If you would like STATA
to display the corresponding date to the time value you need to format the variable:

format date %td (see table above for more options on how dates can be displayed, try for
instance “format date %tq”).

You may want to generate a different set of variables that capture the day, month and year in
which the event happened. The way to do so is to extract this information from the variable date.

gen day = day( date)


gen month = month( date)
gen year = year(date)

28
STATA Tutorial
Copying and pasting results
If you would like to copy the results in the STATA output window and paste them in Word, it is
best to select the output, copy as text or table, and copy it in a Word document with font Courier
New and size 8 or 9.

12. EXITING STATA

When you have finished your session and you want to save your dataset because it has been
modified you can click on File → Save as and you will be prompted with a window that will give
you the option to save the new dataset in your preferred location. It is always desirable to keep
your initial dataset (SAVING.dta) unmodified and save any changes you make to the data into
another data file (i.e. SAVING2.dta).

If you click on the exit button by mistake, no worries! STATA will always ask you whether you
want to save any changes made. You can also close the session typing exit into the command
window and if the data has been modified you will get the following error message in the results
window:

Saving the work you have done

It is upon you to decide whether to save it or not. If you want to save it you just follow the steps
mentioned above (click on File → Save as) and save it in your preferred locaƟon.

save nameofdatabase.dta, replace

Note: if you put the name of the dataset you open before (in our example SAVING.dta) then you
will replace the old SAVING.dta with this one. It is advisable to save the work with a new name
(like SAVING2.dta).

If you don’t want to save any changes you can also type clear in the command window and exit
STATA straight away.

29
STATA Tutorial

13. EXAMPLE: USING STATA FOR REGRESSION ANALYSIS

i. Correlation and Simple Regression

Correlation
After learning basic concepts on STATA, here we do an application of analysis undertaken using
STATA. Using the US savings data SAVING.dta, suppose we want to estimate the following model:

sav = β 0 + β1 * inc + u

In this case, we are interested in the relation between income and savings. As your income
increases, you expect that your savings also increase. But, by how much? From an introductory
course in Macroeconomics you know that β1 is called the “propensity to save”. If β1 = 1 it means
that for each dollar increment (decrement) in people’s income, on average they also increase
their savings by one dollar. If β1 = 0.5 it means that for each dollar increment (decline) in
people’s income, on average they also increase their savings by fifty cents, and so on.

The first thing we can see about this variables is whether there is a correlation between them.

correlate sav inc

| sav inc
-------------+------------------
sav | 1.0000
inc | 0.2493 1.0000

(The actual command is correlate, but you can just type corr and it will be recognized.)

To test whether the correlation coefficient is different from zero, type

pwcorr sav inc, sig

| sav inc
-------------+------------------
sav | 1.0000
|
|
inc | 0.2493 1.0000
| 0.0124

The p-value of the test is 0.0124 (you should remember how the test statistic is constructed).
Therefore you reject the null hypothesis that the correlation is zero.

30
STATA Tutorial

Simple regression

If you want the estimates of βˆ0 and βˆ1 (estimates of the parameters of interest) we put in
Command window:

regress sav inc


(The actual command is regress, but you can just type reg and it will be recognized.)

Source | SS df MS Number of obs = 100


-------------+------------------------------ F( 1, 98) = 6.49
Model | 66368437 1 66368437 Prob > F = 0.0124
Residual | 1.0019e+09 98 10223460.8 R-squared = 0.0621
-------------+------------------------------ Adj R-squared = 0.0526
Total | 1.0683e+09 99 10790581.8 Root MSE = 3197.4

------------------------------------------------------------------------------
sav | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inc | .1466283 .0575488 2.55 0.012 .0324247 .260832
_cons | 124.8424 655.3931 0.19 0.849 -1175.764 1425.449
------------------------------------------------------------------------------

The output of a regression will give you a lot of useful information.


• From the above table you notice that βˆ0 = 124.84 and βˆ1 = 0.1466 . Are they
significant? Look at the standard error for each coefficient. For the constant the standard
error is 655, which is 5 times bigger than the estimated coefficient value. For the slope
coefficient, on the contrary, the standard error is less than half the coefficient. We would
say that in the first case, the constant is not significant, but the second is significant.

A formal test is given by the t-test. In each case, it provides the t-value for a test that the
specified coefficient is zero (i.e. H o : β = 0; H a : β ≠ 0 ). For the significance just
observe the p-value (which appears below P>|t|). For the constant, we obtain a p-value
of 0.849, which means that it is not significant at any useful significance level. However
the slope coefficient has a p-value of 0.012, which means it is significant at 10% and 5%
significance level, although it is not at 1%.

A small p-value means the estimate coefficient is significant, a big one means no-significant. For
instance a p-value of 0.002 says that you reject the Null Hypothesis that the coefficient is
significant at the 1% significance level (since 0.002<0.01) and consequently also at 5% and 10%
levels. On the other hand a p-value of 0.035 means that you cannot reject the null hypothesis at
the 1% significance level, but you can do it at 5% and 10% (since 0.01<0.035<0.05<0.1). Finally a
p-value like 0.56 tells you that your estimate is not even significant at the 10% level
(0.56>0.1>0.05>0.001).

31
STATA Tutorial

• At the right top, you have a test for all the coefficients (other than the constant) being
jointly zero H o : β1 = 0 . This is the F-test. Look at the p-value: 0.0124. That means you
reject the null hypothesis. Note that in the simple regression case, the F-test is equivalent
to the t-test of β1 (for that reason you obtain the same p-value!)

By the way, did you know how to get the degrees of freedom for the F-test? Note that
the degrees of freedom are 1 for the numerator and N-2=98 for the denominator. This
can also be noted at the regression output.

The regression output also provides information about TSS (Total Sum of Squares), RSS
(Regression Sum of Squares) and ESS (Error Sum of Squares). From the top left you get:

RSS | 66368437
ESS | 1.0019e+09
-----------+---------------
TSS | 1.0683e+09

RSS 66368437
You obtain the following F-value: F1, N − 2 = = = 6.49
ESS / N − 2 1.0019e + 09 / 98

Finally, note that the p-value of the t-statistic, the F-statistic and the correlation
coefficient test is the same: 0.0124. Why?

• R-squared is a measure of goodness of fit of the model and it is obtained as follows:

RSS 66368437
R2 = = = 0.0621
TSS 1.0683e + 09

which is the value you get from the top-right.

After you run a regression, basic information about your results is stored. Type

ereturn list

scalars:
e(N) = 100
e(df_m) = 1
e(df_r) = 98
e(F) = 6.491777898938016
e(r2) = .0621271647345948
e(rmse) = 3197.414708235473
e(mss) = 66368436.97882748
e(rss) = 1001899160.011173
e(r2_a) = .0525570337624989
e(ll) = -947.893503812726
e(ll_0) = -951.1005492750204

32
STATA Tutorial
macros:
e(title) : "Linear regression"
e(depvar) : "sav"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"

matrices:
e(b) : 1 x 2
e(V) : 2 x 2

functions:
e(sample)

Two basic element types can be distinguished here:

• Scalars: they are just numbers that you can use in other applications. For instance is you
want to list again R2 you can type

display e(r2)

.06212716

• Matrices: here you can get the coefficient as a vector and the variance-covariance matrix.

matrix list e(b)

e(b)[1,2]
inc _cons
y1 .14662835 124.84241

matrix list e(V)

symmetric e(V)[2,2]
inc _cons
inc .00331186
_cons -32.924014 429540.14

• Additionally, the coefficients can be obtained as _b[NAMECOEF]. For instance,

display _b[inc]
.14662835

display _b[_cons]
124.84241

33
STATA Tutorial
Testing after regression
After you run a regression you can test other hypothesis. For instance, suppose you are
interested in H o : β1 = 1 . From your knowledge of Econometric Theory you would like to use

)
β1 − 1
tβ1 =1 = )
Var β1( )
which follows a t distribution with N-K degrees of freedom, N being the number of observations
and K the number of parameters estimated (in our case just 2). In STATA is easy to type

test inc=1

( 1) inc = 1

F( 1, 98) = 219.89
Prob > F = 0.0000

Why does it use an F-test? The F-test allows you to work with more complicated hypothesis. And
given a t random variable with d degrees of freedom, t2(d) has the same distribution as an F(1,d)
random variable.

If you want to replicate the F-test which appear in the top-right of your regression outcome you
can type:

test inc=0

( 1) inc = 0

F( 1, 98) = 6.49
Prob > F = 0.0124

You can also test more complex hypothesis. For instance, a test that both coefficients are zero
can be done by:

test (inc=0) (_cons=0)

( 1) inc = 0
( 2) _cons = 0

F( 2, 98) = 15.49
Prob > F = 0.0000

34
STATA Tutorial
Special case 1: If we want to estimate a different model, without constant term, like

sav = β1 * inc + u

then type instead

regress sav inc, nocons

Source | SS df MS Number of obs = 100


-------------+------------------------------ F( 1, 99) = 31.26
Model | 316431274 1 316431274 Prob > F = 0.0000
Residual | 1.0023e+09 99 10123940.5 R-squared = 0.2400
-------------+------------------------------ Adj R-squared = 0.2323
Total | 1.3187e+09 100 13187013.9 Root MSE = 3181.8

------------------------------------------------------------------------------
sav | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inc | .1561974 .0279389 5.59 0.000 .1007606 .2116343
------------------------------------------------------------------------------

Note the changes in the degrees of freedom of the F-stat.

Special case 2: In statistics module you have studied the behaviour of random variables. Suppose
that sav is a random variable which is normally distributed with unknown mean and unknown
variance, i.e. sav ≈ N ( µ , σ 2 ) What is an unbiased estimator of µ? As you may know, that is just
the sample mean, µ̂ = sav .

To get the sample mean you can just type summarize sav. An alternative way is using regression.
Type:

regress sav

Source | SS df MS Number of obs = 100


-------------+------------------------------ F( 0, 99) = 0.00
Model | 0 0 . Prob > F = .
Residual | 1.0683e+09 99 10790581.8 R-squared = 0.0000
-------------+------------------------------ Adj R-squared = 0.0000
Total | 1.0683e+09 99 10790581.8 Root MSE = 3284.9

------------------------------------------------------------------------------
sav | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 1582.51 328.4902 4.82 0.000 930.7142 2234.306
------------------------------------------------------------------------------

Then, you get µ̂ =1582.51. What about testing H0:µ=0? This exactly the t-test provided in the
regression output. The p-value is 0.000 that means you reject your null hypothesis!

35
STATA Tutorial

What does regression do?

You know that a regression minimizes the sum of squared errors:

n n 2
∑ uˆ
i =1
i
2

i =1
(
= ∑ savi − βˆ0 − βˆ1 * inci )
How much is this quantity? After the regression, type:

predict u, resid
(This creates a new variable, whose name you need to specify, in our case u, with the predicted
values. You are generating a variable u that contains the error term u = sav − βˆ0 − βˆ1 * inc )3

Now, we need to square it,

generate u2=u^2

How much is the sum of the new variable u2? Type

summarize u2
display r(sum)
(When you type summary, STATA calculates more things than what it reports. One of them is
under the name of r(sum), which contains the sum of all the values in the variable specified after
the summarize command. If you want to see this value you need the display command. Other
statistics can be seen by typing return list)

Alternatively, you could have created it using:

generate u2_=(sav-_b[_cons]-_b[inc]*inc)^2
summarize u2_
display r(sum)

How do we know that this value is the minimum? You could create any other variable with other
betas…..

generate u2_alt=(sav-2-1*inc)^2
summarize u2_alt
display r(sum)

3
Alternatively, you could have created this variable by typing the instructions:
gen u=sav - _b[_cons] + _b[inc]*inc

36
STATA Tutorial
You should get a variable with higher values since 2 and 1 are not the values that minimize the
squared errors, that is, βˆ0 and βˆ1 should not necessarily be 2 and 1.

You also know that OLS has other properties. Try to prove the following:

n n
1)
i =1 i =1
(
∑ uˆi = ∑ savi − βˆ0 − βˆ1 * inci = 0 )
n
2) ∑ inc * uˆ
i =1
i i =0
n
)
3) ∑ sav * uˆ
i =1
i i =0

Solution:

1) Try this:
summarize u
display r(sum)
2) Try this:
gen inc_u=inc*u
summarize inc_u
display r(sum)
3) Try this:
predict sav_hat
gen sav_hat_u=sav_hat*u
summarize sav_hat_u
display r(sum)

37
STATA Tutorial

ii. Graphs
Histograms. The first graph you can use is a histogram. It provides visualization about
what the distribution of a given variable look like.

histogram sav
1.5e-04
1.0e-04
Density
5.0e-05
0

-10000 0 10000 20000 30000


annual sav ings, $

What can you infer about it? First, note that there seems to be outliers in the right tail. In
other words, savings have a positive skewed distribution. Why?

Another way to see that is to impose a normal density function at the same time. Try

histogram sav, normal


1.5e-04
1.0e-04
Density
5.0e-05
0

-10000 0 10000 20000 30000


annual sav ings, $

Here it is clearer, that values above $20000 are not likely to happen if savings were
normally distributed.

38
STATA Tutorial

Actually we can get very easily the skewness and kurtosis. Type

summ sav, detail

annual savings, $
-------------------------------------------------------------
Percentiles Smallest
1% -4230 -5577
5% -1255 -2883
10% -287 -2749 Obs 100
25% 189 -1389 Sum of Wgt. 100

50% 982 Mean 1582.51


Largest Std. Dev. 3284.902
75% 1838.5 6120
90% 4378.5 10668 Variance 1.08e+07
95% 5323 10733 Skewness 4.212466
99% 18069 25405 Kurtosis 29.90081

Note that skewness is positive (as expected; also note that the median, 982, is smaller
than the mean, 1582.51). Also kurtosis is bigger than 3 (actually around 30), which gives
you enough proof that it is not normally distributed.

Scatter plots and lines. Now suppose that you want to visually inspect the relation
between sav and inc. Then type

scatter sav inc, title(“Income and savings”)

“Income and savings”


30000 20000
annual savings, $
10000 0
-10000

0 10000 20000 30000


annual income, $

39
STATA Tutorial

What does the regression did? It predicted some values, and we could see it in this way.

reg sav inc


predict sav_hat
(This creates a new variable, whose name you need to specify, in our case sav_hat, with
the predicted values.)

Now we are not creating the errors, but instead: sav _ hat = βˆ0 − βˆ1 * inc , that is the
predicted value4.

Now type

sort inc
(This sorts, i.e. orders, the data depending on their income values. If you want to learn
more about this type help sort or help gsort. The latter allows you to order in ascending
or descending order.)

scatter sav inc || line sav_hat inc, title(“Income and savings”) t2(“Actual values and
linear prediction”)
(The whole above instruction should be in ONE line, not TWO)

“Income and savings”


“Actual values and linear prediction”
30000
20000
10000
0
-10000

0 10000 20000 30000


annual income, $

annual savings, $ Fitted values

If you want to know which other options you have about graphs, type:

help graph

4
Altenatively, you could have created this variable by typing the instructions:
gen sav_hat=_b[_cons] + _b[inc]*inc

40
STATA Tutorial
iii. Multiple regression
In general, in applied econometric analysis you will use multiple regression, that is, regression
involving more than one explanatory variable. For instance, younger people would save in a
different manner from old people, even if they have the same level of income. If we want to
model this hypothesis we would be interested in the following model:

sav = β 0 + β1 * inc + β 2 * age + u

What do you think is the sign of β2? The life-cycle theory predicts that young people save for
retirement. Therefore we expect that this coefficient is negative. However, older individuals will
also be able to have better jobs, which could give us the opposite sign.

In STATA, multiple regression works in the same way as simple regression. Type

reg sav inc age

Source | SS df MS Number of obs = 100


-------------+------------------------------ F( 2, 97) = 3.40
Model | 69993876.4 2 34996938.2 Prob > F = 0.0374
Residual | 998273721 97 10291481.7 R-squared = 0.0655
-------------+------------------------------ Adj R-squared = 0.0463
Total | 1.0683e+09 99 10790581.8 Root MSE = 3208

------------------------------------------------------------------------------
sav | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
inc | .1555624 .0596697 2.61 0.011 .0371346 .2739903
age | -26.72823 45.03278 -0.59 0.554 -116.1058 62.64938
_cons | 1072.28 1726.415 0.62 0.536 -2354.176 4498.736
------------------------------------------------------------------------------

Note that β1 is similar to that obtained in the simple regression model. Now the coefficient on
age is negative, but not significant (p-value 0.554). This means that age is not a good explanatory
variable for the level of savings.

It is a good idea to spend some time interpreting those coefficients. When you have a multiple
regression model, you have to interpret the effect of one variable, conditional on the remaining
explanatory variables. For instance, β 1=0.155 means, increasing the level of income (keeping age
fixed) would have an average impact on savings of about 15 cents. This idea is similar to the
expression ceteris paribus that you often use in Economics.

In other words, to get β1=0.155, you are taking out the influence of age on savings. The following
example might help you. I can get β 1=0.155 in the following way, just using simple regressions:
reg sav age
predict u_sav, resid
reg inc age
predict u_inc, resid
reg u_sav u_inc

41
STATA Tutorial

The first two instructions run a simple regression between savings and age, and it constructs the
residuals. Note that these residuals, by definition, are not correlated to age (Why? Remember, in
n
a simple regression model ∑ uˆ * X
i =1
i i = 0 ). Instructions 3 and 4 do the same but with income and

age. Finally a simple regression between the residuals is done.

Note: Check the t-statistic and the p-value in the last regression and compare it with the initial
multiple regression. They are different because you are not using the correct degrees of freedom
in the last regression.

Go back to the multiple regression model. The other parts of the regression output you interpret
them as before.

The t-test is specific for each coefficient. For instance, the t-test for the significance of income can
be obtained by

test inc

( 1) inc = 0

F( 1, 97) = 6.80
Prob > F = 0.0106

Similarly, for age


test age

( 1) age = 0

F( 1, 97) = 0.35
Prob > F = 0.5542

The F-test (joint test for the significance of income and age) can be obtained similarly as

test inc age

( 1) inc = 0
( 2) age = 0

F( 2, 97) = 3.40
Prob > F = 0.0374

42

You might also like