STATA Programming II
STATA Programming II
• Datasets not in Stata format can be found in formats which can be classified along a number of
different dimensions: (1) whether it is in a proprietary binary file format associated with a
spreadsheet, database, or statistical program like Microsoft Excel, dBase, SPSS, etc., or instead is in
raw text (ASCII) format; and (2) whether it has single or multiple record observations. If raw text, it
can be fixed format (with variables starting and ending on specified columns, hence it looks like a
giant rectangle often with no spaces in between digits) or delimited (with variables set off from each
other by spaces, commas, tabs, or other characters). Since most proprietary programs can save their
datasets out into raw text files as an option, Stata is only designed to be capable of importing raw text
files (in addition to Stata format itself, of course).
• Single-record comma- or tab-delimited files are easiest to import. Set Stata’s memory sufficiently
high, then use the insheet command: insheet using filename.ext, clear [comma] [names],
where ext is the file’s extension (e.g., *.csv for many comma-delimited files), and the options
[comma] and [names] tell Stata that you are importing a comma-delimited (as opposed to tab-) file
and that the first row in the dataset has the variable names, respectively.
• For space-delimited, fixed-format, or multiple-record files, you must use the infile command
combined with a dictionary file. Refer to Stata’s help or my class example for more information.
• Once you have a dataset in Stata’s memory, you may need to check the variable memory storage
types (to make sure numeric variables are properly stored as numeric instead of strings), modify the
variable names and descriptions, etc.
• To export a Stata dataset in memory into a comma-delimited format which can be read into Excel,
etc., just use the outsheet command: outsheet using filename.csv, comma names replace.
You can optionally export only selected variables or observations, by listing the variables after
outsheet, or by using the if condition before the comma, respectively.
• To merge rows and variables from one dataset into another based on matching values of a variable or
variables, use the merge command.
• First, verify each dataset has an identically named match variable(s) with identical values where
matches should occur.
• Second, sort each dataset by the match variable and save each file.
• Third, open up the file into which you want to import the new rows/columns.
• Fourth, type merge matchvar using dataset2, where dataset2 is the dataset you want to import
from and matchvar is the match variable(s). This process creates a new variable, _merge, which
indicates (see help merge) which rows were originally in the dataset you started with and which came
solely from the dataset you imported from.
• By default, the merge command does not change the values of existing rows or columns if the ones in
the importing dataset are different. Use the , update option to replace cells that were missing in the
original dataset with filled-in values from the importing dataset. Use the , replace option to
change filled-in values from the original dataset with filled-in values from the importing dataset.
Stata will not replace filled-in values from the original dataset with missing values from the importing
dataset.
• Aggregate rows sharing a common value of any given variable using the collapse command. First,
sort by the variable(s) to aggregate on the basis of. Second, type collapse (function)
newname1=varname1 (function) newname2=varname2, by(aggvar), where aggvar is the variable
Stata Programming II September 8, 2003 2
Eric Reinhardt (Department of Political Science, Emory University, Atlanta, GA 30322)
which is the basis of the aggregation, function is the code representing the particular aggregating
function you would like performed (e.g., mean, min, max, sum, count), varname1 is the first variable
which you would like to perform this aggregating function on, and newname1 is the name of the
variable Stata will create when it performs this aggregating function on varname1.
• The collapse command creates a new, smaller, dataset in memory, and drops the existing one, so save
if necessary before you use it.
• You can multiply rows, creating x number of rows for each existing row, by typing expand x.
Transposing
• You can shift your dataset’s structure (e.g., from a structure with values of one variable across time
recorded in separate columns for each time period, to a structure with all values of that variable
recorded in one column, with separate rows for each time period) with the reshape command.
• See help reshape for more information.
Graphing