Practical Exercise 2: Solution: Part A: Introduction To STATA
Practical Exercise 2: Solution: Part A: Introduction To STATA
603969139.doc 1
How many variables and how many observations are contained in this dataset?
Answer: 8 variables and 22 167 observations.
What level of analysis would be performed with this sort of dataset? (e.g. are the observations in
the dataset individuals, households, firms, countries, etc?)
Each observation corresponds to an individual (their gender, age, race, etc.) therefore
we can perform the analysis at an individual (or disaggregated) level.
How would you describe the shape of this graph? What does the graph tell us about rent paid in
the sample?
The graph is heavily skewed to the right – most people pay much less than R2 000 per
month, but there are some who pay almost R8 000. These very large values might be
outliers.
9. The sum command doesn’t give a range or median. The tabstat command is useful as you
can specify exactly what descriptive statistics you want STATA to calculate:
tabstat valrent, statistics(n mean median range sd cv)
Note: p50 is the median; sd is the standard deviation; cv is the coefficient of variation
variable | N mean p50 range sd cv
-------------+------------------------------------------------------------
valrent | 3351 536.8621 200 8000 801.9689 1.493808
--------------------------------------------------------------------------
Compare the mean and the median – what can you conclude? (Hint: look at your
histogram from above).
p50 is the median i.e. 50% of the data values are at or below R200. Thus the
mean is more than 2.5 times larger than the median. This indicates a
distribution that is heavily skewed to the right, as was indicated by the
histogram.
11. One way of seeing whether a variable is continuous or categorical is to use the codebook
command as shown below. It displays a large quantity of information about the variable
specified:
Try it first on a variable you know to be continuous:
603969139.doc 3
codebook valrent
--------------------------------------------------------------------------------
valrent Total monthly rent paid
--------------------------------------------------------------------------------
mean: 536.862
std. dev: 801.969
What different sorts of information does STATA give, depending on the type of variable?
For both types of variable, STATA gives the range, number of unique (i.e. different)
values, units and number of missing values.
For a continuous variable (which takes on lots of different values), STATA also
calculates the mean and standard deviation, and the percentiles.
For a categorical variable (which takes on only a few values which correspond to
its categories), STATA also gives the frequency of each category, and the
category labels.
The variable for total household expenditure could be continuous, or recorded in categories.
Use codebook to determine its type.
--------------------------------------------------------------------------------
tothhexp Total household expenditure in last month
--------------------------------------------------------------------------------
examples: 1 R0-R399
2 R400-R799
3 R800-R1199
5 R1800-RR2499
603969139.doc 4
This is a categorical variable. It has ten categories (we know this because the
variable takes on 10 unique values), but the codebook command only lists a randomly-
selected few of the categories.
0 1 2 3 4 5
Marital status
Is it easy to interpret this graph? For example, can you tell what proportion of the sample are
widows?
This graph is difficult to interpret because the categories on the x-axis are not
labelled.
Note that STATA labels the x-axis with the code it uses to store the categories. You can get
STATA to label the bars with the names of the categories, but it’s complex.
603969139.doc 5
14. You can also represent categorical data using a pie chart. Again, use the drop-down menu for
‘Graphics’, then ‘Pie chart (by category)’.
26.08%
40.65%
5.147%
17.99%
10.14%
Because the pie chart function automatically adds a legend for the graph, it’s easy to see which
category is which.
15. Create a pie chart for any other categorical variable. Write a short interpretation.
For example, using tothhexp:
.212%
.6541%
1.453%
4.61%
8.562% 20.91%
6.113%
9.947%
32.03%
15.51%
R0-R399 R400-R799
R800-R1199 R1200-R1799
R1800-RR2499 R2500-R4999
R5000-R9999 >R10000
Don't know Refuse
Roughly a third of total household spending is between R400 and R799 per month.
However, a large portion of spending (20%) is even less, at below R400 per month.
Interestingly, 6% of those surveyed refused to answer this question.
16. Take a look at what information STATA has stored in your log file.
You’ll notice that STATA has recorded all of the output from your Results window. When you
finish your prac session today, your log file will automatically close. Then at a later stage, you
can open your log file using Word, and edit it like any normal document.
However, your log file does not save your graphs.
603969139.doc 6