Stata Survey Data Reference Manual: Release 13

STATA SURVEY DATA REFERENCE
MANUAL
RELEASE 13
A Stata Press Publication

StataCorp LP
College Station, Texas
c 19852013 StataCorp LP
Copyright
All rights reserved
Version 13
Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845
Typeset in TEX
ISBN-10: 1-59718-125-0
ISBN-13: 978-1-59718-125-9
This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored
in a retrieval system, or transcribed, in any form or by any meanselectronic, mechanical, photocopy, recording, or
otherwisewithout the prior written permission of StataCorp LP unless permitted subject to the terms and conditions
of a license granted to you by StataCorp LP to use the software and documentation. No license, express or implied,
by estoppel or otherwise, to any intellectual property rights is granted by this document.
StataCorp provides this manual as is without warranty of any kind, either expressed or implied, including, but
not limited to, the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make
improvements and/or changes in the product(s) and the program(s) described in this manual at any time and without
notice.
The software described in this manual is furnished under a license agreement or nondisclosure agreement. The software
may be copied only in accordance with the terms of the agreement. It is against the law to copy the software onto
DVD, CD, disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes.
c 1979 by Consumers Union of U.S.,
The automobile dataset appearing on the accompanying media is Copyright
Inc., Yonkers, NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979.
Stata,
, Stata Press, Mata,
, and NetCourse are registered trademarks of StataCorp LP.
Stata and Stata Press are registered trademarks with the World Intellectual Property Organization of the United Nations.
NetCourseNow is a trademark of StataCorp LP.
Other brand and product names are registered trademarks or trademarks of their respective companies.
For copyright information about the software, type help copyright within Stata.
The suggested citation for this software is

StataCorp. 2013. Stata: Release 13 . Statistical Software. College Station, TX: StataCorp LP.
Contents
intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to survey data manual
survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to survey commands
bootstrap options . . . . . . . . . . . . . . . . . . . . . More options for bootstrap variance estimation
24
brr options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . More options for BRR variance estimation
25
direct standardization . . . . . . . . . . . Direct standardization of means, proportions, and ratios
27
estat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation statistics for survey data
31
jackknife options . . . . . . . . . . . . . . . . . . . . . . More options for jackknife variance estimation
51
ml for svy . . . . . . . . . . . . . . . . . . . . . Maximum pseudolikelihood estimation for survey data
52
poststratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poststratification for survey data
54
sdr options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . More options for SDR variance estimation
58
subpopulation estimation . . . . . . . . . . . . . . . . . . . . Subpopulation estimation for survey data
59
svy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The survey prefix command
65
svy bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bootstrap for survey data
73
svy brr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balanced repeated replication for survey data
81
svy estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation commands for survey data
89
svy jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jackknife estimation for survey data 101

svy postestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation tools for svy 109
svy sdr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Successive difference replication for survey data 126
svy: tabulate oneway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-way tables for survey data 132
svy: tabulate twoway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-way tables for survey data 138
svydescribe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Describe survey data 158
svymarkout . . . . . . . Mark observations for exclusion on the basis of survey characteristics 164
svyset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Declare survey design for dataset 165
variance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance estimation for survey data 182
i
ii
Contents
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
197
Subject and author index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
201
Cross-referencing the documentation

When reading this manual, you will find references to other Stata manuals. For example,
[U] 26 Overview of Stata estimation commands
[R] regress
[D] reshape
The first example is a reference to chapter 26, Overview of Stata estimation commands, in the Users
Guide; the second is a reference to the regress entry in the Base Reference Manual; and the third
is a reference to the reshape entry in the Data Management Reference Manual.
All the manuals in the Stata Documentation have a shorthand notation:
[GSM]
[GSU]
[GSW]
[U ]
[R]
[D ]
[G ]
[XT]
[ME]
[MI]
[MV]
[PSS]
[P ]
[SEM]
[SVY]
[ST]
[TS]
[TE]
[I]
Getting Started with Stata for Mac

Getting Started with Stata for Unix
Getting Started with Stata for Windows
Stata Users Guide
Stata Base Reference Manual
Stata Data Management Reference Manual
Stata Graphics Reference Manual
Stata Longitudinal-Data/Panel-Data Reference Manual
Stata Multilevel Mixed-Effects Reference Manual
Stata Multiple-Imputation Reference Manual
Stata Multivariate Statistics Reference Manual
Stata Power and Sample-Size Reference Manual
Stata Programming Reference Manual
Stata Structural Equation Modeling Reference Manual
Stata Survey Data Reference Manual
Stata Survival Analysis and Epidemiological Tables Reference Manual
Stata Time-Series Reference Manual
Stata Treatment-Effects Reference Manual:
Potential Outcomes/Counterfactual Outcomes
Stata Glossary and Index
[M ]
Mata Reference Manual
iii
Title
intro Introduction to survey data manual
Description
Remarks and examples
Also see
Description
This entry describes this manual and what has changed since Stata 12. See the next entry,
[SVY] survey, for an introduction to Statas survey commands.

This manual documents the survey data commands and is referred to as [SVY] in references.
After this entry, [SVY] survey provides an overview of the survey commands. This manual is
arranged alphabetically. If you are new to Statas survey data commands, we recommend that you
read the following sections first:
[SVY]
[SVY]
[SVY]
[SVY]
[SVY]
survey
svyset
svydescribe
svy estimation
svy postestimation
Introduction to survey commands

Declare survey design for dataset
Describe survey data
Estimation commands for survey data
Postestimation tools for svy
Stata is continually being updated, and Stata users are continually writing new commands. To
find out about the latest survey data features, type search survey after installing the latest official
updates; see [R] update.
Whats new
For a complete list of all the new features in Stata 13, see [U] 1.3 Whats new.
Also see
[U] 1.3 Whats new
[R] intro Introduction to base reference manual
Title
survey Introduction to survey commands
Description
Acknowledgments
References
Also see
Description
The Survey Data Reference Manual is organized alphabetically, making it easy to find an individual
entry if you know the name of a command. This overview organizes and presents the commands
conceptually, that is, according to the similarities in the functions they perform.
Survey design tools
[SVY] svyset
[SVY] svydescribe

Survey data analysis tools

[SVY] svy
[SVY] svy estimation
[SVY] svy: tabulate oneway
[SVY] svy: tabulate twoway
[SVY] svy postestimation
[SVY] estat
[SVY] svy bootstrap
[SVY] bootstrap options
[SVY] svy brr
[SVY] brr options
[SVY] svy jackknife
[SVY] jackknife options
[SVY] svy sdr
[SVY] sdr options
The survey prefix command

Estimation commands for survey data
One-way tables for survey data
Two-way tables for survey data
Postestimation tools for svy
Postestimation statistics for survey data, such as design effects
Bootstrap for survey data
More options for bootstrap variance estimation
Balanced repeated replication for survey data
More options for BRR variance estimation
Jackknife estimation for survey data
More options for jackknife variance estimation
Successive difference replication for survey data
More options for SDR variance estimation
Survey data concepts

[SVY] variance estimation
[SVY] subpopulation estimation
[SVY] direct standardization
[SVY] poststratification
Variance estimation for survey data

Subpopulation estimation for survey data
Direct standardization of means, proportions, and ratios
Poststratification for survey data
Tools for programmers of new survey commands

[SVY] ml for svy
Maximum pseudolikelihood estimation for survey data
[SVY] svymarkout
Mark observations for exclusion on the basis of survey
characteristics

Remarks are presented under the following headings:
Introduction
Survey design tools
Video example
Introduction
Statas facilities for survey data analysis are centered around the svy prefix command. After you
identify the survey design characteristics with the svyset command, prefix the estimation commands
in your data analysis with svy:. For example, where you would normally use the regress command
to fit a linear regression model for nonsurvey data, use svy: regress to fit a linear regression model
for your survey data.
Why should you use the svy prefix command when you have survey data? To answer this question,
we need to discuss some of the characteristics of survey design and survey data collection because
these characteristics affect how we must perform our analysis if we want to get it right.
Survey data are characterized by the following:
Sampling weights, also called probability weights pweights in Statas terminology

Cluster sampling
Stratification
These features arise from the design and details of the data collection procedure. Heres a brief
description of how these design features affect the analysis of the data:
Sampling weights. In sample surveys, observations are selected through a random process,
but different observations may have different probabilities of selection. Weights are equal to
(or proportional to) the inverse of the probability of being sampled. Various postsampling
adjustments to the weights are sometimes made, as well. A weight of wj for the j th observation
means, roughly speaking, that the j th observation represents wj elements in the population
from which the sample was drawn.
Omitting weights from the analysis results in estimates that may be biased, sometimes seriously
so. Sampling weights also play a role in estimating standard errors.
Clustering. Individuals are not sampled independently in most survey designs. Collections of
individuals (for example, counties, city blocks, or households) are typically sampled as a group,
known as a cluster.
There may also be further subsampling within the clusters. For example, counties may be
sampled, then city blocks within counties, then households within city blocks, and then finally
persons within households. The clusters at the first level of sampling are called primary sampling
units (PSUs) in this example, counties are the PSUs. In the absence of clustering, the PSUs
are defined to be the individuals, or, equivalently, clusters, each of size one.
Cluster sampling typically results in larger sample-to-sample variability than sampling individuals
directly. This increased variability must be accounted for in standard error estimates, hypothesis
testing, and other forms of inference.
Stratification. In surveys, different groups of clusters are often sampled separately. These groups
are called strata. For example, the 254 counties of a state might be divided into two strata, say,
urban counties and rural counties. Then 10 counties might be sampled from the urban stratum,
and 15 from the rural stratum.
Sampling is done independently across strata; the stratum divisions are fixed in advance. Thus
strata are statistically independent and can be analyzed as such. When the individual strata
are more homogeneous than the population as a whole, the homogeneity can be exploited to
produce smaller (and honestly so) estimates of standard errors.
To put it succinctly: using sampling weights is important to get the point estimates right. We must
consider the weighting, clustering, and stratification of the survey design to get the standard errors
right. If our analysis ignores the clustering in our design, we would probably produce standard errors
that are smaller than they should be. Stratification can be used to get smaller standard errors for a
given overall sample size.
For more detailed introductions to complex survey data analysis, see Cochran (1977); Heeringa,
West, and Berglund (2010); Kish (1965); Levy and Lemeshow (2008); Scheaffer et al.; (2012);
Skinner, Holt, and Smith (1989); Stuart (1984); Thompson (2012); and Williams (1978).
Survey design tools

Before using svy, first take a quick look at [SVY] svyset. Use the svyset command to specify
the variables that identify the survey design characteristics and default method for estimating standard
errors. Once set, svy will automatically use these design specifications until they are cleared or
changed or a new dataset is loaded into memory.
As the following two examples illustrate, svyset allows you to identify a wide range of complex
sampling designs. First, we show a simple single-stage design and then a complex multistage design.
Example 1: Survey data from a one-stage design

A commonly used single-stage survey design uses clustered sampling across several strata, where
the clusters are sampled without replacement. In a Stata dataset composed of survey data from this
design, the survey design variables identify information about the strata, PSUs (clusters), sampling
weights, and finite population correction. Here we use svyset to specify these variables, respectively
named strata, su1, pw, and fpc1.
. use http://www.stata-press.com/data/r13/stage5a
. svyset su1 [pweight=pw], strata(strata) fpc(fpc1)
pweight: pw
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: su1
FPC 1: fpc1
In addition to the variables we specified, svyset reports that the default method for estimating
standard errors is Taylor linearization and that svy will report missing values for the standard errors
when it encounters a stratum with one sampling unit (also called singleton strata).
Example 2: Multistage survey data

We have (fictional) data on American high school seniors (12th graders), and the data were collected
according to the following multistage design. In the first stage, counties were independently selected
within each state. In the second stage, schools were selected within each chosen county. Within each
chosen school, a questionnaire was filled out by every attending high school senior. We have entered
all the information into a Stata dataset called multistage.dta.
The survey design variables are as follows:
state contains the stratum identifiers.

county contains the first-stage sampling units.
ncounties contains the total number of counties within each state.
school contains the second-stage sampling units.
nschools contains the total number of schools within each county.
sampwgt contains the sampling weight for each sampled individual.
Here we load the dataset into memory and use svyset with the above variables to declare that
these data are survey data.
. use http://www.stata-press.com/data/r13/multistage
. svyset county [pw=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools)
pweight: sampwgt
VCE: linearized
Strata 1: state
SU 1: county
FPC 1: ncounties
Strata 2: <one>
SU 2: school
FPC 2: nschools
. save highschool
file highschool.dta saved
We saved the svyset dataset to highschool.dta. We can now use this new dataset without having
to worry about respecifying the design characteristics.
. clear
. describe
Contains data
obs:
0
vars:
0
size:
0
Sorted by:
. use highschool
. svyset
pweight: sampwgt
VCE: linearized
Strata 1: state
SU 1: county
FPC 1: ncounties
Strata 2: <one>
SU 2: school
FPC 2: nschools
After the design characteristics have been svyset, you should also look at [SVY] svydescribe. Use
svydescribe to browse each stage of your survey data; svydescribe reports useful information
on sampling unit counts, missing data, and singleton strata.
Example 3: Survey describe

Here we use svydescribe to describe the first stage of our survey dataset of sampled high school
seniors. We specified the weight variable to get svydescribe to report on where it contains missing
values and how this affects the estimation sample.
. svydescribe weight
Survey: Describing stage 1 sampling units
pweight: sampwgt
VCE: linearized
Strata 1: state
SU 1: county
FPC 1: ncounties
Strata 2: <one>
SU 2: school
FPC 2: nschools
#Obs with #Obs with
#Units
#Units
complete missing
Stratum
included omitted
data
data
1
2
3
4
5
(output omitted )
46
47
48
49
50
50
#Obs per included Unit

min
mean
max
2
2
2
2
2
0
0
0
0
0
92
112
43
37
96
0
0
0
0
0
34
51
18
14
38
46.0
56.0
21.5
18.5
48.0
58
61
25
23
58
2
2
2
2
2
0
0
0
0
0
115
67
56
78
64
0
0
0
0
0
56
28
23
39
31
57.5
33.5
28.0
39.0
32.0
59
39
33
39
33
100
4071
14
40.7
81
4071
From the output, we gather that there are 50 strata, each stratum contains two PSUs, the PSUs vary
in size, and the total sample size is 4,071 students. We can also see that there are no missing data
in the weight variable.

Statas suite of survey data commands is governed by the svy prefix command; see [SVY] svy and
[SVY] svy estimation. svy runs the supplied estimation command while accounting for the survey
design characteristics in the point estimates and variance estimation method. The available variance
estimation methods are balanced repeated replication (BRR), the bootstrap, the jackknife, successive
difference replication, and first-order Taylor linearization. By default, svy computes standard errors
by using the linearized variance estimator so called because it is based on a first-order Taylor series
linear approximation (Wolter 2007). In the nonsurvey context, we refer to this variance estimator as
the robust variance estimator, otherwise known in Stata as the Huber/White/sandwich estimator; see
[P] robust.
Example 4: Estimating a population mean

Here we use the svy prefix with the mean command to estimate the average weight of high school
seniors in our population.
. svy: mean weight
(running mean on estimation sample)
Survey: Mean estimation
Number of strata =
Number of PSUs
=
50
100
Mean
weight
160.2863
Number of obs
Population size
Design df
Linearized
Std. Err.
.7412512
=
=
=
4071
8000000
50
[95% Conf. Interval]

158.7974
161.7751
In its header, svy reports the number of strata and PSUs from the first stage, the sample size, an
estimate of population size, and the design degrees of freedom. Just like the standard output from
the mean command, the table of estimation results contains the estimated mean and its standard error
as well as a confidence interval.
Example 5: Survey regression

Here we use the svy prefix with the regress command to model the association between weight
and height in our population of high school seniors.
. svy: regress weight height
(running regress on estimation sample)
Survey: Linear regression
Number of strata
Number of PSUs
=
=
weight
Coef.
height
_cons
.7163115
-149.6183
50
100
Linearized
Std. Err.
.0293908
12.57265
Number of obs
Population size
Design df
F(
1,
50)
Prob > F
R-squared
t
24.37
-11.90
P>|t|
0.000
0.000
=
=
=
=
=
=
4071
8000000
50
593.99
0.0000
0.2787

.6572784
-174.8712
.7753447
-124.3654
In addition to the header elements we saw in the previous example using svy: mean, the command
svy: regress also reports a model F test and estimated R2 . Although many of Statas model-fitting
commands report Z statistics for testing coefficients against zero, svy always reports t statistics and
uses the design degrees of freedom to compute p-values.
The svy prefix can be used with many estimation commands in Stata. Here is the list of estimation
commands that support the svy prefix.
Descriptive statistics
mean
[R]
proportion [R]
ratio
[R]
total
[R]
mean Estimate means

proportion Estimate proportions
ratio Estimate ratios
total Estimate totals
Linear regression models

cnsreg
[R] cnsreg Constrained linear regression
etregress
[TE] etregress Linear regression with endogenous treatment effects
glm
[R] glm Generalized linear models
intreg
[R] intreg Interval regression
nl
[R] nl Nonlinear least-squares estimation
regress
[R] regress Linear regression
tobit
[R] tobit Tobit regression
truncreg
[R] truncreg Truncated regression
Structural equation models
sem
[SEM] sem Structural equation model estimation command
Survival-data regression models
stcox
[ST] stcox Cox proportional hazards model
streg
[ST] streg Parametric survival models
Binary-response
biprobit
cloglog
hetprobit
logistic
logit
probit
scobit
regression models
[R] biprobit Bivariate probit regression
[R] cloglog Complementary log-log regression
[R] hetprobit Heteroskedastic probit model
[R] logistic Logistic regression, reporting odds ratios
[R] logit Logistic regression, reporting coefficients
[R] probit Probit regression
[R] scobit Skewed logistic regression
Discrete-response regression models

clogit
[R] clogit Conditional (fixed-effects) logistic regression
mlogit
[R] mlogit Multinomial (polytomous) logistic regression
mprobit
[R] mprobit Multinomial probit regression
ologit
[R] ologit Ordered logistic regression
oprobit
[R] oprobit Ordered probit regression
slogit
[R] slogit Stereotype logistic regression
Poisson regression
gnbreg
nbreg
poisson
tnbreg
tpoisson
zinb
zip
models
Generalized negative binomial regression in [R] nbreg
[R] nbreg Negative binomial regression
[R] poisson Poisson regression
[R] tnbreg Truncated negative binomial regression
[R] tpoisson Truncated Poisson regression
[R] zinb Zero-inflated negative binomial regression
[R] zip Zero-inflated Poisson regression
Instrumental-variables regression models

ivprobit
[R] ivprobit Probit model with continuous endogenous regressors
ivregress
[R] ivregress Single-equation instrumental-variables regression
ivtobit
[R] ivtobit Tobit model with continuous endogenous regressors
Regression models with selection
heckman
[R] heckman Heckman selection model
heckoprobit [R] heckoprobit Ordered probit model with sample selection
heckprobit
[R] heckprobit Probit model with sample selection
Example 6: Coxs proportional hazards model

Suppose that we want to model the incidence of lung cancer by using three risk factors: smoking
status, sex, and place of residence. Our dataset comes from a longitudinal health survey: the First
National Health and Nutrition Examination Survey (NHANES I) (Miller 1973; Engel et al. 1978) and its
1992 Epidemiologic Follow-up Study (NHEFS) (Cox et al. 1997); see the National Center for Health
Statistics website at http://www.cdc.gov/nchs/. We will be using data from the samples identified by
NHANES I examination locations 165 and 66100; thus we will svyset the revised pseudo-PSU and
strata variables associated with these locations. Similarly, our pweight variable was generated using
the sampling weights for the nutrition and detailed samples for locations 165 and the weights for
the detailed sample for locations 66100.
. use http://www.stata-press.com/data/r13/nhefs
. svyset psu2 [pw=swgt2], strata(strata2)
pweight: swgt2
VCE: linearized
Strata 1: strata2
SU 1: psu2
FPC 1: <zero>
The lung cancer information was taken from the 1992 NHEFS interview data. We use the participants
ages for the time scale. Participants who never had lung cancer and were alive for the 1992 interview
were considered censored. Participants who never had lung cancer and died before the 1992 interview
were also considered censored at their age of death.
10

. stset age_lung_cancer [pw=swgt2], fail(lung_cancer)
failure event: lung_cancer != 0 & lung_cancer < .
obs. time interval: (0, age_lung_cancer]
exit on or before: failure
weight: [pweight=swgt2]
14407
5126
9281
83
599691
total observations
event time missing (age_lung_cancer>=.)
PROBABLE ERROR
observations remaining, representing

failures in single-record/single-failure data
total analysis time at risk and under observation
at risk from t =
earliest observed entry t =
last observed exit t =
0
0
97
Although stset warns us that it is a probable error to have 5,126 observations with missing event
times, we can verify from the 1992 NHEFS documentation that there were indeed 9,281 participants
with complete information.
For our proportional hazards model, we pulled the risk factor information from the NHANES I and
1992 NHEFS datasets. Smoking status was taken from the 1992 NHEFS interview data, but we filled
in all but 132 missing values by using the general medical history supplement data in NHANES I.
Smoking status is represented by separate indicator variables for former smokers and current smokers;
the base comparison group is nonsmokers. Sex was determined using the 1992 NHEFS vitality data
and is represented by an indicator variable for males. Place-of-residence information was taken from
the medical history questionnaire in NHANES I and is represented by separate indicator variables for
rural and heavily populated (more than 1 million people) urban residences; the base comparison group
is urban residences with populations of fewer than 1 million people.
. svy: stcox former_smoker smoker male urban1 rural
(running stcox on estimation sample)
Survey: Cox regression
Number of strata
=
Number of PSUs
=
_t
Haz. Ratio
former_smoker
smoker
male
urban1
rural
2.788113
7.849483
1.187611
.8035074
1.581674
35
105
Linearized
Std. Err.
.6205102
2.593249
.3445315
.3285144
.5281859
Number of obs
Population size
Design df
F(
5,
66)
Prob > F
t
4.61
6.24
0.59
-0.54
1.37
=
=
=
=
=
9149
151327827
70
14.07
0.0000
P>|t|
0.000
0.000
0.555
0.594
0.174
1.788705
4.061457
.6658757
.3555123
.8125799
4.345923
15.17051
2.118142
1.816039
3.078702
From the above results, we can see that both former and current smokers have a significantly
higher risk for developing lung cancer than do nonsmokers.
11
svy: tabulate can be used to produce one-way and two-way tables with survey data and can
produce survey-adjusted tests of independence for two-way contingency tables; see [SVY] svy: tabulate
oneway and [SVY] svy: tabulate twoway.
Example 7: Two-way tables for survey data

With data from the Second National Health and Nutrition Examination Survey (NHANES II)
(McDowell et al. 1981), we use svy: tabulate to produce a two-way table of cell proportions along
with their standard errors and confidence intervals (the survey design characteristics have already
been svyset). We also use the format() option to get svy: tabulate to report the cell values
and marginals to four decimal places.
. use http://www.stata-press.com/data/r13/nhanes2b
. svy: tabulate race diabetes, row se ci format(%7.4f)
(running tabulate on estimation sample)
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Design df
1=white,
2=black,
3=other
diabetes, 1=yes, 0=no

0
1
10349
117131111
31
Total
White
0.9680
(0.0020)
[0.9638,0.9718]
0.0320
(0.0020)
[0.0282,0.0362]
1.0000
Black
0.9410
(0.0061)
[0.9271,0.9523]
0.0590
(0.0061)
[0.0477,0.0729]
1.0000
Other
0.9797
(0.0076)
[0.9566,0.9906]
0.0203
(0.0076)
[0.0094,0.0434]
1.0000
Total
0.9658
(0.0018)
[0.9619,0.9693]
0.0342
(0.0018)
[0.0307,0.0381]
1.0000
Key:
=
=
=
row proportions
(linearized standard errors of row proportions)
[95% confidence intervals for row proportions]
Pearson:
Uncorrected
Design-based
chi2(2)
F(1.52, 47.26)
=
=
21.3483
15.0056
P = 0.0000
svy: tabulate has many options, such as the format() option, for controlling how the table
looks. See [SVY] svy: tabulate twoway for a discussion of the different design-based and unadjusted
tests of association.
12
All the standard postestimation commands (for example, estimates, lincom, margins, nlcom,
test, testnl) are also available after svy.
Example 8: Comparing means

Going back to our high school survey data in example 2, we estimate the mean of weight (in
pounds) for each subpopulation identified by the categories of the sex variable (male and female).
. use http://www.stata-press.com/data/r13/highschool
. svy: mean weight, over(sex)
Number of strata =
Number of PSUs
=
50
100
Number of obs
Population size
Design df
=
=
=
4071
8000000
50
male: sex = male

female: sex = female
Over
Mean
male
female
175.4809
146.204
Linearized
Std. Err.
weight
1.116802
.9004157
173.2377
144.3955
177.7241
148.0125
Here we use the test command to test the hypothesis that the average male is 30 pounds heavier
than the average female; from the results, we cannot reject this hypothesis at the 5% level.
. test [weight]male - [weight]female = 30
Adjusted Wald test
( 1) [weight]male - [weight]female = 30
F( 1,
50) =
0.23
Prob > F =
0.6353
estat has specific subroutines for use after svy; see [SVY] estat.
estat svyset reports the survey design settings used to produce the current estimation results.
estat effects and estat lceffects report a table of design and misspecification effects
for point estimates and linear combinations of point estimates, respectively.
estat size reports a table of sample and subpopulation sizes after svy: mean, svy: proportion, svy: ratio, and svy: total.
estat sd reports subpopulation standard deviations on the basis of the estimation results from
mean and svy: mean.
estat strata reports the number of singleton and certainty strata within each sampling stage.
estat cv reports the coefficient of variation for each coefficient in the current estimation results.
estat gof reports a goodness-of-fit test for binary response models using survey data.
13
Example 9: Design effects

Here we use estat effects to report the design effects DEFF and DEFT for the mean estimates
from the previous example.
. estat effects
male: sex = male
female: sex = female
Over
Mean
male
female
175.4809
146.204
Linearized
Std. Err.
DEFF
DEFT
2.61016
1.7328
1.61519
1.31603
weight
1.116802
.9004157
Note: weights must represent population totals for deff to

be correct when using an FPC; however, deft is
invariant to the scale of weights.
Now we use estat lceffects to report the design effects DEFF and DEFT for the difference of
the mean estimates from the previous example.
. estat lceffects [weight]male - [weight]female
( 1) [weight]male - [weight]female = 0
Mean
Coef.
(1)
29.27691
Std. Err.
1.515201
DEFF
DEFT
2.42759
1.55768
Note: weights must represent population totals for deff to

be correct when using an FPC; however, deft is
invariant to the scale of weights.
The svy brr prefix command produces point and variance estimates by using the BRR method;
see [SVY] svy brr. BRR was first introduced by McCarthy (1966, 1969a, and 1969b) as a method of
variance estimation for designs with two PSUs in every stratum. The BRR variance estimator tends to
give more reasonable variance estimates for this design than the linearized variance estimator, which
can result in large values and undesirably wide confidence intervals.
The svy jackknife prefix command produces point and variance estimates by using the jackknife
replication method; see [SVY] svy jackknife. The jackknife is a data-driven variance estimation
method that can be used with model-fitting procedures for which the linearized variance estimator is
not implemented, even though a linearized variance estimator is theoretically possible to derive (Shao
and Tu 1995).
To protect the privacy of survey participants, public survey datasets may contain replicate-weight
variables instead of variables that identify the PSUs and strata. These replicate-weight variables can be
used with the appropriate replication method for variance estimation instead of the linearized variance
estimator; see [SVY] svyset.
The svy brr and svy jackknife prefix commands can be used with those commands that may
not be fully supported by svy but are compatible with the BRR and the jackknife replication methods.
They can also be used to produce point estimates for expressions of estimation results from a prefixed
command.
The svy bootstrap and svy sdr prefix commands work only with replicate weights. Both assume
that you have obtained these weight variables externally.
14
The svy bootstrap prefix command produces variance estimates that have been adjusted for
bootstrap sampling. Bootstrap sampling of complex survey has become more popular in recent years
and is the variance-estimation method used in the National Population Health Survey conducted by
Statistics Canada; see [SVY] svy bootstrap and [SVY] variance estimation for more details.
The svy sdr prefix command produces variance estimates that implement successive difference
replication (SDR), first introduced by Fay and Train (1995) as a method for annual demographic
supplements to the Current Population Survey. This method is typically applied to systematic samples
where the observed sampling units follow a natural order; see [SVY] svy sdr and [SVY] variance
estimation for more details.
Example 10: BRR and replicate-weight variables

The survey design for the NHANES II data (McDowell et al. 1981) is specifically suited to BRR:
there are two PSUs in every stratum.
. use http://www.stata-press.com/data/r13/nhanes2
. svydescribe
pweight: finalwgt
VCE: linearized
Strata 1: strata
SU 1: psu
FPC 1: <zero>
#Obs per Unit
Stratum
#Units
1
2
3
4
5
(output omitted )
29
30
31
32
31
#Obs
min
mean
max
2
2
2
2
2
380
185
348
460
252
165
67
149
229
105
190.0
92.5
174.0
230.0
126.0
215
118
199
231
147
2
2
2
2
503
365
308
450
215
166
143
211
251.5
182.5
154.0
225.0
288
199
165
239
62
10351
67
167.0
288
Here is a privacy-conscious dataset equivalent to the one above; all the variables and values
remain, except that strata and psu are replaced with BRR replicate-weight variables. The BRR
replicate-weight variables are already svyset, and the default method for variance estimation is
vce(brr).
15
. use http://www.stata-press.com/data/r13/nhanes2brr
. svyset
pweight: finalwgt
VCE: brr
MSE: off
brrweight: brr_1 brr_2 brr_3 brr_4 brr_5 brr_6 brr_7 brr_8 brr_9 brr_10
brr_11 brr_12 brr_13 brr_14 brr_15 brr_16 brr_17 brr_18 brr_19
brr_29 brr_30 brr_31 brr_32
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>
Suppose that we were interested in the population ratio of weight to height. Here we use total
to estimate the population totals of weight and height and the svy brr prefix to estimate their ratio
and variance; we use total instead of ratio (which is otherwise preferable here) to show how to
specify an expression when using svy: brr.
. svy brr WtoH = (_b[weight]/_b[height]): total weight height
(running total on estimation sample)
BRR replications (32)
1
2
3
4
5
................................
BRR results
Number of obs
Population size
Replications
Design df
command: total weight height
WtoH: _b[weight]/_b[height]
Coef.
WtoH
.4268116
BRR
Std. Err.
.0008904
t
479.36
=
=
=
=
10351
117157513
32
31
P>|t|
0.000
.4249957
.4286276

The variance estimation methods that Stata uses are discussed in [SVY] variance estimation.
Subpopulation estimation involves computing point and variance estimates for part of the population.
This method is not the same as restricting the estimation sample to the collection of observations
within the subpopulation because variance estimation for survey data measures sample-to-sample
variability, assuming that the same survey design is used to collect the data. Use the subpop() option
of the svy prefix to perform subpopulation estimation, and use if and in only when you need to
make restrictions on the estimation sample; see [SVY] subpopulation estimation.
Example 11: Subpopulation estimation

Here we will use our svyset high school data to model the association between weight and
height in the subpopulation of male high school seniors. First, we describe the sex variable to
determine how to identify the males in the dataset. We then use label list to verify that the variable
label agrees with the value labels.
16

. describe sex
storage
display
value
variable name
type
format
label
variable label
sex
byte
. label list sex
sex:
1 male
2 female
%9.0g
sex
1=male, 2=female
Here we generate a variable named male so that we can easily identify the male high school
seniors. We specified if !missing(sex); doing so will cause the generated male variable to contain
a missing value at each observation where the sex variable does. This is done on purpose (although it
is not necessary if sex is free of missing values) because missing values should not be misinterpreted
to imply female.
. generate male = sex == 1 if !missing(sex)
Now we specify subpop(male) as an option to the svy prefix in our model fit.
. svy, subpop(male): regress weight height
Number of strata
=
50
Number of PSUs
=
100
weight
Coef.
height
_cons
.7632911
-168.6532
Linearized
Std. Err.
.0508432
22.5708
Number of obs
Population size
Subpop. no. of obs
Subpop. size
Design df
F(
1,
50)
Prob > F
R-squared
t
15.01
-7.47
=
4071
=
8000000
=
1938
= 3848021.4
=
50
=
225.38
=
0.0000
=
0.2347
P>|t|
0.000
0.000
.6611696
-213.988
.8654127
-123.3184
Although the table of estimation results contains the same columns as earlier, svy reports some
extra subpopulation information in the header. Here the extra header information tells us that 1,938
of the 4,071 sampled high school seniors are male, and the estimated number of male high school
seniors in the population is 3,848,021 (rounded down).
Direct standardization is an estimation method that allows comparing rates that come from different
frequency distributions; see [SVY] direct standardization. In direct standardization, estimated rates
(means, proportions, and ratios) are adjusted according to the frequency distribution of a standard
population. The standard population is partitioned into categories, called standard strata. The stratum
frequencies for the standard population are called standard weights. In the standardizing frequency
distribution, the standard strata are most commonly identified by demographic information such as
age, sex, and ethnicity. The standardized rate estimate is the weighted sum of unadjusted rates, where
the weights are the relative frequencies taken from the standardizing frequency distribution. Direct
standardization is available with svy: mean, svy: proportion, and svy: ratio.
17
Example 12: Standardized rates

Table 3.12-6 of Korn and Graubard (1999, 156) contains enumerated data for two districts of
London for the years 18401841. The age variable identifies the age groups in 5-year increments,
bgliving contains the number of people living in the Bethnal Green district at the beginning of
1840, bgdeaths contains the number of people who died in Bethnal Green that year, hsliving
contains the number of people living in St. Georges Hanover Square at the beginning of 1840, and
hsdeaths contains the number of people who died in Hanover Square that year.
. use http://www.stata-press.com/data/r13/stdize, clear
. list, noobs sep(0) sum
Sum
age
bgliving
bgdeaths
hsliving
hsdeaths
0-5
5-10
10-15
15-20
20-25
25-30
30-35
35-40
40-45
45-50
50-55
55-60
60-65
65-70
70-75
75-80
80-85
85-90
90-95
95-100
unknown
10739
9180
8006
7096
6579
5829
5749
4490
4385
2955
2995
1644
1835
1042
879
366
173
71
21
4
50
850
76
38
37
38
51
51
56
47
66
74
67
64
64
68
47
39
22
6
2
1
5738
4591
4148
6168
9440
8675
7513
5091
4930
2883
2711
1275
1469
649
619
233
136
48
10
2
124
463
55
28
36
68
78
64
78
85
66
77
55
61
55
58
51
20
15
4
1
0
74088
1764
66453
1418
We can use svy: ratio to compute the death rates for each district in 1840. Because this
dataset is identified as census data, we will create an FPC variable that will contain a sampling
rate of 100%. This method will result in zero standard errors, which are interpreted to mean no
variabilityappropriate because our point estimates came from the entire population.
18

. gen fpc = 1
. svyset, fpc(fpc)
pweight: <none>
VCE: linearized
Strata 1: <one>
FPC 1: fpc
. svy: ratio (Bethnal: bgdeaths/bgliving) (Hanover: hsdeaths/hsliving)
(running ratio on estimation sample)
Survey: Ratio estimation
Number of strata =
1
Number of obs
=
21
Number of PSUs
=
21
Population size =
21
Design df
=
20
Bethnal: bgdeaths/bgliving
Hanover: hsdeaths/hsliving
Ratio
Bethnal
Hanover
.0238095
.0213384
Linearized
Std. Err.
0
0

.
.
.
.
Note: zero standard errors because of 100% sampling rate

detected for FPC in the first stage.
The death rates are 2.38% for Bethnal Green and 2.13% for St. Georges Hanover Square. These
observed death rates are not really comparable because they come from two different age distributions.
We can standardize based on the age distribution from Bethnal Green. Here age identifies our standard
strata and bgliving contains the associated population sizes.
. svy: ratio (Bethnal: bgdeaths/bgliving) (Hanover: hsdeaths/hsliving),
> stdize(age) stdweight(bgliving)
Number of strata =
1
Number of obs
=
21
Number of PSUs
=
21
Population size =
21
N. of std strata =
21
Design df
=
20
Ratio
Bethnal
Hanover
.0238095
.0266409
Linearized
Std. Err.
0
0

.
.
.
.

The standardized death rate for St. Georges Hanover Square, 2.66%, is larger than the death rate
for Bethnal Green.
Poststratification is a method for adjusting the sampling weights, usually to account for underrepresented groups in the population; see [SVY] poststratification. This method usually results in
decreasing bias because of nonresponse and underrepresented groups in the population. It also tends to
19
result in smaller variance estimates. Poststratification is available for all survey estimation commands
and is specified using svyset; see [SVY] svyset.
Example 13: Poststratified mean

Levy and Lemeshow (2008, sec. 6.6) give an example of poststratification by using simple survey
data from a veterinarians client list. The data in poststrata.dta were collected using simple
random sampling (SRS) without replacement. The totexp variable contains the total expenses to the
client, type identifies the cats and dogs, postwgt contains the poststratum sizes (450 for cats and
850 for dogs), and fpc contains the total number of clients (850 + 450 = 1300).
. use http://www.stata-press.com/data/r13/poststrata, clear
. svyset, poststrata(type) postweight(postwgt) fpc(fpc)
pweight: <none>
VCE: linearized
Poststrata: type
Postweight: postwgt
Strata 1: <one>
FPC 1: fpc
. svy: mean totexp
Number of strata =
1
Number of PSUs
=
50
N. of poststrata =
2
Mean
totexp
40.11513
Number of obs
Population size
Design df
Linearized
Std. Err.
1.163498
=
=
=
50
1300
49

37.77699
42.45327
The mean total expenses is $40.12 with a standard error of $1.16. In the following, we omit the
poststratification information from svyset, resulting in mean total expenses of $39.73 with standard
error $2.22. The difference between the mean estimates is explained by the facts that expenses tend
to be larger for dogs than for cats and that the dogs were slightly underrepresented in the sample
(850/1,300 0.65 for the population; 32/50 = 0.64 for the sample). This reasoning also explains why
the variance estimate from the poststratified mean is smaller than the one that was not poststratified.
20

. svyset, fpc(fpc)
pweight: <none>
VCE: linearized
Strata 1: <one>
FPC 1: fpc
. svy: mean totexp
Number of strata =
Number of PSUs
=
1
50
Mean
totexp
39.7254
Number of obs
Population size
Design df
Linearized
Std. Err.
2.221747
=
=
=
50
50
49

35.26063
44.19017

The ml command can be used to fit a model by the method of maximum likelihood. When the
svy option is specified, ml performs maximum pseudolikelihood, applying sampling weights and
design-based linearization automatically; see [R] ml and Gould, Pitblado, and Poi (2010).
Example 14
The ml command requires a program that computes likelihood values to perform maximum
likelihood. Here is a likelihood evaluator used in Gould, Pitblado, and Poi (2010) to fit linear
regression models using the likelihood from the normal distribution.
program mynormal_lf
version 13
args lnf mu lnsigma
quietly replace lnf = ln(normalden($ML_y1,mu,exp(lnsigma)))
end
Back in example 5, we fit a linear regression model using the high school survey data. Here we
use ml and mynormal lf to fit the same survey regression model.
21
. ml model lf mynormal_lf (mu: weight = height) /lnsigma, svy
. ml max
initial:
feasible:
rescale:
rescale eq:
Iteration 0:
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
Iteration 5:
Iteration 6:
Iteration 7:
Iteration 8:
Iteration 9:
Iteration 10:
Iteration 11:
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
Number of strata
Number of PSUs
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
=
=
=
-<inf>
= -7.301e+08
= -51944380
= -47565331
= -47565331
= -41221759
= -41218957
= -41170544
= -41145411
= -41123161
= -41103001
= -41083551
= -38467683
= -38329015
= -38328739
= -38328739
50
100
Linearized
Std. Err.
(could not be evaluated)
(not
(not
(not
(not
(not
(not
concave)
concave)
concave)
concave)
concave)
concave)
(backed up)
Number of obs
Population size
Design df
F(
1,
50)
Prob > F
P>|t|
=
=
=
=
=
4071
8000000
50
593.99
0.0000
weight
Coef.
height
_cons
.7163115
-149.6183
.0293908
12.57265
24.37
-11.90
0.000
0.000
.6572784
-174.8712
.7753447
-124.3654
lnsigma
_cons
3.372154
.0180777
186.54
0.000
3.335844
3.408464
mu
svymarkout is a programmers command that resets the values in a variable that identifies the
estimation sample, dropping observations for which any of the survey characteristic variables contain
missing values. This tool is most helpful for developing estimation commands that use ml to fit
models using maximum pseudolikelihood directly, instead of relying on the svy prefix.
Video example
Basic introduction to the analysis of complex survey data in Stata
Acknowledgments
Many of the svy commands were developed in collaboration with John L. Eltinge of the Bureau
of Labor Statistics. We thank him for his invaluable assistance.
We thank Wayne Johnson of the National Center for Health Statistics for providing the NHANES II
dataset.
22
We thank Nicholas Winter of the Politics Department at the University of Virginia for his diligent
efforts to keep Stata up to date with mainstream variance estimation methods for survey data, as well
as for providing versions of svy brr and svy jackknife.

William Gemmell Cochran (19091980) was born in Rutherglen, Scotland, and educated at the
Universities of Glasgow and Cambridge. He accepted a post at Rothamsted before finishing his
doctorate. Cochran emigrated to the United States in 1939 and worked at Iowa State, North
Carolina State, Johns Hopkins, and Harvard. He made many major contributions across several
fields of statistics, including experimental design, the analysis of counted data, sample surveys,
and observational studies, and was author or coauthor (with Gertrude M. Cox and George W.
Snedecor) of various widely used texts.
Leslie Kish (19102000) was born in Poprad, Hungary, and entered the United States with his
family in 1926. He worked as a lab assistant at the Rockefeller Institute for Medical Research and
studied at the College of the City of New York, fighting in the Spanish Civil War before receiving
his first degree in mathematics. Kish worked for the Bureau of the Census, the Department of
Agriculture, the Army Air Corps, and the University of Michigan. He carried out pioneering
work in the theory and practice of survey sampling, including design effects, BRR, response
errors, rolling samples and censuses, controlled selection, multipurpose designs, and small-area
estimation.
References
Cochran, W. G. 1977. Sampling Techniques. 3rd ed. New York: Wiley.
Cox, C. S., M. E. Mussolino, S. T. Rothwell, M. A. Lane, C. D. Golden, J. H. Madans, and J. J. Feldman. 1997.
Plan and operation of the NHANES I Epidemiologic Followup Study, 1992. In Vital and Health Statistics, series 1,
no. 35. Hyattsville, MD: National Center for Health Statistics.
Engel, A., R. S. Murphy, K. Maurer, and E. Collins. 1978. Plan and operation of the HANES I augmentation survey
of adults 2574 years: United States 197475. In Vital and Health Statistics, series 1, no. 14. Hyattsville, MD:
National Center for Health Statistics.
Fay, R. E., and G. F. Train. 1995. Aspects of survey and model-based postcensal estimation of income and poverty
characteristics for states and counties. In Proceedings of the Government Statistics Section, 154159. American
Statistical Association.
Gould, W. W., J. S. Pitblado, and B. P. Poi. 2010. Maximum Likelihood Estimation with Stata. 4th ed. College
Station, TX: Stata Press.
Heeringa, S. G., B. T. West, and P. A. Berglund. 2010. Applied Survey Data Analysis. Boca Raton, FL: Chapman
& Hall/CRC.
Kish, L. 1965. Survey Sampling. New York: Wiley.
Korn, E. L., and B. I. Graubard. 1999. Analysis of Health Surveys. New York: Wiley.
Kreuter, F., and R. Valliant. 2007. A survey on survey statistics: What is done and can be done in Stata. Stata Journal
7: 121.
Levy, P. S., and S. A. Lemeshow. 2008. Sampling of Populations: Methods and Applications. 4th ed. Hoboken, NJ:
Wiley.
McCarthy, P. J. 1966. Replication: An approach to the analysis of data from complex surveys. In Vital and Health
Statistics, series 2. Hyattsville, MD: National Center for Health Statistics.
. 1969a. Pseudoreplication: Further evaluation and application of the balanced half-sample technique. In Vital
and Health Statistics, series 2. Hyattsville, MD: National Center for Health Statistics.
. 1969b. Pseudo-replication: Half-samples. Revue de lInstitut International de Statistique 37: 239264.
23
McDowell, A., A. Engel, J. T. Massey, and K. Maurer. 1981. Plan and operation of the Second National Health and
Nutrition Examination Survey, 19761980. Vital and Health Statistics 1(15): 1144.
Miller, H. W. 1973. Plan and operation of the Health and Nutrition Examination Survey: United States 19711973.
Hyattsville, MD: National Center for Health Statistics.
Scheaffer, R. L., W. Mendenhall, III, R. L. Ott, and K. G. Gerow. 2012. Elementary Survey Sampling. 7th ed.
Boston: Brooks/Cole.
Shao, J., and D. Tu. 1995. The Jackknife and Bootstrap. New York: Springer.
Skinner, C. J., D. Holt, and T. M. F. Smith, ed. 1989. Analysis of Complex Surveys. New York: Wiley.
Stuart, A. 1984. The Ideas of Sampling. 3rd ed. New York: Griffin.
Thompson, S. K. 2012. Sampling. 3rd ed. Hoboken, NJ: Wiley.
Williams, B. 1978. A Sampler on Sampling. New York: Wiley.
Wolter, K. M. 2007. Introduction to Variance Estimation. 2nd ed. New York: Springer.
Also see
[SVY] svyset Declare survey design for dataset
[SVY] svy The survey prefix command
[SVY] svy estimation Estimation commands for survey data
[P] robust Robust variance estimates
Title
bootstrap options More options for bootstrap variance estimation
Syntax
Description
Options
Also see
Syntax
bootstrap options
Description
SE
mse
nodots
bsn(#)
use MSE formula for variance

suppress replication dots
bootstrap mean-weight adjustment
saving( filename, . . .)
verbose
noisily
trace
title(text)
nodrop
reject(exp)
save results to filename

display the full table legend
display any output from command
trace command
use text as the title for results
do not drop observations
identify invalid results
saving, verbose, noisily, trace, title(), nodrop, and reject() are not shown in the dialog boxes for estimation
commands.
Description
svy accepts more options when performing bootstrap variance estimation. See [SVY] svy bootstrap
for a complete discussion.
Options
SE
mse specifies that svy compute the variance by using deviations of the replicates from the observed
value of the statistics based on the entire dataset. By default, svy computes the variance by using
deviations of the replicates from their mean.
nodots suppresses display of the replication dots. By default, one dot character is printed for each
successful replication. A red x is displayed if command returns with an error, and e is displayed
if at least one of the values in the exp list is missing.
bsn(#) specifies that # bootstrap replicate-weight variables were used to generate each bootstrap
mean-weight variable specified in the bsrweight() option of svyset. The bsn() option of
bootstrap overrides the bsn() option of svyset; see [SVY] svyset.
saving(), verbose, noisily, trace, title(), nodrop, reject(); see [SVY] svy bootstrap.
Also see
[SVY] svy bootstrap Bootstrap for survey data
24
Title
brr options More options for BRR variance estimation
Syntax
Description
Options
Also see
Syntax
brr options
Description
SE
mse
nodots
hadamard(matrix)
fay(#)

Hadamard matrix
Fays adjustment
verbose
noisily
trace
title(text)
nodrop
reject(exp)

trace command
saving(), verbose, noisily, trace, title(), nodrop, and reject() are not shown in the dialog boxes for
estimation commands.
Description
svy accepts more options when performing BRR variance estimation. See [SVY] svy brr for a
complete discussion.
Options
SE
hadamard(matrix) specifies the Hadamard matrix to be used to determine which PSUs are chosen
for each replicate.
fay(#) specifies Fays adjustment. This option overrides the fay(#) option of svyset; see [SVY] svyset.
saving(), verbose, noisily, trace, title(), nodrop, reject(); see [SVY] svy brr.
25
26 brr options More options for BRR variance estimation
Also see
[SVY] svy brr Balanced repeated replication for survey data
Title
direct standardization Direct standardization of means, proportions, and ratios
Description
References

Also see
Methods and formulas
Description
Direct standardization is an estimation method that allows comparing rates that come from
different frequency distributions. The mean, proportion, and ratio commands can estimate means,
proportions, and ratios by using direct standardization.
See [SVY] poststratification for a similar estimation method given population sizes for strata not
used in the sampling design.

In direct standardization, estimated rates (means, proportions, and ratios) are adjusted according
to the frequency distribution of a standard population. The standard population is partitioned into
categories, called standard strata. The stratum frequencies for the standard population are called
standard weights. In the standardizing frequency distribution, the standard strata are most commonly
identified by demographic information such as age, sex, and ethnicity.
Statas mean, proportion, and ratio estimation commands have options for estimating means,
proportions, and ratios by using direct standardization. The stdize() option takes a variable that
identifies the standard strata, and the stdweight() option takes a variable that contains the standard
weights.
The standard strata (specified using stdize()) from the standardizing population are not the same
as the strata (specified using svysets strata() option) from the sampling design. In the output
header, Number of strata is the number of strata in the first stage of the sampling design, and
N. of std strata is the number of standard strata.
In the following example, we use direct standardization to compare the death rates between two
districts of London in 1840.
Example 1: Standardized rates

Table 3.12-6 of Korn and Graubard (1999, 156) contains enumerated data for two districts of
London for the years 18401841. The age variable identifies the age groups in 5-year increments,
bgliving contains the number of people living in the Bethnal Green district at the beginning of
1840, bgdeaths contains the number of people who died in Bethnal Green that year, hsliving
contains the number of people living in St. Georges Hanover Square at the beginning of 1840, and
hsdeaths contains the number of people who died in Hanover Square that year.
27
28

. use http://www.stata-press.com/data/r13/stdize
. list, noobs sep(0) sum
age
bgliving
bgdeaths
hsliving
hsdeaths
0-5
5-10
10-15
15-20
20-25
25-30
30-35
35-40
40-45
45-50
50-55
55-60
60-65
65-70
70-75
75-80
80-85
85-90
90-95
95-100
unknown
10739
9180
8006
7096
6579
5829
5749
4490
4385
2955
2995
1644
1835
1042
879
366
173
71
21
4
50
850
76
38
37
38
51
51
56
47
66
74
67
64
64
68
47
39
22
6
2
1
5738
4591
4148
6168
9440
8675
7513
5091
4930
2883
2711
1275
1469
649
619
233
136
48
10
2
124
463
55
28
36
68
78
64
78
85
66
77
55
61
55
58
51
20
15
4
1
0
74088
1764
66453
1418
Sum
We can use svy: ratio to compute the death rates for each district in 1840. Because this
dataset is identified as census data, we will create an FPC variable that will contain a sampling
rate of 100%. This method will result in zero standard errors, which are interpreted to mean no
variabilityappropriate because our point estimates came from the entire population.
. gen fpc = 1
. svyset, fpc(fpc)
pweight: <none>
VCE: linearized
Strata 1: <one>
FPC 1: fpc
. svy: ratio (Bethnal: bgdeaths/bgliving) (Hanover: hsdeaths/hsliving)
Number of strata =
1
Number of obs
=
21
Number of PSUs
=
21
Population size =
21
Design df
=
20
Ratio
Bethnal
Hanover
.0238095
.0213384
Linearized
Std. Err.
0
0

.
.

.
.
29
The death rates are 2.38% for Bethnal Green and 2.13% for St. Georges Hanover Square. These
observed death rates are not really comparable because they come from two different age distributions.
We can standardize based on the age distribution from Bethnal Green. Here age identifies our standard
strata and bgliving contains the associated population sizes.
. svy: ratio (Bethnal: bgdeaths/bgliving) (Hanover: hsdeaths/hsliving),
> stdize(age) stdweight(bgliving)
Number of strata =
1
Number of obs
=
21
Number of PSUs
=
21
Population size =
21
N. of std strata =
21
Design df
=
20
Ratio
Bethnal
Hanover
.0238095
.0266409
Linearized
Std. Err.
0
0

.
.
.
.

The standardized death rate for St. Georges Hanover Square, 2.66%, is larger than the death rate
for Bethnal Green.
For this example, we could have used dstdize to compute the death rates; however, dstdize will
not compute the correct standard errors for survey data. Furthermore, dstdize is not an estimation
command, so test and the other postestimation commands are not available.
Technical note
The values in the variable supplied to the stdweight() option are normalized so that (1) is true;
see Methods and formulas. Thus the stdweight() variable can contain either population sizes or
population proportions for the associated standard strata.

The following discussion assumes that you are already familiar with the topics discussed in
[SVY] variance estimation.
In direct standardization, a weighted sum of the point estimates from the standard strata is used to
produce an overall point estimate for the population. This section will show how direct standardization
affects the ratio estimator. The mean and proportion estimators are special cases of the ratio estimator.
Suppose that you used a complex survey design to sample m individuals from a population of
size M . Let Dg be the set of individuals in the sample that belong to the g th standard stratum, and
let IDg (j) indicate if the j th individual is in standard stratum g , where
(
1, if j Dg
IDg (j) =
0, otherwise
30
Also let LD be the number of standard strata, and let g be the proportion of the population that
belongs to standard stratum g .
LD
X
g = 1
(1)
g=1
In subpopulation estimation, g is set to zero if none of the individuals in standard stratum g are in
the subpopulation. Then the standard stratum proportions are renormalized.
Let yj and xj be the items of interest and wj be the sampling weight for the j th sampled individual.
The estimator for the standardized ratio of R = Y /X is
bD =
R
LD
X
g=1
where
Ybg =
m
X
Ybg
bg
X
IDg (j) wj yj
j=1
bg similarly defined.
with X
For replication-based variance estimation, replicates of the standardized values are used in the
variance formulas.
The score variable for the linearized variance estimator of the standardized ratio is
bD ) =
zj (R
LD
X
g IDg (j)
g=1
bg yj Ybg xj
X
b2
X
g
This score variable was derived using the method described in [SVY] variance estimation and is a
direct result of the methods described in Deville (1999), Demnati and Rao (2004), and Shah (2004).
For the mean and proportion commands, the mean estimator is a ratio estimator with the
denominator variable equal to one (xj = 1) and the proportion estimator is the mean estimator with
an indicator variable in the numerator (yj {0, 1}).
References
Demnati, A., and J. N. K. Rao. 2004. Linearization variance estimators for survey data. Survey Methodology 30:
1726.
Deville, J.-C. 1999. Variance estimation for complex statistics and estimators: Linearization and residual techniques.
Survey Methodology 25: 193203.
Shah, B. V. 2004. Comment [on Demnati and Rao (2004)]. Survey Methodology 30: 29.
Also see
[SVY] poststratification Poststratification for survey data
[SVY] survey Introduction to survey commands
Title
estat Postestimation statistics for survey data
Syntax
Options for estat effects
Options for estat sd
Options for estat vce
Menu
Options for estat lceffects
Options for estat cv
References
Description
Options for estat size
Options for estat gof
Stored results
Also see
Syntax
Survey design characteristics
estat svyset
Design and misspecification effects for point estimates

estat effects , estat effects options
Design and misspecification effects for linear combinations of point estimates

estat lceffects exp , estat lceffects options
Subpopulation sizes

estat size , estat size options
Subpopulation standard-deviation estimates

estat sd , estat sd options
Singleton and certainty strata
estat strata
Coefficient of variation for survey data

estat cv , estat cv options
Goodness-of-fit test for binary response models using survey data

estat gof if
in
, estat gof options
Display covariance matrix estimates

estat vce , estat vce options
31
32
estat effects options
Description
deff
deft
srssubpop
meff
meft
display options
report DEFF design effects

report DEFT design effects
report design effects, assuming SRS within subpopulation
report MEFF design effects
report MEFT design effects
control spacing and display of omitted variables and base and
empty cells
estat lceffects options
Description
deff
deft
srssubpop
meff
meft
report
report
report
report
report
estat size options
Description
obs
size
report number of observations (within subpopulation)

report subpopulation sizes
estat sd options
Description
variance
srssubpop
report subpopulation variances instead of standard deviations

report standard deviation, assuming SRS within subpopulation
estat cv options
Description
nolegend
display options
suppress the table legend

control spacing and display of omitted variables and base and
empty cells
estat gof options
Description
group(#)
total
compute test statistic using # quantiles

compute test statistic using the total estimator instead of the mean
estimator
execute test for all observations in the data
all
DEFF design effects

DEFT design effects
design effects, assuming SRS within subpopulation

MEFF design effects
MEFT design effects
estat vce options
Description
covariance
correlation
equation(spec)
block
diag
format(% fmt)
nolines
display options
display as covariance matrix; the default

display as correlation matrix
display only specified equations
display submatrices by equation
display submatrices by equation; diagonal blocks only
display format for covariances and correlations
suppress lines between equations
control display of omitted variables and base and empty cells
33
Menu
Statistics
>
Survey data analysis
>
DEFF, MEFF, and other statistics
Description
estat svyset reports the survey design characteristics associated with the current estimation
results.
estat effects displays a table of design and misspecification effects for each estimated parameter.
estat lceffects displays a table of design and misspecification effects for a user-specified linear
combination of the parameter estimates.
estat size displays a table of sample and subpopulation sizes for each estimated subpopulation
mean, proportion, ratio, or total. This command is available only after svy: mean, svy: proportion,
svy: ratio, and svy: total; see [R] mean, [R] proportion, [R] ratio, and [R] total.
estat sd reports subpopulation standard deviations based on the estimation results from mean
and svy: mean; see [R] mean. estat sd is not appropriate with estimation results that used direct
standardization or poststratification.
estat strata displays a table of the number of singleton and certainty strata within each
sampling stage. The variance scaling factors are also displayed for estimation results where
singleunit(scaled) was svyset.
estat cv reports the coefficient of variation (CV) for each coefficient in the current estimation
results. The CV for coefficient b is
CV(b)
SE(b)
|b|
100%
estat gof reports a goodness-of-fit test for binary response models using survey data. This
command is available only after svy: logistic, svy: logit, and svy: probit; see [R] logistic,
[R] logit, and [R] probit.
estat vce displays the covariance or correlation matrix of the parameter estimates of the previous
model. See [R] estat vce for examples.
34
Options for estat effects

deff and deft request that the design-effect measures DEFF and DEFT be displayed. This is the
default, unless direct standardization or poststratification was used.
The deff and deft options are not allowed with estimation results that used direct standardization
or poststratification. These methods obscure the measure of design effect because they adjust the
frequency distribution of the target population.
srssubpop requests that DEFF and DEFT be computed using an estimate of simple random sampling
(SRS) variance for sampling within a subpopulation. By default, DEFF and DEFT are computed using
an estimate of the SRS variance for sampling from the entire population. Typically, srssubpop is
used when computing subpopulation estimates by strata or by groups of strata.
meff and meft request that the misspecification-effect measures MEFF and MEFT be displayed.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Options for estat lceffects

deff and deft request that the design-effect measures DEFF and DEFT be displayed. This is the
default, unless direct standardization or poststratification was used.
or poststratification. These methods obscure the measure of design effect because they adjust the
frequency distribution of the target population.
srssubpop requests that DEFF and DEFT be computed using an estimate of simple random sampling
(SRS) variance for sampling within a subpopulation. By default, DEFF and DEFT are computed using
an estimate of the SRS variance for sampling from the entire population. Typically, srssubpop is
used when computing subpopulation estimates by strata or by groups of strata.
meff and meft request that the misspecification-effect measures MEFF and MEFT be displayed.
Options for estat size

obs requests that the number of observations used to compute the estimate be displayed for each row
of estimates.
size requests that the estimate of the subpopulation size be displayed for each row of estimates. The
subpopulation size estimate equals the sum of the weights for those observations in the estimation
sample that are also in the specified subpopulation. The estimated population size is reported when
a subpopulation is not specified.
Options for estat sd

variance requests that the subpopulation variance be displayed instead of the standard deviation.
srssubpop requests that the standard deviation be computed using an estimate of SRS variance for
sampling within a subpopulation. By default, the standard deviation is computed using an estimate
of the SRS variance for sampling from the entire population. Typically, srssubpop is given when
computing subpopulation estimates by strata or by groups of strata.
35
Options for estat cv

nolegend prevents the table legend identifying the subpopulations from being displayed.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels; see [R] estimation options.
Options for estat gof

group(#) specifies the number of quantiles to be used to group the data for the goodness-of-fit test.
The minimum allowed value is group(2). The maximum allowed value is group(df ), where df
is the design degrees of freedom (e(df r)). The default is group(10).
total requests that the goodness-of-fit test statistic be computed using the total estimator instead of
the mean estimator.
all requests that the goodness-of-fit test statistic be computed for all observations in the data, ignoring
any if or in restrictions specified with the model fit.
Options for estat vce

covariance displays the matrix as a variancecovariance matrix; this is the default.
correlation displays the matrix as a correlation matrix rather than a variancecovariance matrix.
rho is a synonym for correlation.
equation(spec) selects the part of the VCE to be displayed. If spec = eqlist, the VCE for the
listed equations is displayed. If spec = eqlist1 \ eqlist2, the part of the VCE associated with
the equations in eqlist1 (rowwise) and eqlist2 (columnwise) is displayed. * is shorthand for all
equations. equation() implies block if diag is not specified.
block displays the submatrices pertaining to distinct equations separately.
diag displays the diagonal submatrices pertaining to distinct equations separately.
format(% fmt) specifies the display format for displaying the elements of the matrix. The default is
format(%10.0g) for covariances and format(%8.4f) for correlations. See [U] 12.5 Formats:
Controlling how data are displayed for more information.
nolines suppresses lines between equations.
display options: noomitted, noemptycells, baselevels, allbaselevels; see [R] estimation
options.

Example 1
Using data from the Second National Health and Nutrition Examination Survey (NHANES II)
(McDowell et al. 1981), lets estimate the population means for total serum cholesterol (tcresult)
and for serum triglycerides (tgresult).
36

. svy: mean tcresult tgresult
Number of strata =
Number of PSUs
=
31
62
Mean
tcresult
tgresult
211.3975
138.576
Number of obs
Population size
Design df
Linearized
Std. Err.
1.252274
2.071934
=
=
=
5050
56820832
31

208.8435
134.3503
213.9515
142.8018
We can use estat svyset to remind us of the survey design characteristics that were used to
produce these results.
. estat svyset
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
finalwgt
linearized
missing
strata
psu
<zero>
estat effects reports a table of design and misspecification effects for each mean we estimated.
. estat effects, deff deft meff meft
Mean
tcresult
tgresult
211.3975
138.576
Linearized
Std. Err.
1.252274
2.071934
DEFF
DEFT
MEFF
MEFT
3.57141
2.35697
1.88982
1.53524
3.46105
2.32821
1.86039
1.52585
estat size reports a table that contains sample and population sizes.
. estat size
Mean
tcresult
tgresult
211.3975
138.576
Linearized
Std. Err.
Obs
Size
1.252274
2.071934
5050
5050
56820832
56820832
37
estat size can also report a table of subpopulation sizes.

. svy: mean tcresult, over(sex)
(output omitted )
. estat size
Male: sex = Male
Female: sex = Female
Over
Mean
tcresult
Male
Female
210.7937
215.2188
Linearized
Std. Err.
Obs
Size
1.312967
1.193853
4915
5436
56159480
60998033
estat sd reports a table of subpopulation standard deviations.

. estat sd
Male: sex = Male
Over
Mean
Std. Dev.
tcresult
Male
Female
210.7937
215.2188
45.79065
50.72563
estat cv reports a table of coefficients of variations for the estimates.

. estat cv
Male: sex = Male
Over
Mean
tcresult
Male
Female
210.7937
215.2188
Linearized
Std. Err.
CV (%)
1.312967
1.193853
.622868
.554716
Example 2: Design effects with subpopulations

When there are subpopulations, estat effects can compute design effects with respect to one
of two different hypothetical SRS designs. The default design is one in which SRS is conducted
across the full population. The alternate design is one in which SRS is conducted entirely within the
subpopulation of interest. This alternate design is used when the srssubpop option is specified.
Deciding which design is preferable depends on the nature of the subpopulations. If we can imagine
identifying members of the subpopulations before sampling them, the alternate design is preferable.
This case arises primarily when the subpopulations are strata or groups of strata. Otherwise, we may
prefer to use the default.
38
Here is an example using the default with the NHANES II data.

. svy: mean iron, over(sex)
(output omitted )
. estat effects
Male: sex = Male
Over
Mean
Male
Female
104.7969
97.16247
Linearized
Std. Err.
DEFF
DEFT
1.36097
2.01403
1.16661
1.41916
iron
.557267
.6743344
Thus the design-based variance estimate is about 36% larger than the estimate from the hypothetical
SRS design including the full population. We can get DEFF and DEFT for the alternate SRS design by
using the srssubpop option.
. estat effects, srssubpop
Male: sex = Male
Over
Mean
Male
Female
104.7969
97.16247
Linearized
Std. Err.
DEFF
DEFT
1.348
2.03132
1.16104
1.42524
iron
.557267
.6743344
Because the NHANES II did not stratify on sex, we think it problematic to consider design effects
with respect to SRS of the female (or male) subpopulation. Consequently, we would prefer to use the
default here, although the values of DEFF differ little between the two in this case.
For other examples (generally involving heavy oversampling or undersampling of specified subpopulations), the differences in DEFF for the two schemes can be much more dramatic.
Consider the NMIHS data (Gonzalez, Krauss, and Scott 1992), and compute the mean of birthwgt
over race:
. use http://www.stata-press.com/data/r13/nmihs
. svy: mean birthwgt, over(race)
(output omitted )
. estat effects
nonblack: race = nonblack
black: race = black
Over
Mean
birthwgt
nonblack
black
3402.32
3127.834
Linearized
Std. Err.
7.609532
6.529814
DEFF
DEFT
1.44376
.172041
1.20157
.414778
39
. estat effects, srssubpop

black: race = black
Over
Mean
birthwgt
nonblack
black
3402.32
3127.834
Linearized
Std. Err.
7.609532
6.529814
DEFF
DEFT
.826842
.528963
.909308
.727298
Because the NMIHS survey was stratified on race, marital status, age, and birthweight, we believe it
reasonable to consider design effects computed with respect to SRS within an individual race group.
Consequently, we would recommend here the alternative hypothetical design for computing design
effects; that is, we would use the srssubpop option.
Example 3: Misspecification effects

Misspecification effects assess biases in variance estimators that are computed under the wrong
assumptions. The survey literature (for example, Scott and Holt 1982, 850; Skinner 1989) defines
misspecification effects with respect to a general set of wrong variance estimators. estat effects
considers only one specific form: variance estimators computed under the incorrect assumption that
our observed sample was selected through SRS.
The resulting misspecification effect measure is informative primarily when an unweighted point
estimator is approximately unbiased for the parameter of interest. See Eltinge and Sribney (1996a)
for a detailed discussion of extensions of misspecification effects that are appropriate for biased point
estimators.
Note the difference between a misspecification effect and a design effect. For a design effect, we
compare our complex-designbased variance estimate with an estimate of the true variance that we
would have obtained under a hypothetical true simple random sample. For a misspecification effect,
we compare our complex-designbased variance estimate with an estimate of the variance from fitting
the same model without weighting, clustering, or stratification.
estat effects defines MEFF and MEFT as
= Vb /Vbmsp
MEFT = MEFF
MEFF
where Vb is the appropriate design-based estimate of variance and Vbmsp is the variance estimate
computed with a misspecified designignoring the sampling weights, stratification, and clustering.
Here we request that the misspecification effects be displayed for the estimation of mean zinc
levels from our NHANES II data.
40

. svy: mean zinc, over(sex)
(output omitted )
. estat effects, meff meft
Male: sex = Male
Over
Mean
Male
Female
90.74543
83.8635
Linearized
Std. Err.
MEFF
MEFT
6.28254
6.32648
2.5065
2.51525
zinc
.5850741
.4689532
If we run ci without weights, we get the standard errors that are (Vbmsp )1/2 .
. sort sex
. ci zinc if sex == "Male":sex
Variable
Obs
Mean
zinc
4375
89.53143
. display [zinc]_se[Male]/r(se)
2.5064994
. display ([zinc]_se[Male]/r(se))^2
6.2825393
. ci zinc if sex == "Female":sex
Obs
Mean
Variable
zinc
4827
83.76652
. display [zinc]_se[Female]/r(se)
2.515249
. display ([zinc]_se[Female]/r(se))^2
6.3264774
Std. Err.
.2334228
Std. Err.
.186444

89.0738
89.98906

83.40101
84.13204
Example 4: Design and misspecification effects for linear combinations

Lets compare the mean of total serum cholesterol (tcresult) between men and women in the
NHANES II dataset.

. svy: mean tcresult, over(sex)
Number of strata =
31
Number of obs
=
Number of PSUs
=
62
Population size =
Design df
=
Male: sex = Male
Over
Mean
tcresult
Male
Female
210.7937
215.2188
Linearized
Std. Err.
1.312967
1.193853
41
10351
117157513
31
208.1159
212.784
213.4715
217.6537
We can use estat lceffects to report the standard error, design effects, and misspecification effects
of the difference between the above means.
. estat lceffects [tcresult]Male - [tcresult]Female, deff deft meff meft
( 1) [tcresult]Male - [tcresult]Female = 0
Mean
Coef.
(1)
-4.425109
Std. Err.
1.086786
DEFF
DEFT
MEFF
MEFT
1.31241
1.1456
1.27473
1.12904
Example 5: Using survey data to determine Neyman allocation

Suppose that we have partitioned our population into L strata and stratum h contains Nh individuals.
Also let h represent the standard deviation of a quantity we wish to sample from the population.
According to Cochran (1977, sec. 5.5), we can minimize the variance of the stratified mean estimator,
for a fixed sample size n, if we choose the stratum sample sizes according to Neyman allocation:
Nh h
nh = n PL
i=1 Ni i
(1)
We can use estat sd with our current survey data to produce a table of subpopulation standarddeviation estimates. Then we could plug these estimates into (1) to improve our survey design for
the next time we sample from our population.
Here is an example using birthweight from the NMIHS data. First, we need estimation results from
svy: mean over the strata.
. svyset [pw=finwgt], strata(stratan)
pweight: finwgt
VCE: linearized
Strata 1: stratan
FPC 1: <zero>
. svy: mean birthwgt, over(stratan)
(output omitted )
42
Next we will use estat size to report the table of stratum sizes. We will also generate matrix
p obs to contain the observed percent allocations for each stratum. In the matrix expression, r( N)
is a row vector of stratum sample sizes and e(N) contains the total sample size. r( N subp) is a
row vector of the estimated population stratum sizes.
. estat size
1:
2:
3:
4:
5:
6:
Over
stratan
stratan
stratan
stratan
stratan
stratan
=
=
=
=
=
=
1
2
3
4
5
6
Mean
Linearized
Std. Err.
Obs
Size
19.00149
9.162736
7.38429
12.32294
9.864682
8.057648
841
803
3578
710
714
3300
18402.98161
67650.95932
579104.6188
29814.93215
153379.07445
3047209.10519
birthwgt
1
2
3
4
5
6
1049.434
2189.561
3303.492
1036.626
2211.217
3485.42
. matrix p_obs = 100 * r(_N)/e(N)

. matrix nsubp = r(_N_subp)
Now we call estat sd to report the stratum standard-deviation estimates and generate matrix
p neyman to contain the percent allocations according to (1). In the matrix expression, r(sd) is a
vector of the stratum standard deviations.
. estat sd
1:
2:
3:
4:
5:
6:
Over
stratan
stratan
stratan
stratan
stratan
stratan
=
=
=
=
=
=
1
2
3
4
5
6
Mean
Std. Dev.
1049.434
2189.561
3303.492
1036.626
2211.217
3485.42
2305.931
555.7971
687.3575
999.0867
349.8068
300.6945
birthwgt
1
2
3
4
5
6
. matrix p_neyman = 100 * hadamard(nsubp,r(sd))/el(nsubp*r(sd),1,1)

. matrix list p_obs, format(%4.1f)
p_obs[1,6]
birthwgt: birthwgt: birthwgt: birthwgt: birthwgt: birthwgt:
1
2
3
4
5
6
r1
8.5
8.1
36.0
7.1
7.2
33.2

. matrix list p_neyman, format(%4.1f)
p_neyman[1,6]
birthwgt: birthwgt: birthwgt: birthwgt:
1
2
3
4
r1
2.9
2.5
26.9
2.0
birthwgt:
5
3.6
43
birthwgt:
6
62.0
We can see that strata 3 and 6 each contain about one-third of the observed data, with the rest of
the observations spread out roughly equally to the remaining strata. However, plugging our sample
estimates into (1) indicates that stratum 6 should get 62% of the sampling units, stratum 3 should
get about 27%, and the remaining strata should get a roughly equal distribution of sampling units.
Example 6: Summarizing singleton and certainty strata

Use estat strata with svy estimation results to produce a table that reports the number of
singleton and certainty strata in each sampling stage. Here is an example using (fictional) data from
a complex survey with five sampling stages (the dataset is already svyset). If singleton strata are
present, estat strata will report their effect on the standard errors.
. use http://www.stata-press.com/data/r13/strata5
. svy: total y
(output omitted )
. estat strata
Stage
Singleton
strata
Certainty
strata
Total
strata
1
2
3
4
5
0
1
0
2
204
1
0
3
0
311
4
10
29
110
865
Note: missing standard error because of

stratum with single sampling unit.
estat strata also reports the scale factor used when the singleunit(scaled) option is
svyset. Of the 865 strata in the last stage, 204 are singleton strata and 311 are certainty strata. Thus
the scaling factor for the last stage is
865 311
1.58
865 311 204
. svyset, singleunit(scaled) noclear
(output omitted )
. svy: total y
(output omitted )
44

. estat strata
Stage
Singleton
strata
Certainty
strata
Total
strata
Scale
factor
1
2
3
4
5
0
1
0
2
204
1
0
3
0
311
4
10
29
110
865
1
1.11
1
1.02
1.58
Note: variances scaled within each stage to handle

strata with a single sampling unit.
The singleunit(scaled) option of svyset is one of three methods in which Statas svy commands
can automatically handle singleton strata when performing variance estimation; see [SVY] variance
estimation for a brief discussion of these methods.
Example 7: Goodness-of-fit test for svy: logistic

From example 2 in [SVY] svy estimation, we modeled the incidence of high blood pressure as a
function of height, weight, age, and sex (using the female indicator variable).
. use http://www.stata-press.com/data/r13/nhanes2d
. svyset
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
finalwgt
linearized
missing
strata
psu
<zero>
. svy: logistic highbp height weight age female

(running logistic on estimation sample)
Survey: Logistic regression
Number of strata
Number of PSUs
=
=
31
62
highbp
Odds Ratio
height
weight
age
female
_cons
.9657022
1.053023
1.050059
.6272129
.716868
Linearized
Std. Err.
.0051511
.0026902
.0019761
.0368195
.6106878
Number of obs
Population size
Design df
F(
4,
28)
Prob > F
t
-6.54
20.22
25.96
-7.95
-0.39
Logistic model for highbp, goodness-of-fit test

F(9,23) =
Prob > F =
5.32
0.0006
10351
117157513
31
368.33
0.0000
P>|t|
0.000
0.000
0.000
0.000
0.699
.9552534
1.047551
1.046037
.5564402
.1261491
We can use estat gof to perform a goodness-of-fit test for this model.
. estat gof
=
=
=
=
=
.9762654
1.058524
1.054097
.706987
4.073749
45
The F statistic is significant at the 5% level, indicating that the model is not a good fit for these data.
Stored results
estat svyset stores the following in r():
Scalars
r(stages)
Macros
r(wtype)
r(wexp)
r(wvar)
r(su#)
r(strata#)
r(fpc#)
r(bsrweight)
r(bsn)
r(brrweight)
r(fay)
r(jkrweight)
r(sdrweight)
r(sdrfpc)
r(vce)
r(dof)
r(mse)
r(poststrata)
r(postweight)
r(settings)
r(singleunit)
number of sampling stages

weight type
weight expression
weight variable name
variable identifying sampling units for stage #
variable identifying strata for stage #
FPC for stage #
bsrweight() variable list
brrweight() variable list
Fays adjustment
jkrweight() variable list
sdrweight() variable list
fpc() value from within sdrweight()
vcetype specified in vce()
dof() value
mse, if specified
poststrata() variable
postweight() variable
svyset arguments to reproduce the current settings
singleunit() setting
estat strata stores the following in r():

Matrices
r( N strata single)
r( N strata certain)
r( N strata)
r(scale)
number of strata with one sampling unit

number of certainty strata
number of strata
variance scale factors used when singleunit(scaled) is svyset
estat effects stores the following in r():

Matrices
r(deff)
r(deft)
r(deffsub)
r(deftsub)
r(meff)
r(meft)
vector
vector
vector
vector
vector
vector
of
of
of
of
of
of
DEFF estimates
DEFT estimates
DEFF estimates for srssubpop
DEFT estimates for srssubpop
MEFF estimates
MEFT estimates
estat lceffects stores the following in r():

Scalars
r(estimate)
r(se)
r(df)
r(deff)
r(deft)
r(deffsub)
r(deftsub)
r(meff)
r(meft)
point estimate
estimate of standard error
degrees of freedom
DEFF estimate
DEFT estimate
DEFF estimate for srssubpop
DEFT estimate for srssubpop
MEFF estimate
MEFT estimate
46
estat size stores the following in r():

Matrices
r( N)
r( N subp)
vector of numbers of nonmissing observations

vector of subpopulation size estimates
estat sd stores the following in r():

Macros
r(srssubpop)
Matrices
r(mean)
r(sd)
r(variance)
srssubpop, if specified
vector of subpopulation mean estimates
vector of subpopulation standard-deviation estimates
vector of subpopulation variance estimates
estat cv stores the following in r():

Matrices
r(b)
r(se)
r(cv)
estimates
standard errors of the estimates
coefficients of variation of the estimates
estat gof stores the following in r():

Scalars
r(p)
r(F)
r(df1)
r(df2)
r(chi2)
r(df)
p-value associated with the test statistic

F statistic, if e(df r) was stored by estimation command
numerator degrees of freedom for F statistic
denominator degrees of freedom for F statistic
2 statistic, if e(df r) was not stored by estimation command
degrees of freedom for 2 statistic
estat vce stores the following in r():

Matrices
r(V)
VCE or correlation matrix

Methods and formulas are presented under the following headings:
Design effects
Linear combinations
Misspecification effects
Population and subpopulation standard deviations
Coefficient of variation
Goodness of fit for binary response models
Design effects
estat effects produces two estimators of design effect, DEFF and DEFT.
DEFF is estimated as described in Kish (1965) as
DEFF
b
Vb ()
Vbsrswor (esrs )
47
b is the design-based estimate of variance for a parameter, , and Vbsrswor (esrs ) is an

where Vb ()
estimate of the variance for an estimator, esrs , that would be obtained from a similar hypothetical
survey conducted using SRS without replacement (wor) and with the same number of sample elements,
m, as in the actual survey. For example, if is a total Y , then
m
c X
2
M
Vbsrswor (esrs ) = (1 f )
wj yj Yb
m 1 j=1
(1)
c. The factor (1 f ) is a finite population correction. If the user sets an FPC for
where Yb = Yb /M
c is used; otherwise, f = 0.
the first stage, f = m/M
DEFT is estimated as described in Kish (1987, 41) as
s
DEFT
b
Vb ()
Vbsrswr (esrs )
where Vbsrswr (esrs ) is an estimate of the variance for an estimator, esrs , obtained from a similar survey
conducted using SRS with replacement (wr). Vbsrswr (esrs ) is computed using (1) with f = 0.
When computing estimates for a subpopulation, S , and the srssubpop option is not specified
(that is, the default), (1) is used with wSj = IS (j) wj in place of wj , where
(
IS (j) =
1, if j S
0, otherwise
The sums in (1) are still calculated over all elements in the sample, regardless of whether they belong
to the subpopulation: by default, the SRS is assumed to be done across the full population.
When the srssubpop option is specified, the SRS is carried out within subpopulation S . Here
(1) is used with the sums restricted to those elements belonging to the subpopulation; m is replaced
c is replaced with M
cS , the sum
with mS , the number of sample elements from the subpopulation; M
b
b
b
c
of the weights from the subpopulation; and Y is replaced with Y = Y /M , the weighted mean
S
across the subpopulation.
Linear combinations
estat lceffects estimates = C, where is a q 1 vector of parameters (for example,
population means or population regression coefficients) and C is any 1 q vector of constants. The
estimate of is b = C b, and its variance estimate is
b 0
Vb (b
) = C Vb ()C
Similarly, the SRS without replacement (srswor) variance estimator used in the computation of DEFF
is
Vbsrswor (e
srs ) = C Vbsrswor (bsrs )C 0
48
and the SRS with replacement (srswr) variance estimator used in the computation of DEFT is
Vbsrswr (e
srs ) = C Vbsrswr (bsrs )C 0
The variance estimator used in computing MEFF and MEFT is
Vbmsp (e
msp ) = C Vbmsp (bmsp )C 0
estat lceffects was originally developed under a different command name; see Eltinge and
Sribney (1996b).
Misspecification effects
estat effects produces two estimators of misspecification effect, MEFF and MEFT.
MEFF
MEFT
b
Vb ()
Vbmsp (bmsp )
MEFF
b is the design-based estimate of variance for a parameter, , and Vbmsp (bmsp ) is the variance
where Vb ()
estimate for bmsp . These estimators, bmsp and Vbmsp (bmsp ), are based on the incorrect assumption
that the observations were obtained through SRS with replacement: they are the estimators obtained
by simply ignoring weights, stratification, and clustering. When is a total Y , the estimator and its
variance estimate are computed using the standard formulas for an unweighted total:
m
cX
cy = M
yj
Ybmsp = M
m j=1
Vbmsp (Ybmsp ) =
m
X
c2
2
M
yj y
m(m 1) j=1
When computing MEFF and MEFT for a subpopulation, sums are restricted to those elements
cS are used in place of m and M
c.
belonging to the subpopulation, and mS and M
Population and subpopulation standard deviations

For srswr designs, the variance of the mean estimator is
Vsrswr (y) = 2 /n
where n is the sample size and is the population standard deviation. estat sd uses this formula
and the results from mean and svy: mean to estimate the population standard deviation via
q
b = n Vbsrswr (y)
Subpopulation standard deviations are computed similarly, using the corresponding variance estimate
and sample size.
49
Coefficient of variation
The coefficient of variation (CV) for estimate b is
q
b
Vb ()
b =
100%
CV()
b
||
A missing value is reported when b is zero.
Goodness of fit for binary response models

Let yj be the j th observed value of the dependent variable, pbj be the predicted probability of a
positive outcome, and rbj = yj pbj . Let g be the requested number of groups from the group()
option; then the rbj are placed in g quantile groups as described in Methods and formulas for the
xtile command in [D] pctile. Let r = (r1 , . . . , rg ), where ri is the subpopulation mean of the rbj
for the ith quantile group. The standard Wald statistic for testing H0 : r = 0 is
b 2 = r{Vb (r)}1 r0
X
b 2 is approximately distributed as a
where Vb (r) is the design-based variance estimate for r. Here X
2 with g 1 degrees of freedom. This Wald statistic is one of the three goodness-of-fit statistics
discussed in Graubard, Korn, and Midthune (1997). estat gof reports this statistic when the design
degrees of freedom is missing, such as with svy bootstrap results.
According to Archer and Lemeshow (2006), the F -adjusted mean residual test is given by
b 2 (d g + 2)/(dg)
Fb = X
where d is the design degrees of freedom. Here Fb is approximately distributed as an F with g 1
numerator and d g + 2 denominator degrees of freedom.
With the total option, estat gof uses the subpopulation total estimator instead of the subpopulation
mean estimator.
References
Archer, K. J., and S. A. Lemeshow. 2006. Goodness-of-fit test for a logistic regression model fitted using survey
sample data. Stata Journal 6: 97105.
Eltinge, J. L., and W. M. Sribney. 1996a. Accounting for point-estimation bias in assessment of misspecification
effects, confidence-set coverage rates and test sizes. Unpublished manuscript, Department of Statistics, Texas A&M
University.
. 1996b. svy5: Estimates of linear combinations and hypothesis tests for survey data. Stata Technical Bulletin
31: 3142. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 246259. College Station, TX: Stata Press.
Gonzalez, J. F., Jr., N. Krauss, and C. Scott. 1992. Estimation in the 1988 National Maternal and Infant Health
Survey. Proceedings of the Section on Statistics Education, American Statistical Association 343348.
Graubard, B. I., E. L. Korn, and D. Midthune. 1997. Testing goodness-of-fit for logistic regression with survey data.
In Proceedings of the Section on Survey Research Methods, Joint Statistical Meetings, 170174. Alexandria, VA:
American Statistical Association.
Kish, L. 1965. Survey Sampling. New York: Wiley.
50

. 1987. Statistical Design for Research. New York: Wiley.
Scott, A. J., and D. Holt. 1982. The effect of two-stage sampling on ordinary least squares methods. Journal of the
American Statistical Association 77: 848854.
Skinner, C. J. 1989. Introduction to part A. In Analysis of Complex Surveys, ed. C. J. Skinner, D. Holt, and
T. M. F. Smith, 2358. New York: Wiley.
West, B. T., and S. E. McCabe. 2012. Incorporating complex sample design effects when only final survey weights
are available. Stata Journal 12: 718725.
Also see
[SVY] svy postestimation Postestimation tools for svy
[SVY] subpopulation estimation Subpopulation estimation for survey data
[SVY] variance estimation Variance estimation for survey data
Title
jackknife options More options for jackknife variance estimation
Syntax
Description
Options
Also see
Syntax
jackknife options
Description
SE
mse
nodots

keep
verbose
noisily
trace
title(text)
nodrop
reject(exp)

keep pseudovalues
trace command
saving(), keep, verbose, noisily, trace, title(), nodrop, and reject() are not shown in the dialog boxes
for estimation commands.
Description
svy accepts more options when performing jackknife variance estimation.
Options
SE
value of the statistic based on the entire dataset. By default, svy computes the variance by using
deviations of the pseudovalues from their mean.
successful replication. A red x is displayed if command returns with an error, e is displayed
if at least one of the values in the exp list is missing, n is displayed if the sample size is not
correct, and a yellow s is displayed if the dropped sampling unit is outside the subpopulation
sample.
saving(), keep, verbose, noisily, trace, title(), nodrop, reject(); see [SVY] svy jackknife.
Also see
[SVY] svy jackknife Jackknife estimation for survey data
51
Title
ml for svy Maximum pseudolikelihood estimation for survey data
Reference
Also see

Statas ml command can fit maximum likelihoodbased models for survey data. Many ml-based
estimators can now be modified to handle one or more stages of clustering, stratification, sampling
weights, finite population correction, poststratification, and subpopulation estimation. See [R] ml for
details.
See [P] program properties for a discussion of the programming requirements for an estimation
command to work with the svy prefix. See Gould, Pitblado, and Poi (2010) for examples of user-written
estimation commands that support the svy prefix.
Example 1: User-written survey regression

The ml command requires a program that computes likelihood values to perform maximum
likelihood. Here is a likelihood evaluator used in Gould, Pitblado, and Poi (2010) to fit linear
regression models using likelihood from the normal distribution.
program mynormal_lf
version 13
args lnf mu lnsigma
quietly replace lnf = ln(normalden($ML_y1,mu,exp(lnsigma)))
end
Here we fit a survey regression model using a multistage survey dataset with ml and the above
likelihood evaluator.
. use http://www.stata-press.com/data/r13/multistage
. svyset county [pw=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools)
pweight: sampwgt
VCE: linearized
Strata 1: state
SU 1: county
FPC 1: ncounties
Strata 2: <one>
SU 2: school
FPC 2: nschools
. ml model lf mynormal_lf (mu: weight = height) /lnsigma, svy
52
ml for svy Maximum pseudolikelihood estimation for survey data

. ml max
initial:
feasible:
rescale:
rescale eq:
Iteration 0:
Iteration 1:
Iteration 2:
Iteration 3:
Iteration 4:
Iteration 5:
Iteration 6:
Iteration 7:
Iteration 8:
Iteration 9:
Iteration 10:
Iteration 11:
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
log
Number of strata
Number of PSUs
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
pseudolikelihood
=
=
=
-<inf>
= -7.301e+08
= -51944380
= -47565331
= -47565331
= -41221759
= -41218957
= -41170544
= -41145411
= -41123161
= -41103001
= -41083551
= -38467683
= -38329015
= -38328739
= -38328739
50
100
(could not be evaluated)
(not
(not
(not
(not
(not
(not
concave)
concave)
concave)
concave)
concave)
concave)
(backed up)
Number of obs
Population size
Design df
F(
1,
50)
Prob > F
Linearized
Std. Err.
=
=
=
=
=
4071
8000000
50
593.99
0.0000
weight
Coef.
height
_cons
.7163115
-149.6183
.0293908
12.57265
24.37
-11.90
0.000
0.000
.6572784
-174.8712
.7753447
-124.3654
lnsigma
_cons
3.372154
.0180777
186.54
0.000
3.335844
3.408464
53
P>|t|
mu
Reference
Gould, W. W., J. S. Pitblado, and B. P. Poi. 2010. Maximum Likelihood Estimation with Stata. 4th ed. College
Also see
[P] program properties Properties of user-defined programs
[R] maximize Details of iterative maximization
[R] ml Maximum likelihood estimation
Title
poststratification Poststratification for survey data
Description
References

Also see
Description
Poststratification is a method for adjusting the sampling weights, usually to account for underrepresented groups in the population.
See [SVY] direct standardization for a similar method of adjustment that allows the comparison
of rates that come from different frequency distributions.

Overview
Video example
Overview
Poststratification involves adjusting the sampling weights so that they sum to the population
sizes within each poststratum. This usually results in decreasing bias because of nonresponse and
underrepresented groups in the population. Poststratification also tends to result in smaller variance
estimates.
The svyset command has options to set variables for applying poststratification adjustments to the
sampling weights. The poststrata() option takes a variable that contains poststratum identifiers,
and the postweight() option takes a variable that contains the poststratum population sizes.
In the following example, we use an example from Levy and Lemeshow (2008) to show how
poststratification affects the point estimates and their variance.
Example 1: Poststratified mean

Levy and Lemeshow (2008, sec. 6.6) give an example of poststratification by using simple survey
data from a veterinarians client list. The data in poststrata.dta were collected using simple
random sampling without replacement. The totexp variable contains the total expenses to the client,
type identifies the cats and dogs, postwgt contains the poststratum sizes (450 for cats and 850 for
dogs), and fpc contains the total number of clients (850 + 450 = 1300).
. use http://www.stata-press.com/data/r13/poststrata
. svyset, poststrata(type) postweight(postwgt) fpc(fpc)
pweight: <none>
VCE: linearized
Poststrata: type
Postweight: postwgt
Strata 1: <one>
FPC 1: fpc
54

. svy: mean totexp
Number of strata =
1
Number of obs
Number of PSUs
=
50
Population size
N. of poststrata =
2
Design df
Mean
totexp
40.11513
Linearized
Std. Err.
1.163498
=
=
=
55
50
1300
49

37.77699
42.45327
The mean total expenses is $40.12 with a standard error of $1.16. In the following, we omit the
poststratification information from svyset, resulting in mean total expenses of $39.73 with standard
error $2.22. The difference between the mean estimates is explained by the facts that expenses tend
to be larger for dogs than for cats and that the dogs were slightly underrepresented in the sample
(850/1,300 0.65 for the population; 32/50 = 0.64 for the sample). This reasoning also explains why
the variance estimate from the poststratified mean is smaller than the one that was not poststratified.
. svyset, fpc(fpc)
pweight: <none>
VCE: linearized
Strata 1: <one>
FPC 1: fpc
. svy: mean totexp
Number of strata =
1
Number of obs
Number of PSUs
=
50
Population size
Design df
Mean
totexp
39.7254
Linearized
Std. Err.
2.221747
=
=
=
50
50
49

35.26063
44.19017
Video example
Specifying the poststratification of survey data to Stata

Suppose that you used a complex survey design to sample m individuals from a population of
size M . Let Pk be the set of individuals in the sample that belong to poststratum k , and let IPk (j)
indicate if the j th individual is in poststratum k , where
(
1, if j Pk
IPk (j) =
0, otherwise
Also let LP be the number of poststrata and Mk be the population size for poststratum k .
56
If wj is the unadjusted sampling weight for the j th sampled individual, the poststratification
adjusted sampling weight is
LP
X
Mk
IPk (j)
wj =
w
ck j
M
k=1
ck is
where M
ck =
M
m
X
IPk (j)wj
j=1
The point estimates are computed using these adjusted weights. For example, the poststratified total
estimator is
m
X
Yb P =
wj yj
j=1
where yj is an item from the j th sampled individual.

For replication-based variance estimation, the BRR and jackknife replicate-weight variables are
similarly adjusted to produce the replicate values used in the respective variance formulas.
The score variable for the linearized variance estimator of a poststratified total is
bP
zj (Y ) =
LP
X
k=1
Mk
IPk (j)
ck
M
Ybk
yj
ck
M
!
(1)
where Ybk is the total estimator for the k th poststratum,
Ybk =
m
X
IPk (j)wj yj
j=1
For the poststratified ratio estimator, the score variable is
bP
bP
bP
bP
bP ) = X zj (Y ) Y zj (X )
zj ( R
b P )2
(X
(2)
b P is the poststratified total estimator for item xj . For regression models, the equation-level
where X
scores are adjusted as in (1). These score variables were derived using the method described in
[SVY] variance estimation for the ratio estimator and are a direct result of the methods described in
Deville (1999), Demnati and Rao (2004), and Shah (2004).
References
1726.
Levy, P. S., and S. A. Lemeshow. 2008. Sampling of Populations: Methods and Applications. 4th ed. Hoboken, NJ:
Wiley.
Also see
57
Title
sdr options More options for SDR variance estimation
Syntax
Description
Options
Also see
Syntax
sdr options
Description
SE
mse
nodots

verbose
noisily
trace
title(text)
nodrop
reject(exp)

trace command
saving(), verbose, noisily, trace, title(), nodrop, and reject() are not shown in the dialog boxes for
estimation commands.
Description
svy accepts more options when performing successive difference replication (SDR) variance
estimation. See [SVY] svy sdr for a complete discussion.
Options
SE
saving(), verbose, noisily, trace, title(), nodrop, reject(); see [SVY] svy sdr.
Also see
[SVY] svy sdr Successive difference replication for survey data
58
Title
subpopulation estimation Subpopulation estimation for survey data
Description
References

Also see
Description
Subpopulation estimation focuses on part of the population. This entry discusses subpopulation
estimation and explains why you should use the subpop() option instead of if and in for your
survey data analysis.

Subpopulation estimation involves computing point and variance estimates for part of the population.
This is not the same as restricting the estimation sample to the collection of observations within
the subpopulation because variance estimation for survey data measures sample-to-sample variability,
assuming that the same survey design is used to collect the data; see Methods and formulas
for a detailed explanation. West, Berglund, and Heeringa (2008) provides further information on
subpopulation analysis.
The svy prefix commands subpop() option performs subpopulation estimation. The svy: mean,
svy: proportion, svy: ratio, and svy: total commands also have the over() option to perform
estimation for multiple subpopulations.
The following examples illustrate how to use the subpop() and over() options.
Example 1
Suppose that we are interested in estimating the proportion of women in our population who have
had a heart attack. In our NHANES II dataset (McDowell et al. 1981), the female participants can
be identified using the female variable, and the heartatk variable indicates whether an individual
has ever had a heart attack. Below we use svy: mean with the heartatk variable to estimate the
proportion of individuals who have had a heart attack, and we use subpop(female) to identify our
subpopulation of interest.
. svy, subpop(female): mean heartatk
Number of strata =
Number of PSUs
=
31
62
Mean
heartatk
.0193276
Number of obs
Population size
Subpop. no. obs
Subpop. size
Design df
Linearized
Std. Err.
.0017021
=
=
=
=
=
10349
117131111
5434
60971631
31

.0158562
59
.0227991
60
The subpop(varname) option takes a 0/1 variable, and the subpopulation of interest is defined by
varname = 1. All other members of the sample not in the subpopulation are indicated by varname = 0.
If a persons subpopulation status is unknown, varname should be set to missing (.), so those
observations will be omitted from the analysis. For instance, in the preceding analysis, if a persons
sex was not recorded, female should be coded as missing rather than as male (female = 0).
Technical note
Actually, the subpop(varname) option takes a zero/nonzero variable, and the subpopulation is
defined by varname 6= 0 and not missing. All other members of the sample not in the subpopulation
are indicated by varname = 0, but 0, 1, and missing are typically the only values used for the
subpop() variable.
Furthermore, you can specify an if qualifier within subpop() to identify a subpopulation. The
result is the same as generating a variable equal to the conditional expression and supplying it as the
subpop() variable. If a varname and an if qualifier are specified within the subpop() option, the
subpopulation is identified by their logical conjunction (logical and ), and observations with missing
values in either are dropped from the estimation sample.
Example 2: Multiple subpopulation estimation

Means, proportions, ratios, and totals for multiple subpopulations can be estimated using the
over() option with svy: mean, svy: proportion, svy: ratio, and svy: total, respectively.
Here is an example using the NMIHS data (Gonzalez, Krauss, and Scott 1992), estimating mean
birthweight over the categories of the race variable.
. svy: mean birthwgt, over(race)
Number of strata =
6
Number of obs
Number of PSUs
=
9946
Population size
Design df
black: race = black
Over
Mean
birthwgt
nonblack
black
3402.32
3127.834
Linearized
Std. Err.
7.609532
6.529814
=
=
=
9946
3895562
9940
3387.404
3115.035
3417.236
3140.634
61
More than one variable can be used in the over() option.

. svy: mean birthwgt, over(race marital)
Number of strata =
6
Number of obs
Number of PSUs
=
9946
Population size
Design df
Over: race marital
_subpop_1: nonblack single
_subpop_2: nonblack married
_subpop_3: black single
_subpop_4: black married
Over
Mean
birthwgt
_subpop_1
_subpop_2
_subpop_3
_subpop_4
3291.045
3426.407
3073.122
3221.616
Linearized
Std. Err.
20.18795
8.379497
8.752553
12.42687
=
=
=
9946
3895562
9940
3251.472
3409.982
3055.965
3197.257
3330.617
3442.833
3090.279
3245.975
Here the race and marital variables have value labels. race has the value 0 labeled nonblack
(that is, white and other) and 1 labeled black; marital has the value 0 labeled single and 1
labeled married. Value labels on the over() variables make for a more informative legend above
the table of point estimates. See [U] 12.6.3 Value labels for information on creating value labels.
We can also combine the subpop() option with the over() option.
. generate nonblack = (race == 0) if !missing(race)
. svy, subpop(nonblack): mean birthwgt, over(marital age20)
Number of strata =
3
Number of obs
=
4724
Number of PSUs
=
4724
Population size = 3230403
Subpop. no. obs =
4724
Subpop. size
= 3230403
Design df
=
4721
Over: marital age20
_subpop_1: single age20+
_subpop_2: single age<20
_subpop_3: married age20+
_subpop_4: married age<20
Over
Mean
birthwgt
_subpop_1
_subpop_2
_subpop_3
_subpop_4
3312.012
3244.709
3434.923
3287.301
Linearized
Std. Err.
24.2869
36.85934
8.674633
34.15988
3264.398
3172.448
3417.916
3220.332
3359.625
3316.971
3451.929
3354.271
Note: 3 strata omitted because they contain no subpopulation

members.
This time, we estimated means for the marital status and age (<20 or 20) subpopulations for race
== 0 (nonblack) only. We carefully define nonblack so that it is missing when race is missing.
62
If we omitted the if !missing(race) in our generate statement, then nonblack would be 0

when race was missing. This would improperly assume that all individuals with a missing value
for race were black and could cause our results to have incorrect standard errors. The standard
errors could be incorrect because those observations for which race is missing would be counted as
part of the estimation sample, potentially inflating the number of PSUs used in the formula for the
variance estimator. For this reason, observations with missing values for any of the over() variables
are omitted from the analysis.

Cochran (1977, sec. 2.13) discusses a method by which you can derive estimates for subpopulation
totals. This section uses this method to derive the formulas for a subpopulation total from a simple
random sample (without replacement) to explain how the subpop() option works, shows why this
method will often produce different results from those produced using an equivalent if (or in)
qualifier (outside the subpop() option), and discusses how this method applies to subpopulation
means, proportions, ratios, and regression models.
Subpopulation totals
Subpopulation estimates other than the total
Subpopulation with replication methods
Subpopulation totals
Let Yj be a survey item for individual j in the population, where j = 1, . . . , N and N is the
population size. Let S be a subset of individuals in the population and IS (j) indicate if the j th
individual is in S , where
(
1, if j S
IS (j) =
0, otherwise
The subpopulation total is
YS =
N
X
IS (j)Yj
j=1
and the subpopulation size is
NS =
N
X
IS (j)
j=1
Let yj be the items for those individuals selected in the sample, where j = 1, . . . , n and n is the
sample size. The number of individuals sampled from the subpopulation is
n
X
nS =
IS (j)
j=1
The estimator for the subpopulation total is
YbS =
n
X
j=1
IS (j)wj yj
(1)
63
where wj = N/n is the unadjusted sampling weight for this design. The estimator for NS is
n
X
bS =
N
IS (j)wj
j=1
The replicate values for the BRR and jackknife variance estimators are computed using the same
method.
The linearized variance estimator for YbS is
2
n

1
n n X
IS (j)wj yj YbS
(2)
Vb (YbS ) = 1
N n1
n
j=1
bS (notation for XS is defined similarly

The covariance estimator for the subpopulation totals YbS and X
to that of YS ) is

n

1b
n n X
1 b
d
b
b
IS (j)wj yj YS
Cov(YS , XS ) = 1
IS (j)wj xj XS
(3)
N n1
n
n
j=1
Equation (2) is not the same formula that results from restricting the estimation sample to
the observations within S . The formula using this restricted sample (assuming a svyset with the
corresponding FPC) is

2
n
nS X
nS
1 b
e
b
V (YS ) = 1
IS (j) wj yj
YS
(4)
bS nS 1
nS
N
j=1
These variance estimators, (2) and (4), assume two different survey designs. In (2), n individuals are
sampled without replacement from the population comprising the NS values from the subpopulation
with N NS additional zeros. In (4), nS individuals are sampled without replacement from the
subpopulation of NS values. We discourage using (4) by warning against using the if and in
qualifiers for subpopulation estimation because this variance estimator does not accurately measure
the sample-to-sample variability of the subpopulation estimates for the survey design that was used
to collect the data.
For survey data, there are only a few circumstances that require using the if qualifier. For example,
if you suspected laboratory error for a certain set of measurements, then using the if qualifier to
omit these observations from the analysis might be proper.
Subpopulation estimates other than the total

To generalize the above results, note that the other point estimatorssuch as means, proportions,
ratios, and regression coefficientsyield a linearized variance estimator based on one or more (equation
level) score variables. For example, the weighted sample estimation equations of a regression model
for a given subpopulation (see (3) from [SVY] variance estimation) is
n
X
b S ) =
G(
IS (j)wj S(S ; yj , xj ) = 0
(5)
j=1
b S ) as
You can write G(
b S ) =
G(
n
X
IS (j)wj dj
j=1
which is an estimator for the subpopulation total G(S ), so its variance estimator can be computed
using the design-based variance estimator for a subpopulation total.
64
Subpopulation with replication methods

The above comparison between the variance estimator from the subpop() option and the variance
estimator from the if and in qualifiers is also true for the replication methods.
For the BRR method, the same number of replicates is produced with or without the subpop()
option. The difference is how the replicate values are computed. Using the if and in qualifiers may
cause an error because svy brr checks that there are two PSUs in every stratum within the restricted
sample.
For the jackknife method, every PSU produces a replicate, even if it does not contain an observation
within the subpopulation specified using the subpop() option. When the if and in qualifiers are
used, only the PSUs that have at least 1 observation within the restricted sample will produce a
replicate.
For methods using replicate weight variables, every weight variable produces a replicate, even if
it does not contain an observation within the subpopulation specified using the subpop() option.
When the if and in qualifiers are used, only the PSUs that have at least 1 observation within the
restricted sample will produce a replicate.
References
West, B. T., P. A. Berglund, and S. G. Heeringa. 2008. A closer examination of subpopulation analysis of complex-sample
survey data. Stata Journal 8: 520531.
Also see
Title
svy The survey prefix command
Syntax
Stored results
Description
Options
References

Also see
Syntax

svy vcetype
, svy options eform option : command
vcetype
Description
SE
linearized
bootstrap
brr
jackknife
sdr
Taylor-linearized variance estimation

bootstrap variance estimation; see [SVY] svy bootstrap
BRR variance estimation; see [SVY] svy brr
jackknife variance estimation; see [SVY] svy jackknife
SDR variance estimation; see [SVY] svy sdr
Specifying a vcetype overrides the default from svyset.
svy options
Description
if/in

if )
subpop( varname
identify a subpopulation
SE
dof(#)
bootstrap options
brr options
jackknife options
sdr options
design degrees of freedom

more options allowed with bootstrap variance estimation;
see [SVY] bootstrap options
more options allowed with BRR variance estimation;
see [SVY] brr options
more options allowed with jackknife variance estimation;
see [SVY] jackknife options
more options allowed with SDR variance estimation;
see [SVY] sdr options
Reporting
level(#)
nocnsreport
display options
set confidence level; default is level(95)

do not display constraints
control column formats, row spacing, line width, display of omitted
variables and base and empty cells, and factor-variable labeling
noheader
nolegend
noadjust
noisily
trace
coeflegend
suppress table header

suppress table legend
do not adjust model Wald statistic
trace command
display legend instead of statistics
65
66
svy requires that the survey design variables be identified using svyset; see [SVY] svyset.
mi estimate may be used with svy linearized if the estimation command allows mi estimate; it may not be
used with svy bootstrap, svy brr, svy jackknife, or svy sdr.
noheader, nolegend, noadjust, noisily, trace, and coeflegend are not shown in the dialog boxes for estimation
commands.
Warning: Using if or in restrictions will often not produce correct variance estimates for subpopulations. To compute
estimates for subpopulations, use the subpop() option.
See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.
Description
svy fits statistical models for complex survey data. Typing
. svy: command
executes command while accounting for the survey settings identified by svyset.
command defines the estimation command to be executed. Not all estimation commands are
supported by svy. See [SVY] svy estimation for a list of Statas estimation commands that are
supported by svy. See [P] program properties for a discussion of what is required for svy to support
an estimation command. The by prefix may not be part of command.
Options
if/in
subpop(subpop) specifies that estimates be computed for the single subpopulation identified by
subpop, which is

varname
if
Thus the subpopulation is defined by the observations for which varname 6= 0 that also meet
the if conditions. Typically, varname = 1 defines the subpopulation, and varname = 0 indicates
observations not belonging to the subpopulation. For observations whose subpopulation status is
uncertain, varname should be set to a missing value; such observations are dropped from the
estimation sample.
See [SVY] subpopulation estimation and [SVY] estat.
SE
dof(#) specifies the design degrees of freedom, overriding the default calculation, df = Npsu Nstrata .
bootstrap options are other options that are allowed with bootstrap variance estimation specified by svy
bootstrap or specified as svyset using the vce(bootstrap) option; see [SVY] bootstrap options.
brr options are other options that are allowed with BRR variance estimation specified by svy brr or
specified as svyset using the vce(brr) option; see [SVY] brr options.
jackknife options are other options that are allowed with jackknife variance estimation specified by svy
jackknife or specified as svyset using the vce(jackknife) option; see [SVY] jackknife options.
sdr options are other options that are allowed with SDR variance estimation specified by svy sdr or
specified as svyset using the vce(sdr) option; see [SVY] sdr options.
67
Reporting
level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is
level(95) or as set by set level; see [U] 20.7 Specifying the width of confidence intervals.
nocnsreport; see [R] estimation options.
display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and
nolstretch; see [R] estimation options.
The following options are available with svy but are not shown in the dialog boxes:
noheader prevents the table header from being displayed. This option implies nolegend.
nolegend prevents the table legend identifying the subpopulations from being displayed.
noadjust specifies that the model Wald test be carried out as W/k F (k, d), where W is the Wald
test statistic, k is the number of terms in the model excluding the constant term, d is the total
number of sampled PSUs minus the total number of strata, and F (k, d) is an F distribution with
k numerator degrees of freedom and d denominator degrees of freedom. By default, an adjusted
Wald test is conducted: (d k + 1)W/(kd) F (k, d k + 1).
See Korn and Graubard (1990) for a discussion of the Wald test and the adjustments thereof. Using
the noadjust option is not recommended.
noisily requests that any output from command be displayed.
trace causes a trace of the execution of command to be displayed.
coeflegend; see [R] estimation options.
The following option is usually available with svy at the time of estimation or on replay but is not
shown in all dialog boxes:
eform option; see [R] eform option.

The svy prefix is designed for use with complex survey data. Typical survey design characteristics
include sampling weights, one or more stages of clustered sampling, and stratification. For a general
discussion of various aspects of survey designs, including multistage designs, see [SVY] svyset.
Below we present an example of the effects of weights, clustering, and stratification. This is a
typical case, but drawing general rules from any one example is still dangerous. You could find
particular analyses from other surveys that are counterexamples for each of the trends for standard
errors exhibited here.
Example 1: The effects of weights, clustering, and stratification

We use data from the Second National Health and Nutrition Examination Survey (NHANES II)
(McDowell et al. 1981) as our example. This is a national survey, and the dataset has sampling
weights, strata, and clustering. In this example, we will consider the estimation of the mean serum
zinc level of all adults in the United States.
First, consider a proper design-based analysis, which accounts for weighting, clustering, and
stratification. Before we issue our svy estimation command, we set the weight, strata, and PSU
identifier variables:
68

. use http://www.stata-press.com/data/r13/nhanes2f
. svyset psuid [pweight=finalwgt], strata(stratid)
pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
We now estimate the mean by using the proper design-based analysis:

. svy: mean zinc
Number of strata =
31
Number of PSUs
=
62
Mean
zinc
87.18207
Number of obs
Population size
Design df
Linearized
Std. Err.
.4944827
=
=
=
9189
104176071
31

86.17356
88.19057
If we ignore the survey design and use mean to estimate the mean, we get
. mean zinc
Mean estimation
Number of obs
Mean
zinc
86.51518
9189
Std. Err.
.1510744
86.21904
86.81132
The point estimate from the unweighted analysis is smaller by more than one standard error than
the proper design-based estimate. Also, design-based analysis produced a standard error that is 3.27
times larger than the standard error produced by our incorrect analysis.
Example 2: Halfway is not enoughthe importance of stratification and clustering

When some people analyze survey data, they say, I know I have to use my survey weights, but
I will just ignore the stratification and clustering information. If we follow this strategy, we will
obtain the proper design-based point estimates, but our standard errors, confidence intervals, and test
statistics will usually be wrong.
69
To illustrate this effect, suppose that we used the svy: mean procedure with pweights only.
. svyset [pweight=finalwgt]
pweight: finalwgt
VCE: linearized
Strata 1: <one>
FPC 1: <zero>
. svy: mean zinc
Number of strata =
Number of PSUs
=
1
9189
Mean
zinc
87.18207
Number of obs
Population size
Design df
Linearized
Std. Err.
.1828747
=
=
=
9189
104176071
9188

86.82359
87.54054
This approach gives us the same point estimate as our design-based analysis, but the reported
standard error is less than one-half the design-based standard error. If we accounted only for clustering
and weights and ignored stratification in NHANES II, we would obtain the following analysis:
. svyset psuid [pweight=finalwgt]
pweight: finalwgt
VCE: linearized
Strata 1: <one>
SU 1: psuid
FPC 1: <zero>
. svy: mean zinc
Number of strata =
1
Number of obs
Number of PSUs
=
2
Population size
Design df
Mean
zinc
87.18207
Linearized
Std. Err.
.7426221
=
=
=
9189
104176071
1

77.74616
96.61798
Here our standard error is about 50% larger than what we obtained in our proper design-based analysis.
Example 3
Lets look at a regression. We model zinc on the basis of age, weight, sex, race, and rural or urban
residence. We compare a proper design-based analysis with an ordinary regression (which assumes
independent and identically distributed error).
70
Here is our design-based analysis:

pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
. svy: regress zinc age c.age#c.age weight female black orace rural
Number of strata
Number of PSUs
=
=
31
62
Linearized
Std. Err.
Number of obs
Population size
Design df
F(
7,
25)
Prob > F
R-squared
P>|t|
=
=
=
=
=
=
9189
104176071
31
62.50
0.0000
0.0698
zinc
Coef.
age
-.1701161
.0844192
-2.02
0.053
-.3422901
.002058
c.age#c.age
.0008744
.0008655
1.01
0.320
-.0008907
.0026396
weight
female
black
orace
rural
_cons
.0535225
-6.134161
-2.881813
-4.118051
-.5386327
92.47495
.0139115
.4403625
1.075958
1.621121
.6171836
2.228263
3.85
-13.93
-2.68
-2.54
-0.87
41.50
0.001
0.000
0.012
0.016
0.390
0.000
.0251499
-7.032286
-5.076244
-7.424349
-1.797387
87.93038
.0818951
-5.236035
-.687381
-.8117528
.7201216
97.01952
If we had improperly ignored our survey weights, stratification, and clustering (that is, if we had
used the usual Stata regress command), we would have obtained the following results:
. regress zinc age c.age#c.age weight female black orace rural
SS
df
MS
Number of obs
Source
F( 7, 9181)
Model
110417.827
7 15773.9753
Prob > F
Residual
1816535.3 9181
197.85811
R-squared
Adj R-squared
Total
1926953.13 9188 209.724982
Root MSE
Std. Err.
P>|t|
=
=
=
=
=
=
9189
79.72
0.0000
0.0573
0.0566
14.066
zinc
Coef.
age
-.090298
.0638452
-1.41
0.157
-.2154488
.0348528
c.age#c.age
-.0000324
.0006788
-0.05
0.962
-.0013631
.0012983
weight
female
black
orace
rural
_cons
.0606481
-5.021949
-2.311753
-3.390879
-.0966462
89.49465
.0105986
.3194705
.5073536
1.060981
.3098948
1.477528
5.72
-15.72
-4.56
-3.20
-0.31
60.57
0.000
0.000
0.000
0.001
0.755
0.000
.0398725
-5.648182
-3.306279
-5.470637
-.7041089
86.59836
.0814237
-4.395716
-1.317227
-1.311121
.5108166
92.39093
71
The point estimates differ by 3% 100%, and the standard errors for the proper designed-based analysis
are 30% 110% larger. The differences are not as dramatic as we saw with the estimation of the
mean, but they are still substantial.
Stored results
svy stores the following in e():
Scalars
e(N)
e(N sub)
e(N strata)
e(N strata omit)
e(singleton)
e(census)
e(F)
e(df m)
e(df r)
e(N pop)
e(N subpop)
e(N psu)
e(stages)
e(k eq)
e(k aux)
e(p)
e(rank)
Macros
e(prefix)
e(cmdname)
e(cmd)
e(command)
e(cmdline)
e(wtype)
e(wexp)
e(wvar)
e(singleunit)
e(strata)
e(strata#)
e(psu)
e(su#)
e(fpc)
e(fpc#)
e(title)
e(poststrata)
e(postweight)
e(vce)
e(vcetype)
e(mse)
e(subpop)
e(adjust)
e(properties)
e(estat cmd)
e(predict)
e(marginsnotok)
Matrices
e(b)
e(V)
e(V srs)
e(V srssub)
number of observations
subpopulation observations
number of strata
number of strata omitted
1 if singleton strata, 0 otherwise
1 if census data, 0 otherwise
model F statistic
model degrees of freedom
variance degrees of freedom
estimate of population size
estimate of subpopulation size
number of sampled PSUs
number of equations in e(b)
number of ancillary parameters
p-value
rank of e(V)
svy
command name from command
same as e(cmdname) or e(vce)
command
command as typed
weight type
weight expression
strata() variable
psu() variable
fpc() variable
FPC for stage #
title in estimation output
title used to label Std. Err.
mse, if specified
subpop from subpop()
noadjust, if specified
b V
program used to implement estat
program used to implement predict
predictions disallowed by margins
estimates
design-based variance
bsrswor
simple-random-sampling-without-replacement variance, V
bsrswor
subpopulation simple-random-sampling-without-replacement variance, V
(created only when subpop() is specified)
72

e(V srswr)
e(V srssubwr)
e(V modelbased)
e(V msp)
e( N strata single)
e( N strata certain)
e( N strata)
Functions
e(sample)
bsrswr
simple-random-sampling-with-replacement variance, V
(created only when fpc() option is svyset)
bsrswr
subpopulation simple-random-sampling-with-replacement variance, V
(created only when subpop() is specified)
model-based variance
bmsp
variance from misspecified model fit, V
number of strata with one sampling unit
number of certainty strata
number of strata
marks estimation sample
svy also carries forward most of the results already in e() from command.

See [SVY] variance estimation for all the details behind the point estimate and variance calculations
made by svy.
References
Korn, E. L., and B. I. Graubard. 1990. Simultaneous testing of regression coefficients with complex survey data: Use
of Bonferroni t statistics. American Statistician 44: 270276.
Also see
[U] 20 Estimation and postestimation commands

Title
svy bootstrap Bootstrap for survey data
Syntax
Options
Menu
References
Description
Stored results
Also see
Syntax
svy bootstrap exp list
svy options
, svy options bootstrap options eform option
: command
Description
if/in

subpop( varname
if )
Reporting
level(#)
noheader
nolegend
noadjust
nocnsreport
display options

coeflegend
coeflegend is not shown in the dialog boxes for estimation commands.
73
74
Description
bootstrap options
Main
bsn(#)
Options

saving( filename , . . . )
mse
save results to filename; save statistics in double precision;

save results to filename every # replications
Reporting

trace command
use text as title for bootstrap results
verbose
nodots
noisily
trace
title(text)
Advanced

nodrop
reject(exp)
dof(#)
svy bootstrap requires that the bootstrap replicate weights be identified using svyset.
exp list contains
elist contains
eexp is
specname is
eqno is
(name: elist)
elist
eexp
newvarname = (exp)
(exp)
specname
[eqno]specname
b
b[]
se
se[]
##
name
exp is a standard Stata expression; see [U] 13 Functions and expressions.

Distinguish between [ ], which are to be typed, and , which indicate optional arguments.
Menu
Statistics
>
>
Resampling
>
Bootstrap estimation
75
Description
svy bootstrap performs bootstrap replication for complex survey data. Typing
. svy bootstrap exp list: command
executes command once for each replicate, using sampling weights that are adjusted according to the
bootstrap methodology.
command defines the statistical command to be executed. Most Stata commands and user-written
programs can be used with svy bootstrap as long as they follow standard Stata syntax, allow the
if qualifier, and allow pweights and iweights; see [U] 11 Language syntax. The by prefix may
not be part of command.
exp list specifies the statistics to be collected from the execution of command. exp list is required
unless command has the svyb program property, in which case exp list defaults to b; see [P] program
properties.
Options
svy options; see [SVY] svy.
Main
mean-weight variable specified in the bsrweight() option of svyset. The default is bsn(1).
The bsn() option of svy bootstrap overrides the bsn() option of svyset; see [SVY] svyset.
Options

saving( filename , suboptions ) creates a Stata data file (.dta file) consisting of (for each statistic
in exp list) a variable containing the replicates.
double specifies that the results for each replication be saved as doubles, meaning 8-byte reals.
By default, they are saved as floats, meaning 4-byte reals. This option may be used without
the saving() option to compute the variance estimates by using double precision.
every(#) specifies that results be written to disk every #th replication. every() should be specified
in conjunction with saving() only when command takes a long time for each replication.
This will allow recovery of partial results should some other software crash your computer.
See [P] postfile.
replace indicates that filename be overwritten if it exists. This option is not shown on the dialog
box.
mse specifies that svy bootstrap compute the variance by using deviations of the replicates from the
observed value of the statistics based on the entire dataset. By default, svy bootstrap computes
the variance by using deviations of the replicates from their mean.
Reporting
verbose requests that the full table legend be displayed.

successful replication. A red x is printed if command returns with an error, and e is printed if
one of the values in exp list is missing.
76
noisily requests that any output from command be displayed. This option implies the nodots
option.
trace causes a trace of the execution of command to be displayed. This option implies the noisily
option.
title(text) specifies a title to be displayed above the table of bootstrap results; the default title is
Bootstrap results.
eform option; see [R] eform option. This option is ignored if exp list is not b.
Advanced
nodrop prevents observations outside e(sample) and the if and in qualifiers from being dropped
before the data are resampled.
reject(exp) identifies an expression that indicates when results should be rejected. When exp is
true, the resulting values are reset to missing values.

The bootstrap methods for survey data used in recent years are largely due to McCarthy and
Snowden (1985), Rao and Wu (1988), and Rao, Wu, and Yue (1992). For example, Yeo, Mantel, and
Liu (1999) cites Rao, Wu, and Yue (1992) as the method for variance estimation used in the National
Population Health Survey conducted by Statistics Canada.
In the survey bootstrap, the model is fit multiple times, once for each of a set of adjusted sampling
weights. The variance is estimated using the resulting replicated point estimates.
Example 1
Suppose that we need to estimate the average birthweight for the population represented by the
National Maternal and Infant Health Survey (NMIHS) (Gonzalez, Krauss, and Scott 1992).
In [SVY] svy estimation, the dataset nmihs.dta contained the following design information:
Primary sampling units are mothers; that is, PSUs are individual observations there is no
separate PSU variable.
The finalwgt variable contains the sampling weights.
The stratan variable identifies strata.
There is no variable for the finite population correction.
nmihs bs.dta is equivalent to nmihs.dta except that the stratum identifier variable stratan is
replaced by bootstrap replicate-weight variables. The replicate-weight variables are already svyset,
and the default method for variance estimation is vce(bootstrap).
77
. use http://www.stata-press.com/data/r13/nmihs_bs
. svyset
pweight: finwgt
VCE: bootstrap
MSE: off
bsrweight: bsrw1 bsrw2 bsrw3 bsrw4 bsrw5 bsrw6 bsrw7 bsrw8 bsrw9 bsrw10
bsrw11 bsrw12 bsrw13 bsrw14 bsrw15 bsrw16 bsrw17 bsrw18 bsrw19
(output omitted )
bsrw989 bsrw990 bsrw991 bsrw992 bsrw993 bsrw994 bsrw995
bsrw996 bsrw997 bsrw998 bsrw999 bsrw1000
Strata 1: <one>
FPC 1: <zero>
Now we can use svy: mean to estimate the average birthweight for our population, and the standard
errors will be estimated using the survey bootstrap.
. svy, nodots: mean birthwgt
birthwgt
Number of obs
Population size
Replications
=
=
=
9946
3895562
1000
Observed
Mean
Bootstrap
Std. Err.
Normal-based
3355.452
6.520637
3342.672
3368.233
From these results, we are 95% confident that the mean birthweight for our population is between
3,343 and 3,368 grams.
To accommodate privacy concerns, many public-use datasets contain replicate-weight variables

derived from the mean bootstrap described by Yung (1997). In the mean bootstrap, each adjusted
weight is derived from more than one bootstrap sample. When replicate-weight variables for the
mean bootstrap are svyset, the bsn() option identifying the number of bootstrap samples used to
generate the adjusted-weight variables should also be specified. This number is used in the variance
calculation; see [SVY] variance estimation.
Example 2
nmihs mbs.dta is equivalent to nmihs.dta except that the strata identifier variable stratan is
replaced by mean bootstrap replicate-weight variables. The replicate-weight variables and variance
adjustment are already svyset, and the default method for variance estimation is vce(bootstrap).
78

. use http://www.stata-press.com/data/r13/nmihs_mbs
. svyset
pweight: finwgt
VCE: bootstrap
MSE: off
bsrweight: mbsrw1 mbsrw2 mbsrw3 mbsrw4 mbsrw5 mbsrw6 mbsrw7 mbsrw8 mbsrw9
mbsrw10 mbsrw11 mbsrw12 mbsrw13 mbsrw14 mbsrw15 mbsrw16
(output omitted )
mbsrw192 mbsrw193 mbsrw194 mbsrw195 mbsrw196 mbsrw197 mbsrw198
mbsrw199 mbsrw200
bsn: 5
Strata 1: <one>
FPC 1: <zero>
Notice that the 200 mean bootstrap replicate-weight variables were generated from 5 bootstrap samples;
in fact, the mean bootstrap weight variables in nmihs mbs.dta were generated from the bootstrap
weight variables in nmihs bs.dta.
Here we use svy: mean to estimate the average birthweight for our population.
. svy, nodots: mean birthwgt
birthwgt
Number of obs
Population size
Replications
=
=
=
9946
3895562
200
Observed
Mean
Bootstrap
Std. Err.
Normal-based
3355.452
5.712574
3344.256
3366.649
The standard error and confidence limits differ from the previous example. This merely illustrates
that the mean bootstrap is not numerically equivalent to the standard bootstrap, even when the
replicate-weight variables are generated from the same resampled datasets.
79
Stored results
In addition to the results documented in [SVY] svy, svy bootstrap stores the following in e():
Scalars
e(N reps)
e(N misreps)
e(k exp)
e(k eexp)
e(k extra)
e(bsn)
Macros
e(cmdname)
e(cmd)
e(vce)
e(exp#)
e(bsrweight)
Matrices
e(b bs)
e(V)
number of replications
number of replications with missing values
number of standard expressions
number of b/ se expressions
number of extra estimates added to b
same as e(cmdname) or bootstrap
bootstrap
#th expression
bootstrap means
bootstrap variance estimates
When exp list is b, svy bootstrap will also carry forward most of the results already in e() from
command.

See [SVY] variance estimation for details regarding bootstrap variance estimation.
References
Kolenikov, S. 2010. Resampling variance estimation for complex survey data. Stata Journal 10: 165199.
McCarthy, P. J., and C. B. Snowden. 1985. The bootstrap and finite population sampling. In Vital and Health Statistics,
123. Washington, DC: U.S. Government Printing Office.
Rao, J. N. K., and C. F. J. Wu. 1988. Resampling inference with complex survey data. Journal of the American
Statistical Association 83: 231241.
Rao, J. N. K., C. F. J. Wu, and K. Yue. 1992. Some recent work on resampling methods for complex surveys. Survey
Methodology 18: 209217.
Yeo, D., H. Mantel, and T.-P. Liu. 1999. Bootstrap variance estimation for the National Population Health Survey.
In Proceedings of the Survey Research Methods Section, 778785. American Statistical Association.
Yung, W. 1997. Variance estimation for public use files under confidentiality constraints. In Proceedings of the Survey
Research Methods Section, 434439. American Statistical Association.
80
Also see
[R] bootstrap Bootstrap sampling and estimation

Title
svy brr Balanced repeated replication for survey data
Syntax
Options
Menu
References
Description
Stored results
Also see
Syntax
svy
brr exp list
, svy options brr options eform option
svy options
: command
Description
if/in

subpop( varname
if )
Reporting
level(#)
noheader
nolegend
noadjust
nocnsreport
display options

coeflegend
81
82
Description
brr options
Main
Hadamard matrix
Fays adjustment
hadamard(matrix)
fay(#)
Options

mse

Reporting

trace command
use text as title for BRR results
verbose
nodots
noisily
trace
title(text)
Advanced

nodrop
reject(exp)
dof(#)
exp list contains
elist contains
eexp is
specname is
eqno is
(name: elist)
elist
eexp
newvarname = (exp)
(exp)
specname
[eqno]specname
b
b[]
se
se[]
##
name

Menu
Statistics
>
>
Resampling
>
Balanced repeated replications estimation
83
Description
svy brr performs balanced repeated replication (BRR) for complex survey data. Typing
. svy brr exp list: command
BRR methodology.
programs can be used with svy brr as long as they follow standard Stata syntax, allow the if
qualifier, and allow pweights and iweights; see [U] 11 Language syntax. The by prefix may not
be part of command.
properties.
Options
Main
hadamard(matrix) specifies the Hadamard matrix to be used to determine which PSUs are chosen
for each replicate.
fay(#) specifies Fays adjustment (Judkins 1990), where 0 # 2, but excluding 1. This option
overrides the fay(#) option of svyset; see [SVY] svyset.
Options

See [P] postfile.
box.
mse specifies that svy brr compute the variance by using deviations of the replicates from the
observed value of the statistics based on the entire dataset. By default, svy brr computes the
variance by using deviations of the replicates from their mean.
Reporting

84
option.
option.
title(text) specifies a title to be displayed above the table of BRR results; the default title is BRR
results.
Advanced

BRR was first introduced by McCarthy (1966, 1969a, 1969b) as a method of variance estimation
for designs with two PSUs in every stratum. The BRR variance estimator tends to give more reasonable
variance estimates for this design than the linearized variance estimator, which can result in large
values and undesirably wide confidence intervals.
In BRR, the model is fit multiple times, once for each of a balanced set of combinations where
one PSU is dropped from each stratum. The variance is estimated using the resulting replicated point
estimates. Although the BRR method has since been generalized to include other designs, Statas
implementation of BRR requires two PSUs per stratum.
To protect the privacy of survey participants, public survey datasets may contain replicate-weight
variables instead of variables that identify the PSUs and strata. These replicate-weight variables are
adjusted copies of the sampling weights. For BRR, the sampling weights are adjusted for dropping
one PSU from each stratum; see [SVY] variance estimation for more details.
Example 1: BRR replicate-weight variables

The survey design for the NHANES II data (McDowell et al. 1981) is specifically suited to BRR;
there are two PSUs in every stratum.
85
. svydescribe
pweight: finalwgt
VCE: linearized
Strata 1: strata
SU 1: psu
FPC 1: <zero>
#Obs per Unit
Stratum
#Units
1
2
3
4
5
(output omitted )
29
30
31
32
31
#Obs
min
mean
max
2
2
2
2
2
380
185
348
460
252
165
67
149
229
105
190.0
92.5
174.0
230.0
126.0
215
118
199
231
147
2
2
2
2
503
365
308
450
215
166
143
211
251.5
182.5
154.0
225.0
288
199
165
239
62
10351
67
167.0
288
Here is a privacy-conscious dataset equivalent to the one above; all the variables and values remain,
except strata and psu are replaced with BRR replicate-weight variables. The BRR replicate-weight
variables are already svyset, and the default method for variance estimation is vce(brr).
. use http://www.stata-press.com/data/r13/nhanes2brr
. svyset
pweight: finalwgt
VCE: brr
MSE: off
brrweight: brr_1 brr_2 brr_3 brr_4 brr_5 brr_6 brr_7 brr_8 brr_9 brr_10
brr_29 brr_30 brr_31 brr_32
Strata 1: <one>
FPC 1: <zero>
Suppose that we were interested in the population ratio of weight to height. Here we use total
to estimate the population totals of weight and height and the svy brr prefix to estimate their
ratio and variance; we use total instead of ratio (which is otherwise preferable here) to illustrate
how to specify an exp list.
86

. svy brr WtoH = (_b[weight]/_b[height]): total weight height
1
2
3
4
5
................................
BRR results
Number of obs
Population size
Replications
Design df
Coef.
WtoH
.4268116
BRR
Std. Err.
.0008904
t
479.36
=
=
=
=
10351
117157513
32
31
P>|t|
0.000
.4249957
.4286276
The mse option causes svy brr to use the MSE form of the BRR variance estimator. This variance
estimator will tend to be larger than the previous because of the addition of the familiar squared
bias term in the MSE; see [SVY] variance estimation for more details. The header for the column of
standard errors in the table of results is BRR * for the BRR variance estimator using the MSE formula.
. svy brr WtoH = (_b[weight]/_b[height]), mse: total weight height
1
2
3
4
5
................................
BRR results
Number of obs
=
Population size
=
Replications
=
Design df
=
Coef.
WtoH
.4268116
BRR *
Std. Err.
.0008904
t
479.36
10351
117157513
32
31
P>|t|
0.000
.4249957
.4286276
The bias term here is too small to see any difference in the standard errors.
Example 2: Survey data without replicate-weight variables

For survey data with the PSU and strata variables but no replication weights, svy brr can compute
adjusted sampling weights within its replication loop. Here the hadamard() option must be supplied
with the name of a Stata matrix that is a Hadamard matrix of appropriate order for the number of
strata in your dataset (see the following technical note for a quick introduction to Hadamard matrices).
There are 31 strata in nhanes2.dta, so we need a Hadamard matrix of order 32 (or more) to
use svy brr with this dataset. Here we use h32 (from the following technical note) to estimate the
population ratio of weight to height by using the BRR variance estimator.
87
. svy brr, hadamard(h32): ratio (WtoH: weight/height)
1
2
3
4
5
................................
Number of strata =
31
Number of obs
=
10351
Number of PSUs
=
62
Population size = 117157513
Replications
=
32
Design df
=
31
WtoH: weight/height
Ratio
WtoH
.4268116
BRR
Std. Err.
.0008904
.4249957
.4286276
Technical note
A Hadamard matrix is a square matrix with r rows and columns that has the property
Hr0 Hr = rIr
where Ir is the identity matrix of order r. Generating a Hadamard matrix with order r = 2p is
easily accomplished. Start with a Hadamard matrix of order 2 (H2 ), and build your Hr by repeatedly
applying Kronecker products with H2 . Here is the Stata code to generate the Hadamard matrix for
the previous example.
matrix h2 = (-1, 1 \ 1, 1)
matrix h32 = h2
forvalues i = 1/4 {
matrix h32 = h2 # h32
}
svy brr consumes Hadamard matrices from left to right, so it is best to make sure that r is greater
than the number of strata and that the last column is the one consisting of all 1s. This will ensure
full orthogonal balance according to Wolter (2007).
88
Stored results
In addition to the results documented in [SVY] svy, svy brr stores the following in e():
Scalars
e(N reps)
e(N misreps)
e(k exp)
e(k eexp)
e(k extra)
e(fay)
Macros
e(cmdname)
e(cmd)
e(vce)
e(brrweight)
Matrices
e(b brr)
e(V)
When exp list is

command.
number of replications
number of replications with missing values
number of standard expressions
number of b/ se expressions
number of extra estimates added to b
Fays adjustment
same as e(cmdname) or brr
brr
BRR means
BRR variance estimates
b, svy brr will also carry forward most of the results already in e() from

See [SVY] variance estimation for details regarding BRR variance estimation.
References
Judkins, D. R. 1990. Fays method for variance estimation. Journal of Official Statistics 6: 223239.
Also see

Title
svy estimation Estimation commands for survey data
Description
Menu
References
Also see
Description
Survey data analysis in Stata is essentially the same as standard data analysis. The standard syntax
applies; you just need to also remember the following:
Use svyset to identify the survey design characteristics.

Prefix the estimation commands with svy:.
For example,
. use http://www.stata-press.com/data/r13/nhanes2f
. svy: regress zinc age c.age#c.age weight female black orace rural
See [SVY] svyset and [SVY] svy.

The following estimation commands support the svy prefix:
mean
[R]
proportion [R]
ratio
[R]
total
[R]
mean Estimate means

proportion Estimate proportions
ratio Estimate ratios
total Estimate totals
Linear regression models

cnsreg
[R] cnsreg Constrained linear regression
etregress
[TE] etregress Linear regression with endogenous treatment effects
glm
[R] glm Generalized linear models
intreg
[R] intreg Interval regression
nl
[R] nl Nonlinear least-squares estimation
regress
[R] regress Linear regression
tobit
[R] tobit Tobit regression
truncreg
[R] truncreg Truncated regression
Structural equation models
sem
[SEM] sem Structural equation model estimation command
Survival-data regression models
stcox
[ST] stcox Cox proportional hazards model
streg
[ST] streg Parametric survival models
89
90
Binary-response regression models

biprobit
[R] biprobit Bivariate probit regression
cloglog
[R] cloglog Complementary log-log regression
hetprobit
[R] hetprobit Heteroskedastic probit model
logistic
[R] logistic Logistic regression, reporting odds ratios
logit
[R] logit Logistic regression, reporting coefficients
probit
[R] probit Probit regression
scobit
[R] scobit Skewed logistic regression
Discrete-response
clogit
mlogit
mprobit
ologit
oprobit
slogit
regression models
[R] clogit Conditional (fixed-effects) logistic regression
[R] mlogit Multinomial (polytomous) logistic regression
[R] mprobit Multinomial probit regression
[R] ologit Ordered logistic regression
[R] oprobit Ordered probit regression
[R] slogit Stereotype logistic regression
Poisson regression
gnbreg
nbreg
poisson
tnbreg
tpoisson
zinb
zip
models
Generalized negative binomial regression in [R] nbreg
[R] nbreg Negative binomial regression
[R] poisson Poisson regression
[R] tnbreg Truncated negative binomial regression
[R] tpoisson Truncated Poisson regression
[R] zinb Zero-inflated negative binomial regression
[R] zip Zero-inflated Poisson regression
Instrumental-variables regression models

ivprobit
[R] ivprobit Probit model with continuous endogenous regressors
ivregress
[R] ivregress Single-equation instrumental-variables regression
ivtobit
[R] ivtobit Tobit model with continuous endogenous regressors
Regression models with selection
heckman
[R] heckman Heckman selection model
heckoprobit [R] heckoprobit Ordered probit model with sample selection
heckprobit
[R] heckprobit Probit model with sample selection
Menu
Statistics
>
> ...
Dialog boxes for all statistical estimators that support svy can be found on the above menu path.
In addition, you can access survey data estimation from standard dialog boxes on the SE/Robust or
SE/Cluster tab.
91

Overview of survey analysis in Stata
Regression models
Health surveys
Overview of survey analysis in Stata

Many Stata commands estimate the parameters of a process or population by using sample data.
For example, mean estimates means, ratio estimates ratios, regress fits linear regression models,
poisson fits Poisson regression models, and logistic fits logistic regression models. Some of these
estimation commands support the svy prefix, that is, they may be prefixed by svy: to produce results
appropriate for complex survey data. Whereas poisson is used with standard, nonsurvey data, svy:
poisson is used with survey data. In what follows, we refer to any estimation command not prefixed
by svy: as the standard command. A standard command prefixed by svy: is referred to as a svy
command.
Most standard commands (and all standard commands supported by svy) allow pweights and
the vce(cluster clustvar) option, where clustvar corresponds to the PSU variable that you svyset.
If your survey data exhibit only sampling weights or first-stage clusters (or both), you can get by
with using the standard command with pweights, vce(cluster clustvar), or both. Your parameter
estimates will always be identical to those you would have obtained from the svy command, and the
standard command uses the same robust (linearization) variance estimator as the svy command with
a similarly svyset design.
Most standard commands are also fit using maximum likelihood. When used with independently
distributed, nonweighted data, the likelihood to be maximized reflects the joint probability distribution
of the data given the chosen model. With complex survey data, however, this interpretation of the
likelihood is no longer valid, because survey data are weighted, not independently distributed, or
both. Yet for survey data, (valid) parameter estimates for a given model can be obtained using the
associated likelihood function with appropriate weighting. Because the probabilistic interpretation no
longer holds, the likelihood here is instead called a pseudolikelihood, but likelihood-ratio tests are no
longer valid. See Skinner (1989, sec. 3.4.4) for a discussion of maximum pseudolikelihood estimators.
Here we highlight the other features of svy commands:
svy commands handle stratified sampling, but none of the standard commands do. Because
stratification usually makes standard errors smaller, ignoring stratification is usually conservative.
So not using svy with stratified sample data is not a terrible thing to do. However, to get the
smallest possible honest standard-error estimates for stratified sampling, use svy.
svy commands use t statistics with n L degrees of freedom to test the significance of
coefficients, where n is the total number of sampled PSUs (clusters) and L is the number of
strata in the first stage. Some of the standard commands use t statistics, but most use z statistics.
If the standard command uses z statistics for its standard variance estimator, then it also uses
z statistics with the robust (linearization) variance estimator. Strictly speaking, t statistics are
appropriate with the robust (linearization) variance estimator; see [P] robust for the theoretical
rationale. But, using z rather than t statistics yields a nontrivial difference only when there is
a small number of clusters (< 50). If a regression model command uses t statistics and the
vce(cluster clustvar) option is specified, then the degrees of freedom used is the same as
that of the svy command (in the absence of stratification).
92
svy commands produce an adjusted Wald test for the model test, and test can be used to
produce adjusted Wald tests for other hypotheses after svy commands. Only unadjusted Wald
tests are available if the svy prefix is not used. The adjustment can be important when the
degrees of freedom, n L, is small relative to the dimension of the test. (If the dimension is
one, then the adjusted and unadjusted Wald tests are identical.) This fact along with the point
made in the second bullet make using the svy command important if the number of sampled
PSUs (clusters) is small (< 50).
svy: regress differs slightly from regress and svy: ivregress differs slightly from
ivregress in that they use different multipliers for the variance estimator. regress and
ivregress (when the small option is specified) use a multiplier of {(N 1)/(N k)}{n/(n
1)}, where N is the number of observations, n is the number of clusters (PSUs), and k is
the number of regressors including the constant. svy: regress and svy: ivregress use
n/(n 1) instead. Thus they produce slightly different standard errors. The (N 1)/(N k)
is ad hoc and has no rigorous theoretical justification; hence, the purist svy commands do not
use it. The svy commands tacitly assume that N k . If (N 1)/(N k) is not close to 1,
you may be well advised to use regress or ivregress so that some punishment is inflicted on
your variance estimates. Maximum likelihood estimators in Stata (for example, logit) do no
such adjustment but rely on the sensibilities of the analyst to ensure that N is reasonably larger
than k . Thus the maximum pseudolikelihood estimators (for example, svy: logit) produce
the same standard errors as the corresponding maximum likelihood commands (for example,
logit), but p-values are slightly different because of the point made in the second bullet.
svy commands can produce proper estimates for subpopulations by using the subpop() option.
Using an if restriction with svy or standard commands can yield incorrect standard-error estimates for subpopulations. Often an if restriction will yield the same standard error as subpop();
most other times, the two standard errors will be slightly different; but sometimes usually for
thinly sampled subpopulations the standard errors can be appreciably different. Hence, the
svy command with the subpop() option should be used to obtain estimates for thinly sampled
subpopulations. See [SVY] subpopulation estimation for more information.
svy commands handle zero sampling weights properly. Standard commands ignore any observation with a weight of zero. Usually, this will yield the same standard errors, but sometimes
they will differ. Sampling weights of zero can arise from various postsampling adjustment
procedures. If the sum of weights for one or more PSUs is zero, svy and standard commands
will produce different standard errors, but usually this difference is very small.
You can svyset iweights and let these weights be negative. Negative sampling weights can
arise from various postsampling adjustment procedures. If you want to use negative sampling
weights, then you must svyset iweights instead of pweights; no standard command will
allow negative sampling weights.
The svy commands compute finite population corrections (FPCs).
After a svy command, estat effects will compute the design effects DEFF and DEFT and
the misspecification effects MEFF and MEFT.
svy commands can perform variance estimation that accounts for multiple stages of clustered
sampling.
svy commands can perform variance estimation that accounts for poststratification adjustments
to the sampling weights.
Some standard options are not allowed with the svy prefix. For example, vce() and weights
cannot be specified when using the svy prefix because svy is already using the variance
estimation and sampling weights identified by svyset. Some options are not allowed with
93
survey data because they would be statistically invalid, such as noskip for producing optional
likelihood-ratio tests. Other options are not allowed because they change how estimation results
are reported (for example, nodisplay, first, plus) or are not compatible with svys variance
estimation methods (for example, irls, mse1, hc2, hc3).
Estimation results are presented in the standard way, except that svy has its own table header:
In addition to the sample size, model test, and R2 (if present in the output from the standard
command), svy will also report the following information in the header:
a. number of strata and PSUs
b. number of poststrata, if specified to svyset
c. population size estimate
d. subpopulation sizes, if the subpop() option was specified
e. design degrees of freedom
Use svy: mean, svy: ratio, svy: proportion, and svy: total to estimate finite population and
subpopulation means, ratios, proportions, and totals, respectively. You can also estimate standardized
means, ratios, and proportions for survey data; see [SVY] direct standardization. Estimates for multiple
subpopulations can be obtained using the over() option; see [SVY] subpopulation estimation.
Example 1
Suppose that we need to estimate the average birthweight for the population represented by the
National Maternal and Infant Health Survey (NMIHS) (Gonzalez, Krauss, and Scott 1992).
First, we gather the survey design information.
Primary sampling units are mothers; that is, PSUs are individual observations there is no
separate PSU variable.
The finalwgt variable contains the sampling weights.
The stratan variable identifies strata.
There is no variable for the finite population correction.
Then we use svyset to identify the variables for sampling weights and stratification.
. svyset [pweight=finwgt], strata(stratan)
pweight: finwgt
VCE: linearized
Strata 1: stratan
FPC 1: <zero>
Now we can use svy: mean to estimate the average birthweight for our population.
94

. svy: mean birthwgt
Number of strata =
6
Number of obs
Number of PSUs
=
9946
Population size
Design df
Mean
birthwgt
3355.452
Linearized
Std. Err.
6.402741
=
=
=
9946
3895562
9940

3342.902
3368.003
From these results, we are 95% confident that the mean birthweight for our population is between
3,343 and 3,368 grams.
Regression models
As exhibited in the table at the beginning of this manual entry, many of Statas regression model
commands support the svy prefix. If you know how to use one of these commands with standard
data, then you can also use the corresponding svy command with your survey data.
Example 2
Lets model the incidence of high blood pressure with a dataset from the Second National Health and
Nutrition Examination Survey (NHANES II) (McDowell et al. 1981). The survey design characteristics
are already svyset, so we will just replay them.
. svyset
pweight: finalwgt
VCE: linearized
Strata 1: strata
SU 1: psu
FPC 1: <zero>
Now we can use svy: logistic to model the incidence of high blood pressure as a function of
height, weight, age, and sex (using the female indicator variable).

. svy: logistic highbp height weight age female
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Design df
F(
4,
28)
Prob > F
highbp
Odds Ratio
height
weight
age
female
_cons
.9657022
1.053023
1.050059
.6272129
.716868
Linearized
Std. Err.
.0051511
.0026902
.0019761
.0368195
.6106878
t
-6.54
20.22
25.96
-7.95
-0.39
=
=
=
=
=
95
10351
117157513
31
368.33
0.0000
P>|t|
0.000
0.000
0.000
0.000
0.699
.9552534
1.047551
1.046037
.5564402
.1261491
.9762654
1.058524
1.054097
.706987
4.073749
The odds ratio for the female predictor is 0.63 (rounded to two decimal places) and is significantly
less than 1. This finding implies that females have a lower incidence of high blood pressure than do
males.
Here we use the subpop() option to model the incidence of high blood pressure in the subpopulation
identified by the female variable.
. svy, subpop(female): logistic highbp height weight age
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Subpop. no. of obs
Subpop. size
Design df
F(
3,
29)
Prob > F
highbp
Odds Ratio
height
weight
age
_cons
.9630557
1.053197
1.066112
.3372393
Linearized
Std. Err.
.0074892
.003579
.0034457
.4045108
t
-4.84
15.25
19.81
-0.91
=
=
=
=
=
=
=
10351
117157513
5436
60998033
31
227.53
0.0000
P>|t|
0.000
0.000
0.000
0.372
.9479018
1.045923
1.059107
.029208
.9784518
1.060522
1.073163
3.893807
Because the odds ratio for the age predictor is significantly greater than 1, we can conclude that
older females are more likely to have high blood pressure than are younger females.
Health surveys
There are many sources of bias when modeling the association between a disease and its risk
factors (Korn, Graubard, and Midthune 1997; Korn and Graubard 1999, sec. 3.7). In cross-sectional
health surveys, inference is typically restricted to the target population as it stood when the data were
collected. This type of survey cannot capture the fact that participants may change their habits over
time. Some health surveys collect data retrospectively, relying on the participants to recall the status
of risk factors as they stood in the past. This type of survey is vulnerable to recall bias.
96
Longitudinal surveys collect data over time, monitoring the survey participants over several years.
Although the above biases are minimized, analysts are still faced with some tough choices/situations
when modeling time-to-event data. For example:
Time scale. When studying cancer, should we measure the time scale by using the participants
age or the initial date from which data were collected?
Time-varying covariates. Were all relevant risk factors sampled over time, or do we have only
the baseline measurement?
Competing risks. When studying mortality, do we have the data specific to cause of death?
Binder (1983) provides the foundation for fitting most of the common parametric models by using
survey data. Similarly, Lin and Wei (1989) provide the foundational theory for robust inference by
using the proportional hazards model. Binder (1992) describes how to estimate standard errors for
the proportional hazards model from survey data, and Lin (2000) provides a rigorous justification for
Binders method. Korn and Graubard (1999) discuss many aspects of model fitting by using data from
health surveys. ODonnell et al. (2008, chap. 10) use Stata survey commands to perform multivariate
analysis using health survey data.
Example 3: Coxs proportional hazards model

Suppose that we want to model the incidence of lung cancer by using three risk factors: smoking
status, sex, and place of residence. Our dataset comes from a longitudinal health survey: the First
National Health and Nutrition Examination Survey (NHANES I) (Miller 1973; Engel et al. 1978) and its
1992 Epidemiologic Follow-up Study (NHEFS) (Cox et al. 1997); see the National Center for Health
Statistics website at http://www.cdc.gov/nchs/. We will be using data from the samples identified by
NHANES I examination locations 165 and 66100; thus we will svyset the revised pseudo-PSU and
strata variables associated with these locations. Similarly, our pweight variable was generated using
the sampling weights for the nutrition and detailed samples for locations 165 and the weights for
the detailed sample for locations 66100.
. use http://www.stata-press.com/data/r13/nhefs
. svyset psu2 [pw=swgt2], strata(strata2)
pweight: swgt2
VCE: linearized
Strata 1: strata2
SU 1: psu2
FPC 1: <zero>
The lung cancer information was taken from the 1992 NHEFS interview data. We use the participants
ages for the time scale. Participants who never had lung cancer and were alive for the 1992 interview
were considered censored. Participants who never had lung cancer and died before the 1992 interview
were also considered censored at their age of death.
97
. stset age_lung_cancer [pw=swgt2], fail(lung_cancer)

failure event: lung_cancer != 0 & lung_cancer < .
obs. time interval: (0, age_lung_cancer]
exit on or before: failure
weight: [pweight=swgt2]
14407
5126
9281
83
599691
total observations
event time missing (age_lung_cancer>=.)
PROBABLE ERROR
observations remaining, representing

failures in single-record/single-failure data
total analysis time at risk and under observation
at risk from t =
earliest observed entry t =
last observed exit t =
0
0
97
Although stset warns us that it is a probable error to have 5,126 observations with missing event
times, we can verify from the 1992 NHEFS documentation that there were indeed 9,281 participants
with complete information.
For our proportional hazards model, we pulled the risk factor information from the NHANES I and
1992 NHEFS datasets. Smoking status was taken from the 1992 NHEFS interview data, but we filled
in all but 132 missing values by using the general medical history supplement data in NHANES I.
Smoking status is represented by separate indicator variables for former smokers and current smokers;
the base comparison group is nonsmokers. Sex was determined using the 1992 NHEFS vitality data
and is represented by an indicator variable for males. Place-of-residence information was taken from
the medical history questionnaire in NHANES I and is represented by separate indicator variables for
rural and heavily populated (more than 1 million people) urban residences; the base comparison group
is urban residences with populations of fewer than 1 million people.
. svy: stcox former_smoker smoker male urban1 rural
Number of strata
=
Number of PSUs
=
_t
Haz. Ratio
former_smoker
smoker
male
urban1
rural
2.788113
7.849483
1.187611
.8035074
1.581674
35
105
Linearized
Std. Err.
.6205102
2.593249
.3445315
.3285144
.5281859
Number of obs
Population size
Design df
F(
5,
66)
Prob > F
t
4.61
6.24
0.59
-0.54
1.37
=
=
=
=
=
9149
151327827
70
14.07
0.0000
P>|t|
0.000
0.000
0.555
0.594
0.174
1.788705
4.061457
.6658757
.3555123
.8125799
4.345923
15.17051
2.118142
1.816039
3.078702
From the above results, we can see that both former and current smokers have a significantly
higher risk for developing lung cancer than do nonsmokers.
98
Technical note
In the previous example, we specified a sampling weight variable in the calls to both svyset
and stset. When the svy prefix is used with stcox and streg, it identifies the sampling weight
variable by using the data characteristics from both svyset and stset. svy will report an error if
the svyset pweight variable is different from the stset pweight variable. The svy prefix will use
the specified pweight variable, even if it is svyset but not stset. If a pweight variable is stset
but not svyset, svy will note that it will be using the stset pweight variable and then svyset it.
The standard st commands will not use the svyset pweight variable if it is not also stset.
Example 4: Multiple baseline hazards

We can assess the proportional-hazards assumption across the observed race categories for the
model fit in the previous example. The race information in our 1992 NHEFS dataset is contained in
the revised race variable. We will use stphplot to produce a log-log plot for each category of
revised race. As described in [ST] stcox PH-assumption tests, if the plotted lines are reasonably
parallel, the proportional-hazards assumption has not been violated. We will use the zero option to
reset the risk factors to their base comparison group.
. stphplot, strata(revised_race) adjust(former_smoker smoker male urban1 rural)
> zero legend(col(1))
lung_cancer
age_lung_cancer
[pweight=swgt2]
ln[ln(Survival Probability)]
4
6
8
10
failure _d:
analysis time _t:
weight:
3.6
3.8
4
4.2
ln(analysis time)
4.4
4.6
revised_race = Aleut, Eskimo or American Indian

revised_race = Asian/Pacific Islander
revised_race = Black
revised_race = White
revised_race = Other
As we can see from the graph produced above, the lines for the black and white race categories
intersect. This indicates a violation of the proportional-hazards assumption, so we should consider
using separate baseline hazard functions for each race category in our model fit. We do this next, by
specifying strata(revised race) in our call to svy: stcox.
99
. svy: stcox former_smoker smoker male urban1 rural, strata(revised_race)

Number of strata
=
35
Number of obs
=
9149
Number of PSUs
=
105
Population size
= 151327827
Design df
=
70
F(
5,
66)
=
13.95
Prob > F
=
0.0000
_t
Haz. Ratio
former_smoker
smoker
male
urban1
rural
2.801797
7.954921
1.165724
.784031
1.490269
Linearized
Std. Err.
.6280352
2.640022
.3390339
.3120525
.5048569
t
4.60
6.25
0.53
-0.61
1.18
P>|t|
0.000
0.000
0.600
0.543
0.243
1.791761
4.103709
.6526527
.3544764
.7582848
4.381201
15.42038
2.082139
1.73412
2.928851
Stratified by revised_race
References
Binder, D. A. 1983. On the variances of asymptotically normal estimators from complex surveys. International
Statistical Review 51: 279292.
. 1992. Fitting Coxs proportional hazards models for survey data. Biometrika 79: 139147.
Cox, C. S., M. E. Mussolino, S. T. Rothwell, M. A. Lane, C. D. Golden, J. H. Madans, and J. J. Feldman. 1997.
Plan and operation of the NHANES I Epidemiologic Followup Study, 1992. In Vital and Health Statistics, series 1,
no. 35. Hyattsville, MD: National Center for Health Statistics.
Engel, A., R. S. Murphy, K. Maurer, and E. Collins. 1978. Plan and operation of the HANES I augmentation survey
of adults 2574 years: United States 197475. In Vital and Health Statistics, series 1, no. 14. Hyattsville, MD:
National Center for Health Statistics.
Korn, E. L., B. I. Graubard, and D. Midthune. 1997. Time-to-event analysis of longitudinal follow-up of a survey:
Choice of time-scale. American Journal of Epidemiology 145: 7280.
Lin, D. Y. 2000. On fitting Coxs proportional hazards models to survey data. Biometrika 87: 3747.
Lin, D. Y., and L. J. Wei. 1989. The robust inference for the Cox proportional hazards model. Journal of the American
Miller, H. W. 1973. Plan and operation of the Health and Nutrition Examination Survey: United States 19711973.
Hyattsville, MD: National Center for Health Statistics.
ODonnell, O., E. van Doorslaer, A. Wagstaff, and M. Lindelow. 2008. Analyzing Health Equity Using Household
Survey Data: A Guide to Techniques and Their Implementation. Washington, DC: The World Bank.
100
Also see
[SVY] estat Postestimation statistics for survey data
[SVY] direct standardization Direct standardization of means, proportions, and ratios

Title
svy jackknife Jackknife estimation for survey data
Syntax
Options
Menu
References
Description
Stored results
Also see
Syntax
svy jackknife exp list
svy options
, svy options jackknife options eform option
: command
Description
if/in

subpop( varname
if )
Reporting
level(#)
noheader
nolegend
noadjust
nocnsreport
display options

coeflegend
101
102
Description
jackknife options
Main
number of observations is in e(N)

number of observations is in r(N)
specify exp that evaluates to number of observations used
eclass
rclass
n(exp)
Options

keep
mse

keep pseudovalues
Reporting
verbose
nodots
noisily
trace
title(text)

trace command
use text as title for jackknife results
Advanced
nodrop
reject(exp)
dof(#)

exp list contains
elist contains
eexp is
specname is
eqno is
(name: elist)
elist
eexp
newvarname = (exp)
(exp)
specname
[eqno]specname
b
b[]
se
se[]
##
name

103
Menu
Statistics
>
>
Resampling
>
Jackknife estimation
Description
svy jackknife performs jackknife estimation for complex survey data. Typing
. svy jackknife exp list: command
executes command once for each primary sampling unit (PSU) in the dataset, leaving the associated
PSU out of the calculations that make up exp list.
programs can be used with svy jackknife as long as they follow standard Stata syntax, allow the
if qualifier, and allow pweights and iweights; see [U] 11 Language syntax. The by prefix may
not be part of command.
unless command has the svyj program property, in which case exp list defaults to b; see [P] program
properties.
Options
Main
eclass, rclass, and n(exp) specify where command stores the number of observations on which
it based the calculated results. We strongly advise you to specify one of these options.
eclass specifies that command store the number of observations in e(N).
rclass specifies that command store the number of observations in r(N).
n(exp) allows you to specify an expression that evaluates to the number of observations used.
Specifying n(r(N)) is equivalent to specifying the rclass option. Specifying n(e(N)) is equivalent to specifying the eclass option. If command stores the number of observations in r(N1),
specify n(r(N1)).
If you specify none of these options, svy jackknife will assume eclass or rclass depending
upon which of e(N) and r(N) is not missing (in that order). If both e(N) and r(N) are missing,
svy jackknife assumes that all observations in the dataset contribute to the calculated result. If
that assumption is incorrect, then the reported standard errors will be incorrect. For instance, say
that you specify
. svy jackknife coef=_b[x2]: myreg y x1 x2 x3
where myreg uses e(n) instead of e(N) to identify the number of observations used in calculations.
Further assume that observation 42 in the dataset has x3 equal to missing. The 42nd observation
plays no role in obtaining the estimates, but svy jackknife has no way of knowing that and will
use the wrong N . If, on the other hand, you specify
. svy jackknife coef=_b[x2], n(e(n)): myreg y x1 x2 x3
104
Then svy jackknife will notice that observation 42 plays no role. The n(e(n)) option is
specified because myreg is an estimation command, but it stores the number of observations
used in e(n) (instead of the standard e(N)). When svy jackknife runs the regression omitting
the 42nd observation, svy jackknife will observe that e(n) has the same value as when svy
jackknife previously ran the regression by using all the observations. Thus svy jackknife will
know that myreg did not use the observation.
Options

See [P] postfile.
box.
keep specifies that new variables be added to the dataset containing the pseudovalues of the requested
statistics. For instance, if you typed
. svy jackknife coef=_b[x2], eclass keep: regress y x1 x2 x3
Then the new variable coef would be added to the dataset containing the pseudovalues for b[x2].
Let b be defined as the value of b[x2] when all observations are used to fit the model, and let
b(j) be the value when the j th observation is omitted. The pseudovalues are defined as
pseudovaluej = N {b b(j)} + b(j)
where N is the number of observations used to produce b.
keep implies the nodrop option.
mse specifies that svy jackknife compute the variance by using deviations of the replicates from the
observed value of the statistics based on the entire dataset. By default, svy jackknife computes
the variance by using deviations of the pseudovalues from their mean.
Reporting

successful replication. A red x is printed if command returns with an error, e is printed if one
of the values in exp list is missing, n is printed if the sample size is not correct, and a yellow
s is printed if the dropped sampling unit is outside the subpopulation sample.
option.
option.
title(text) specifies a title to be displayed above the table of jackknife results; the default title is
Jackknife results.
105
Advanced

The jackknife is
an alternative, first-order unbiased estimator for a statistic;

a data-dependent way to calculate the standard error of the statistic and to obtain significance
levels and confidence intervals; and
a way of producing measures called pseudovalues for each observation, reflecting the observations
influence on the overall statistic.
The idea behind the simplest form of the jackknifethe one implemented in [R] jackknife is to
repeatedly calculate the statistic in question, each time omitting just one of the datasets observations.
Assume that our statistic of interest is the sample mean. Let yj be the j th observation of our data
on some measurement y , where j = 1, . . . , N and N is the sample size. If y is the sample mean of
y using the entire dataset and y (j) is the mean when the j th observation is omitted, then
y=
(N 1) y (j) + yj
N
Solving for yj , we obtain
yj = N y (N 1) y (j)
These are the pseudovalues that svy: jackknife calculates. To move this discussion beyond the
sample mean, let b be the value of our statistic (not necessarily the sample mean) using the entire
dataset, and let b(j) be the computed value of our statistic with the j th observation omitted. The
pseudovalue for the j th observation is
bj = N b (N 1) b(j)
The mean of the pseudovalues is the alternative, first-order unbiased estimator mentioned above,
and the standard error of the mean of the pseudovalues is an estimator for the standard error of b
(Tukey 1958, Shao and Tu 1995).
When the jackknife is applied to survey data, PSUs are omitted instead of observations, N is the
number of PSUs instead of the sample size, and the sampling weights are adjusted owing to omitting
PSUs; see [SVY] variance estimation for more details.
Because of privacy concerns, many public survey datasets contain jackknife replication-weight
variables instead of variables containing information on the PSUs and strata. These replication-weight
variables are the adjusted sampling weights, and there is one replication-weight variable for each
omitted PSU.
106
Example 1: Jackknife with information on PSUs and strata

Suppose that we were interested in a measure of association between the weight and height of
individuals in the population represented by the NHANES II data (McDowell et al. 1981). To measure
the association, we will use the slope estimate from a linear regression of weight on height. We
also use svy jackknife to estimate the variance of the slope.
. svyset
pweight: finalwgt
VCE: linearized
Strata 1: strata
SU 1: psu
FPC 1: <zero>
. svy jackknife slope = _b[height]: regress weight height
Jackknife replications (62)
1
2
3
4
5
..................................................
50
............
Linear regression
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Replications
Design df
command: regress weight height
slope: _b[height]
n(): e(N)
Coef.
slope
.8014753
Jackknife
Std. Err.
.0160281
t
50.00
=
=
=
=
10351
117157513
62
31
P>|t|
0.000
.7687858
.8341648
Example 2: Jackknife replicate-weight variables

nhanes2jknife.dta is a privacy-conscious dataset equivalent to nhanes2.dta; all the variables
and values remain, except that strata and psu are replaced with jackknife replicate-weight variables.
The replicate-weight variables are already svyset, and the default method for variance estimation is
vce(jackknife).

. use http://www.stata-press.com/data/r13/nhanes2jknife
. svyset
pweight: finalwgt
VCE: jackknife
MSE: off
jkrweight: jkw_1 jkw_2 jkw_3 jkw_4 jkw_5 jkw_6 jkw_7
jkw_11 jkw_12 jkw_13 jkw_14 jkw_15 jkw_16
Strata 1: <one>
FPC 1: <zero>
107
jkw_8 jkw_9 jkw_10

jkw_17 jkw_18 jkw_19
jkw_62
Here we perform the same analysis as in the previous example, using jackknife replication weights.
. svy jackknife slope = _b[height], nodots: regress weight height
Linear regression
Number of strata
=
31
Number of obs
Population size
Replications
Design df
command:
slope:
10351
117157513
62
31
regress weight height

_b[height]
Coef.
slope
=
=
=
=
.8014753
Jackknife
Std. Err.
.0160281
t
50.00
P>|t|
0.000
.7687858
.8341648
The mse option causes svy jackknife to use the MSE form of the jackknife variance estimator.
This variance estimator will tend to be larger than the previous because of the addition of the familiar
squared bias term in the MSE; see [SVY] variance estimation for more details. The header for the
column of standard errors in the table of results is Jknife * for the jackknife variance estimator,
which uses the MSE formula.
. svy jackknife slope = _b[height], mse nodots: regress weight height
Linear regression
Number of strata
=
31
Number of obs
=
10351
Population size
= 117157513
Replications
=
62
Design df
=
31
command: regress weight height
slope: _b[height]
Coef.
slope
.8014753
Jknife *
Std. Err.
.0160284
t
50.00
P>|t|
0.000
.7687852
.8341654
108
Stored results
In addition to the results documented in [SVY] svy, svy jackknife stores the following in e():
Scalars
e(N reps)
e(N misreps)
e(k exp)
e(k eexp)
e(k extra)
Macros
e(cmdname)
e(cmd)
e(vce)
e(exp#)
e(jkrweight)
Matrices
e(b jk)
e(V)
number
number
number
number
number
of
of
of
of
of
replications
replications with missing values
standard expressions
b/ se expressions
extra estimates added to b

same as e(cmdname) or jackknife
jackknife
#th expression
jackknife means
jackknife variance estimates
When exp list is b, svy jackknife will also carry forward most of the results already in e() from
command.

See [SVY] variance estimation for details regarding jackknife variance estimation.
References
Tukey, J. W. 1958. Bias and confidence in not-quite large samples. Abstract in Annals of Mathematical Statistics 29:
614.
Also see
[R] jackknife Jackknife estimation

Title
svy postestimation Postestimation tools for svy
Description
Syntax for predict
References
Also see
Description
The following postestimation commands are available after svy:
command
Description
contrasts and ANOVA-style joint tests of estimates

postestimation statistics for survey data
cataloging estimation results
point estimates, standard errors, testing, and inference for linear
combinations of coefficients
margins
marginal means, predictive margins, marginal effects, and average marginal
effects
marginsplot graph the results from margins (profile plots, interaction plots, etc.)
nlcom
point estimates, standard errors, testing, and inference for nonlinear
combinations of coefficients
predict
predictions, residuals, influence statistics, and other diagnostic measures
predictnl
point estimates, standard errors, testing, and inference for generalized
predictions
pwcompare
pairwise comparisons of estimates
suest
seemingly unrelated estimation
test
Wald tests of simple and composite linear hypotheses
testnl
Wald tests of nonlinear hypotheses
contrast
estat (svy)
estimates
lincom
See [SVY] estat.
Syntax for predict

The syntax of predict (and even if predict is allowed) after svy depends on the command used
with svy. Specifically, predict is not allowed after svy: mean, svy: proportion, svy: ratio,
svy: tabulate, or svy: total.

What follows are some examples of applications of postestimation commands using survey data.
The examples are meant only to introduce the commands in a survey context and explore a few of
the possibilities for postestimation analysis. See the individual entries for each command in the Base
Reference Manual for complete syntax and many more examples.
109
110
Example 1: Linear and nonlinear combinations

lincom will display an estimate of a linear combination of parameters, along with its standard
error, a confidence interval, and a test that the linear combination is zero. nlcom will do likewise for
nonlinear combinations of parameters.
lincom is commonly used to compute the differences of two subpopulation means. For example,
suppose that we wish to estimate the difference of zinc levels in white males versus black males
in the population represented by the NHANES II data (McDowell et al. 1981). Because the survey
design characteristics are already svyset in nhanes2.dta, we only need to generate a variable for
identifying the male subpopulation before using svy: mean.
. generate male = (sex == 1)
. svy, subpop(male): mean zinc, over(race)
Number of strata =
31
Number of obs
Number of PSUs
=
62
Population size
Subpop. no. obs
Subpop. size
Design df
White: race = White
Black: race = Black
Other: race = Other
Over
Mean
White
Black
Other
91.15725
88.269
85.54716
Linearized
Std. Err.
=
=
=
=
=
9811
111127314
4375
50129281
31
zinc
.541625
1.208336
2.608974
90.0526
85.80458
80.22612
92.2619
90.73342
90.8682
Then we run lincom to estimate the difference of zinc levels between the two subpopulations.
. lincom [zinc]White - [zinc]Black
( 1) [zinc]White - [zinc]Black = 0
Mean
Coef.
(1)
2.888249
Std. Err.
P>|t|
1.103999
2.62
0.014
.6366288
5.139868
The t statistic and its p-value give a survey analysis equivalent of a two-sample t test.
lincom and nlcom can be used after any of the estimation commands described in [SVY] svy
estimation. lincom can, for example, display results as odds ratios after svy: logit and can be
used to compute odds ratios for one covariate group relative to another. nlcom can display odds
ratios, as well, and allows more general nonlinear combinations of the parameters. See [R] lincom
and [R] nlcom for full details. Also see Eltinge and Sribney (1996) for an earlier implementation of
lincom for survey data.
Finally, lincom and nlcom operate on the estimated parameters only. To obtain estimates and
inference for functions of the parameters and of the data, such as for an exponentiated linear predictor
or a predicted probability of success from a logit model, use predictnl; see [R] predictnl.
111
Example 2: Quadratic terms

From example 2 in [SVY] svy estimation, we modeled the incidence of high blood pressure as a
function of height, weight, age, and sex (using the female indicator variable). Here we also include
c.age#c.age, a squared term for age.
. use http://www.stata-press.com/data/r13/nhanes2d, clear
. svy: logistic highbp height weight age c.age#c.age female
Number of strata
Number of PSUs
=
=
31
62
Number of obs
Population size
Design df
F(
5,
27)
Prob > F
Linearized
Std. Err.
highbp
Odds Ratio
height
weight
age
.9656421
1.052911
1.055829
.005163
.0026433
.0133575
c.age#c.age
.9999399
female
_cons
.6260988
.653647
=
=
=
=
=
10351
117157513
31
284.33
0.0000
P>|t|
-6.54
20.54
4.29
0.000
0.000
0.000
.9551693
1.047534
1.028935
.9762298
1.058316
1.083426
.0001379
-0.44
0.666
.9996588
1.000221
.0364945
.5744564
-8.03
-0.48
0.000
0.632
.5559217
.108869
.7051347
3.924483
Because our model includes a quadratic in the age variable, the peak incidence of high blood
pressure with respect to age will occur at - b[age]/(2* b[c.age#c.age]), which we can estimate,
along with its standard error, using nlcom.
. nlcom peak: -_b[age]/(2*_b[c.age#c.age])
peak: -_b[age]/(2*_b[c.age#c.age])
highbp
Coef.
peak
452.0979
Std. Err.
P>|z|
933.512
0.48
0.628

-1377.552
2281.748
Or we can use testnl to test that the peak incidence of high blood pressure in the population is
70 years.
. testnl -_b[age]/(2*_b[c.age#c.age]) = 70
(1)
-_b[age]/(2*_b[c.age#c.age]) = 70
F(1, 31) =
0.17
Prob > F =
0.6851
These data do not reject our theory. testnl allows multiple hypotheses to be tested jointly and
applies the degrees-of-freedom adjustment for survey results; see [R] testnl.
112
Example 3: Predictive margins

Changing our logistic regression for high blood pressure slightly, we add a factor variable for the
levels of race. Level 1 of race represents whites, level 2 represents blacks, and level 3 represents
others. We also specify that female is a factor variable, which does not change its coefficient but
does increase its functionality with some postestimation commands.
. svy: logistic highbp height weight age c.age#c.age i.female i.race, baselevels
Number of strata
=
31
Number of obs
=
10351
Number of PSUs
=
62
Population size
= 117157513
Design df
=
31
F(
7,
25)
=
230.16
Prob > F
=
0.0000
Linearized
Std. Err.
highbp
Odds Ratio
height
weight
age
.9675961
1.052683
1.056628
.0052361
.0026091
.0134451
c.age#c.age
.9999402
female
0
1
P>|t|
-6.09
20.72
4.33
0.000
0.000
0.000
.9569758
1.047376
1.029559
.9783343
1.058018
1.084408
.0001382
-0.43
0.668
.9996585
1.000222
1
.6382331
(base)
.0377648
-7.59
0.000
.5656774
.720095
race
White
Black
Other
1
1.422003
1.63456
(base)
.1556023
.2929919
3.22
2.74
0.003
0.010
1.137569
1.13405
1.777557
2.355971
_cons
.4312846
.378572
-0.96
0.345
.0719901
2.583777
Our point estimates indicate that the odds of females having high blood pressure is about 64% of
the odds for men and that the odds of blacks having high blood pressure is about 1.4 times that of
whites. The odds ratios give us the relative effects of their covariates, but they do not give us any
sense of the absolute size of the effects. The odds ratio comparing blacks with whites is clearly large
and statistically significant, but does it represent a sizable change? One way to answer that question
is to explore the probabilities of high blood pressure from our fitted model. Lets first look at the
predictive margins of the probability of high blood pressure for the three levels of race.
. margins race, vce(unconditional)
Predictive margins
Expression
: Pr(highbp), predict()
Margin
race
White
Black
Other
.3600722
.4256413
.4523404
Linearized
Std. Err.
.0150121
.0211311
.0311137
Number of obs
23.99
20.14
14.54
10351
P>|t|
0.000
0.000
0.000
.3294548
.3825441
.3888836
.3906895
.4687385
.5157972
Because our response is a probability, these margins are sometimes called predicted marginal
proportions or model-adjusted risks. They let us compare the effect of our three racial groups while
113
controlling for the distribution of other covariates in the groups. Computationally, these predictive
margins are the weighted average of the predicted probabilities for each observation in the estimation
sample. The marginal probability for whites is the average probability, assuming that everyone in the
sample is white; the margin for blacks assumes that everyone is black; and the margin for others
assumes that everyone is something other than black or white.
There is a sizable difference in blood pressure between whites and blacks, with the marginal
probability of high blood pressure for whites being about 36% and that for blacks being just over
43%. These are the adjusted probability levels. A more direct answer to our question about whether
the odds ratios represent a substantial effect requires looking at the differences of these marginal
probabilities. Researchers in the health-related sciences call such differences risk differences, whereas
researchers in the social sciences usually call them average marginal effects or average partial effects.
Regardless of terminology, we are interested in the difference in the probability of blacks having
high blood pressure as compared with whites, while adjusting for all other covariates in the model.
We request risk differences by specifying the variables of interest in a dydx() option.
. margins, vce(unconditional) dydx(race)
Average marginal effects
Number of obs
10351
Expression
dy/dx w.r.t. : 2.race 3.race
dy/dx
race
Black
Other
.0655691
.0922682
Linearized
Std. Err.
P>|t|
.0204063
.0343809
3.21
2.68
0.003
0.012
.0239501
.0221478
.1071881
.1623886
Note: dy/dx for factor levels is the discrete change from the base level.
Looking in the column labeled dy/dx, we see that the risk difference between blacks and whites
is about 2.7% (0.0267). That is a sizable as well as significant difference.
Because they are population-weighted averages over the whole sample, these margins are estimates
of the population average risk differences. And because we specified the vce(unconditional)
option, their standard errors and confidence intervals can be used to make inferences about the
population average risk differences. See Methods and formulas in [R] margins for details.
We can also compute margins or risk differences for subpopulations. To compute risk differences
for the four subpopulations that are the regions of the United StatesNortheast, Midwest, South,
and Westwe add the over(region) option.
114

. margins, vce(unconditional) dydx(race) by(region)
Average marginal effects
Number of obs
Expression
dy/dx w.r.t. : 2.race 3.race
over
: region
dy/dx
10351
Linearized
Std. Err.
P>|t|
2.race
region
NE
MW
S
W
.0662951
.065088
.0663448
.0647221
.0207354
.0204357
.0202173
.0203523
3.20
3.19
3.28
3.18
0.003
0.003
0.003
0.003
.0240051
.0234091
.0251112
.0232134
.1085852
.106767
.1075783
.1062308
3.race
region
NE
MW
S
W
.093168
.091685
.0932303
.0912062
.0343919
.034247
.0345933
.034322
2.71
2.68
2.70
2.66
0.011
0.012
0.011
0.012
.0230253
.0218379
.0226769
.021206
.1633106
.1615322
.1637837
.1612063
Note: dy/dx for factor levels is the discrete change from the base level.
The differences in the covariate distributions across the regions have little effect on the risk
differences between blacks and whites, or between other races and whites.
Rather than explore the probabilities after logistic regression, we might have explored the hazards
or mean survival times after fitting a survival model. See [R] margins for many more applications of
margins.
Example 4: Predictive means with replication-based variance estimators

When performing estimations with linearized standard errors, we use the vce(unconditional)
option to compute marginal effects so that we can use the results to make inferences on the population.
margins with vce(unconditional) uses linearization to compute the unconditional variance of the
marginal means.
The vce(unconditional) option, therefore, cannot be used when a different variance estimation
method has been specified for the model. If you are using a replication-based method to estimate
the variance in your model, you may want to use this method to perform the variance estimation for
your margins as well. To do that, you can write a program that performs both your main estimation
and the computation of your margins and use the replication method with your program.
Continuing with the logistic example, we will see how to estimate the marginal means for race
by using the jackknife variance estimator. The program below accepts an argument that contains the
estimation command line. Notice that the program should accept the if qualifier and also weights.
In addition, the set buildfvinfo on command is included so that margins checks for estimability.
buildfvinfo is usually set on, but is set off because it increases the computation time when you
use replication methods; thus you need to set it on. The option post of margins posts the results
to e(b), so they can be used by svy jackknife.
115
program mymargins, eclass

version 13
syntax anything [if] [iw pw]
if "weight" != "" {
local wgtexp "[weight exp]"
}
set buildfvinfo on
anything if wgtexp
margins race, post
end
We can now type

. local mycmdline logistic highbp height weight age c.age#c.age i.race i.female
. quietly mymargins mycmdline
. svy jackknife _b: mymargins mycmdline
(running mymargins on estimation sample)
1
2
3
4
5
..................................................
50
............
Predictive margins
Number of obs
=
10351
Replications
=
62
Coef.
race
White
Black
Other
.3600722
.4256413
.4523404
Jackknife
Std. Err.
.0150128
.0211504
.0322488
23.98
20.12
14.03
P>|t|
0.000
0.000
0.000
.3294534
.3825048
.3865684
.390691
.4687778
.5181124
You can see that now the jackknife standard errors are being reported.
Example 5: Nonlinear predictions and their standard errors

Continuing with the NHANES II data, we fit a linear regression of log of blood lead level on age,
age-squared, gender, race, and region.
116

. svy: regress loglead age c.age#c.age i.female i.race i.region
Number of strata
Number of PSUs
=
=
31
62
Number of obs
Population size
Design df
F(
8,
24)
Prob > F
R-squared
=
=
=
=
=
=
4948
56405414
31
156.24
0.0000
0.2379
Linearized
Std. Err.
P>|t|
.0158388
.0027352
5.79
0.000
.0102603
.0214173
c.age#c.age
-.0001464
.0000295
-4.96
0.000
-.0002066
-.0000862
1.female
-.3655338
.0116157
-31.47
0.000
-.3892242
-.3418434
race
Black
Other
.178402
-.0516952
.0314173
.0402381
5.68
-1.28
0.000
0.208
.114326
-.1337614
.242478
.030371
region
MW
S
W
-.02283
-.1685453
-.0362295
.0389823
.056004
.0387508
-0.59
-3.01
-0.93
0.562
0.005
0.357
-.1023349
-.2827662
-.1152623
.0566749
-.0543244
.0428032
_cons
2.440671
.0627987
38.86
0.000
2.312592
2.568749
loglead
Coef.
age
Given that we modeled the natural log of the lead measurement, we can use predictnl to compute
the exponentiated linear prediction (in the original units of the lead variable), along with its standard
error.
. predictnl leadhat = exp(xb()) if e(sample), se(leadhat_se)
(5403 missing values generated)
. sort lead leadhat
. gen showobs = inrange(_n,1,5) + inrange(_n,2501,2505) + inrange(_n,4945,4948)
117
. list lead leadhat leadhat_se age c.age#c.age if showobs, abbrev(10)
lead
leadhat
leadhat_se
age
c.age#
c.age
1.
2.
3.
4.
5.
2
3
3
3
3
9.419804
8.966098
9.046788
9.046788
9.27693
.5433255
.5301117
.5298448
.5298448
.5347956
29
23
24
24
27
841
529
576
576
729
2501.
2502.
2503.
2504.
2505.
13
13
13
13
13
16.88317
16.90057
16.90057
16.90237
16.90852
.7728783
2.296082
2.296082
1.501056
2.018708
37
71
71
48
60
1369
5041
5041
2304
3600
4945.
4946.
4947.
4948.
61
64
66
80
17.18581
15.08437
17.78698
16.85864
2.052034
.647629
1.641349
1.333927
58
24
56
42
3364
576
3136
1764
Example 6: Multiple-hypothesis testing

Joint-hypothesis tests can be performed after svy commands with the test command. Using the
results from the regression model fit in the previous example, we can use test to test the joint
significance of 2.region, 3.region, and 4.region. (1.region is the Northeast, 2.region is the
Midwest, 3.region is the South, and 4.region is the West.) We test the hypothesis that 2.region
= 0, 3.region = 0, and 4.region = 0.
. test 2.region 3.region 4.region
Adjusted Wald test
( 1)
( 2)
( 3)
2.region = 0
3.region = 0
4.region = 0
F( 3,
29) =
Prob > F =
2.96
0.0486
The nosvyadjust option on test produces an unadjusted Wald test.

. test 2.region 3.region 4.region, nosvyadjust
Unadjusted Wald test
( 1) 2.region = 0
( 2) 3.region = 0
( 3) 4.region = 0
F( 3,
31) =
3.17
Prob > F =
0.0382
For one-dimensional tests, the adjusted and unadjusted F statistics are identical, but they differ for
higher-dimensional tests. Using the nosvyadjust option is not recommended because the unadjusted
F statistic can produce extremely anticonservative p-values (that is, p-values that are too small) when
the variance degrees of freedom (equal to the number of sampled PSUs minus the number of strata)
is not large relative to the dimension of the test.
118
Bonferroni-adjusted p-values can also be computed:

. test 2.region 3.region 4.region, mtest(bonferroni)
Adjusted Wald test
( 1)
( 2)
( 3)
2.region = 0
3.region = 0
4.region = 0
F(df,29)
df
(1)
(2)
(3)
0.34
9.06
0.87
1
1
1
1.0000 #
0.0155 #
1.0000 #
all
2.96
0.0486
# Bonferroni-adjusted p-values
See Korn and Graubard (1990) for a discussion of these three different procedures for conducting
joint-hypothesis tests. See Eltinge and Sribney (1996) for an earlier implementation of test for
survey data.
Example 7: Contrasts
After svy commands, we can estimate contrasts and make pairwise comparisons with the contrast
and pwcompare commands. First, we will fit a regression of serum zinc levels on health status:
. use http://www.stata-press.com/data/r13/nhanes2f, clear
. label list
hlthgrp:
1
2
3
4
5
hlthgrp
poor
fair
average
good
excellent
. svy: regress zinc i.health

Number of strata
Number of PSUs
=
=
31
62
Number of obs
Population size
Design df
F(
4,
28)
Prob > F
R-squared
=
=
=
=
=
=
9188
104162204
31
15.61
0.0000
0.0098
Linearized
Std. Err.
P>|t|
.9272308
2.444004
4.038285
4.770911
.7690396
.6407097
.6830349
.7151641
1.21
3.81
5.91
6.67
0.237
0.001
0.000
0.000
-.6412357
1.137268
2.645226
3.312324
2.495697
3.75074
5.431344
6.229498
83.94729
.8523379
98.49
0.000
82.20893
85.68564
zinc
Coef.
health
fair
average
good
excellent
_cons
119
Higher levels of zinc are associated with better health. We can use reverse adjacent contrasts to
compare each health status with the preceding status.
. contrast ar.health
Contrasts of marginal linear predictions
Design df
Margins
(fair
(average
(good vs
(excellent
31
: asbalanced
health
vs poor)
vs fair)
average)
vs good)
Joint
Design
df
P>F
1
1
1
1
4
31
1.45
5.49
10.92
1.93
15.61
0.2371
0.0257
0.0024
0.1744
0.0000
Note: F statistics are adjusted for the survey design.
(fair
(average
(good vs
(excellent
health
vs poor)
vs fair)
average)
vs good)
Contrast
Std. Err.
.9272308
1.516773
1.594281
.7326264
.7690396
.6474771
.4824634
.5270869
-.6412357
.1962347
.6102904
-.3423744
2.495697
2.837311
2.578271
1.807627
The first table reports significance tests for each contrast, along with a joint test of all the contrasts.
The row labeled (fair vs poor), for example, tests the null hypothesis that the first two health
statuses have the same mean zinc level. The test statistics are automatically adjusted for the survey
design.
The second table reports estimates, standard errors, and confidence limits for each contrast. The
row labeled (good vs average), for example, shows that those in good health have a mean zinc
level about 1.6 units higher than those of average health. The standard errors and confidence intervals
also account for the survey design.
If we would like to go further and make all possible pairwise comparisons of the health groups,
we can use the pwcompare command. We will specify the mcompare(sidak) option to account for
multiple comparisons and the cformat(%3.1f) option to reduce the number of decimal places in
the output:
120

. pwcompare health, mcompare(sidak) cformat(%3.1f)
Pairwise comparisons of marginal linear predictions
Design df
Margins
: asbalanced
31
Number of
Comparisons
health
10
Contrast
fair
average
good
excellent
average
good
excellent
good vs
excellent vs
excellent
health
vs poor
vs poor
vs poor
vs poor
vs fair
vs fair
vs fair
average
average
vs good
0.9
2.4
4.0
4.8
1.5
3.1
3.8
1.6
2.3
0.7
Std. Err.
Sidak
0.8
0.6
0.7
0.7
0.6
0.5
0.7
0.5
0.7
0.5
-1.4
0.5
2.0
2.6
-0.4
1.5
1.7
0.1
0.3
-0.9
3.2
4.4
6.1
6.9
3.5
4.8
6.0
3.0
4.4
2.3
ak intervals exclude the null value of zero. See [R] pwcompare for more
Seven of the ten Sid
information on pairwise comparisons and multiple-comparison adjustments.
Example 8: Using suest with survey data, the svy prefix

suest can be used to obtain the variance estimates for a series of estimators that used the svy
prefix. To use suest for this purpose, perform the following steps:
1. Be sure to set the survey design characteristics correctly by using svyset. Do not use the
vce() option to change the default variance estimator from the linearized variance estimator.
vce(brr) and vce(jackknife) are not supported by suest.
2. Fit the model or models by using the svy prefix command, optionally including subpopulation
estimation with the subpop() option.
3. Store the estimation results with estimates store name.
In the following, we illustrate how to use suest to compare the parameter estimates between two
ordered logistic regression models.
In the NHANES II dataset, we have the variable health containing self-reported health status,
which takes on the values 15, with 1 being poor and 5 being excellent. Because this is an
ordered categorical variable, it makes sense to model it by using svy: ologit. We use some basic
demographic variables as predictors: female (an indicator of female individuals), black (an indicator
for black individuals), age in years, and c.age#c.age (age squared).

. use http://www.stata-press.com/data/r13/nhanes2f, clear
. svyset psuid [pw=finalwgt], strata(stratid)
pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
. svy: ologit health female black age c.age#c.age
(running ologit on estimation sample)
Survey: Ordered logistic regression
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Design df
F(
4,
28)
Prob > F
Linearized
Std. Err.
P>|t|
=
=
=
=
=
121
10335
116997257
31
223.27
0.0000
health
Coef.
female
black
age
-.1615219
-.986568
-.0119491
.0523678
.0790277
.0082974
-3.08
-12.48
-1.44
0.004
0.000
0.160
-.2683267
-1.147746
-.0288717
-.054717
-.8253899
.0049736
c.age#c.age
-.0003234
.000091
-3.55
0.001
-.000509
-.0001377
/cut1
/cut2
/cut3
/cut4
-4.566229
-3.057415
-1.520596
-.242785
.1632561
.1699944
.1714342
.1703965
-27.97
-17.99
-8.87
-1.42
0.000
0.000
0.000
0.164
-4.899192
-3.404121
-1.870239
-.590311
-4.233266
-2.710709
-1.170954
.104741
The self-reported health variable takes five categories. Categories 1 and 2 denote negative
categories, whereas categories 4 and 5 denote positive categories. We wonder whether the distinctions
between the two positive categories and between the two negative categories are produced in accordance
with one latent dimension, which is an assumption of the ordered logistic model. To test onedimensionality, we will collapse the five-point health measure into a three-point measure, refit the
ordered logistic model, and compare the regression coefficients and cutpoints between the two analyses.
If the single latent variable assumption is valid, the coefficients and cutpoints should match. This can
be seen as a Hausman-style specification test. Estimation of the ordered logistic model parameters
for survey data is by maximum pseudolikelihood. Neither estimator is fully efficient, and thus the
assumptions for the classic Hausman test and for the hausman command are not satisfied. With
suest, we can obtain an appropriate Hausman test for survey data.
To perform the Hausman test, we are already almost halfway there by following steps 1 and 2 for
one of the models. We just need to store the current estimation results before moving on to the next
model. Here we store the results with estimates store under the name H5, indicating that in this
analysis, the dependent variable health has five categories.
. estimates store H5
We proceed by generating a new dependent variable health3, which maps values 1 and 2 into
2, 3 into 3, and 4 and 5 into 4. This transformation is conveniently accomplished with the clip()
function. We then fit an ologit model with this new dependent variable and store the estimation
results under the name H3.
. gen health3 = clip(health, 2, 4)
122

. svy: ologit health3 female black age c.age#c.age
(running ologit on estimation sample)
Survey: Ordered logistic regression
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Design df
F(
4,
28)
Prob > F
Linearized
Std. Err.
=
=
=
=
=
10335
116997257
31
197.08
0.0000
health3
Coef.
female
black
age
-.1551238
-1.046316
-.0365408
.0563809
.0728274
.0073653
-2.75
-14.37
-4.96
0.010
0.000
0.000
-.2701133
-1.194849
-.0515624
-.0401342
-.8977836
-.0215192
c.age#c.age
-.00009
.0000791
-1.14
0.264
-.0002512
.0000713
/cut1
/cut2
-3.655498
-2.109584
.1610211
.1597057
-22.70
-13.21
0.000
0.000
-3.983903
-2.435306
-3.327093
-1.783862
P>|t|
. estimates store H3
We can now obtain the combined estimation results of the two models stored under H5 and H3
with design-based standard errors.

. suest H5 H3
Simultaneous survey results for H5, H3
Number of strata
=
31
Number of PSUs
=
62
Coef.
Linearized
Std. Err.
Number of obs
Population size
Design df
P>|t|
=
=
=
123
10335
116997257
31
H5_health
female
black
age
-.1615219
-.986568
-.0119491
.0523678
.0790277
.0082974
-3.08
-12.48
-1.44
0.004
0.000
0.160
-.2683267
-1.147746
-.0288717
-.054717
-.8253899
.0049736
c.age#c.age
-.0003234
.000091
-3.55
0.001
-.000509
-.0001377
H5_cut1
_cons
-4.566229
.1632561
-27.97
0.000
-4.899192
-4.233266
H5_cut2
_cons
-3.057415
.1699944
-17.99
0.000
-3.404121
-2.710709
H5_cut3
_cons
-1.520596
.1714342
-8.87
0.000
-1.870239
-1.170954
H5_cut4
_cons
-.242785
.1703965
-1.42
0.164
-.590311
.104741
H3_health3
female
black
age
-.1551238
-1.046316
-.0365408
.0563809
.0728274
.0073653
-2.75
-14.37
-4.96
0.010
0.000
0.000
-.2701133
-1.194849
-.0515624
-.0401342
-.8977836
-.0215192
c.age#c.age
-.00009
.0000791
-1.14
0.264
-.0002512
.0000713
H3_cut1
_cons
-3.655498
.1610211
-22.70
0.000
-3.983903
-3.327093
H3_cut2
_cons
-2.109584
.1597057
-13.21
0.000
-2.435306
-1.783862
The coefficients of H3 and H5 look rather similar. We now use test to perform a formal Hausmantype test for the hypothesis that the regression coefficients are indeed the same, as we would expect
if there is indeed a one-dimensional latent dimension for health. Thus we test that the coefficients in
the equation H5 health are equal to those in H3 health3.
. test [H5_health=H3_health3]
Adjusted Wald test
( 1) [H5_health]female - [H3_health3]female = 0
( 2) [H5_health]black - [H3_health3]black = 0
( 3) [H5_health]age - [H3_health3]age = 0
( 4) [H5_health]c.age#c.age - [H3_health3]c.age#c.age = 0
F( 4,
28) =
17.13
Prob > F =
0.0000
We can reject the null hypothesis, which indicates that the ordered logistic regression model is
indeed misspecified. Another specification test can be conducted with respect to the cutpoints. Variable
health3 was constructed from health by collapsing the two worst categories into value 2 and the
two best categories into value 4. This action effectively has removed two cutpoints, but if the model
124
fits the data, it should not affect the other two cutpoints. The comparison is hampered by a difference
in the names of the cutpoints between the models, as illustrated in the figure below:
H5
cut1
cut2
cut3
cut4
latent
xxxx
observed
1
2
3
4
5
H3
cut1
cut2
latent
xxobserved
2
3
4
Cutpoint /cut2 of model H5 should be compared with cutpoint /cut1 of H3, and similarly, /cut3
of H5 with /cut2 of H3.
. test ([H5_cut2]_cons=[H3_cut1]_cons) ([H5_cut3]_cons=[H3_cut2]_cons)
Adjusted Wald test
( 1) [H5_cut2]_cons - [H3_cut1]_cons = 0
( 2) [H5_cut3]_cons - [H3_cut2]_cons = 0
F( 2,
30) =
33.49
Prob > F =
0.0000
We conclude that the invariance of the cutpoints under the collapse of categories is not supported
by the data, again providing evidence against the reduced specification of the ordered logistic model
in this case.
Example 9: Using suest with survey data, the svy option

Not all estimation commands support the svy prefix, but you can use the svy option with suest
to get survey estimation results. If you can use suest after a command, you can use suest, svy.
Here are the corresponding Stata commands to perform the analysis in the previous example, using
the svy option instead of the svy prefix.
.
.
.
.
.
.
.
.
.
.
use http://www.stata-press.com/data/r13/nhanes2f, clear

svyset psuid [pw=finalwgt], strata(stratid)
ologit health female black age c.age#c.age [iw=finalwgt]
estimates store H5
gen health3 = clip(health,2,4)
ologit health3 female black age c.age#c.age [iw=finalwgt]
estimates store H3
suest H5 H3, svy
test [H5_health=H3_health3]
test ([H5_cut2]_cons=[H3_cut1]_cons) ([H5_cut3]_cons=[H3_cut2]_cons)
The calls to ologit now use iweights instead of the svy prefix, and the svy option was added
to suest. No other changes are required.
125
References
Eltinge, J. L., and W. M. Sribney. 1996. svy5: Estimates of linear combinations and hypothesis tests for survey data.
Stata Technical Bulletin 31: 3142. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 246259. College
Graubard, B. I., and E. L. Korn. 2004. Predictive margins with survey data. Biometrics 55: 652659.
Also see
[SVY] estat Postestimation statistics for survey data
[U] 13.5 Accessing coefficients and standard errors
Title
svy sdr Successive difference replication for survey data
Syntax
Options
Menu
Reference
Description
Stored results
Also see
Syntax
svy
sdr exp list
, svy options sdr options eform option
svy options
: command
Description
if/in

subpop( varname
if )
Reporting
level(#)
noheader
nolegend
noadjust
nocnsreport
display options

coeflegend
126
127
Description
sdr options
Options

mse

Reporting

trace command
use text as title for SDR results
verbose
nodots
noisily
trace
title(text)
Advanced

nodrop
reject(exp)
dof(#)
svy sdr requires that the successive difference replicate weights be identified using svyset.
exp list contains
elist contains
eexp is
specname is
eqno is
(name: elist)
elist
eexp
newvarname = (exp)
(exp)
specname
[eqno]specname
b
b[]
se
se[]
##
name

Menu
Statistics
>
>
Resampling
>
Successive difference replications estimation
128
Description
svy sdr performs successive difference replication (SDR) for complex survey data. Typing
. svy sdr exp list: command
SDR methodology.
programs can be used with svy sdr as long as they follow standard Stata syntax, allow the if
qualifier, and allow pweights and iweights; see [U] 11 Language syntax. The by prefix may not
be part of command.
properties.
Options
Options

double specifies that the results for each replication be stored as doubles, meaning 8-byte reals.
By default, they are stored as floats, meaning 4-byte reals. This option may be used without
See [P] postfile.
box.
mse specifies that svy sdr compute the variance by using deviations of the replicates from the
observed value of the statistics based on the entire dataset. By default, svy sdr computes the
variance by using deviations of the replicates from their mean.
Reporting

option.
option.
title(text) specifies a title to be displayed above the table of SDR results; the default title is SDR
results.
129
Advanced

SDR was first introduced by Fay and Train (1995) as a method of variance estimation for annual
demographic supplements to the Current Population Survey (CPS). In SDR, the model is fit multiple
times, once for each of a set of adjusted sampling weights. The variance is estimated using the
resulting replicated point estimates.
Example 1
The U.S. Census Bureau publishes public-use data from several of its surveys. This data can
be downloaded from http://factfinder.census.gov. We downloaded the American Community Survey
(ACS) Public Use Microdata Sample (PUMS) data collected in 2007. We extracted data for the state of
Texas and kept the variables containing age, sex, and sampling weight for each person in the dataset.
This sample dataset also contains 80 SDR weight variables.
. use http://www.stata-press.com/data/r13/ss07ptx
. svyset
pweight: pwgtp
VCE: sdr
MSE: off
sdrweight: pwgtp1 pwgtp2 pwgtp3 pwgtp4 pwgtp5 pwgtp6 pwgtp7 pwgtp8 pwgtp9
pwgtp10 pwgtp11 pwgtp12 pwgtp13 pwgtp14 pwgtp15 pwgtp16
(output omitted )
pwgtp73 pwgtp74 pwgtp75 pwgtp76 pwgtp77 pwgtp78 pwgtp79
pwgtp80
Strata 1: <one>
FPC 1: <zero>
This dataset was already svyset as

. svyset [pw=pwgtp], sdrweight(pwgtp1-pwgtp80) vce(sdr)
Here we estimate the average age of the males and of the females for our Texas subpopulation.
The standard errors are estimated using SDR.
130

. svy: mean agep, over(sex)
SDR replications (80)
1
2
3
4
5
..................................................
..............................
Number of obs
Population size
Replications
50
=
=
=
230817
23904380
80
Male: sex = Male

Over
Mean
Male
Female
33.24486
35.23908
SDR
Std. Err.
.0470986
.0386393
33.15255
35.16335
agep
33.33717
35.31481
Stored results
In addition to the results documented in [SVY] svy, svy sdr stores the following in e():
Scalars
e(N reps)
e(N misreps)
e(k exp)
e(k eexp)
e(k extra)
Macros
e(cmdname)
e(cmd)
e(vce)
e(exp#)
e(sdrweight)
Matrices
e(b sdr)
e(V)
When exp list is

command.
number
number
number
number
number
of
of
of
of
of
replications
replications with missing values
standard expressions
b/ se expressions
extra estimates added to b

same as e(cmdname) or sdr
sdr
#th expression
SDR means
SDR variance estimates
b, svy sdr will also carry forward most of the results already in e() from

See [SVY] variance estimation for details regarding SDR variance estimation.
Reference
Also see

131
Title
svy: tabulate oneway One-way tables for survey data
Syntax
Options
Menu
Reference
Description
Stored results
Also see
Syntax
Basic syntax
svy: tabulate varname
Full syntax

in
svy vcetype
, svy options : tabulate varname if

, tabulate options display items display options
Syntax to report results

svy , display items display options
vcetype
Description
SE
linearized
bootstrap
brr
jackknife
sdr

132
svy options
133
Description
if/in

subpop( varname
if )
SE
bootstrap options
brr options
jackknife options
sdr options

tabulate options
Description
Model
stdize(varname)
stdweight(varname)
tab(varname)
missing
variable identifying strata for standardization

weight variable for standardization
variable for which to compute cell totals/proportions
treat missing values like other values
display items
Description
Table items
cell
count
se
ci
deff
deft
cv
srssubpop
obs
cell proportions
weighted cell counts
standard errors
confidence intervals
display the DEFF design effects
display the DEFT design effects
display the coefficient of variation
report design effects assuming SRS within subpopulation
cell observations
When any of se, ci, deff, deft, cv, or srssubpop is specified, only one of cell or count can be specified. If
none of se, ci, deff, deft, cv, or srssubpop is specified, both cell and count can be specified.
134
display options
Description
Reporting
level(#)
proportion
percent
nomarginal
nolabel
cellwidth(#)
csepwidth(#)
stubwidth(#)
format(% fmt)

display proportions; the default
display percentages instead of proportions
suppress column marginal
suppress displaying value labels
cell width
column-separation width
stub width
cell format; default is format(%6.0g)
proportion is not shown in the dialog box.
Menu
Statistics
>
>
Tables
>
One-way tables
Description
svy: tabulate produces one-way tabulations for complex survey data. See [SVY] svy: tabulate
twoway for two-way tabulations for complex survey data.
Options
Model
stdize(varname) specifies that the point estimates be adjusted by direct standardization across the
strata identified by varname. This option requires the stdweight() option.
stdweight(varname) specifies the weight variable associated with the standard strata identified in
the stdize() option. The standardization weights must be constant within the standard strata.
tab(varname) specifies that counts be cell totals of this variable and that proportions (or percentages)
be relative to (that is, weighted by) this variable. For example, if this variable denotes income, then
the cell counts are instead totals of income for each cell, and the cell proportions are proportions
of income for each cell.
missing specifies that missing values in varname be treated as another row category rather than be
omitted from the analysis (the default).
Table items
cell requests that cell proportions (or percentages) be displayed. This is the default if count is not
specified.
count requests that weighted cell counts be displayed.
se requests that the standard errors of cell proportions (the default) or weighted counts be displayed.
When se (or ci, deff, deft, or cv) is specified, only one of cell or count can be selected.
The standard error computed is the standard error of the one selected.
135
ci requests confidence intervals for cell proportions or weighted counts.

deff and deft request that the design-effect measures DEFF and DEFT be displayed for each cell
proportion or weighted count. See [SVY] estat for details.
or poststratification.
cv requests that the coefficient of variation be displayed for each cell proportion, count, or row or
column proportion. See [SVY] estat for details.
srssubpop requests that DEFF and DEFT be computed using an estimate of SRS (simple random
sampling) variance for sampling within a subpopulation. By default, DEFF and DEFT are computed
using an estimate of the SRS variance for sampling from the entire population. Typically, srssubpop
would be given when computing subpopulation estimates by strata or by groups of strata.
obs requests that the number of observations for each cell be displayed.
Reporting
proportion, the default, requests that proportions be displayed.
percent requests that percentages be displayed instead of proportions.
nomarginal requests that the column marginal not be displayed.
nolabel requests that variable labels and value labels be ignored.
cellwidth(#), csepwidth(#), and stubwidth(#) specify widths of table elements in the output;
see [P] tabdisp. Acceptable values for the stubwidth() option range from 4 to 32.
format(% fmt) specifies a format for the items in the table. The default is format(%6.0g). See
[U] 12.5 Formats: Controlling how data are displayed.
svy: tabulate uses the tabdisp command (see [P] tabdisp) to produce the table. Only five items
can be displayed in the table at one time. The ci option implies two items. If too many items are
selected, a warning will appear immediately. To view more items, redisplay the table while specifying
different options.

Despite the long list of options for svy: tabulate, it is a simple command to use. Using the
svy: tabulate command is just like using tabulate to produce one-way tables for ordinary data.
The main difference is that svy: tabulate computes standard errors appropriate for complex survey
data.
Standard errors and confidence intervals can optionally be displayed for weighted counts or cell
proportions. The confidence intervals for proportions are constructed using a logit transform so that
their endpoints always lie between 0 and 1; see [SVY] svy: tabulate twoway. Associated design
effects (DEFF and DEFT) can be viewed for the variance estimates.
Example 1
Here we use svy: tabulate to estimate the distribution of the race category variable from our
NHANES II dataset (McDowell et al. 1981). Before calling svy: tabulate, we use svyset to declare
the survey structure of the data.
136

pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
. svy: tabulate race
Number of strata
Number of PSUs
1=white,
2=black,
3=other
=
=
31
62
=
=
=
10351
117157513
31
proportions
White
Black
Other
.8792
.0955
.0253
Total
Key:
Number of obs
Population size
Design df
proportions
cell proportions
Here we display weighted counts for each category of race along with the 95% confidence bounds,
as well as the design effects DEFF and DEFT. We also use the format() option to improve the look
of the table.
. svy: tabulate race, format(%11.3g) count ci deff deft
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Design df
=
=
=
1=white,
2=black,
3=other
10351
117157513
31
count
lb
ub
deff
deft
White
Black
Other
102999549
11189236
2968728
97060400
8213964
414930
108938698
14164508
5522526
60.2
18.6
47.9
7.76
4.31
6.92
Total
117157513
Key:
count
lb
ub
deff
deft
=
=
=
=
=
weighted counts
lower 95% confidence bounds for weighted counts
upper 95% confidence bounds for weighted counts
deff for variances of weighted counts
deft for variances of weighted counts
From the above results, we can conclude with 95% confidence that the number of people in the
population that fall within the White category is between 97,060,400 and 108,938,698.
137
Stored results
In addition to the results documented in [SVY] svy, svy: tabulate stores the following in e():
Scalars
e(r)
Macros
e(cmd)
e(tab)
e(rowlab)
Matrices
e(Prop)
e(Obs)
e(Deff)
e(Deft)
e(Row)
number of rows
e(total)
weighted sum of tab() variable
tabulate
tab() variable
label or empty
e(rowvlab)
e(rowvar)
e(setype)
row variable label

varname, the row variable
cell or count
matrix of cell proportions

matrix of observation counts
DEFF vector for e(setype) items
DEFT vector for e(setype) items
values for row variable
e(V row)
e(V srs row)
e(Deff row)
e(Deft row)
variance for row totals

Vsrs for row totals
DEFF for row totals
DEFT for row totals

See Methods and formulas in [SVY] svy: tabulate twoway for a discussion of how table items
and confidence intervals are computed. A one-way table is really just a two-way table that has one
row or column.
Reference
Also see
[SVY] svydescribe Describe survey data
[R] tabulate oneway One-way table of frequencies
[SVY] svy: tabulate twoway Two-way tables for survey data

Title
svy: tabulate twoway Two-way tables for survey data
Syntax
Options
Menu
References
Description
Stored results
Also see
Syntax
Basic syntax
svy: tabulate varname1 varname2
Full syntax

in
svy vcetype
, svy options : tabulate varname1 varname2 if

, tabulate options display items display options statistic options
Syntax to report results

svy , display items display options statistic options
vcetype
Description
SE
linearized
bootstrap
brr
jackknife
sdr

138
svy options
139
Description
if/in

subpop( varname
if )
SE
bootstrap options
brr options
jackknife options
sdr options

tabulate options
Description
Model
stdize(varname)
stdweight(varname)
tab(varname)
missing
variable identifying strata for standardization

weight variable for standardization
variable for which to compute cell totals/proportions
treat missing values like other values
display items
Description
Table items
cell
count
column
row
se
ci
deff
deft
cv
srssubpop
obs
cell proportions
weighted cell counts
within-column proportions
within-row proportions
standard errors
confidence intervals
display the DEFF design effects
display the DEFT design effects
display the coefficient of variation
report design effects assuming SRS within subpopulation
cell observations
When any of se, ci, deff, deft, cv, or srssubpop is specified, only one of cell, count, column, or row can
be specified. If none of se, ci, deff, deft, cv, or srssubpop is specified, any or all of cell, count, column,
and row can be specified.
140
display options
Description
Reporting
level(#)
proportion
percent
vertical
nomarginals
nolabel
notable
cellwidth(#)
csepwidth(#)
stubwidth(#)
format(% fmt)

display proportions; the default
display percentages instead of proportions
stack confidence interval endpoints vertically
suppress row and column marginals
suppress displaying value labels
suppress displaying the table
cell width
column-separation width
stub width
cell format; default is format(%6.0g)
proportion and notable are not shown in the dialog box.
statistic options
Description
Test statistics
Pearsons chi-squared
likelihood ratio
display null-based statistics
adjusted Wald
adjusted log-linear Wald
report unadjusted Wald statistics
pearson
lr
null
wald
llwald
noadjust
Menu
Statistics
>
>
Tables
>
Two-way tables
Description
svy: tabulate produces two-way tabulations with tests of independence for complex survey data.
See [SVY] svy: tabulate oneway for one-way tabulations for complex survey data.
Options
Model
stdize(varname) specifies that the point estimates be adjusted by direct standardization across the
strata identified by varname. This option requires the stdweight() option.
stdweight(varname) specifies the weight variable associated with the standard strata identified in
the stdize() option. The standardization weights must be constant within the standard strata.
tab(varname) specifies that counts be cell totals of this variable and that proportions (or percentages)
be relative to (that is, weighted by) this variable. For example, if this variable denotes income, the
cell counts are instead totals of income for each cell, and the cell proportions are proportions
of income for each cell.
141
missing specifies that missing values in varname1 and varname2 be treated as another row or column
category rather than be omitted from the analysis (the default).
Table items
cell requests that cell proportions (or percentages) be displayed. This is the default if none of count,
row, or column is specified.
count requests that weighted cell counts be displayed.
column or row requests that column or row proportions (or percentages) be displayed.
se requests that the standard errors of cell proportions (the default), weighted counts, or row or
column proportions be displayed. When se (or ci, deff, deft, or cv) is specified, only one of
cell, count, row, or column can be selected. The standard error computed is the standard error
of the one selected.
ci requests confidence intervals for cell proportions, weighted counts, or row or column proportions.
The confidence intervals are constructed using a logit transform so that their endpoints always lie
between 0 and 1.
deff and deft request that the design-effect measures DEFF and DEFT be displayed for each cell
proportion, count, or row or column proportion. See [SVY] estat for details. The mean generalized
DEFF is also displayed when deff, deft, or subpop is requested; see Methods and formulas for
an explanation.
or poststratification.
cv requests that the coefficient of variation be displayed for each cell proportion, count, or row or
column proportion. See [SVY] estat for details.
srssubpop requests that DEFF and DEFT be computed using an estimate of SRS (simple random
sampling) variance for sampling within a subpopulation. By default, DEFF and DEFT are computed
using an estimate of the SRS variance for sampling from the entire population. Typically, srssubpop
would be given when computing subpopulation estimates by strata or by groups of strata.
obs requests that the number of observations for each cell be displayed.
Reporting
proportion, the default, requests that proportions be displayed.
percent requests that percentages be displayed instead of proportions.
vertical requests that the endpoints of confidence intervals be stacked vertically on display.
nomarginals requests that row and column marginals not be displayed.
nolabel requests that variable labels and value labels be ignored.
notable prevents the header and table from being displayed in the output. When specified, only the
results of the requested test statistics are displayed. This option may not be specified with any
other option in display options except the level() option.
cellwidth(#), csepwidth(#), and stubwidth(#) specify widths of table elements in the output;
see [P] tabdisp. Acceptable values for the stubwidth() option range from 4 to 32.
format(% fmt) specifies a format for the items in the table. The default is format(%6.0g). See
[U] 12.5 Formats: Controlling how data are displayed.
142
Test statistics
pearson requests that the Pearson 2 statistic be computed. By default, this is the test of independence
that is displayed. The Pearson 2 statistic is corrected for the survey design with the second-order
correction of Rao and Scott (1984) and is converted into an F statistic. One term in the correction
formula can be calculated using either observed cell proportions or proportions under the null
hypothesis (that is, the product of the marginals). By default, observed cell proportions are used.
If the null option is selected, then a statistic corrected using proportions under the null hypothesis
is displayed as well.
lr requests that the likelihood-ratio test statistic for proportions be computed. This statistic is not
defined when there are one or more zero cells in the table. The statistic is corrected for the survey
design by using the same correction procedure that is used with the pearson statistic. Again either
observed cell proportions or proportions under the null hypothesis can be used in the correction
formula. By default, the former is used; specifying the null option gives both the former and the
latter. Neither variant of this statistic is recommended for sparse tables. For nonsparse tables, the
lr statistics are similar to the corresponding pearson statistics.
null modifies the pearson and lr options only. If null is specified, two corrected statistics are
displayed. The statistic labeled D-B (null) (D-B stands for design-based) uses proportions under
the null hypothesis (that is, the product of the marginals) in the Rao and Scott (1984) correction.
The statistic labeled merely Design-based uses observed cell proportions. If null is not specified,
only the correction that uses observed proportions is displayed.
wald requests a Wald test of whether observed weighted counts equal the product of the marginals
(Koch, Freeman, and Freeman 1975). By default, an adjusted F statistic is produced; an unadjusted
statistic can be produced by specifying noadjust. The unadjusted F statistic can yield extremely
anticonservative p-values (that is, p-values that are too small) when the degrees of freedom of the
variance estimates (the number of sampled PSUs minus the number of strata) are small relative
to the (R 1)(C 1) degrees of freedom of the table (where R is the number of rows and C
is the number of columns). Hence, the statistic produced by wald and noadjust should not be
used for inference unless it is essentially identical to the adjusted statistic.
This option must be specified at run time in order to be used on subsequent calls to svy to report
results.
llwald requests a Wald test of the log-linear model of independence (Koch, Freeman, and Freeman 1975). The statistic is not defined when there are one or more zero cells in the table. The
adjusted statistic (the default) can produce anticonservative p-values, especially for sparse tables,
when the degrees of freedom of the variance estimates are small relative to the degrees of freedom of
the table. Specifying noadjust yields a statistic with more severe problems. Neither the adjusted
nor the unadjusted statistic is recommended for inference; the statistics are made available only
for pedagogical purposes.
noadjust modifies the wald and llwald options only. It requests that an unadjusted F statistic be
displayed in addition to the adjusted statistic.
svy: tabulate uses the tabdisp command (see [P] tabdisp) to produce the table. Only five items
can be displayed in the table at one time. The ci option implies two items. If too many items are
selected, a warning will appear immediately. To view more items, redisplay the table while specifying
different options.
143

Introduction
The Rao and Scott correction
Wald statistics
Properties of the statistics
Introduction
Despite the long list of options for svy: tabulate, it is a simple command to use. Using the
svy: tabulate command is just like using tabulate to produce two-way tables for ordinary data.
The main difference is that svy: tabulate computes a test of independence that is appropriate for
complex survey data.
The test of independence that is displayed by default is based on the usual Pearson 2 statistic
for two-way tables. To account for the survey design, the statistic is turned into an F statistic with
noninteger degrees of freedom by using a second-order Rao and Scott (1981, 1984) correction.
Although the theory behind the Rao and Scott correction is complicated, the p-value for the corrected
F statistic can be interpreted in the same way as a p-value for the Pearson 2 statistic for ordinary
data (that is, data that are assumed independent and identically distributed [i.i.d.]).
svy: tabulate, in fact, computes four statistics for the test of independence with two variants
of each, for a total of eight statistics. The option combination for each of the eight statistics are the
following:
1. pearson (the default)
2. pearson null
3. lr
4. lr null
5. wald
6. wald noadjust
7. llwald
8. llwald noadjust
The wald and llwald options with noadjust yield the statistics developed by Koch, Freeman, and
Freeman (1975), which have been implemented in the CROSSTAB procedure of the SUDAAN software
(Research Triangle Institute 1997, release 7.5).
These eight statistics, along with other variants, have been evaluated in simulations (Sribney 1998).
On the basis of these simulations, we advise researchers to use the default statistic (the pearson
option) in all situations. We recommend that the other statistics be used only for comparative or
pedagogical purposes. Sribney (1998) gives a detailed comparison of the statistics; a summary of his
conclusions is provided later in this entry.
Other than the test-statistic options (statistic options) and the survey design options (svy options),
most of the other options of svy: tabulate simply relate to different choices for what can be
displayed in the body of the table. By default, cell proportions are displayed, but viewing either row
or column proportions or weighted counts usually makes more sense.
Standard errors and confidence intervals can optionally be displayed for weighted counts or cell,
row, or column proportions. The confidence intervals for proportions are constructed using a logit
transform so that their endpoints always lie between 0 and 1. Associated design effects (DEFF and
144
DEFT) can be viewed for the variance estimates. The mean generalized DEFF (Rao and Scott 1984)
is also displayed when option deff, deft, or srssubpop is specified. The mean generalized DEFF
is essentially a design effect for the asymptotic distribution of the test statistic; see the Methods and
formulas section at the end of this entry.
Example 1
Using data from the Second National Health and Nutrition Examination Survey (NHANES II)
(McDowell et al. 1981), we identify the survey design characteristics with svyset and then produce
a two-way table of cell proportions with svy: tabulate.
pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
. svy: tabulate race diabetes
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Design df
1=white,
2=black,
3=other
=
=
=
10349
117131111
31

0
1
Total
White
Black
Other
.851
.0899
.0248
.0281
.0056
5.2e-04
.8791
.0955
.0253
Total
.9658
.0342
Key: cell proportions

Pearson:
Uncorrected
chi2(2)
Design-based F(1.52, 47.26)
=
=
21.3483
15.0056
P = 0.0000
The default table displays only cell proportions, and this makes it difficult to compare the incidence
of diabetes in white, black, and other racial groups. It would be better to look at row proportions.
This can be done by redisplaying the results (that is, reissuing the command without specifying any
variables) with the row option.

. svy: tabulate, row
Number of strata
=
Number of PSUs
=
1=white,
2=black,
3=other
31
62
Number of obs
Population size
Design df
=
=
=
145
10349
117131111
31
diabetes, 1=yes,
0=no
0
1 Total
White
Black
Other
.968
.941
.9797
.032
.059
.0203
1
1
1
Total
.9658
.0342
Key: row proportions

Pearson:
Uncorrected
chi2(2)
=
=
21.3483
15.0056
P = 0.0000
This table is much easier to interpret. A larger proportion of blacks have diabetes than do whites
or persons in the other racial category. The test of independence for a two-way contingency table
is equivalent to the test of homogeneity of row (or column) proportions. Hence, we can conclude
that there is a highly significant difference between the incidence of diabetes among the three racial
groups.
We may now wish to compute confidence intervals for the row proportions. If we try to redisplay,
specifying ci along with row, we get the following result:
. svy: tabulate, row ci
confidence intervals are only available for cells to compute row confidence
intervals, rerun command with row and ci options
r(111);
There are limits to what svy: tabulate can redisplay. Basically, any of the options relating to
variance estimation (that is, se, ci, deff, and deft) must be specified at run time along with the
single item (that is, count, cell, row, or column) for which you want standard errors, confidence
intervals, DEFF, or DEFT. So to get confidence intervals for row proportions, we must rerun the
command. We do so below, requesting not only ci but also se.
146

. svy: tabulate race diabetes, row se ci format(%7.4f)
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Design df
1=white,
2=black,
3=other

0
1
=
=
=
10349
117131111
31
Total
White
0.9680
(0.0020)
[0.9638,0.9718]
0.0320
(0.0020)
[0.0282,0.0362]
1.0000
Black
0.9410
(0.0061)
[0.9271,0.9523]
0.0590
(0.0061)
[0.0477,0.0729]
1.0000
Other
0.9797
(0.0076)
[0.9566,0.9906]
0.0203
(0.0076)
[0.0094,0.0434]
1.0000
Total
0.9658
(0.0018)
[0.9619,0.9693]
0.0342
(0.0018)
[0.0307,0.0381]
1.0000
Key:
row proportions
(linearized standard errors of row proportions)
[95% confidence intervals for row proportions]
Pearson:
Uncorrected
chi2(2)
=
21.3483
Design-based F(1.52, 47.26) =
15.0056
P = 0.0000
In the above table, we specified a %7.4f format rather than using the default %6.0g format.
The single format applies to every item in the table. We can omit the marginal totals by specifying
nomarginals. If the above style for displaying the confidence intervals is obtrusiveand it can be
in a wider tablewe can use the vertical option to stack the endpoints of the confidence interval,
one over the other, and omit the brackets (the parentheses around the standard errors are also omitted
when vertical is specified). To express results as percentages, as with the tabulate command (see
[R] tabulate twoway), we can use the percent option. Or we can play around with these display
options until we get a table that we are satisfied with, first making changes to the options on redisplay
(that is, omitting the cross-tabulated variables when we issue the command).
Technical note
The standard errors computed by svy: tabulate are the same as those produced by svy: mean,
svy: proportion, and svy: ratio. Indeed, svy: tabulate uses these commands as subroutines
to produce its table.
In the previous example, the estimate of the proportion of African Americans with diabetes (the
second proportion in the second row of the preceding table) is simply a ratio estimate; hence, we can
also obtain the same estimates by using svy: ratio:
147
. drop black
. gen black = (race==2) if !missing(race)
. gen diablk = diabetes*black
. svy: ratio diablk/black
Number of strata =
31
Number of obs
Number of PSUs
=
62
Population size
Design df
_ratio_1: diablk/black
Ratio
_ratio_1
.0590349
Linearized
Std. Err.
.0061443
=
=
=
10349
117131111
31

.0465035
.0715662
Although the standard errors are the same, the confidence intervals are slightly different. The
svy: tabulate command produced the confidence interval [ 0.0477, 0.0729 ], and svy: ratio
gave [ 0.0465, 0.0716 ]. The difference is because svy: tabulate uses a logit transform to produce
confidence intervals whose endpoints are always between 0 and 1. This transformation also shifts the
confidence intervals slightly toward 0.5, which is beneficial because the untransformed confidence
intervals tend to be, on average, biased away from 0.5. See Methods and formulas for details.
Example 2: The tab() option

The tab() option allows us to compute proportions relative to a certain variable. Suppose that
we wish to compare the proportion of total income among different racial groups in males with that
of females. We do so below with fictitious data:
. use http://www.stata-press.com/data/r13/svy_tabopt, clear
. svy: tabulate gender race, tab(income) row
Number of strata
=
31
Number of obs
Number of PSUs
=
62
Population size
Design df
Gender
White
Race
Black Other
Male
Female
.8857
.884
.0875
.094
.0268
.022
1
1
Total
.8848
.0909
.0243
Tabulated variable: income

Key: row proportions
Pearson:
Uncorrected
chi2(2)
Total
=
=
3.6241
0.8626
P = 0.4227
=
=
=
10351
117157513
31
148
The Rao and Scott correction

svy: tabulate can produce eight different statistics for the test of independence. By default,
svy: tabulate displays the Pearson 2 statistic with the Rao and Scott (1981, 1984) second-order
correction. On the basis of simulations Sribney (1998), we recommend that you use this statistic
in all situations. The statistical literature, however, contains several alternatives, along with other
possibilities for implementing the Rao and Scott correction. Hence, for comparative or pedagogical
purposes, you may want to view some of the other statistics computed by svy: tabulate. This
section briefly describes the differences among these statistics; for a more detailed discussion, see
Sribney (1998).
Two statistics commonly used for i.i.d. data for the test of independence of R C tables (R rows
and C columns) are the Pearson 2 statistic
XP2 = m
C
R X
X
(b
prc pb0rc ) /b
p0rc
r=1 c=1
and the likelihood-ratio 2 statistic

2
XLR
= 2m
R X
C
X
pbrc ln (b
prc /b
p0rc )
r=1 c=1
where m is the total number of sampled individuals, pbrc is the estimated proportion for the cell in the
rth row and cth column of the table, and pb0rc is the estimated proportion under the null hypothesis of
PC
independence; that is, pb0rc = pbr pbc , the product of the row and column marginals: pbr = c=1 pbrc
PR
and pbc = r=1 pbrc .
For i.i.d. data, both these statistics are distributed asymptotically as 2(R1)(C1) . The likelihoodratio statistic is not defined when one or more of the cells in the table are empty. The Pearson statistic,
however, can be calculated when one or more cells in the table are emptythe statistic may not have
good properties in this case, but the statistic still has a computable value.
2
can be computed using weighted estimates of pbrc and pb0rc . However,
For survey data, XP2 and XLR
for a complex sampling design, one can no longer claim that they are distributed as 2(R1)(C1) , but
you can estimate the variance of pbrc under the sampling design. For instance, in Stata, this variance
can be estimated via linearization methods by using svy: mean or svy: ratio.
2
Rao and Scott (1981, 1984) derived the asymptotic distribution of XP2 and XLR
in terms of the
variance of pbrc . Unfortunately, the result (see (1) in Methods and formulas) is not computationally
feasible, but it can be approximated using correction formulas. svy: tabulate uses the second-order
correction developed by Rao and Scott (1984). By default, or when the pearson option is specified,
svy: tabulate displays the second-order correction of the Pearson statistic. The lr option gives the
second-order correction of the likelihood-ratio statistic. Because it is the default of svy: tabulate,
the correction computed with pbrc is referred to as the default correction.
The Rao and Scott papers, however, left some details outstanding about the computation of the
correction. One term in the correction formula can be computed using either pbrc or pb0rc . Because
under the null hypothesis both are asymptotically equivalent, theory offers no guidance about which
is best. By default, svy: tabulate uses pbrc for the corrections of the Pearson and likelihood-ratio
statistics. If the null option is specified, the correction is computed using pb0rc . For nonsparse tables,
these two correction methods yield almost identical results. However, in simulations of sparse tables,
Sribney (1998) found that the null-corrected statistics were extremely anticonservative for 2 2 tables
(that is, under the null, significance was declared too often) and were too conservative for other
tables. The default correction, however, had better properties. Hence, we do not recommend using
null.
149
For the computational details of the Rao and Scottcorrected statistics, see Methods and formulas.
Wald statistics
Prior to the work by Rao and Scott (1981, 1984), Wald tests for the test of independence for
two-way tables were developed by Koch, Freeman, and Freeman (1975). Two Wald statistics have
been proposed. The first, similar to the Pearson statistic, is based on
brc N
br N
bc /N
b
Ybrc = N
brc is the estimated weighted count for the r, cth cell. The delta method can be used to
where N
approximate the variance of Ybrc , and a Wald statistic can be calculated as usual. A second Wald
statistic can be constructed based on a log-linear model for the table. Like the likelihood-ratio statistic,
this statistic is undefined when there is a zero proportion in the table.
These Wald statistics are initially 2 statistics, but they have better properties when converted
into F statistics with denominator degrees of freedom that account for the degrees of freedom of the
variance estimator. They can be converted to F statistics in two ways.
One method is the standard manner: divide by the 2 degrees of freedom d0 = (R 1)(C 1)
to get an F statistic with d0 numerator degrees of freedom and = n L denominator degrees of
freedom. This is the form of the F statistic suggested by Koch, Freeman, and Freeman (1975) and
implemented in the CROSSTAB procedure of the SUDAAN software (Research Triangle Institute 1997,
release 7.5), and it is the method used by svy: tabulate when the noadjust option is specified
with wald or llwald.
Another technique is to adjust the F statistic by using
Fadj = ( d0 + 1)W/(d0 )
with
Fadj F (d0 , d0 + 1)
This is the default adjustment for svy: tabulate. test and the other svy estimation commands produce adjusted F statistics by default, using the same adjustment procedure. See Korn
and Graubard (1990) for a justification of the procedure.
The adjusted F statistic is identical to the unadjusted F statistic when d0 = 1, that is, for 2 2
tables.
As Thomas and Rao (1987) point out (also see Korn and Graubard [1990]), the unadjusted
F statistics can become extremely anticonservative as d0 increases when is small or moderate;
that is, under the null, the statistics are significant far more often than they should be. Because
the unadjusted statistics behave so poorly for larger tables when is not large, their use can be
justified only for small tables or when is large. But when the table is small or when is large,
the unadjusted statistic is essentially identical to the adjusted statistic. Hence, for statistical inference,
looking at the unadjusted statistics has no point.
The adjusted Pearson Wald F statistic usually behaves reasonably under the null. However, even
the adjusted F statistic for the log-linear Wald test tends to be moderately anticonservative when
is not large (Thomas and Rao 1987; Sribney 1998).
Example 3
With the NHANES II data, we tabulate, for the male subpopulation, high blood pressure (highbp)
versus a variable (sizplace) that indicates the degree of urbanity/ruralness. We request that all eight
statistics for the test of independence be displayed.
150

. gen male = (sex==1) if !missing(sex)
. svy, subpop(male): tabulate highbp sizplace, col obs pearson lr null wald
> llwald noadj
Number of strata
Number of PSUs
1 if BP >
140/90, 0
otherwise
=
=
31
62
Number of obs
Population size
Subpop. no. of obs
Subpop. size
Design df
1=urban,..., 8=rural
4
5
6
.8489
431
.8929
527
.9213
558
.8509
371
.8413
186
.1511
95
.1071
80
.0787
64
.1491
74
Total
1
526
1
607
1
622
1
445
Key:
=
=
=
=
=
10351
117157513
4915
56159480
31
Total
.9242
210
.8707
314
.8674
1619
.8764
4216
.1587
36
.0758
20
.1293
57
.1326
273
.1236
699
1
222
1
230
1
371
1
1892
1
4915
column proportions
number of observations
Pearson:
Uncorrected
D-B (null)
Design-based
chi2(7)
=
F(5.30, 164.45) =
F(5.54, 171.87) =
64.4581
2.2078
2.6863
P = 0.0522
P = 0.0189
Likelihood ratio:
Uncorrected
chi2(7)
=
D-B (null)
F(5.30, 164.45) =
Design-based F(5.54, 171.87) =
68.2365
2.3372
2.8437
P = 0.0408
P = 0.0138
Wald (Pearson):
Unadjusted
chi2(7)
Unadjusted
F(7, 31)
Adjusted
F(7, 25)
=
=
=
21.2704
3.0386
2.4505
P = 0.0149
P = 0.0465
Wald (log-linear):
Unadjusted
chi2(7)
Unadjusted
F(7, 31)
Adjusted
F(7, 25)
=
=
=
25.7644
3.6806
2.9683
P = 0.0052
P = 0.0208
The p-values from the null-corrected Pearson and likelihood-ratio statistics (lines labeled D-B
(null); D-B stands for design-based) are bigger than the corresponding default-corrected statistics
(lines labeled Design-based). Simulations (Sribney 1998) show that the null-corrected statistics are
overly conservative for many sparse tables (except 2 2 tables); this appears to be the case here,
although this table is hardly sparse. The default-corrected Pearson statistic has good properties under
the null for both sparse and nonsparse tables; hence, the smaller p-value for it should be considered
reliable.
The default-corrected likelihood-ratio statistic is usually similar to the default-corrected Pearson
statistic except for sparse tables, when it tends to be anticonservative. This example follows this
pattern, with its p-value being slightly smaller than that of the default-corrected Pearson statistic.
For tables of these dimensions (2 8), the unadjusted Pearson Wald and log-linear Wald
F statistics are extremely anticonservative under the null when the variance degrees of freedom is
small. Here the variance degrees of freedom is only 31 (62 PSUs minus 31 strata), so we expect that
151
the unadjusted Wald F statistics yield smaller p-values than the adjusted F statistics. Because of
their poor behavior under the null for small variance degrees of freedom, they cannot be trusted here.
Simulations show that although the adjusted Pearson Wald F statistic has good properties under
the null, it is often less powerful than the default Rao and Scottcorrected statistics. That is probably
the explanation for the larger p-value for the adjusted Pearson Wald F statistic than that for the
default-corrected Pearson and likelihood-ratio statistics.
The p-value for the adjusted log-linear Wald F statistic is about the same as that for the trustworthy
default-corrected Pearson statistic. However, that is probably because of the anticonservatism of the
log-linear Wald under the null balancing out its lower power under alternative hypotheses.
The uncorrected 2 Pearson and likelihood-ratio statistics displayed in the table are misspecified
statistics; that is, they are based on an i.i.d. assumption, which is not valid for complex survey data.
Hence, they are not correct, even asymptotically. The unadjusted Wald 2 statistics, on the other
hand, are completely different. They are valid asymptotically as the variance degrees of freedom
becomes large.
Properties of the statistics

This section briefly summarizes the properties of the eight statistics computed by svy: tabulate.
For details, see Sribney (1998), Rao and Thomas (1989), Thomas and Rao (1987), and Korn and
Graubard (1990).
pearson is the Rao and Scott (1984) second-order corrected Pearson statistic, computed using pbrc
in the correction (default correction). It is displayed by default. Simulations show it to have good
properties under the null for both sparse and nonsparse tables. Its power is similar to that of the
lr statistic in most situations. It often appears to be more powerful than the adjusted Pearson
Wald F statistic (wald option), especially for larger tables. We recommend using this statistic in
all situations.
pearson null is the Rao and Scott second-order corrected Pearson statistic, computed using pb0rc in
the correction. It is numerically similar to the pearson statistic for nonsparse tables. For sparse
tables, it can be erratic. Under the null, it can be anticonservative for sparse 2 2 tables but
conservative for larger sparse tables.
lr is the Rao and Scott second-order corrected likelihood-ratio statistic, computed using pbrc in the
correction (default correction). The correction is identical to that for pearson. It is numerically
similar to the pearson statistic for nonsparse tables. It can be anticonservative (p-values too small)
in sparse tables. If there is a zero cell, it cannot be computed.
lr null is the Rao and Scott second-order corrected likelihood-ratio statistic, computed using pb0rc
in the correction. The correction is identical to that for pearson null. It is numerically similar
to the lr statistic for nonsparse tables. For sparse tables, it can be overly conservative. If there is
a zero cell, it cannot be computed.
wald statistic is the adjusted Pearson Wald F statistic. It has good properties under the null for
nonsparse tables. It can be erratic for sparse 2 2 tables and some sparse large tables. The pearson
statistic often appears to be more powerful.
wald noadjust is the unadjusted Pearson Wald F statistic. It can be extremely anticonservative
under the null when the table degrees of freedom (number of rows minus one times the number of
columns minus one) approaches the variance degrees of freedom (number of sampled PSUs minus
the number of strata). It is the same as the adjusted wald statistic for 2 2 tables. It is similar
to the adjusted wald statistic for small tables, large variance degrees of freedom, or both.
152
llwald statistic is the adjusted log-linear Wald F statistic. It can be anticonservative for both sparse
and nonsparse tables. If there is a zero cell, it cannot be computed.
llwald noadjust statistic is the unadjusted log-linear Wald F statistic. Like wald noadjust, it
can be extremely anticonservative under the null when the table degrees of freedom approaches
the variance degrees of freedom. It also suffers from the same general anticonservatism of the
llwald statistic. If there is a zero cell, it cannot be computed.
Stored results
In addition to the results documented in [SVY] svy, svy: tabulate stores the following in e():
Scalars
e(r)
e(cvgdeff)
number of rows
c.v. of generalized DEFF eigenvalues
e(total)
weighted sum of tab() variable
e(F Pear)
default-corrected Pearson F
e(df1 Pear) numerator d.f. for e(F Pear)
e(df2 Pear) denominator d.f. for e(F Pear)
e(p Pear)
p-value for e(F Pear)
e(cun Pear) uncorrected Pearson 2
e(F LR)
default-corrected likelihood-ratio F
e(df1 LR)
numerator d.f. for e(F LR)
denominator d.f. for e(F LR)
e(df2 LR)
e(p LR)
p-value for e(F LR)
e(cun LR)
uncorrected likelihood-ratio 2
e(F Wald)
e(p Wald)
e(Fun Wald)
e(pun Wald)
e(cun Wald)
Macros
e(cmd)
e(tab)
e(rowlab)
e(collab)
e(rowvlab)
Matrices
e(Prop)
e(Obs)
e(Deff)
e(Deft)
e(Row)
e(Col)
e(V row)
e(c)
e(mgdeff)
number of columns
mean generalized DEFF
e(F Penl)
e(df1 Penl)
e(df2 Penl)
e(p Penl)
e(cun Penl)
e(F LRnl)
e(df1 LRnl)
e(df2 LRnl)
e(p LRnl)
e(cun LRln)
adjusted Pearson Wald F

p-value for e(F Wald)
unadjusted Pearson Wald F
p-value for e(Fun Wald)
unadjusted Pearson Wald 2
e(F LLW)
e(p LLW)
e(Fun LLW)
e(pun LLW)
e(cun LLW)
null-corrected Pearson F
numerator d.f. for e(F Penl)
denominator d.f. for e(F Penl)
p-value for e(F Penl)
null variant uncorrected Pearson 2
null-corrected likelihood-ratio F
numerator d.f. for e(F LRnl)
denominator d.f. for e(F LRnl)
p-value for e(F LRnl)
null variant uncorrected
likelihood-ratio 2
adjusted log-linear Wald F
p-value for e(F LLW)
unadjusted log-linear Wald F
p-value for e(Fun LLW)
unadjusted log-linear Wald 2
tabulate
tab() variable
label or empty
label or empty
row variable label
e(colvlab)
e(rowvar)
e(colvar)
e(setype)
column variable label

varname1 , the row variable
varname2 , the column variable
cell, count, column, or row
matrix of cell proportions

matrix of observation counts
DEFF vector for e(setype) items
DEFT vector for e(setype) items
values for row variable
values for column variable
variance for row totals
e(V col)
e(V srs row)
e(V srs col)
e(Deff row)
e(Deff col)
e(Deft row)
e(Deft col)
variance for column totals

Vsrs for row totals
Vsrs for column totals
DEFF for row totals
DEFF for column totals
DEFT for row totals
DEFT for column totals

The table items
Confidence intervals
The test statistics
See Coefficient of variation under Methods and formulas of [SVY] estat for information on the
coefficient of variation (the cv option).
153
The table items

For a table of R rows by C columns with cells indexed by r, c, let
n
1 if the j th observation of the data is in the r, cth cell
y(rc)j =
0 otherwise
where j = 1, . . . , m indexes individuals in the sample. Weighted cell counts (count option) are
m
X
brc =
N
wj y(rc)j
j=1
brc becomes
where wj is a sampling weight. If a variable, xj , is specified with the tab() option, N
m
X
brc =
N
wj xj y(rc)j
j=1
Let
br =
N
C
X
c=1
brc ,
N
bc =
N
R
X
r=1
brc ,
N
and
b =
N
C
R X
X
brc
N
r=1 c=1
brc /N
b ; estimated row proportions (row option) are pbrow rc =
Estimated cell proportions are pbrc = N
b
b
brc /N
bc ; estimated row
Nrc /Nr ; estimated column proportions (column option) are pbcol rc = N
b
b
b
b
marginals are pbr = Nr /N ; and estimated column marginals are pbc = Nc /N .
brc is a total, the proportion estimators are ratios, and their variances can be estimated using
N
linearization methods as outlined in [SVY] variance estimation. svy: tabulate computes the variance
estimates by using svy: mean, svy: ratio, and svy: total.
Confidence intervals for proportions are calculated using a logit transform so that the endpoints
lie between 0 and 1. Let pb be an estimated proportion and sb be an estimate of its standard error. Let

pb
f (b
p) = ln
1 pb
be the logit transform of the proportion. In this metric, an estimate of the standard error is
sb
c {f (b
SE
p)} = f 0 (b
p)b
s =
pb(1 pb)
Thus a 100(1 )% confidence interval in this metric is

t1/2, sb
pb
ln
1 pb
pb(1 pb)
where t1/2, is the (1 /2)th quantile of Students t distribution with degrees of freedom.
The endpoints of this confidence interval are transformed back to the proportion metric by using the
inverse of the logit transform
ey
f 1 (y) =
1 + ey
Hence, the displayed confidence intervals for proportions are

t1/2, sb
pb
1
f
ln
1 pb
pb(1 pb)
Confidence intervals for weighted counts are untransformed and are identical to the intervals produced
by svy: total.
154
The test statistics

The uncorrected Pearson 2 statistic is
XP2 = m
R X
C
X
(b
prc pb0rc ) /b
p0rc
r=1 c=1
and the uncorrected likelihood-ratio 2 statistic is

2
XLR
= 2m
R X
C
X
pbrc ln (b
prc /b
p0rc )
r=1 c=1
where m is the total number of sampled individuals, pbrc is the estimated proportion for the cell in the
rth row and cth column of the table as defined earlier, and pb0rc is the estimated proportion under the
null hypothesis of independence; that is, pb0rc = pbr pbc , the product of the row and column marginals.
2
Rao and Scott (1981, 1984) show that, asymptotically, XP2 and XLR
are distributed as
(R1)(C1)
X2
k Wk
(1)
k=1
where the Wk are independent 21 variables and the k are the eigenvalues of
e 0 Vsrs X
e 2 )1 (X
e 0 VX
e 2)
= (X
2
2
(2)
where V is the variance of the pbrc under the survey design and Vsrs is the variance of the pbrc that
you would have if the design were simple random sampling; namely, Vsrs has diagonal elements
prc (1 prc )/m and off-diagonal elements prc pst /m.
e 2 is calculated as follows. Rao and Scott do their development in a log-linear modeling context,
X
so consider [ 1 | X1 | X2 ] as predictors for the cell counts of the R C table in a log-linear model.
The X1 matrix of dimension RC (R + C 2) contains the R 1 main effects for the rows
and the C 1 main effects for the columns. The X2 matrix of dimension RC (R 1)(C 1)
contains the row and column interactions. Hence, fitting [ 1 | X1 | X2 ] gives the fully saturated
model (that is, fits the observed values perfectly) and [ 1 | X1 ] gives the independence model. The
e 2 matrix is the projection of X2 onto the orthogonal complement of the space spanned by the
X
e 0 Vsrs X1 = 0.
columns of X1 , where the orthogonality is defined with respect to Vsrs ; that is, X
2
See Rao and Scott (1984) for the proof justifying (1) and (2). However, even without a full
understanding, you can get a feeling for . It is like a ratio (although remember that it is a matrix) of
two variances. The variance in the numerator involves the variance under the true survey design, and
the variance in the denominator involves the variance assuming that the design was simple random
sampling. The design effect DEFF for an estimated proportion (see [SVY] estat) is defined as
DEFF
Vb (b
prc )
e
Vsrsor (e
prc )
Hence, can be regarded as a design-effects matrix, and Rao and Scott call its eigenvalues, the k s,
the generalized design effects.
155
Computing an estimate for by using estimates for V and Vsrs is easy. Rao and Scott (1984)
b:
derive a simpler formula for

b = C0 D1 Vbsrs D1 C 1 C0 D1 Vb D1 C
b
p
b
p
b
p
b
p
Here C is a contrast matrix that is any RC (R 1)(C 1) full-rank matrix orthogonal to [ 1 | X1 ];
that is, C0 1 = 0 and C0 X1 = 0. Db
is a diagonal matrix with the estimated proportions pbrc on the
p
diagonal. When one of the pbrc is zero, the corresponding variance estimate is also zero; hence, the
b.
corresponding element for D1 is immaterial for computing
b
p
Unfortunately, (1) is not practical for computing a p-value. However, you can compute simple
first-order and second-order corrections based on it. A first-order correction is based on downweighting
b ; namely, you compute
the i.i.d. statistics by the average eigenvalue of
XP2 (b ) = XP2 /b
2
2
XLR
(b ) = XLR
/b
and
where b is the mean-generalized DEFF
b =
1
(R 1)(C 1)
(R1)(C1)
k=1
These corrected statistics are asymptotically distributed as 2(R1)(C1) . Thus, to first-order, you can
2
as being too big by a factor of b for true survey design.
view the i.i.d. statistics XP2 and XLR
A better second-order correction can be obtained by using the Satterthwaite approximation to the
distribution of a weighted sum of 21 variables. Here the Pearson statistic becomes
XP2 (b , b
a) =
XP2
b (b
a2 + 1)
(3)
where b
a is the coefficient of variation of the eigenvalues:
b
a2 =
Because
P b2
k
(R 1)(C 1)b2
Pb
b 2 , (3) can be written in an easily computable form as
b and P b2 = tr
k = tr
k
b
tr
XP2 (b , b
a) =
XP2
2
b
tr
These corrected statistics are asymptotically distributed as 2d , with
d=
b 2
(R 1)(C 1)
(tr )
=
2
b2
b
a +1
tr
2
that is, a 2 with, in general, noninteger degrees of freedom. The likelihood-ratio statistic XLR
can
also be given this second-order correction in an identical manner.
156
Two issues remain. First, there are two possible ways to compute the variance estimate Vbsrs ,
b . Vsrs has diagonal elements prc (1 prc )/m and off-diagonal elements
which is used to compute
prc pst /m, but here prc is the true, not estimated, proportion. Hence, the question is what to use
to estimate prc : the observed proportions, pbrc , or the proportions estimated under the null hypothesis
of independence, pb0rc = pbr pbc ? Rao and Scott (1984, 53) leave this as an open question.
Because of the question of using pbrc or pb0rc to compute Vbsrs , svy: tabulate can compute both
corrections. By default, when the null option is not specified, only the correction based on pbrc is
displayed. If null is specified, two corrected statistics and corresponding p-values are displayed, one
computed using pbrc and the other using pb0rc .
The second outstanding issue concerns the degrees of freedom resulting from the variance estimate,
Vb , of the cell proportions under the survey design. The customary degrees of freedom for t statistics
resulting from this variance estimate is = n L, where n is the number of PSUs in the sample
and L is the number of strata.
Rao and Thomas (1989) suggest turning the corrected 2 statistic into an F statistic by dividing
it by its degrees of freedom, d0 = (R 1)(C 1). The F statistic is then taken to have numerator
degrees of freedom equal to d0 and denominator degrees of freedom equal to d0 . Hence, the corrected
Pearson F statistic is
FP =
XP2
b
tr
with FP F (d, d)
where d =
b 2
(tr )
b2
tr
and = n L
(4)
This is the corrected statistic that svy: tabulate displays by default or when the pearson option
is specified. When the lr option is specified, an identical correction is produced for the likelihood-ratio
2
b is
statistic XLR
. When null is specified, (4) is also used. For the statistic labeled D-B (null),
b is computed using pbrc .
computed using pb0rc . For the statistic labeled Design-based,
The Wald statistics computed by svy: tabulate with the wald and llwald options were developed
by Koch, Freeman, and Freeman (1975). The statistic given by the wald option is similar to the
Pearson statistic because it is based on
brc N
br N
bc /N
b
Ybrc = N
where r = 1, . . . , R 1 and c = 1, . . . , C 1. The delta method can be used to estimate the
b (which is Ybrc stacked into a vector), and a Wald statistic can be constructed in the
variance of Y
usual manner:

b 0 JN Vb (N)J
b N 0 1 Y
b
b N
b0
W =Y
where
JN = Y/
The statistic given by the llwald option is based on the log-linear model with predictors [1|X1 |X2 ]
that was mentioned earlier. This Wald statistic is
0
1 0

b X02 Jp Vb (b
b
WLL = X02 lnp
p)Jp 0 X2
X2 lnp
b with respect to p
b , which is, of course, just a matrix
where Jp is the matrix of first derivatives of lnp
with pb1
on
the
diagonal
and
zero
elsewhere.
This
log-linear
Wald
statistic is undefined when there
rc
is a zero cell in the table.
Unadjusted F statistics (noadjust option) are produced using
Funadj = W/d0
with
Funadj F (d0 , )
157
Adjusted F statistics are produced using
Fadj = ( d0 + 1)W/(d0 )
with
Fadj F (d0 , d0 + 1)
The other svy estimators also use this adjustment procedure for F statistics. See Korn and
Graubard (1990) for a justification of the procedure.
References
Fuller, W. A., W. J. Kennedy, Jr., D. Schnell, G. Sullivan, and H. J. Park. 1986. PC CARP. Software package. Ames,
IA: Statistical Laboratory, Iowa State University.
Jann, B. 2008. Multinomial goodness-of-fit: Large-sample tests with survey design correction and exact tests for small
samples. Stata Journal 8: 147169.
Koch, G. G., D. H. Freeman, Jr., and J. L. Freeman. 1975. Strategies in the multivariate analysis of data from
complex surveys. International Statistical Review 43: 5978.
Rao, J. N. K., and A. J. Scott. 1981. The analysis of categorical data from complex sample surveys: Chi-squared
tests for goodness of fit and independence in two-way tables. Journal of the American Statistical Association 76:
221230.
. 1984. On chi-squared tests for multiway contingency tables with cell proportions estimated from survey data.
Annals of Statistics 12: 4660.
Rao, J. N. K., and D. R. Thomas. 1989. Chi-squared tests for contingency tables. In Analysis of Complex Surveys,
ed. C. J. Skinner, D. Holt, and T. M. F. Smith, 89114. New York: Wiley.
Research Triangle Institute. 1997. SUDAAN Users Manual, Release 7.5. Research Triangle Park, NC: Research
Triangle Institute.
Sribney, W. M. 1998. svy7: Two-way contingency tables for survey or clustered data. Stata Technical Bulletin 45:
3349. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 297322. College Station, TX: Stata Press.
Thomas, D. R., and J. N. K. Rao. 1987. Small-sample comparisons of level and power for simple goodness-of-fit
statistics under cluster sampling. Journal of the American Statistical Association 82: 630636.
Also see
[R] tabulate twoway Two-way table of frequencies
[R] test Test linear hypotheses after estimation
[SVY] svy: tabulate oneway One-way tables for survey data

Title
svydescribe Describe survey data
Syntax
Menu
References
Description
Also see
Options
Syntax
svydescribe
varlist

if

in

, options
Description
options
Main
stage(#)
finalstage
single
generate(newvar)
sampling stage to describe; default is stage(1)

display information per sampling unit in the final stage
display only the strata with one sampling unit
generate a variable identifying strata with one sampling unit
svydescribe requires that the survey design variables be identified using svyset; see [SVY] svyset.
Menu
Statistics
>
>
Setup and utilities
>
Description
svydescribe displays a table that describes the strata and the sampling units for a given sampling
stage in a survey dataset.
Options
Main
stage(#) specifies the sampling stage to describe. The default is stage(1).

finalstage specifies that results be displayed for each sampling unit in the final sampling stage;
that is, a separate line of output is produced for every sampling unit in the final sampling stage.
This option is not allowed with stage(), single, or generate().
single specifies that only the strata containing one sampling unit be displayed in the table.
generate(newvar) stores a variable that identifies strata containing one sampling unit for a given
sampling stage.
158
159

Survey datasets are typically the result of a stratified survey design with cluster sampling in one
or more stages. Within a stratum for a given sampling stage, there are sampling units, which may be
either clusters of observations or individual observations.
svydescribe displays a table that describes the strata and sampling units for a given sampling
stage. One row of the table is produced for each stratum. Each row contains the number of sampling
units, the range and mean of the number of observations per sampling unit, and the total number
of observations. If the finalstage option is specified, one row of the table is produced for each
sampling unit of the final stage. Here each row contains the number of observations for the respective
sampling unit.
If a varlist is specified, svydescribe reports the number of sampling units that contain at least
one observation with complete data (that is, no missing values) for all variables in varlist. These are
the sampling units that would be used to compute point estimates by using the variables in varlist
with a given svy estimation command.
Example 1: Strata with one sampling unit

We use data from the Second National Health and Nutrition Examination Survey (NHANES II)
(McDowell et al. 1981) as our example. First, we set the PSU, pweight, and strata variables.
pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
svydescribe will display the strata and PSU arrangement of the dataset.
. svydescribe
pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
#Obs per Unit
Stratum
#Units
1
2
3
(output omitted )
17
18
20
21
(output omitted )
31
32
31
#Obs
min
mean
max
2
2
2
380
185
348
165
67
149
190.0
92.5
174.0
215
118
199
2
2
2
2
393
359
285
214
180
144
125
102
196.5
179.5
142.5
107.0
213
215
160
112
2
2
308
450
143
211
154.0
225.0
165
239
62
10351
67
167.0
288
160
Our NHANES II dataset has 31 strata (stratum 19 is missing) and two PSUs per stratum.
The hdresult variable contains serum levels of high-density lipoprotein (HDL). If we try to
estimate the mean of hdresult, we get a missing value for the standard-error estimate and a note
explaining why.
. svy: mean hdresult
Number of strata =
Number of PSUs
=
31
60
Mean
hdresult
Number of obs
Population size
Design df
Linearized
Std. Err.
49.67141
=
=
=
8720
98725345
29
Note: missing standard error because of stratum with single

sampling unit.
Running svydescribe with hdresult and the single option will show which strata have only one
PSU.
. svydescribe hdresult, single
Survey: Describing strata with a single sampling unit in stage 1
pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
#Obs with #Obs with
#Obs per included Unit
#Units
#Units
complete missing
Stratum
included omitted
data
data
min
mean
max
1
2
1*
1*
1
1
114
98
266
87
114
98
114.0
98.0
114
98
Both stratid = 1 and stratid = 2 have only one PSU with nonmissing values of hdresult.
Because this dataset has only 62 PSUs, the finalstage option produces a manageable amount of
output:
161
. svydes hdresult, finalstage

Survey: Describing final stage sampling units
pweight: finalwgt
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: <zero>
#Obs with #Obs with
complete missing
Stratum
Unit
data
data
1
1
2
2
(output omitted )
32
31
1
2
1
2
0
114
98
0
215
51
20
67
203
62
8720
1631
10351
It is rather striking that there are two PSUs with no values for hdresult. All other PSUs have only
a moderate number of missing values. Obviously, here a data analyst should first try to ascertain why
these data are missing. The answer here (C. L. Johnson, 1995, pers. comm.) is that HDL measurements
could not be collected until the third survey location. Thus there are no hdresult data for the first
two locations: stratid = 1, psuid = 1 and stratid = 2, psuid = 2.
Assuming that we wish to go ahead and analyze the hdresult data, we must collapse stratathat
is, merge themso that every stratum has at least two PSUs with some nonmissing values. We can
accomplish this by collapsing stratid = 1 into stratid = 2. To perform the stratum collapse, we
create a new strata identifier, newstr, and a new PSU identifier, newpsu.
. gen newstr = stratid
. gen newpsu = psuid
. replace newpsu = psuid + 2 if stratid == 1
(380 real changes made)
. replace newstr = 2 if stratid == 1
(380 real changes made)
svyset the new PSU and strata variables.

. svyset newpsu [pweight=finalwgt], strata(newstr)
pweight: finalwgt
VCE: linearized
Strata 1: newstr
SU 1: newpsu
FPC 1: <zero>
Then use svydescribe to check what we have done.
162

. svydes hdresult, finalstage
Survey: Describing final stage sampling units
pweight: finalwgt
VCE: linearized
Strata 1: newstr
SU 1: newpsu
FPC 1: <zero>
#Obs with #Obs with
complete missing
Stratum
Unit
data
data
2
2
2
2
3
3
(output omitted )
32
32
30
1
2
3
4
1
2
98
0
0
114
161
116
20
67
215
51
38
33
1
2
180
203
59
8
62
8720
1631
10351
The new stratum, newstr = 2, has four PSUs, two of which contain some nonmissing values of
hdresult. This is sufficient to allow us to estimate the mean of hdresult and get a nonmissing
standard-error estimate.
Number of strata =
30
Number of PSUs
=
60
Mean
hdresult
49.67141
Number of obs
Population size
Design df
Linearized
Std. Err.
.3830147
=
=
=
8720
98725345
30

48.88919
50.45364
Example 2: Using e(sample) to find strata with one sampling unit

Some estimation commands drop observations from the estimation sample when they encounter
collinear predictors or perfect predictors. Ascertaining which strata contain one sampling unit is
therefore difficult. We can then use if e(sample) instead of varlist when faced with the problem
of strata with one sampling unit. We revisit the previous analysis to illustrate.
163
. use http://www.stata-press.com/data/r13/nhanes2b, clear

Number of strata =
Number of PSUs
=
31
60
Linearized
Std. Err.
Mean
hdresult
Number of obs
Population size
Design df
49.67141
=
=
=
8720
98725345
29
Note: missing standard error because of stratum with single

sampling unit.
. svydes if e(sample), single
Survey: Describing strata with a single sampling unit in stage 1
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
finalwgt
linearized
missing
stratid
psuid
<zero>
#Obs per Unit
Stratum
1
2
#Units
1*
1*
#Obs
min
114
98
mean
114
98
114.0
98.0
max
114
98

See Eltinge and Sribney (1996) for an earlier implementation of svydescribe.
References
Eltinge, J. L., and W. M. Sribney. 1996. svy3: Describing survey data: Sampling design and missing data. Stata
Technical Bulletin 31: 2326. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 235239. College Station,
TX: Stata Press.
Also see
Title
svymarkout Mark observations for exclusion on the basis of survey characteristics
Syntax
Description
Stored results
Also see
Syntax

svymarkout markvar
Description
svymarkout is a programmers command that resets the values of markvar to contain 0 wherever
any of the survey-characteristic variables (previously set by svyset) contain missing values.
svymarkout assumes that markvar was created by marksample or mark; see [P] mark. This
command is most helpful for developing estimation commands that use ml to fit models using
maximum pseudolikelihood directly, instead of relying on the svy prefix; see [P] program properties
for a discussion of how to write programs to be used with the svy prefix.
Example 1
program mysvyprogram, ...
...
syntax ...
marksample touse
svymarkout touse
...
end
Stored results
svymarkout stores the following in s():
Macros
s(weight)
weight variable set by svyset
Also see
[P] mark Mark observations for inclusion
164
Title
svyset Declare survey design for dataset
Syntax
Options
References
Menu
Also see
Description
Stored results
Syntax
Single-stage design

svyset psu
weight
, design options options
Multiple-stage design

|| ssu, design options . . . options
svyset psu weight
, design options
Clear the current settings
svyset, clear
Report the current settings
svyset
design options
Description
Main
strata(varname)
fpc(varname)
variable identifying strata

finite population correction
165
166
Description
options
Weights
brrweight(varlist)
fay(#)
bsrweight(varlist)
bsn(#)
jkrweight(varlist, . . . )
sdrweight(varlist, . . . )
balanced repeated replicate (BRR) weights

Fays adjustment
bootstrap replicate weights
jackknife replicate weights
successive difference replicate (SDR) weights
SE
vce(linearized)
vce(bootstrap)
vce(brr)
vce(jackknife)
vce(sdr)
dof(#)
mse
singleunit(method)
Taylor linearized variance estimation

bootstrap variance estimation
BRR variance estimation
jackknife variance estimation
SDR variance estimation
use the MSE formula with vce(bootstrap), vce(brr),
vce(jackknife), or vce(sdr)
strata with a single sampling unit; method may be missing,
certainty, scaled, or centered
Poststratification
poststrata(varname)
postweight(varname)
variable identifying poststrata

poststratum population sizes
clear
noclear
clear(opnames)
clear all settings from the data

change some of the settings without clearing the others
clear specified settings without clearing all others; opnames may be
one or more of weight, vce, dof, mse, bsrweight, brrweight,
jkrweight, sdrweight, or poststrata
pweights and iweights are allowed; see [U] 11.1.6 weight.

The full specification for jkrweight() is
jkrweight(varlist
, stratum(#

. . . ) fpc(# # . . . ) multiplier(# # . . . ) reset )
The full specification for sdrweight() is
sdrweight(varlist , fpc(#) )
clear, noclear, and clear() are not shown in the dialog box.
Menu
Statistics
>
>
Setup and utilities
>
Description
svyset declares the data to be complex survey data, designates variables that contain information
about the survey design, and specifies the default method for variance estimation. You must svyset
your data before using any svy command; see [SVY] svy estimation.
167
psu is n or the name of a variable (numeric or string) that contains identifiers for the primary
sampling units (clusters). Use n to indicate that individuals (instead of clusters) were randomly
sampled if the design does not involve clustered sampling. In the single-stage syntax, psu is optional
and defaults to n.
ssu is n or the name of a variable (numeric or string) that contains identifiers for sampling
units (clusters) in subsequent stages of the survey design. Use n to indicate that individuals were
randomly sampled within the last sampling stage.
Settings made by svyset are saved with a dataset. So, if a dataset is saved after it has been
svyset, it does not have to be set again.
The current settings are reported when svyset is called without arguments:
. svyset
Use the clear option to remove the current settings:

. svyset, clear
See [SVY] poststratification for a discussion with examples using the poststrata() and postweight() options.
Options
Main
strata(varname) specifies the name of a variable (numeric or string) that contains stratum identifiers.
fpc(varname) requests a finite population correction for the variance estimates. If varname has
values less than or equal to 1, it is interpreted as a stratum sampling rate fh = nh /Nh , where
nh = number of units sampled from stratum h and Nh = total number of units in the population
belonging to stratum h. If varname has values greater than or equal to nh , it is interpreted as
containing Nh . It is an error for varname to have values between 1 and nh or to have a mixture
of sampling rates and stratum sizes.
Weights
brrweight(varlist) specifies the replicate-weight variables to be used with vce(brr) or with svy
brr.
fay(#) specifies Fays adjustment (Judkins 1990). The value specified in fay(#) is used to adjust
the BRR weights and is present in the BRR variance formulas.
The sampling weight of the selected PSUs for a given replicate is multiplied by 2-#, where the
sampling weight for the unselected PSUs is multiplied by #. When brrweight(varlist) is specified,
the replicate-weight variables in varlist are assumed to be adjusted using #.
fay(0) is the default and is equivalent to the original BRR method. # must be between 0 and 2,
inclusive, and excluding 1. fay(1) is not allowed because this results in unadjusted weights.
bsrweight(varlist) specifies the replicate-weight variables to be used with vce(bootstrap) or
with svy bootstrap.
mean-weight variable specified in the bsrweight() option. The default is bsn(1). The value
specified in bsn(#) is used to adjust the variance estimate to account for mean bootstrap weights.
168
jkrweight(varlist, . . . ) specifies the replicate-weight variables to be used with vce(jackknife)

or with svy jackknife.
The following options set characteristics on the jackknife replicate-weight variables. If one value
is specified, all the specified jackknife replicate-weight variables will be supplied with the same
characteristic. If multiple values are specified, each replicate-weight variable will be supplied with
the corresponding value according to the order specified. These options are not shown in the dialog
box.

stratum(# # . . . ) specifies an identifier for the stratum in which the sampling weights have
been adjusted.

fpc(# # . . . ) specifies the FPC value to be added as a characteristic of the jackknife
replicate-weight variables. The values set by this suboption have the same interpretation as
the fpc(varname) option.

multiplier(# # . . . ) specifies the value of a jackknife multiplier to be added as a characteristic
of the jackknife replicate-weight variables.
reset indicates that the characteristics for the replicate-weight variables may be overwritten or
reset to the default, if they exist.
sdrweight(varlist, . . . ) specifies the replicate-weight variables to be used with vce(sdr) or with
svy sdr.
fpc(#) specifies the FPC value associated with the SDR weights. The value set by this suboption
has the same interpretation as the fpc(varname) option. This option is not shown in the dialog
box.
SE
vce(vcetype) specifies the default method for variance estimation; see [SVY] variance estimation.
vce(linearized) sets the default to Taylor linearization.
vce(bootstrap) sets the default to the bootstrap; also see [SVY] svy bootstrap.
vce(brr) sets the default to BRR; also see [SVY] svy brr.
vce(jackknife) sets the default to the jackknife; also see [SVY] svy jackknife.
vce(sdr) sets the default to the SDR; also see [SVY] svy sdr.
mse specifies that the MSE formula be used when vce(bootstrap), vce(brr), vce(jackknife),
or vce(sdr) is specified. This option requires vce(bootstrap), vce(brr), vce(jackknife),
or vce(sdr).
singleunit(method) specifies how to handle strata with one sampling unit.
singleunit(missing) results in missing values for the standard errors and is the default.
singleunit(certainty) causes strata with single sampling units to be treated as certainty units.
Certainty units contribute nothing to the standard error.
singleunit(scaled) results in a scaled version of singleunit(certainty). The scaling factor
comes from using the average of the variances from the strata with multiple sampling units for
each stratum with one sampling unit.
singleunit(centered) specifies that strata with one sampling unit are centered at the grand
mean instead of the stratum mean.
169
Poststratification
poststrata(varname) specifies the name of the variable (numeric or string) that contains poststratum
identifiers.
postweight(varname) specifies the name of the numeric variable that contains poststratum population
totals (or sizes), that is, the number of elementary sampling units in the population within each
poststratum.
The following options are available with svyset but are not shown in the dialog box:
clear clears all the settings from the data. Typing
. svyset, clear
clears the survey design characteristics from the data in memory. Although this option may be
specified with some of the other svyset options, it is redundant because svyset automatically
clears the previous settings before setting new survey design characteristics.
noclear allows some of the options in options to be changed without clearing all the other settings.
This option is not allowed with psu, ssu, design options, or clear.
clear(opnames) allows some of the options in options to be cleared without clearing all the other
settings. opnames refers to an option name and may be one or more of the following:
weight vce
poststrata
dof
mse
brrweight
bsrweight
jkrweight
sdrweight
This option implies the noclear option.

Introduction to survey design characteristics
Finite population correction (FPC)
Multiple-stage designs and with-replacement sampling
Replication-weight variables
Combining datasets from multiple surveys
Video example
Introduction to survey design characteristics

Statas suite of commands for survey data analysis relies on properly identified survey design
characteristics for point estimation, model fitting, and variance estimation. In fact, the svy prefix
will report an error if no survey design characteristics have been identified using svyset. Typical
survey design characteristics include sampling weights, one or more stages of clustered sampling, and
stratification. ODonnell et al. (2008, 2627) show four survey sample designs with the corresponding
svyset specification. Use svyset to declare your dataset to be complex survey data by specifying
the survey design variables. We will use the following contrived dataset for the examples in this
section.
. use http://www.stata-press.com/data/r13/stage5a
170
Example 1: Simple random sampling with replacement

Use n for psu to specify that the primary sampling units (PSUs) are the sampled individuals.
. svyset _n
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
<none>
linearized
missing
<one>
<observations>
<zero>
The output from svyset states that there are no sampling weights (each observation is given a
sampling weight of 1), there is only one stratum (which is the same as no stratification), and the
PSUs are the observed individuals.
Example 2: One-stage clustered design with stratification

The most commonly specified design, one-stage clustered design with stratification, can be used to
approximate multiple-stage designs when only the first-stage information is available. In this design,
the population is partitioned into strata and the PSUs are sampled independently within each stratum.
A dataset from this design will have a variable that identifies the strata, another variable that identifies
the PSUs, and a variable containing the sampling weights. Lets assume that these variables are,
respectively, strata, su1, and pw.
. svyset su1 [pweight=pw], strata(strata)
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
pw
linearized
missing
strata
su1
<zero>
Example 3: Two-stage designs

In two-stage designs, the PSUs are sampled without replacement and then collections of individuals
are sampled within the selected PSUs. svyset uses || (double or bars) to separate the stagespecific design specifications. The first-stage information is specified before ||, and the second-stage
information is specified afterward. We will assume that the variables containing the finite population
correction (FPC) information for the two stages are named fpc1 and fpc2; see Finite population
correction (FPC) for a discussion about the FPC.
Use n for ssu to specify that the second-stage sampling units are the sampled individuals.
. svyset su1 [pweight=pw], fpc(fpc1) || _n, fpc(fpc2)
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
Strata 2:
SU 2:
FPC 2:
pw
linearized
missing
<one>
su1
fpc1
<one>
<observations>
fpc2
171
Suppose that su2 identifies the clusters of individuals sampled in the second stage.
. svyset su1 [pweight=pw], fpc(fpc1) || su2, fpc(fpc2)
pweight: pw
VCE: linearized
Strata 1: <one>
SU 1: su1
FPC 1: fpc1
Strata 2: <one>
SU 2: su2
FPC 2: fpc2
Stratification can take place in one or both of the sampling stages. Suppose that strata identifies
the second-stage strata and the first stage was not stratified.
. svyset su1 [pweight=pw], fpc(fpc1) || su2, fpc(fpc2) strata(strata)
pweight: pw
VCE: linearized
Strata 1: <one>
SU 1: su1
FPC 1: fpc1
Strata 2: strata
SU 2: su2
FPC 2: fpc2
Example 4: Multiple-stage designs

Specifying designs with three or more stages is not much more difficult than specifying two-stage
designs. Each stage will have its own variables for identifying strata, sampling units, and the FPC.
Not all stages will be stratified and some will be sampled with replacement; thus some stages may
not have a variable for identifying strata or the FPC.
Suppose that we have a three-stage design with variables su# and fpc# for the sampling unit and
FPC information in stage #. Also assume that the design called for stratification in the first stage only.
. svyset su1 [pweight=pw], fpc(fpc1) strata(strata)
>
|| su2, fpc(fpc2)
>
|| su3, fpc(fpc3)
pweight: pw
VCE: linearized
Strata 1: strata
SU 1: su1
FPC 1: fpc1
Strata 2: <one>
SU 2: su2
FPC 2: fpc2
Strata 3: <one>
SU 3: su3
FPC 3: fpc3
Use n for ssu in the last stage if the individuals are sampled within the third stage of clustered
sampling.
172

. svyset su1 [pweight=pw], fpc(fpc1) strata(strata)
>
|| su2, fpc(fpc2)
>
|| su3, fpc(fpc3)
>
|| _n
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
Strata 2:
SU 2:
FPC 2:
Strata 3:
SU 3:
FPC 3:
Strata 4:
SU 4:
FPC 4:
pw
linearized
missing
strata
su1
fpc1
<one>
su2
fpc2
<one>
su3
fpc3
<one>
<observations>
<zero>
Finite population correction (FPC)

An FPC accounts for the reduction in variance that occurs when sampling without replacement from
a finite population compared to sampling with replacement from the same population. Specifying
an FPC variable for stage i indicates that the sampling units in that stage were sampled without
replacement. See Cochran (1977) for an introduction to variance estimation and sampling without
replacement.
Example 5
Consider the following dataset:
. use http://www.stata-press.com/data/r13/fpc
. list
stratid
psuid
weight
nh
Nh
1.
2.
3.
4.
5.
1
1
1
1
1
1
2
3
4
5
3
3
3
3
3
5
5
5
5
5
15
15
15
15
15
2.8
4.1
6.8
6.8
9.2
6.
7.
8.
2
2
2
1
2
3
4
4
4
3
3
3
12
12
12
3.7
6.6
4.2
Here the variable nh is the number of PSUs per stratum that were sampled, Nh is the total number of
PSUs per stratum in the sampling frame (that is, the population), and x is our survey item of interest.
If we wish to use a finite population correction in our computations, we must svyset an FPC
variable when we specify the variables for sampling weights, PSUs, and strata. The FPC variable
typically contains the number of sampling units per stratum in the population; Nh is our FPC variable.
Here we estimate the population mean of x assuming sampling without replacement.
173
. svyset psuid [pweight=weight], strata(stratid) fpc(Nh)

pweight: weight
VCE: linearized
Strata 1: stratid
SU 1: psuid
FPC 1: Nh
. svy: mean x
Number of strata =
Number of PSUs
=
2
8
Mean
x
5.448148
Number of obs
Population size
Design df
Linearized
Std. Err.
.6160407
=
=
=
8
27
6

3.940751
6.955545
We must respecify the survey design before estimating the population mean of x assuming sampling
with replacement.
. svyset psuid [pweight=weight], strata(stratid)
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
weight
linearized
missing
stratid
psuid
<zero>
. svy: mean x
Number of strata =
2
Number of obs
Number of PSUs
=
8
Population size
Design df
Mean
x
5.448148
Linearized
Std. Err.
.7412683
=
=
=
8
27
6

3.63433
7.261966
Including an FPC always reduces the variance estimate. However, the reduction in the variance estimates
will be small when the Nh are large relative to the nh .
Rather than having a variable that represents the total number of PSUs per stratum in the sampling
frame, we sometimes have a variable that represents a sampling rate fh = nh /Nh . The syntax for
svyset is the same whether the FPC variable contains Nh or fh . The survey variance-estimation
routines in Stata are smart enough to identify what type of FPC information has been specified. If
the FPC variable is less than or equal to 1, it is interpreted as a sampling rate; if it is greater than
or equal to nh , it is interpreted as containing Nh . It is an error for the FPC variable to have values
between 1 and nh or to have a mixture of sampling rates and stratum sizes.
174
Multiple-stage designs and with-replacement sampling

Although survey data are seldom collected using with-replacement sampling, dropping the FPC
information when the sampling fractions are small is common. In either case, svyset ignores the
design variables specified in later sampling stages because this information is not necessary for
variance estimation. In the following, we describe why this is true.
Example 6
Consider the two-stage design where PSUs are sampled with replacement and individuals are
sampled without replacement within the selected PSUs. Sampling the individuals with replacement
would change some of the details in the following discussion, but the result would be the same.
Our population contains 100 PSUs, with five individuals in each, so our population size is 500.
We will sample 10 PSUs with replacement and then sample two individuals without replacement from
within each selected PSU. This results in a dataset with 10 PSUs, each with 2 observations, for a total
of 20 observations. If our dataset contained the PSU information in variable su1 and the second-stage
FPC information in variable fpc2, our svyset command would be as follows.
. use http://www.stata-press.com/data/r13/svyset_wr
. svyset su1 || _n, fpc(fpc2)
Note: stage 1 is sampled with replacement; all further stages will be ignored
pweight: <none>
VCE: linearized
Strata 1: <one>
SU 1: su1
FPC 1: <zero>
As expected, svyset tells us that it is ignoring the second-stage information because the first-stage
units were sampled with replacement. Because we do not have an FPC variable for the first stage, we
can regard the sampling of PSUs as a series of independently and identically distributed draws. The
second-sampled PSU is drawn independently from the first and has the same sampling distribution
because the first-sampled PSU is eligible to be sampled again.
Consider the following alternative scenario. Because there are 10 ways to pick two people of five,
lets expand the 100 PSUs to form 100 10 = 1,000 new PSUs (NPSUs), each of size 2, representing
all possible two-person groups that can be sampled from the original 100 groups of five people. We
now have a population of 1,000 2 = 2,000 new people; each original person was replicated four
times. We can select 10 NPSUs with replacement to end up with a dataset consisting of 10 groups of
two to form samples of 20 people. If our new dataset contained the PSU information in variable
nsu1, our svyset command would be as follows:
. svyset nsu1
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
<none>
linearized
missing
<one>
nsu1
<zero>
There is nothing from a sampling standpoint to distinguish between our two scenarios. The
information contained in the variables su1 and nsu1 is equivalent; thus svyset can behave as if our
dataset came from the second scenario.
175
The following questions may spring to mind after reading the above:
The population in the first scenario has 500 people; the second has 2,000. Does that not
invalidate the comparison between the two scenarios?
Although the populations are different, the sampling schemes described for each scenario result
in the same sampling space. By construction, each possible sample from the first scenario is
also a possible sample from the second scenario. For the first scenario, the number of possible
samples of 10 of 100 PSUs sampled with replacement, where two of five individuals are sampled
without replacement, is
10
5
10010
= 1030
2
For the second scenario, the number of possible samples of 10 of 1,000 NPSUs sampled with
replacement, where each NPSU is sampled as a whole, is
1,00010 = 1030
Does the probability of being in the sample not depend on what happens in the first sampling
stage?
Not when the first stage is sampled with replacement. Sampling with replacement means that
all PSUs have the same chance of being selected even after one of the PSUs has been selected.
Thus each of the two-person groups that can possibly be sampled has the same chance of being
sampled even after a specific two-person group has been selected.
Is it valid to have replicated people in the population like the one in the second scenario?
Yes, because each person in the population can be sampled more than once. Sampling with
replacement allows us to construct the replicated people.
Replication-weight variables
Many groups that collect survey data for public use have taken steps to protect the privacy of
the survey participants. This may result in datasets that have replicate-weight variables instead of
variables that identify the strata and sampling units from the sampling stages. These datasets require
replication methods for variance estimation.
The brrweight(), jkrweight(), bsrweight(), and sdrweight() options allow svyset to
identify the set of replication weights for use with BRR, jackknife, bootstrap, and SDR variance
estimation (svy brr, svy jackknife, svy bootstrap, and svy sdr), respectively. In addition to
the weight variables, svyset also allows you to change the default variance estimation method from
linearization to BRR, jackknife, bootstrap, or SDR.
Example 7
Here are two simple examples using jackknife replication weights.
1. Data containing only sampling weights and jackknife replication weights, and we set the default
variance estimator to the jackknife:
176

. use http://www.stata-press.com/data/r13/stage5a_jkw
. svyset [pweight=pw], jkrweight(jkw_*) vce(jackknife)
pweight: pw
VCE: jackknife
MSE: off
jkrweight: jkw_1 jkw_2 jkw_3 jkw_4 jkw_5 jkw_6 jkw_7 jkw_8 jkw_9
Strata 1: <one>
FPC 1: <zero>
2. Data containing only sampling weights and jackknife replication weights, and we set the default
variance estimator to the jackknife by using the MSE formula:
. svyset [pweight=pw], jkrweight(jkw_*) vce(jackknife) mse
pweight: pw
VCE: jackknife
MSE: on
jkrweight: jkw_1 jkw_2 jkw_3 jkw_4 jkw_5 jkw_6 jkw_7 jkw_8 jkw_9
Strata 1: <one>
FPC 1: <zero>
Example 8: Characteristics for jackknife replicate-weight variables

The jkrweight() option has suboptions that allow you to identify certain characteristics of the
jackknife replicate-weight variables. These characteristics include the following:
An identifier for the stratum in which the sampling weights have been adjusted because one
of its PSUs was dropped. We use the stratum() suboption to set these values. The default is
one stratum for all the replicate-weight variables.
The FPC value. We use the fpc() suboption to set these values. The default value is zero.
This characteristic is ignored when the mse option is supplied to svy jackknife.
A jackknife multiplier used in the formula for variance estimation. The multiplier for the
standard leave-one-out jackknife method is
nh 1
nh
where nh is the number of PSUs sampled from stratum h. We use the multiplier() suboption
to set these values. The default is derived from the above formula, assuming that nh is equal
to the number of replicate-weight variables for stratum h.
Because of privacy concerns, public survey datasets may not contain stratum-specific information.
However, the population size and an overall jackknife multiplier will probably be provided. You must
then supply this information to svyset for the jackknife replicate-weight variables. We will use the
19992000 NHANES data to illustrate how to set these characteristics.
The NHANES datasets for years 19992000 are available for download from the Centers for Disease
Control and Prevention (CDC) website, http://www.cdc.gov. This particular release of the NHANES data
contains jackknife replication weights in addition to the usual PSU and stratum information. These
variables are contained in the demographic dataset. In our web browser, we saved the demographic data
from the CDC website ftp://ftp.cdc.gov/pub/Health Statistics/NCHS/nhanes/1999-2000/DEMO.xpt. We
suggest that you rename the data to demo.xpt.
177
The 19992000 NHANES datasets are distributed in SAS Transport format, so we use Statas
import sasxport command to read the data into memory. Because of the nature of the survey
design, the demographic dataset demo.xpt has two sampling-weight variables. wtint2yr contains
the sampling weights appropriate for the interview data, and wtmec2yr contains the sampling weights
appropriate for the Mobile Examination Center (MEC) exam data. Consequently, there are two sets of
jackknife replicate-weight variables. The jackknife replicate-weight variables for the interview data
are named wtirep01, wtirep02, . . . , wtirep52. The jackknife replicate-weight variables for the
MEC exam data are named wtmrep01, wtmrep02, . . . , wtmrep52. The documentation published with
the NHANES data gives guidance on which weight variables to use.
. import sasxport demo.xpt
. describe wtint2yr wtmec2yr wtirep01 wtmrep01
storage
display
value
variable name
type
format
label
variable label
wtint2yr
double
%10.0g
wtmec2yr
double
%10.0g
wtirep01
double
%10.0g
wtmrep01
double
%10.0g
Full Sample 2 Year Interview

Weight
Full Sample 2 Year MEC Exam
Weight
Interview Weight Jack Knife
Replicate 01
MEC Exam Weight Jack Knife
Replicate 01
The number of PSUs in the NHANES population is not apparent, so we will not set an FPC value, but
we can set the standard jackknife multiplier for the 52 replicate-weight variables and save the results
as a Stata dataset for future use. Also the NHANES datasets all contain a variable called seqn. This
variable has a respondent sequence number that allows the dataset users to merge the demographic
dataset with other 19992000 NHANES datasets, so we sort on seqn before saving demo99 00.dta.
. local mult = 51/52
. svyset, jkrweight(wtmrep*, multiplier(mult))
(output omitted )
. svyset, jkrweight(wtirep*, multiplier(mult))
(output omitted )
. svyset, clear
. sort seqn
. save demo99_00
file demo99_00.dta saved
To complete this example, we will perform a simple analysis using the blood pressure data;
however, before we can perform any analysis, we have to merge the blood pressure dataset, bpx.xpt,
with our demographic dataset, demo99 00.dta. In our web browser, we saved the blood pressure
data from the CDC website ftp://ftp.cdc.gov/pub/Health Statistics/NCHS/nhanes/1999-2000/BPX.xpt.
We suggest that you rename the data to bpx.xpt.
We can then use import sasxport to read in the blood pressure data, sort on seqn, and save
the resulting dataset to bpx99 00.dta. We read in our copy of the demographic data, drop the
irrelevant weight variables, and merge in the blood pressure data from bpx99 00.dta. A quick call
to tabulate on the merge variable generated by merge indicates that 683 observations in the
demographic data are not present in the blood pressure data. We do not drop these observations;
otherwise, the estimate of the population size will be incorrect. Finally, we set the appropriate sampling
and replicate-weight variables with svyset before replacing bpx99 00.dta with a more complete
copy of the blood pressure data.
178

. import sasxport bpx.xpt
. sort seqn
. save bpx99_00
file bpx99_00.dta saved
. use demo99_00
. drop wtint?yr wtirep*
. merge 1:1 seqn using bpx99_00
Result
# of obs.
not matched
from master
from using
matched
683
683
0
(_merge==1)
(_merge==2)
9,282
(_merge==3)
. drop _merge
. svyset [pw=wtmec2yr], jkrweight(wtmrep*) vce(jackknife)
(output omitted )
. save bpx99_00, replace
Having saved our merged dataset (with svysettings), we estimate the mean systolic blood pressure
for the population, using the MEC exam replication weights for jackknife variance estimation.
. svy: mean bpxsar
1
2
3
4
5
..................................................
..
50

Number of strata =
Mean
bpxsar
119.7056
Number of obs
Population size
Replications
Design df
=
=
=
=
7898
231756417
52
51
Jackknife
Std. Err.
.5109122
118.6799
120.7313
Combining datasets from multiple surveys

The 20012002 NHANES datasets are also available from the CDC website, http://www.cdc.gov.
The guidelines that are published with these datasets recommend that the 19992000 and 20012002
NHANES datasets be combined to increase the accuracy of results. Combining datasets from multiple
surveys is a complicated process, and Stata has no specific tools for this task. However, the distributors
of the NHANES datasets provide sampling-weight variables for the 19992002 combined data in the
respective demographic datasets. They also provide some simple instructions on how to combine the
datasets from these two surveys.
179
In the previous example, we worked with the 19992000 NHANES data. The 20012002 NHANES
demographics data are contained in demo b.xpt, and the blood pressure data are contained in
bpx b.xpt. We follow the same steps as in the previous example to merge the blood pressure data
with the demographic data for 20012002.
Visit the following CDC websites and save the data:
ftp://ftp.cdc.gov/pub/Health Statistics/NCHS/nhanes/2001-2002/BPX B.xpt
ftp://ftp.cdc.gov/pub/Health Statistics/NCHS/nhanes/2001-2002/DEMO B.xpt
We suggest that you rename the data to bpx b.xpt and demo b.xpt. We can then continue with
our example:
. import sasxport bpx_b.xpt
. sort seqn
. save bpx01_02
. import sasxport demo_b.xpt
. drop wtint?yr
. sort seqn
. merge 1:1 seqn using bpx01_02
Result
not matched
from master
from using
matched
# of obs.
562
562
0
10,477
(_merge==1)
(_merge==2)
(_merge==3)
. drop _merge
. svyset sdmvpsu [pw=wtmec2yr], strata(sdmvstra)
pweight: wtmec2yr
VCE: linearized
Strata 1: sdmvstra
SU 1: sdmvpsu
FPC 1: <zero>
. save bpx01_02, replace
The demographic dataset for 20012002 does not contain replicate-weight variables, but there are
variables that provide information on PSUs and strata for variance estimation. The PSU information
is contained in sdmvpsu, and the stratum information is in sdmvstra. See the documentation that
comes with the NHANES datasets for the details regarding these variables.
This new blood pressure dataset (bpx01 02.dta) is all we need if we are interested in analyzing
blood pressure data only for 20012002. However, we want to use the 19992002 combined data,
so we will follow the advice in the guidelines and just combine the datasets from the two surveys.
For those concerned about overlapping stratum identifiers between the two survey datasets, it is a
simple exercise to check that sdmvstra ranges from 1 to 13 for 19992000 but ranges from 14 to
28 for 20012002. Thus the stratum identifiers do not overlap, so we can simply append the data.
The 20012002 NHANES demographic dataset has no jackknife replicate-weight variables, so
we drop the replicate-weight variables from the 19992000 dataset. The sampling-weight variable
wtmec2yr is no longer appropriate for use with the combined data because its values are based on
the survey designs individually, so we drop it from the combined dataset. Finally, we use svyset
to identify the design variables for the combined surveys. wtmec4yr is the sampling-weight variable
for the MEC exam data developed by the data producers for the combined 19992002 NHANES data.
180

.
.
.
.
.
use bpx99_00
drop wt?rep*
append using bpx01_02
drop wtmec2yr
svyset sdmvpsu [pw=wtmec4yr], strata(sdmvstra)
pweight:
VCE:
Single unit:
Strata 1:
SU 1:
FPC 1:
wtmec4yr
linearized
missing
sdmvstra
sdmvpsu
<zero>
. save bpx99_02
Now we can estimate the mean systolic blood pressure for our population by using the combined
surveys and jackknife variance estimation.
. svy jackknife: mean bpxsar
1
2
3
4
5
..................................................
.......
Number of strata =
28
Number of obs
=
Number of PSUs
=
57
Population size =
Replications
=
Design df
=
Mean
bpxsar
119.8914
50
16297
237466080
57
29
Jackknife
Std. Err.
.3828434
119.1084
Video example
Specifying the design of your survey data to Stata
120.6744
181
Stored results
svyset stores the following in r():
Scalars
r(stages)
Macros
r(wtype)
r(wexp)
r(wvar)
r(su#)
r(strata#)
r(fpc#)
r(bsrweight)
r(bsn)
r(brrweight)
r(fay)
r(jkrweight)
r(sdrweight)
r(sdrfpc)
r(vce)
r(dof)
r(mse)
r(poststrata)
r(postweight)
r(settings)
r(singleunit)

weight type
weight expression
FPC for stage #
Fays adjustment
fpc() value from within sdrweight()
dof() value
mse, if specified
svyset arguments to reproduce the current settings
References
ODonnell, O., E. van Doorslaer, A. Wagstaff, and M. Lindelow. 2008. Analyzing Health Equity Using Household
Survey Data: A Guide to Techniques and Their Implementation. Washington, DC: The World Bank.
Also see
Title
variance estimation Variance estimation for survey data
Description
References
Also see
Description
Statas suite of estimation commands for survey data use the most commonly used variance estimation techniques: bootstrap, balanced repeated replication, jackknife, successive difference replication,
and linearization. The bootstrap, balanced repeated replication, jackknife, and successive difference
replication techniques are known as replication methods in the survey literature. We stick with that
nomenclature here, but note that these techniques are also known as resampling methods. This entry
discusses the details of these variance estimation techniques.
Also see Cochran (1977), Wolter (2007), and Shao and Tu (1995) for some background on these
variance estimators.

Variance of the total
Stratified single-stage design
Stratified two-stage design
Variance for census data
Certainty sampling units
Strata with one sampling unit
Ratios and other functions of survey data
Revisiting the total estimator
The ratio estimator
A note about score variables
Linearized/robust variance estimation
The bootstrap
BRR
The jackknife
The delete-one jackknife
The delete-k jackknife
Successive difference replication
Variance of the total

This section describes the methods and formulas for svy: total. The variance estimators not
using replication methods use the variance of a total as an important ingredient; this section therefore
also introduces variance estimation for survey data.
We will discuss the variance estimators for two complex survey designs:
1. The stratified single-stage design is the simplest design that has the elements present in most
complex survey designs.
2. Adding a second stage of clustering to the previous design results in a variance estimator for
designs with multiple stages of clustered sampling.
182
183
Stratified single-stage design

The population is partitioned into groups called strata. Clusters of observations are randomly
sampledwith or without replacementfrom within each stratum. These clusters are called primary
sampling units (PSUs). In single-stage designs, data are collected from every member of the sampled
PSUs. When the observed data are analyzed, sampling weights are used to account for the survey
design. If the PSUs were sampled without replacement, a finite population correction (FPC) is applied
to the variance estimator.
The svyset syntax to specify this design is
svyset psu [pweight=weight], strata(strata) fpc(fpc)
The stratum identifiers are contained in the variable named strata, PSU identifiers are contained in
variable psu, the sampling weights are contained in variable weight, and the values for the FPC are
contained in variable fpc.
Let h = 1, . . . , L count the strata and (h, i) denote the ith PSU in stratum h, where i = 1, . . . , Nh
and Nh is the number of PSUs in stratum h. Let (h, i, j) denote the j th individual from PSU (h, i)
and Mhi be the number of individuals in PSU (h, i); then
M=
Nh
L X
X
Mhi
h=1 i=1
is the number of individuals in the population. Let Yhij be a survey item for individual (h, i, j); for
example, Yhij might be income for adult j living in block i of county h. The associated population
total is
Nh M
L X
hi
X
X
Y =
Yhij
h=1 i=1 j=1
Let yhij denote the items for individuals who are members of the sampled PSUs; here h = 1,
. . . , L; i = 1, . . . , nh ; and j = 1, . . . , mhi . The number of individuals in the sample (number of
observations) is
nh
L X
X
m=
mhi
h=1 i=1
The estimator for Y is
Yb =
nh X
mhi
L X
X
whij yhij
h=1 i=1 j=1
where whij is a sampling weight, and its unadjusted value for this design is whij = Nh /nh . The
estimator for the number of individuals in the population (population size) is
c=
M
nh X
mhi
L X
X
h=1 i=1 j=1
whij
184
The estimator for the variance of Yb is
Vb (Yb ) =
L
X
(1 fh )
h=1
h
nh X
(yhi y h )2
nh 1 i=1
(1)
where yhi is the weighted total for PSU (h, i),
yhi =
mhi
X
whij yhij
j=1
and y h is the mean of the PSU totals for stratum h:

nh
1 X
yhi
nh i=1
yh =
The factor (1 fh ) is the FPC for stratum h, and fh is the sampling rate for stratum h. The sampling
rate fh is derived from the variable specified in the fpc() option of svyset. If an FPC variable is
not svyset, then fh = 0. If an FPC variable is set and its values are greater than or equal to nh ,
then the variable is assumed to contain the values of Nh , and fh is given by fh = nh /Nh . If its
values are less than or equal to 1, then the variable is assumed to contain the sampling rates fh .
If multiple variables are supplied to svy: total, covariances are also computed. The estimator
b (notation for X is defined similarly to that of Y ) is
for the covariance between Yb and X
d Yb , X)
b =
Cov(
L
X
(1 fh )
h=1
h
nh X
(yhi y h )(xhi xh )
nh 1 i=1
Stratified two-stage design

The population is partitioned into strata. PSUs are randomly sampled without replacement from within
each stratum. Clusters of observations are then randomly sampledwith or without replacement
from within the sampled PSUs. These clusters are called secondary sampling units (SSUs). Data are then
collected from every member of the sampled SSUs. When the observed data are analyzed, sampling
weights are used to account for the survey design. Each sampling stage provides a component to the
variance estimator and has its own FPC.
The svyset syntax to specify this design is
svyset psu [pweight=weight], strata(strata) fpc(fpc1 ) || ssu, fpc(fpc2 )
The stratum identifiers are contained in the variable named strata, PSU identifiers are contained in
variable psu, the sampling weights are contained in variable weight, the values for the FPC for the
first sampling stage are contained in variable fpc1 , SSU identifiers are contained in variable ssu, and
the values for the FPC for the second sampling stage are contained in variable fpc2 .
The notation for this design is based on the previous notation. There still are L strata, and (h, i)
identifies the ith PSU in stratum h. Let Mhi be the number of SSUs in PSU (h, i), Mhij be the number
of individuals in SSU (h, i, j), and
M=
Nh M
L X
hi
X
X
h=1 i=1 j=1
Mhij
185
be the population size. Let Yhijk be a survey item for individual (h, i, j, k); for example, Yhijk might
be income for adult k living in block j of county i of state h. The associated population total is
Y =
Nh M
hij
L X
hi M
X
X
X
Yhijk
h=1 i=1 j=1 k=1
Let yhijk denote the items for individuals who are members of the sampled SSUs; here h = 1,
. . . , L; i = 1, . . . , nh ; j = 1, . . . , mhi ; and k = 1, . . . , mhij . The number of observations is
m=
nh X
mhi
L X
X
mhij
h=1 i=1 j=1
The estimator for Y is
Yb =
nh X
mhi m
hij
L X
X
X
whijk yhijk
h=1 i=1 j=1 k=1
where whijk is a sampling weight, and its unadjusted value for this design is

Mhi
Nh
whijk =
nh
mhi
The estimator for the population size is
c=
M
nh X
mhi m
hij
L X
X
X
whijk
h=1 i=1 j=1 k=1
The estimator for the variance of Yb is
Vb (Yb ) =
L
X
(1 fh )
h=1
h
nh X
(yhi y h )2
nh 1 i=1
(2)
+
L
X
h=1
nh
X
hi
mhi X
fh
(1 fhi )
(yhij y hi )2
m
1
hi
i=1
j=1
where yhi is the weighted total for PSU (h, i); y h is the mean of the PSU totals for stratum h; yhij
is the weighted total for SSU (h, i, j),
mhij
yhij =
whijk yhijk
k=1
and y hi is the mean of the SSU totals for PSU (h, i),
y hi =
mhi
1 X
yhij
mhi j=1
186
Equation (2) is equivalent to (1) with an added term representing the increase in variability because
of the second stage of sampling. The factor (1 fh ) is the FPC, and fh is the sampling rate for the
first stage of sampling. The factor (1 fhi ) is the FPC, and fhi is the sampling rate for PSU (h, i).
The sampling rate fhi is derived in the same manner as fh .
If multiple variables are supplied to svy: total, covariances are also computed. For estimated
b (notation for X is defined similarly to that of Y ), the covariance estimator is
totals Yb and X
d Yb , X)
b =
Cov(
L
X
(1 fh )
h=1
L
X
h=1
fh
h
nh X
(yhi y h )(xhi xh )
nh 1 i=1
nh
mhi
X
mhi X
(1 fhi )
(yhij y hi )(xhij xhi )
mhi 1 j=1
i=1
On the basis of the formulas (1) and (2), writing down the variance estimator for a survey design
with three or more stages is a matter of deriving the variance component for each sampling stage.
The sampling units from a given stage pose as strata for the next sampling stage.
All but the last stage must be sampled without replacement to get nonzero variance components
from each stage of clustered sampling. For example, if fh = 0 in (2), the second stage contributes
nothing to the variance estimator.
Variance for census data

The point estimates that result from the analysis of census data, in which the entire population
was sampled without replacement, are the populations parameters instead of random variables. As
such, there is no sample-to-sample variation if we consider the population fixed. Here the sampling
fraction is one; thus, if the FPC variable you svyset for the first sampling stage is one, Stata will
report a standard error of zero.
Certainty sampling units

Statas svy commands identify strata with an FPC equal to one as units sampling with certainty.
To properly determine the design degrees of freedom, certainty sampling units should be contained
within their own strata, one for each certainty unit, in each sampling stage. Although the observations
contained in certainty units from a given sampling stage play a role in parameter estimation, they
contribute nothing to the variance for that stage.
Strata with one sampling unit

By default, Statas svy commands report missing standard errors when they encounter a stratum
with one sampling unit. Although the best way to solve this problem is to reassign the sampling unit
to another appropriately chosen stratum, there are three automatic alternatives that you can choose
from, in the singleunit() option, when you svyset your data.
singleunit(certainty) treats the strata with single sampling units as certainty units.
singleunit(scaled) treats the strata with single sampling units as certainty units but multiplies
the variance components from each stage by a scaling factor. For a given sampling stage, suppose
that L is the total number of strata, Lc is the number of certainty strata, and Ls is the number
of strata with one sampling unit, and then the scaling factor is (L Lc )/(L Lc Ls ). Using
this scaling factor is the same as using the average of the variances from the strata with multiple
sampling units for each stratum with one sampling unit.
187
singleunit(centered) specifies that strata with one sampling unit are centered at the population
mean instead of the stratum mean. The quotient nh /(nh 1) in the variance formula is also taken
to be 1 if nh = 1.
Ratios and other functions of survey data

Shah (2004) points out a simple procedure for deriving the linearized variance for functions of
survey data that are continuous functions of the sampling weights. Let be a (possibly vector-valued)
function of the population data and b be its associated estimator based on survey data.
1. Define the j th observation of the score variable by
zj =
b
wj
If b is implicitly defined through estimating equations, zj can be computed by taking the partial
derivative of the estimating equations with respect to wj .
2. Define the weighted total of the score variable by
Zb =
m
X
wj zj
j=1
b by using the design-based variance estimator for the total Zb. This
3. Estimate the variance V (Z)
b.
variance estimator is an approximation of V ()
Revisiting the total estimator
As a first example, we derive the variance of the total from a stratified single-stage design. Here
you have b = Yb , and deriving the score variable for Yb results in the original values of the variable
of interest.
b
b = zj (Yb ) = Y = yj
zj ()
wj
Thus you trivially recover the variance of the total given in (1) and (2).
The ratio estimator
The estimator for the population ratio is
b
b= Y
R
b
X
and its score variable is
b xj
b
yj R
R
=
b
wj
X
Plugging this into (1) or (2) results in a variance estimator that is algebraically equivalent to the
variance estimator derived from directly applying the delta method (a first-order Taylor expansion
with respect to y and x)

b = 1 Vb (Yb ) 2R
b Cov(
d Yb , X)
b +R
b2 Vb (X)
b
Vb (R)
2
b
X
b =
zj (R)
188
A note about score variables

The functional form of the score variable for each estimation command is detailed in the Methods
and formulas section of its manual entry; see [R] total, [R] ratio, and [R] mean.
Although Deville (1999) and Demnati and Rao (2004) refer to zj as the linearized variable, here
it is referred to as the score variable to tie it more closely to the model-based estimators discussed
in the following section.
Linearized/robust variance estimation

The regression models for survey data that allow the vce(linearized) option use linearizationbased variance estimators that are natural extensions of the variance estimator for totals. For general background on regression and generalized linear model analysis of complex survey data, see
Binder (1983); Cochran (1977); Fuller (1975); Godambe (1991); Kish and Frankel (1974); Sarndal,
Swensson, and Wretman (1992); and Skinner (1989).
Suppose that you observed (Yj , xj ) for the entire population and are interested in modeling the
relationship between Yj and xj by the vector of parameters that solve the following estimating
equations:
M
X
G() =
S(; Yj , xj ) = 0
j=1
For ordinary least squares, G() is the normal equations
G() = X 0 Y X 0 X = 0
where Y is the vector of outcomes for the full population and X is the matrix of explanatory variables
for the full population. For a pseudolikelihood modelsuch as logistic regression G() is the first
b from
derivative of the log-pseudolikelihood function with respect to . Estimate by solving for
the weighted sample estimating equations
b ) =
G(
m
X
wj S(; yj , xj ) = 0
(3)
j=1
b equal to the
The associated estimation command with iweights will produce point estimates
solution of (3).
A first-order matrix Taylor-series expansion yields
(
b
b )
G(
)1
b )
G(
b
with the following variance estimator for :
(
b
b ) = G()
Vb (
)1
(
b )}
Vb {G(
b )
G(

)T

b )}

= DVb {G(
D0
b
b
=
=
189
where D is (Xs0 W Xs )1 for linear regression (where W is a diagonal matrix of the sampling weights
and Xs is the matrix of sampled explanatory variables) or the inverse of the negative Hessian matrix
b ) as
from the pseudolikelihood model. Write G(
b ) =
G(
m
X
wj dj
j=1
where dj = sj xj and sj is a residual for linear regression or an equation-level score from the
pseudolikelihood model. The term equation-level score means the derivative of the log pseudolikelihood
b
b ) is an estimator for the total G(), and the variance estimator
with respect to xj . In either case, G(
b
b
V {G()}| b is computed using the design-based variance estimator for a total.
=
The above result is easily extended to models with ancillary parameters, multiple regression
equations, or both.
The bootstrap
The bootstrap methods for survey data used in recent years are largely due to McCarthy and
Snowden (1985), Rao and Wu (1988), and Rao, Wu, and Yue (1992). For example, Yeo, Mantel,
and Liu (1999) cite Rao, Wu, and Yue (1992) with the method for variance estimation used in the
National Population Health Survey conducted by Statistics Canada.
In the survey bootstrap, the model is fit multiple times, once for each of a set of adjusted sampling
weights that mimic bootstrap resampling. The variance is estimated using the resulting replicated
point estimates.
Let b
be the vector of point estimates computed using the sampling weights for a given survey
dataset (for example, b
could be a vector of means, ratios, or regression coefficients). Each bootstrap
replicate is produced by fitting the model with adjusted sampling weights. The adjusted sampling
weights are derived from the method used to resample the original survey data.
According to Yeo, Mantel, and Liu (1999), if nh is the number of observed PSUs in stratum h,
then nh 1 PSUs are sampled with replacement from within stratum h. This sampling is performed
independently across the strata to produce one bootstrap sample of the survey data. Let r be the
number of bootstrap samples. Suppose that we are about to generate the adjusted-weight variable for
the ith bootstrap replication and whij is the sampling weight attached to the j th observation in the
ith PSU of stratum h. The adjusted weight is
whij
=
nh
m whij
nh 1 hi
where mhi is the number of times the ith cluster in stratum h was resampled.
To accommodate privacy concerns, many public-use datasets contain replicate-weight variables
derived from the mean bootstrap described by Yung (1997). In the mean bootstrap, each adjusted
weight is derived from b bootstrap samples instead of one. The adjusted weight is
whij
=
nh
m whij
nh 1 hi
where
b
mhi =
1X
mhik
b
k=1
190
is the average of the number of times the ith cluster in stratum h was resampled among the b bootstrap
samples.
Each replicate is produced using an adjusted-weight variable with the estimation command that
computed b
. The adjusted-weight variables must be supplied to svyset with the bsrweight() option.
For the mean bootstrap, b must also be supplied to svyset with the bsn() option; otherwise, bsn(1)
is assumed. We call the variables supplied to the bsrweight() option bootstrap replicate-weight
variables when b = 1 and mean bootstrap replicate-weight variables when b > 1.
Let b
(i) be the vector of point estimates from the ith replication. When the mse option is specified,
the variance estimator is
r
b X b
b
b
V () =
{(i) b
}{b
(i) b
}0
r
i=1
Otherwise, the variance estimator is

r
b X b
Vb (b
) =
{(i) (.) }{b
(i) (.) }0
r i=1
where (.) is the bootstrap mean,

(.) =
r
1 Xb
(i)
r i=1
BRR
BRR was first introduced by McCarthy (1966, 1969a, and 1969b) as a method of variance estimation
variance estimates for this design than the linearized variance estimator, which can result in large
values and undesirably wide confidence intervals.
The model is fit multiple times, once for each of a balanced set of combinations where one PSU is
dropped (or downweighted) from each stratum. The variance is estimated using the resulting replicated
point estimates (replicates). Although the BRR method has since been generalized to include other
designs, Statas implementation of BRR requires two PSUs per stratum.
Let b
be the vector of point estimates computed using the sampling weights for a given stratified
survey design (for example, b
could be a vector of means, ratios, or regression coefficients). Each BRR
replicate is produced by dropping (or downweighting) a PSU from every stratum. This could result
in as many as 2L replicates for a dataset with L strata; however, the BRR method uses Hadamard
matrices to identify a balanced subset of the combinations from which to produce the replicates.
A Hadamard matrix is a square matrix, Hr (with r rows and columns), such that Hr0 Hr = rI ,
where I is the identity matrix. The elements of Hr are +1 and 1; 1 causes the first PSU to be
downweighted and +1 causes the second PSU to be downweighted. Thus r must be greater than or
equal to the number of strata.
Suppose that we are about to generate the adjusted-weight variable for the ith replication and wj
is the sampling weight attached to the j th observation, which happens to be in the first PSU of stratum
h. The adjusted weight is
(
f wj , if Hr [i, h] = 1
wj =
(2 f )wj , if Hr [i, h] = +1
where f is Fays adjustment (Judkins 1990). By default, f = 0.
191
computed b
. The adjusted-weight variables can be generated by Stata or supplied to svyset with the
brrweight() option. We call the variables supplied to the brrweight() option BRR replicate-weight
variables.
Let b
(i) be the vector of point estimates from the ith replication. When the mse option is specified,
Vb (b
) =
r
X
1
{b
(i) b
}{b
(i) b
}0
r(1 f )2 i=1
Vb (b
) =
r
X
1
(i) (.) }0
{b
(i) (.) }{b
r(1 f )2 i=1
where (.) is the BRR mean,

(.) =
r
1 Xb
(i)
r i=1
The jackknife
The jackknife method for variance estimation is appropriate for many models and survey designs.
The model is fit multiple times, and each time one or more PSUs are dropped from the estimation
sample. The variance is estimated using the resulting replicates (replicated point estimates).
Let b
design (for example, b
could be a vector of means, ratios, or regression coefficients). The dataset
is resampled by dropping one or more PSUs from one stratum and adjusting the sampling weights
before recomputing a replicate for b
.
Let whij be the sampling weight for the j th individual from PSU i in stratum h. Suppose that
you are about to generate the adjusted weights for the replicate resulting from dropping k PSUs from
stratum h. The adjusted weight is
wabj
0,
nh
wabj ,
=
nh k
wabj ,
if a = h and b is dropped
if a = h and b is not dropped
otherwise
Each replicate is produced by using the adjusted-weight variable with the estimation command
that produced b
. For the delete-one jackknife (where one PSU is dropped for each replicate), adjusted
weights can be generated by Stata or supplied to svyset with the jkrweight() option. For the deletek jackknife (where k > 1 PSUs are dropped for each replicate), the adjusted-weight variables must
be supplied to svyset using the jkrweight() option. The variables supplied to the jkrweight()
option are called jackknife replicate-weight variables.
192
The delete-one jackknife

Let b
(h,i) be the point estimates (replicate) from leaving out the ith PSU from stratum h. The
pseudovalue for replicate (h, i) is
b
h,i = b
(h,i) + nh {b
b
(h,i) }
When the mse option is specified, the variance estimator is
Vb (b
) =
L
X
nh
X
(1 fh ) mh
{b
(h,i) b
}{b
(h,i) b
}0
i=1
h=1
and the jackknife mean is

(.) =
L nh
1 XX
b
(h,i)
n
i=1
h=1
where fh is the sampling rate and mh is the jackknife multiplier associated with stratum h. Otherwise,
Vb (b
) =
L
X
(1 fh ) mh
nh
X
{b
(h,i) h }{b
(h,i) h }0 ,
h =
i=1
h=1
nh
1 X
b
(h,i)
nh i=1
L nh
1 XX
b
h,i
n
i=1
h=1
The multiplier for the delete-one jackknife is
mh =
nh 1
nh
The delete-k jackknife

Let e
(h,d) be one of the point estimates that resulted from leaving out k PSUs from stratum h. Let
ch be the number of such combinations that were used to generate a replicate for stratum h; then
d = 1, . . . , ch . If all combinations were used, then
ch =
nh !
(nh k)!k!
The pseudovalue for replicate (h, d) is
e
h,d = e
(h,d) + ch {b
e
(h,d) }
When the mse option is specified, the variance estimator is
Vb (b
) =
L
X
(1 fh ) mh
h=1
ch
X
{e
(h,d) b
}{e
(h,d) b
}0
d=1

(.) =
L ch
1 XX
e
(h,d) ,
C
h=1 d=1
C=
L
X
h=1
ch
193
Vb (b
) =
L
X
ch
X
(h,d) h }0 ,
(1 fh ) mh
{e
(h,d) h }{e
h=1
d=1
ch
1 X
e
h =
(h,d)
ch
d=1
L ch
1 XX
e
h,d
C
h=1 d=1
The multiplier for the delete-k jackknife is
mh =
nh k
ch k
Variables containing the values for the stratum identifier h, the sampling rate fh , and the jackknife
multiplier mh can be svyset using the respective suboptions of the jkrweight() option: stratum(),
fpc(), and multiplier().
Successive difference replication

Successive difference replication (SDR) was first introduced by Fay and Train (1995) as a method
of variance estimation for annual demographic supplements to the Current Population Survey. This
method is typically applied to systematic samples, where the observed sampling units are somehow
ordered.
In SDR, the model is fit multiple times, once for each of a set of adjusted sampling weights. The
variance is estimated using the resulting replicated point estimates.
Let b
dataset (for example, b
could be a vector of means, ratios, or regression coefficients). Each SDR
replicate is produced by fitting the model with adjusted sampling weights. The SDR method uses
Hadamard matrices to generate these adjustments.
A Hadamard matrix is a square matrix, Hr (with r rows and columns), such that Hr0 Hr = rI ,
where I is the identity matrix. Let hij be an element of Hr ; then hij = 1 or hij = 1. In SDR, if
n is the number of PSUs, then we must find Hr with r n + 2.
Without loss of generality, we will assume the ordered PSUs are individuals instead of clusters.
Suppose that we are about to generate the adjusted-weight variable for the ith replication and that wj
is the sampling weight attached to the j th observation. The adjusted weight is wj = fji wj , where
fji is
1
fji = 1 + (hj+1,i hj+2,i )
2 2
Here we assume that the elements of the first row of Hr are all 1.
computed b
. The adjusted-weight variables must be supplied to svyset with the sdrweight()
option. We call the variables supplied to the sdrweight() option SDR replicate-weight variables.
Let b
(i) be the vector of point estimates from the ith replication, and let f be the sampling fraction
computed using the FPC information svyset in the fpc() suboption of the sdrweight() option,
where f = 0 when fpc() is not specified. When the mse option is specified, the variance estimator
is
194
4
Vb (b
) = (1 f )
r
r
X
{b
(i) b
}{b
(i) b
}0
i=1
4
Vb (b
) = (1 f )
r
r
X
{b
(i) (.) }{b
(i) (.) }0
i=1
where (.) is the SDR mean,

(.) =
r
1 Xb
(i)
r i=1
In survey data analysis, the customary number of degrees of freedom attributed to a test statistic is
d = n L, where n is the number of PSUs and L is the number of strata. Under regularity conditions,
an approximate 100(1 )% confidence interval for a parameter (for example, could be a total,
ratio, or regression coefficient) is
b 1/2
b t1/2,d {Vb ()}
Cochran (1977, sec. 2.8) and Korn and Graubard (1990) give some theoretical justification for
using d = n L to compute univariate confidence intervals and p-values. However, for some cases,
inferences based on the customary n L degrees-of-freedom calculation may be excessively liberal;
the resulting confidence intervals may have coverage rates substantially less than the nominal 1 .
This problem generally is of the greatest practical concern when the population of interest has a
skewed or heavy-tailed distribution or is concentrated in a few PSUs. In some of these cases, the user
may want to consider constructing confidence intervals based on alternative degrees-of-freedom terms,
based on the Satterthwaite (1941, 1946) approximation and modifications thereof; see, for example,
Cochran (1977, sec. 5.4) and Eltinge and Jang (1996).
Sometimes there is no information on n or L for datasets that contain replicate-weight variables
but no PSU or strata variables. Each of svys replication commands has its own default behavior
when the design degrees of freedom are not svyset or specified using the dof() option. svy brr:
and svy jackknife: use d = r 1, where r is the number of replications. svy bootstrap: and
svy sdr: use z1/2 for the critical value instead of t1/2,d .
References
Binder, D. A. 1983. On the variances of asymptotically normal estimators from complex surveys. International
Statistical Review 51: 279292.
1726.
Eltinge, J. L., and D. S. Jang. 1996. Stability measures for variance component estimators under a stratified multistage
design. Survey Methodology 22: 157165.
195
Fuller, W. A. 1975. Regression analysis for sample survey. Sankhya, Series C 37: 117132.
Godambe, V. P., ed. 1991. Estimating Functions. Oxford: Oxford University Press.
Kish, L., and M. R. Frankel. 1974. Inference from complex samples. Journal of the Royal Statistical Society, Series
B 36: 137.
Kolenikov, S. 2010. Resampling variance estimation for complex survey data. Stata Journal 10: 165199.
McCarthy, P. J., and C. B. Snowden. 1985. The bootstrap and finite population sampling. In Vital and Health Statistics,
123. Washington, DC: U.S. Government Printing Office.
Rao, J. N. K., and C. F. J. Wu. 1988. Resampling inference with complex survey data. Journal of the American
Rao, J. N. K., C. F. J. Wu, and K. Yue. 1992. Some recent work on resampling methods for complex surveys. Survey
Methodology 18: 209217.
Sarndal, C.-E., B. Swensson, and J. Wretman. 1992. Model Assisted Survey Sampling. New York: Springer.
Satterthwaite, F. E. 1941. Synthesis of variance. Psychometrika 6: 309316.
. 1946. An approximate distribution of estimates of variance components. Biometrics Bulletin 2: 110114.
Yeo, D., H. Mantel, and T.-P. Liu. 1999. Bootstrap variance estimation for the National Population Health Survey.
In Proceedings of the Survey Research Methods Section, 778785. American Statistical Association.
Yung, W. 1997. Variance estimation for public use files under confidentiality constraints. In Proceedings of the Survey
Research Methods Section, 434439. American Statistical Association.
Also see
Glossary
100% sample. See census.
balanced repeated replication. Balanced repeated replication (BRR) is a method of variance estimation
variance estimates for this design than does the linearized variance estimator, which can result in
large values and undesirably wide confidence intervals. The BRR variance estimator is described
in [SVY] variance estimation.
bootstrap. The bootstrap is a method of variance estimation. The bootstrap variance estimator for
survey data is described in [SVY] variance estimation.
BRR. See balanced repeated replication.
census. When a census of the population is conducted, every individual in the population participates
in the survey. Because of the time, cost, and other constraints, the data collected in a census are
typically limited to items that can be quickly and easily determined, usually through a questionnaire.
cluster. A cluster is a collection of individuals that are sampled as a group. Although the cost in time
and money can be greatly decreased, cluster sampling usually results in larger variance estimates
when compared with designs in which individuals are sampled independently.
DEFF and DEFT. DEFF and DEFT are design effects. Design effects compare the sample-to-sample
variability from a given survey dataset with a hypothetical SRS design with the same number of
individuals sampled from the population.
DEFF is the ratio of two variance estimates. The design-based variance is in the numerator; the
hypothetical SRS variance is in the denominator.
DEFT is the ratio of two standard-error estimates. The design-based standard error is in the
numerator; the hypothetical SRS with-replacement standard error is in the denominator. If the given
survey design is sampled with replacement, DEFT is the square root of DEFF.
delta method. See linearization.

design effects. See DEFF and DEFT.
direct standardization. Direct standardization is an estimation method that allows comparing rates
that come from different frequency distributions.
Estimated rates (means, proportions, and ratios) are adjusted according to the frequency distribution
from a standard population. The standard population is partitioned into categories called standard
strata. The stratum frequencies for the standard population are called standard weights. The
standardizing frequency distribution typically comes from census data, and the standard strata are
most commonly identified by demographic information such as age, sex, and ethnicity.
finite population correction. Finite population correction (FPC) is an adjustment applied to the variance
of a point estimator because of sampling without replacement, resulting in variance estimates that
are smaller than the variance estimates from comparable with-replacement sampling designs.
FPC. See finite population correction.
Hadamard matrix. A Hadamard matrix is a square matrix with r rows and columns that has the
property
Hr0 Hr = rIr
197
198
Glossary
where Ir is the identity matrix of order r. Generating a Hadamard matrix with order r = 2p
is easily accomplished. Start with a Hadamard matrix of order 2 (H2 ), and build your Hr by
repeatedly applying Kronecker products with H2 .
jackknife. The jackknife is a data-dependent way to estimate the variance of a statistic, such as a
mean, ratio, or regression coefficient. Unlike BRR, the jackknife can be applied to practically any
survey design. The jackknife variance estimator is described in [SVY] variance estimation.
linearization. Linearization is short for Taylor linearization. Also known as the delta method or
the Huber/White/robust sandwich variance estimator, linearization is a method for deriving an
approximation to the variance of a point estimator, such as a ratio or regression coefficient. The
linearized variance estimator is described in [SVY] variance estimation.
MEFF and MEFT. MEFF and MEFT are misspecification effects. Misspecification effects compare
the variance estimate from a given survey dataset with the variance from a misspecified model. In
Stata, the misspecified model is fit without weighting, clustering, or stratification.
MEFF is the ratio of two variance estimates. The design-based variance is in the numerator; the
misspecified variance is in the denominator.

MEFT is the ratio of two standard-error estimates. The design-based standard error is in the
numerator; the misspecified standard error is in the denominator. MEFT is the square root of MEFF.
misspecification effects. See MEFF and MEFT.

point estimate. A point estimate is another name for a statistic, such as a mean or regression
coefficient.
poststratification. Poststratification is a method for adjusting sampling weights, usually to account
for underrepresented groups in the population. This usually results in decreased bias because of
nonresponse and underrepresented groups in the population. Poststratification also tends to result
in smaller variance estimates.
The population is partitioned into categories, called poststrata. The sampling weights are adjusted
so that the sum of the weights within each poststratum is equal to the respective poststratum size.
The poststratum size is the number of individuals in the population that are in the poststratum.
The frequency distribution of the poststrata typically comes from census data, and the poststrata
are most commonly identified by demographic information such as age, sex, and ethnicity.
predictive margins. Predictive margins provide a way of exploring the response surface of a fitted
model in any response metric of interestmeans, linear predictions, probabilities, marginal effects,
risk differences, and so on. Predictive margins are estimates of responses (or outcomes) for
the groups represented by the levels of a factor variable, controlling for the differing covariate
distributions across the groups. They are the survey-data and nonlinear response analogue to what
are often called estimated marginal means or least-squares means for linear models.
Because these margins are population-weighted averages over the estimation sample or subsamples,
and because they take account of the sampling distribution of the covariates, they can be used to
make inferences about treatment effects for the population.
primary sampling unit. Primary sampling unit (PSU) is a cluster that was sampled in the first sampling
stage; see cluster.
probability weight. Probability weight is another term for sampling weight.
pseudolikelihood. A pseudolikelihood is a weighted likelihood that is used for point estimation.
Pseudolikelihoods are not true likelihoods because they do not represent the distribution function
for the sample data from a survey. The sampling distribution is instead determined by the survey
design.
Glossary
199
PSU. See primary sampling unit.

replicate-weight variable. A replicate-weight variable contains sampling weight values that were
adjusted for resampling the data; see [SVY] variance estimation for more details.
resampling. Resampling refers to the process of sampling from the dataset. In the delete-one jackknife,
the dataset is resampled by dropping one PSU and producing a replicate of the point estimates. In
the BRR method, the dataset is resampled by dropping combinations of one PSU from each stratum.
The resulting replicates of the point estimates are used to estimate their variances and covariances.
sample. A sample is the collection of individuals in the population that were chosen as part of the
survey. Sample is also used to refer to the data, typically in the form of answered questions,
collected from the sampled individuals.
sampling stage. Complex survey data are typically collected using multiple stages of clustered
sampling. In the first stage, the PSUs are independently selected within each stratum. In the second
stage, smaller sampling units are selected within the PSUs. In later stages, smaller and smaller
sampling units are selected within the clusters from the previous stage.
sampling unit. A sampling unit is an individual or collection of individuals from the population that
can be selected in a specific stage of a given survey design. Examples of sampling units include
city blocks, high schools, hospitals, and houses.
sampling weight. Given a survey design, the sampling weight for an individual is the reciprocal of
the probability of being sampled. The probability for being sampled is derived from stratification
and clustering in the survey design. A sampling weight is typically considered to be the number
of individuals in the population represented by the sampled individual.
sampling with and without replacement. Sampling units may be chosen more than once in designs
that use sampling with replacement. Sampling units may be chosen at most once in designs that
use sampling without replacement. Variance estimates from with-replacement designs tend to be
larger than those from corresponding without-replacement designs.
SDR. See successive difference replication.
secondary sampling unit. Secondary sampling unit (SSU) is a cluster that was sampled from within a
PSU in the second sampling stage. SSU is also used as a generic term unit to indicate any sampling
unit that is not from the first sampling stage.
simple random sample. In a simple random sample (SRS), individuals are independently sampled
each with the same probability of being chosen.
SRS. See simple random sample.
SSU. See secondary sampling unit.
standard strata. See direct standardization.
standard weights. See direct standardization.
stratification. The population is partitioned into well-defined groups of individuals, called strata.
In the first sampling stage, PSUs are independently sampled from within each stratum. In later
sampling stages, SSUs are independently sampled from within each stratum for that stage.
Survey designs that use stratification typically result in smaller variance estimates than do similar
designs that do not use stratification. Stratification is most effective in decreasing variability when
sampling units are more similar within the strata than between them.
subpopulation estimation. Subpopulation estimation focuses on computing point and variance estimates for part of the population. The variance estimates measure the sample-to-sample variability,
assuming that the same survey design is used to select individuals for observation from the
200
Glossary
population. This approach results in a different variance than measuring the sample-to-sample variability by restricting the samples to individuals within the subpopulation; see [SVY] subpopulation
estimation.
successive difference replication. Successive difference replication (SDR) is a method of variance
typically applied to systematic samples, where the observed sampling units are somehow ordered.
The SDR variance estimator is described in [SVY] variance estimation.
survey data. Survey data consist of information about individuals that were sampled from a population
according to a survey design. Survey data distinguishes itself from other forms of data by the
complex nature under which individuals are selected from the population.
In survey data analysis, the sample is used to draw inferences about the population. Furthermore,
the variance estimates measure the sample-to-sample variability that results from the survey design
applied to the fixed population. This approach differs from standard statistical analysis, in which the
sample is used to draw inferences about a physical process and the variance measures the sampleto-sample variability that results from independently collecting the same number of observations
from the same process.
survey design. A survey design describes how to sample individuals from the population. Survey
designs typically include stratification and cluster sampling at one or more stages.
Taylor linearization. See linearization.
variance estimation. Variance estimation refers to the collection of methods used to measure the
amount of sample-to-sample variation of point estimates; see [SVY] variance estimation.
Subject and author index

This is the subject and author index for the Survey Data
Reference Manual. Readers interested in topics other
than survey data should see the combined subject index
(and the combined author index) in the Glossary and
Index.
Symbols
100% sample, [SVY] Glossary
A
Archer, K. J., [SVY] estat
association test, [SVY] svy: tabulate twoway
complementary log-log regression, [SVY] svy

estimation
conditional (fixed-effects) logistic regression, [SVY] svy
estimation
confidence interval, [SVY] variance estimation
for tabulated proportions, [SVY] svy: tabulate
twoway
linear combinations, [SVY] svy postestimation
constrained linear regression, [SVY] svy estimation
contingency tables, [SVY] svy: tabulate twoway
contrast command, [SVY] svy postestimation
correlated errors, see robust, Huber/White/sandwich
estimator of variance
count-time data, [SVY] svy estimation
Cox, C. S., [SVY] survey, [SVY] svy estimation
Cox proportional hazards model, [SVY] svy estimation
cv, estat subcommand, [SVY] estat
B
balanced repeated replication, [SVY] brr options,
[SVY] svy brr, [SVY] variance estimation,
[SVY] Glossary
balanced repeated replication standard errors, [SVY] svy
brr, [SVY] variance estimation
Berglund, P. A., [SVY] subpopulation estimation,
[SVY] survey
Binder, D. A., [SVY] svy estimation, [SVY] variance
estimation
bivariate probit regression, [SVY] svy estimation
bootstrap estimation, [SVY] bootstrap options,
[SVY] svy bootstrap, [SVY] variance
estimation, [SVY] Glossary
bootstrap options, [SVY] bootstrap options
bootstrap standard errors, [SVY] svy bootstrap,
BRR, see balanced repeated replication
brr options, [SVY] brr options
C
categorical data, [SVY] svy estimation,
[SVY] svy: tabulate oneway,
census, [SVY] Glossary
data, [SVY] direct standardization, [SVY] survey,
certainty
strata, [SVY] estat
units, [SVY] variance estimation
chi-squared, test of independence, [SVY] svy: tabulate
twoway
cluster, [SVY] survey, [SVY] svy estimation,
[SVY] svyset, [SVY] variance estimation,
[SVY] Glossary
Cochran, W. G., [SVY] estat, [SVY] subpopulation
estimation, [SVY] survey, [SVY] svyset,
coefficient of variation, [SVY] estat
Collins, E., [SVY] survey, [SVY] svy estimation
201
D
data, survey, see survey data
DEFF, see design effects
DEFT, see design effects
delta method, [SVY] variance estimation,
[SVY] Glossary
Demnati, A., [SVY] direct standardization,
[SVY] poststratification, [SVY] variance
estimation
design effects, [SVY] estat, [SVY] svy: tabulate
oneway, [SVY] svy: tabulate twoway,
[SVY] Glossary
Deville, J.-C., [SVY] direct standardization,
estimation
differences of two means test, [SVY] svy
postestimation
direct standardization, [SVY] direct standardization,
[SVY] Glossary
E
effects, estat subcommand, [SVY] estat
Eltinge, J. L., [SVY] estat, [SVY] survey, [SVY] svy
postestimation, [SVY] svydescribe,
endogenous variable, [SVY] svy estimation
Engel, A., [SVY] estat, [SVY] subpopulation
estimation, [SVY] survey, [SVY] svy,
[SVY] svy brr, [SVY] svy estimation,
[SVY] svy jackknife, [SVY] svy
postestimation, [SVY] svy: tabulate oneway,
[SVY] svy: tabulate twoway, [SVY] svydescribe
equality test of coefficients, [SVY] svy postestimation
equality test of means, [SVY] svy postestimation
estat
cv command, [SVY] estat
effects command, [SVY] estat
gof command, [SVY] estat
lceffects command, [SVY] estat
202 Subject and author index

estat, continued
sd command, [SVY] estat
size command, [SVY] estat
strata command, [SVY] estat
svyset command, [SVY] estat
vce command, [SVY] estat
estimates command, [SVY] svy postestimation
exp list, [SVY] svy bootstrap, [SVY] svy brr,
[SVY] svy jackknife, [SVY] svy sdr
HosmerLemeshow goodness of fit, [SVY] estat

Huber/White/sandwich estimator of variance, see robust,
Huber/White/sandwich estimator of variance
failure-time model, see survival analysis

Fay, R. E., [SVY] survey, [SVY] svy sdr,
Feldman, J. J., [SVY] survey, [SVY] svy estimation
finite population correction, [SVY] survey, [SVY] svy
estimation, [SVY] svyset, [SVY] variance
FPC, see finite population correction
Frankel, M. R., [SVY] variance estimation
Freeman, D. H., Jr., [SVY] svy: tabulate twoway
Freeman, J. L., [SVY] svy: tabulate twoway
frequencies, table of, [SVY] svy: tabulate oneway,
Fuller, W. A., [SVY] svy: tabulate twoway,
G
generalized
linear models, [SVY] svy estimation
negative binomial regression, [SVY] svy estimation
Gerow, K. G., [SVY] survey
Godambe, V. P., [SVY] variance estimation
gof, estat subcommand, [SVY] estat
Golden, C. D., [SVY] survey, [SVY] svy estimation
Gonzalez, J. F., Jr., [SVY] estat, [SVY] subpopulation
estimation, [SVY] svy bootstrap, [SVY] svy
estimation
goodness of fit, [SVY] estat
Gould, W. W., [SVY] ml for svy, [SVY] survey
Graubard, B. I., [SVY] direct standardization,
[SVY] estat, [SVY] survey, [SVY] svy,
[SVY] svy estimation, [SVY] svy
postestimation, [SVY] svy: tabulate twoway,
H
Hadamard matrix, [SVY] svy brr, [SVY] Glossary
Heckman selection model, [SVY] svy estimation
Heeringa, S. G., [SVY] subpopulation estimation,
[SVY] survey
heteroskedastic probit regression, [SVY] svy estimation
heteroskedasticity robust variances, see robust,
Holt, D., [SVY] estat, [SVY] survey
I
independence test, [SVY] svy: tabulate twoway
instrumental-variables regression, [SVY] svy estimation
interval regression, [SVY] svy estimation
jackknife estimation, [SVY] jackknife options,

[SVY] svy jackknife, [SVY] variance
jackknife options, [SVY] jackknife options
jackknife standard errors, [SVY] svy jackknife,
Jang, D. S., [SVY] variance estimation
Jann, B., [SVY] svy: tabulate twoway
Johnson, W., [SVY] survey
Judkins, D. R., [SVY] svy brr, [SVY] svyset,
K
Kennedy, W. J., Jr., [SVY] svy: tabulate twoway
Kish design effects, [SVY] estat
Kish, L., [SVY] estat, [SVY] survey, [SVY] variance
estimation
Koch, G. G., [SVY] svy: tabulate twoway
Kolenikov, S., [SVY] svy bootstrap, [SVY] variance
estimation
Korn, E. L., [SVY] direct standardization,
[SVY] estat, [SVY] survey, [SVY] svy,
Krauss, N., [SVY] estat, [SVY] subpopulation
estimation
Kreuter, F., [SVY] survey
L
Lane, M. A., [SVY] survey, [SVY] svy estimation
lceffects, estat subcommand, [SVY] estat
Lemeshow, S. A., [SVY] estat,
[SVY] poststratification, [SVY] survey
Levy, P. S., [SVY] poststratification, [SVY] survey
Lin, D. Y., [SVY] svy estimation
lincom command, [SVY] svy postestimation
Lindelow, M., [SVY] svy estimation, [SVY] svyset
linear combinations, [SVY] estat, [SVY] svy
postestimation
linear regression, [SVY] svy estimation
linearization, see linearized variance estimator
linearized variance estimator, [SVY] variance
Subject and author index 203

Liu, T.-P., [SVY] svy bootstrap, [SVY] variance
estimation
logistic and logit regression, [SVY] svy estimation
log-linear model, [SVY] svy estimation
longitudinal survey data, [SVY] svy estimation
M
Madans, J. H., [SVY] survey, [SVY] svy estimation
Mantel, H., [SVY] svy bootstrap, [SVY] variance
estimation
margins command, [SVY] svy postestimation
Massey, J. T., [SVY] estat, [SVY] subpopulation
Maurer, K., [SVY] estat, [SVY] subpopulation
maximum pseudolikelihood estimation, [SVY] ml for
svy, [SVY] variance estimation
McCabe, S. E., [SVY] estat
McCarthy, P. J., [SVY] survey, [SVY] svy bootstrap,
[SVY] svy brr, [SVY] variance estimation
McDowell, A., [SVY] estat, [SVY] subpopulation
means, survey data, [SVY] svy estimation
MEFF, see misspecification effects
MEFT, see misspecification effects
Mendenhall, W., III, [SVY] survey
Midthune, D., [SVY] estat, [SVY] svy estimation
Miller, H. W., [SVY] survey, [SVY] svy estimation
misspecification effects, [SVY] estat, [SVY] Glossary
ml command, [SVY] ml for svy
model coefficients test, [SVY] svy postestimation
multinomial
logistic regression, [SVY] svy estimation
probit regression, [SVY] svy estimation
multistage clustered sampling, [SVY] survey,
[SVY] svydescribe, [SVY] svyset
Murphy, R. S., [SVY] survey, [SVY] svy estimation
Mussolino, M. E., [SVY] survey, [SVY] svy estimation
nonlinear combinations, predictions, and tests,

nonlinear least squares, [SVY] svy estimation
nonlinear test, [SVY] svy postestimation
O
odds ratio, [SVY] svy estimation
differences, [SVY] svy postestimation
ODonnell, O., [SVY] svy estimation, [SVY] svyset
ordered
logistic regression, [SVY] svy estimation
probit with sample selection, [SVY] svy estimation
Ott, R. L., [SVY] survey
P
parametric survival models, [SVY] svy estimation
Park, H. J., [SVY] svy: tabulate twoway
Pitblado, J. S., [SVY] ml for svy, [SVY] survey
Poi, B. P., [SVY] ml for svy, [SVY] survey
point estimate, [SVY] Glossary
Poisson regression, [SVY] svy estimation
polytomous logistic regression, [SVY] svy estimation
population standard deviation, see subpopulation,
standard deviations of
postestimation command, [SVY] estat, [SVY] svy
postestimation
poststratification, [SVY] poststratification,
[SVY] Glossary
predict command, [SVY] svy postestimation
predictions, [SVY] svy postestimation
predictive margins, [SVY] Glossary
predictnl command, [SVY] svy postestimation
primary sampling unit, [SVY] svydescribe,
[SVY] svyset, [SVY] Glossary
probability weight, see sampling weight
with endogenous regressors, [SVY] svy estimation
with sample selection, [SVY] svy estimation
proportional hazards model, [SVY] svy estimation
proportions, survey data, [SVY] svy estimation,
pseudolikelihood, [SVY] Glossary
PSU, see primary sampling unit
pwcompare command, [SVY] svy postestimation
pweight, see sampling weight
N
Neyman allocation, [SVY] estat
nlcom command, [SVY] svy postestimation
nonconstant variance, [SVY] variance estimation
Q
quadratic terms, [SVY] svy postestimation
qualitative dependent variables, [SVY] svy estimation
R
Rao, J. N. K., [SVY] direct standardization,
[SVY] poststratification, [SVY] svy bootstrap,
[SVY] svy: tabulate twoway, [SVY] variance
estimation
ratios, survey data, [SVY] svy estimation,
regression diagnostics, [SVY] estat, [SVY] svy
postestimation
replicate-weight variable, [SVY] survey, [SVY] svy
bootstrap, [SVY] svy brr, [SVY] svy jackknife,
[SVY] svy sdr, [SVY] svyset, [SVY] Glossary
replication method, [SVY] svy bootstrap, [SVY] svy
brr, [SVY] svy jackknife, [SVY] svy sdr,
[SVY] svyset, [SVY] variance estimation
resampling, [SVY] Glossary
Research Triangle Institute, [SVY] svy: tabulate
twoway
robust, Huber/White/sandwich estimator of variance,
Rothwell, S. T., [SVY] survey, [SVY] svy estimation
S
sample, [SVY] Glossary
sampling, [SVY] survey, [SVY] svydescribe,
stage, [SVY] estat, [SVY] Glossary
unit, [SVY] survey, [SVY] Glossary, also see
primary sampling unit
weight, [SVY] poststratification, [SVY] survey,
[SVY] Glossary
with and without replacement, [SVY] Glossary
sandwich/Huber/White estimator of variance, see robust,
Sarndal, C.-E., [SVY] variance estimation
Satterthwaite, F. E., [SVY] variance estimation
Scheaffer, R. L., [SVY] survey
Schnell, D., [SVY] svy: tabulate twoway
Scott, A. J., [SVY] estat, [SVY] svy: tabulate twoway
Scott, C., [SVY] estat, [SVY] subpopulation
estimation
sd, estat subcommand, [SVY] estat
SDR, see successive difference replication
sdr options, [SVY] sdr options
secondary sampling unit, [SVY] Glossary
selection models, [SVY] svy estimation
Shah, B. V., [SVY] direct standardization,
estimation
Shao, J., [SVY] survey, [SVY] svy jackknife,
simple random sample, [SVY] Glossary
singleton strata, [SVY] estat, [SVY] variance
estimation
size, estat subcommand, [SVY] estat
skewed logistic regression, [SVY] svy estimation
Skinner, C. J., [SVY] estat, [SVY] survey, [SVY] svy

estimation, [SVY] variance estimation
Smith, T. M. F., [SVY] survey
Snowden, C. B., [SVY] svy bootstrap, [SVY] variance
estimation
Sribney, W. M., [SVY] estat, [SVY] svy
[SVY] svydescribe
SRS, see simple random sample
SSU, see secondary sampling unit
standard deviations, subpopulations, see subpopulation,
standard deviations of
standard errors,
balanced repeated replication, see balanced repeated
replication standard errors
bootstrap, see bootstrap standard errors
jackknife, see jackknife standard errors
successive difference replication, see successive
difference replication standard errors
standard strata, see direct standardization
standard weights, see direct standardization
stereotype logistic regression, [SVY] svy estimation
strata, estat subcommand, [SVY] estat
strata with one sampling unit, [SVY] variance
estimation
stratification, see stratified sampling
stratified sampling, [SVY] survey, [SVY] svydescribe,
stratum collapse, [SVY] svydescribe
structural equation modeling, [SVY] svy estimation
Stuart, A., [SVY] survey
subpopulation
differences, [SVY] survey, [SVY] svy
postestimation
estimation, [SVY] subpopulation estimation,
[SVY] svy estimation, [SVY] Glossary
means, [SVY] svy estimation
proportions, [SVY] svy estimation,
ratios, [SVY] svy estimation, [SVY] svy: tabulate
oneway, [SVY] svy: tabulate twoway
standard deviations of, [SVY] estat
totals, [SVY] svy estimation, [SVY] svy: tabulate
successive difference replication, [SVY] sdr options,
[SVY] svy sdr, [SVY] variance estimation,
[SVY] Glossary
successive difference replication standard errors,
[SVY] svy sdr, [SVY] variance estimation
suest command, [SVY] svy postestimation
Sullivan, G., [SVY] svy: tabulate twoway
summarizing data, [SVY] svy: tabulate twoway
survey
data, [SVY] survey, [SVY] svydescribe,
design, [SVY] Glossary
postestimation, [SVY] svy postestimation
Subject and author index 205

survey, continued
prefix command, [SVY] svy
[SVY] svyset
survival analysis, [SVY] svy estimation
survival models, [SVY] svy estimation
svy: biprobit command, [SVY] svy estimation
svy: clogit command, [SVY] svy estimation
svy: cloglog command, [SVY] svy estimation
svy: cnsreg command, [SVY] svy estimation
svy: etregress command, [SVY] svy estimation
svy: glm command, [SVY] svy estimation
svy: gnbreg command, [SVY] svy estimation
svy: heckman command, [SVY] svy estimation
svy: heckoprobit command, [SVY] svy estimation
svy: heckprobit command, [SVY] svy estimation
svy: hetprobit command, [SVY] svy estimation
svy: intreg command, [SVY] svy estimation
svy: ivprobit command, [SVY] svy estimation
svy: ivregress command, [SVY] svy estimation
svy: ivtobit command, [SVY] svy estimation
svy: logistic command, [SVY] svy estimation,
svy: logit command, [SVY] svy estimation
svy: mean command, [SVY] estat,
[SVY] poststratification, [SVY] subpopulation
postestimation, [SVY] svydescribe,
[SVY] svyset
svy: mlogit command, [SVY] svy estimation
svy: mprobit command, [SVY] svy estimation
svy: nbreg command, [SVY] svy estimation
svy: nl command, [SVY] svy estimation
svy: ologit command, [SVY] svy estimation,
svy: oprobit command, [SVY] svy estimation
svy: poisson command, [SVY] svy estimation
svy: probit command, [SVY] svy estimation
svy: proportion command, [SVY] svy estimation
svy: ratio command, [SVY] direct standardization,
svy: regress command, [SVY] survey, [SVY] svy,
[SVY] svy estimation, [SVY] svy jackknife,
svy: scobit command, [SVY] svy estimation
svy: sem command, [SVY] svy estimation
svy: slogit command, [SVY] svy estimation
svy: stcox command, [SVY] svy estimation
svy: streg command, [SVY] svy estimation
svy: tabulate command, [SVY] svy: tabulate
svy: tnbreg command, [SVY] svy estimation
svy: tobit command, [SVY] svy estimation
svy: total command, [SVY] svy brr, [SVY] svy
estimation
svy: tpoisson command, [SVY] svy estimation
svy: truncreg command, [SVY] svy estimation

svy: zinb command, [SVY] svy estimation
svy: zip command, [SVY] svy estimation
svy bootstrap prefix command, [SVY] svy bootstrap
svy brr prefix command, [SVY] svy brr
svy jackknife prefix command, [SVY] svy jackknife
svy prefix command, [SVY] svy
svy sdr prefix command, [SVY] svy sdr
svydescribe command, [SVY] survey,
[SVY] svydescribe
svymarkout command, [SVY] svymarkout
svyset command, [SVY] survey, [SVY] svyset
svyset, estat subcommand, [SVY] estat
Swensson, B., [SVY] variance estimation
T
tables
contingency, [SVY] svy: tabulate twoway
frequency, [SVY] svy: tabulate oneway,
tabulate
one-way, [SVY] svy: tabulate oneway
two-way, [SVY] svy: tabulate twoway
Taylor linearization, see linearized variance estimator
test,
association, see association test
differences of two means, see differences of two
means test
equality of
coefficients, see equality test of coefficients
means, see equality test of means
goodness-of-fit, see goodness of fit
independence, see independence test
model coefficients, see model coefficients test
nonlinear, see nonlinear test
Wald, see Wald test
test command, [SVY] survey, [SVY] svy
postestimation
testnl command, [SVY] svy postestimation
testparm command, [SVY] svy postestimation
Thomas, D. R., [SVY] svy: tabulate twoway
Thompson, S. K., [SVY] survey
tobit regression, [SVY] svy estimation
with endogenous regressors, [SVY] svy estimation
totals, survey data, [SVY] svy estimation
Train, G. F., [SVY] survey, [SVY] svy sdr,
treatment-effects regression, [SVY] svy estimation
truncated
regression, [SVY] svy estimation
Tu, D., [SVY] survey, [SVY] svy jackknife,
Tukey, J. W., [SVY] svy jackknife
two-stage least squares, [SVY] svy estimation
V
Valliant, R., [SVY] survey
van Doorslaer, E., [SVY] svy estimation, [SVY] svyset
variance,
Huber/White/sandwich estimator, see robust,
linearized, [SVY] variance estimation
nonconstant, [SVY] variance estimation
variance estimation, [SVY] variance estimation,
[SVY] Glossary
vce, estat subcommand, [SVY] estat
W
Wagstaff, A., [SVY] svy estimation, [SVY] svyset
Wald test, [SVY] svy postestimation
Wei, L. J., [SVY] svy estimation
weights,
probability, [SVY] survey, [SVY] svydescribe,
[SVY] svyset
[SVY] svyset
West, B. T., [SVY] estat, [SVY] subpopulation
estimation, [SVY] survey
White/Huber/sandwich estimator of variance, see robust,
Williams, B., [SVY] survey
Winter, N. J. G., [SVY] survey
Wolter, K. M., [SVY] survey, [SVY] svy brr,
Wretman, J., [SVY] variance estimation
Wu, C. F. J., [SVY] svy bootstrap, [SVY] variance
estimation
Y
Yeo, D., [SVY] svy bootstrap, [SVY] variance
estimation
Yue, K., [SVY] svy bootstrap, [SVY] variance
estimation
Yung, W., [SVY] svy bootstrap, [SVY] variance
estimation
Z
zero-inflated

Stata Survey Data Reference Manual: Release 13

Uploaded by

Copyright:

Available Formats

Stata Survey Data Reference Manual: Release 13

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stata Survey Data Reference Manual: Release 13

Uploaded by

Copyright:

Available Formats

STATA SURVEY DATA REFERENCE

A Stata Press Publication

, Stata Press, Mata,

, and NetCourse are registered trademarks of StataCorp LP.

The suggested citation for this software is

survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to survey commands

bootstrap options . . . . . . . . . . . . . . . . . . . . . More options for bootstrap variance estimation

brr options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . More options for BRR variance estimation

direct standardization . . . . . . . . . . . Direct standardization of means, proportions, and ratios

estat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postestimation statistics for survey data

jackknife options . . . . . . . . . . . . . . . . . . . . . . More options for jackknife variance estimation

ml for svy . . . . . . . . . . . . . . . . . . . . . Maximum pseudolikelihood estimation for survey data

poststratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poststratification for survey data

sdr options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . More options for SDR variance estimation

subpopulation estimation . . . . . . . . . . . . . . . . . . . . Subpopulation estimation for survey data

svy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The survey prefix command

svy bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bootstrap for survey data

svy brr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balanced repeated replication for survey data

svy estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation commands for survey data

svy jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jackknife estimation for survey data 101

Subject and author index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Cross-referencing the documentation

Getting Started with Stata for Mac

Mata Reference Manual

Remarks and examples

Remarks and examples

Introduction to survey commands

[R] intro Introduction to base reference manual

Remarks and examples

Declare survey design for dataset

Survey data analysis tools

The survey prefix command

Survey data concepts

Variance estimation for survey data

Tools for programmers of new survey commands

survey Introduction to survey commands

Remarks and examples

Sampling weights, also called probability weights pweights in Statas terminology

survey Introduction to survey commands

Survey design tools

Example 1: Survey data from a one-stage design

survey Introduction to survey commands

Example 2: Multistage survey data

state contains the stratum identifiers.

survey Introduction to survey commands

Example 3: Survey describe

#Obs per included Unit

Survey data analysis tools

survey Introduction to survey commands

Example 4: Estimating a population mean

[95% Conf. Interval]

Example 5: Survey regression

[95% Conf. Interval]

survey Introduction to survey commands

mean Estimate means

Linear regression models

Discrete-response regression models

survey Introduction to survey commands

Instrumental-variables regression models