Robust estimation in the presence of deviations from linearity
in small domain models November 2018
Julie Gershunskaya, Terrance D. Savitsky1
U.S. Bureau of Labor Statistics, 2 Massachusetts Ave NE, Suite 4985, Washington, DC,
20212
Abstract:
Small domain estimation models, like the Fay-Herriot, often assume a normally distributed
latent process centered on a linear mean function. The linearity assumption may be violated
for domains that express idiosyncratic phenomena not captured by the predictors. Under a
single component normal distribution prior for the random effects, direct sample estimates
for those domains would be viewed as if they were outliers with respect to the model, when
in fact they may reflect the underlying true population value. The model interpretation is
also confounded by the variances of direct sample estimates because, while typically
treated as fixed and known, they are estimates and thus contain noise. In this paper, we
construct a joint model for the direct estimates and their variances where we replace the
normal distribution for the latent process with a nonparametric mixtures of normal
distributions with the goal to improve robustness in estimation quality for these
idiosyncratic domains. We devise a model-based screening tool that leverages the
posterior predictive distribution under the model to nominate domains where the model
may not accurately account for deviations from the linearity assumption. Our screening
tool nominates a few domains to allow for a focused investigation to determine whether a
deviation from linearity is real. The U.S. Bureau of Labor Statistics’ Current Employment
Statistics (CES) survey publishes monthly employment estimates for domains defined by
industry and geography. Model estimation is performed for smaller domains to improve
the reliability of the direct estimator. We compare fit performances for our candidate
models under data constructed to be similar to the CES and conduct a simulation study to
assess the robustness of our candidate models in the presence of deviations from
linearity. We apply our model-based screening method and quantify its ability to improve
the quality of published estimates.
Key Words: Bayesian Hierarchical Modeling, Posterior predictive distribution, False
discovery rate, Dirichlet process, Fay-Herriot, Variational Bayes, Stan
1. Introduction
Large government surveys, such as the Current Employment Statistics (CES) survey
considered in this paper, are designed to produce high quality sample-based estimates for
a number of state and national levels. More detailed geographical and industrial domains
often contain a small number of sample units (e.g., business establishments). Direct
sample-based estimates at these detailed levels are not reliable, and models are used to
1
Any opinions expressed in this paper are those of the authors and do not constitute
policy of the Bureau of Labor Statistics.
improve the quality of the estimates. One of the most popular models is the classical FayHerriot (FH) model (Fay and Herriot 1979). The FH model yields an estimator that can be
conveniently presented in the form of a weighted average of the direct sample-based
estimator and a so-called “synthetic” component. Both the synthetic component and the
mixture weights depend on specific distributional assumptions. Direct sample-based
estimates are used as the data input in the FH modeling. In the classical FH model,
variances of the direct sample point estimates are assumed to be fixed and known. In
reality, these variances are not known and sample-based variances can be plugged in as if
they were true variances. However, such sample-based estimates of variances contain
noise.
The usual practice to address noise in sampling-based variances is to smooth the noise by
using model-based estimates extracted from a generalized variance function (GVF). Such
GVF-based variances are implemented in a separate model from that for the direct point
estimates. Maiti et al. (2014) showed that co-modeling of direct point estimates and their
variances in the same model may improve estimates of both quantities as it would exploit
the relationship between the point estimates and their variances. Maiti et al. (2014)
proposed a solution within the frequentist paradigm, and Sugasawa et al. (2017) considered
a Bayesian approach.
In this paper, we extend Sugasawa et al. (2017) to include nonparametric probabilistic
clustering and apply it to estimates from the CES survey. Our clustering formulation
relaxes the assumption of normality of the random effects in the models for both the direct
point estimates and the variances as a means of addressing deviations in employment from
linearity assumption among industry domains implied in Sugasawa et al. (2017).
Employment may grow or decrease faster in some groups of domains included in the model
due to idiosyncratic effects in a particular domain that is not shared among other domains.
This phenomenon may be captured by imposing a mixture of normal distributions
assumption on the random effects.
The models considered may still fail to describe true population target in domains having
large deviations from the linearity, even under a prior formulation on random effects that
induces clustering of domains. So we devise a posterior predictive checking approach to
uncover domains that are not well described (or generated) by a particular model. We
identify such domains using a Bayesian multiple hypotheses testing approach. Each
domain’s probability of not being generated by the target model is considered in
conjunction with the overall false discovery rate (FDR) (Benjamini and Hochberg 1995),
to identify a relatively small number of “suspected” domains whose estimates are posited
as not having been generated by our joint model. The list of these domains may be sent to
analysts for review. Analysts may conclude that the deviation is due to a few outlying units
used in deriving the domain estimates, in which case the modeled estimates are accepted;
otherwise, analysts may decide that a particular domain’s deviation from the linearity
expresses real economic movement. In the case that the deviation from linearity is deemed
real, the modeled estimate for that particular domain would be replaced by the direct
estimate. We show in the sequel that making such replacements of selected model values
would be expected to notably improve estimation performance. Our testing approach is
also expected to be useful when applied to non-modeled domains.
We compare the performances of alternative models under synthetic data generated from
the Quarterly Census of Employment and Wages (QCEW), which is considered the gold
standard (because it lacks sampling error) for evaluation of CES estimates. Our simulation
results confirm that co-modeling of the direct point estimates and their variances leads to
improved estimates. We follow by conducting a simulation study to assess the robustness
of our models for capturing deviations from linearity among some domains.
We adopt the hierarchical Bayesian paradigm for development of the models. The code is
written in the Stan modeling language (Gelman et al. 2015) using a Variational Bayes
algorithm (Kucukelbir et al. 2017) implemented in RStan V2.15.1 package, which is the R
interface for the Stan modeling language (Gelman et al. 2015, Stan Development Team
2017).
The paper is organized as follows. The models considered in this paper are stated in Section
2. In Section 3, after a brief introduction of estimation procedures and the form of the
sample-based estimator used in CES, we discuss the results of application of the models to
the synthetic data generated by adding noise to the true historical series. In Section 4, we
conduct a robustness study of our candidate model formulations to assess their
performances under deviations from linearity. Section 5 introduces additional uses for our
models with large-sized domains where modeling is not traditionally performed because
the direct estimates are published; in particular, we introduce a model-based screening
procedure to identify a set of domains whose direct sample-based estimates are not
adequately described by the model. We conclude with a summary discussion in Section 6.
2. Description of the models
We start with the classical Fay-Herriot (FH) model (Fay and Herriot 1979.) Let yi be a
survey estimate of target parameter i for domain i. For each domain, i 1,..., N , assume
yi | i ~ N i , vi ,
(1)
i | , β, ~ N xTi β, u2 .
(2)
ind
2
u
ind
Sample estimated yi ’s are assumed to be normally distributed and unbiased for target
parameter i , with variances vi that are treated as known (equation (1)). Equation (2)
links true signal i to a vector of covariates xi via the linear regression by assuming the
T
normally distributed deviation of the true signal from “synthetic” part x i β (to
facilitate the ensuing description, we explicitly write the intercept term as .)
As noted, sampling variances vi in the FH model are considered fixed and known. In
practice, estimates of true variances are used. We consider two possibilities for the
treatment of vi : 1) using direct sample based estimates of true variances, which treats the
variances as fixed and known in a Fay Herriot model that we refer to as FH; 2) using a
smoothed estimator of variances that is plugged into the Fay-Herriot model. For this model
(referred to as FH-V), the estimation of the variances is performed, separately, in a first
step and then used as plug-in estimators for vi in the Fay-Herriot model. The first step of
the variance estimation is based on the same set of covariates as used in the models
described below. Note that this approach ignores any uncertainty in the estimation of the
variances, and so, is not a fully Bayesian approach (though we estimate the variance portion
of FH-V under a Bayesian construction).
In the next two models, rather than fixing the variances at the estimated value, vi , we view
direct sample-based estimates of variances as data and model them together with the vector
of point estimates yi in a fully Bayesian model specification.
Our first model that co-estimates the direct estimates and their variances is referred to as
FHS and is a modification of an approach considered by Sugasawa et al. (2017). Assume
the following hold for pair of direct survey estimates yi , vi for each domain i :
yi | i ,i2 ~ N i ,i2 ,
(3)
i | , β, u2 ~ N xTi β, u2 .
(4)
ind
ind
ind
an * an *
vi | a , i2 ~ G i , i2 ,
2 2 i
(5)
i2 | b, γ ~ IG 2, b exp ziT γ .
ind
(6)
Lines (3)-(4) are the usual FH assumptions on the point estimates yi and lines (5)-(6)
describe the variance model, where parameter i is the true sampling variance; zi is a
vector of covariates for the variance model for area i ; a , b, γ are the model parameters.
2
Note that in equation (5), estimated variances depend on the sample size ni , where for a
set of domains with unequal number of respondents, we use the standardized response size,
max n min n 0,1.
ni* ni min ni 1
i
i
i
i
i
Our assumption is somewhat
different from Maiti et al. (2014) and Sugasawa et al. (2017) as we include an additional
(unknown) parameter, a , to regulate the scale and shape of the distribution. In our
application, we found that for moderate sample sizes, using the sample size alone would
result in predicted variances that are overly similar to direct estimates of variances.
The normality assumption used in (2) and (4) may not be realistic. For example, if a single
T
or a handful of domains deviate significantly from x i β , assumption (2) of the FH
model would result in the under-shrinkage of the bulk of the observations. In the FHS
model, violation of assumption (4) would result in overestimation of sampling variances in
the domains where the signal deviates from the assumed linearity and, hence, in overshrinkage of estimates for these domains.
Our capstone model is termed FHSC and is designed to allow for deviations from linearity
T
assumption x i β for some subsets of domains. Namely, we replace the normal
distribution of assumption (4) with a finite mixture normal distributions. Specifically, we
assume the existence of K latent clusters having cluster specific intercepts k , for
k 1,..., K , and common variance u2 :
i | π, μ, β,u2 ~ k1k N k xTi β,u2
iid
K
(7)
In addition, the inverse gamma assumption in (6) can be relaxed by specifying a mixture
of the inverse gamma distributions with the cluster-specific shape parameter bk :
i2 | b, γ, π ~ k 1 k IG 2, bk exp ziT γ .
ind
K
FH
(8)
Table 1. FH, FHS, and FHSC models
FHS
yi | i ~ N i , vi
ind
i | , , u2 ~ N xi , u2
ind
FHSC
yi | i ,i2 ~ N i ,i2
yi | i ,i2 ~ N i ,i2
ind
i | , , ~ N xi ,
2
u
ind
ind
2
u
ind
an* an *
vi | a , i2 ~ G i , i2
2 2 i
ind
i2 | b, γ ~ IG 2, be z
T
i γ
i | π, μ, ,u2 ~ k1k N k xi ,u2
iid
K
ind
an * an *
vi | a , i2 ~ G i , i2
2 2 i
ind
i2 | b, γ, π ~ k 1 k IG 2, bk e z
K
| ~ N 0, 1
| ~ N 0, 1
| ~ N 0, 1
| ~ N 0, 1
| ~ N 0, 1
k | ~ N 0, 1
γ | ~ N p 0,
γ | ~ N p 0,
iid
π | ~ Dir K ,..., K
u2 , , ~ G 1,1 ,
u2 , , ~ G 1,1
log a ~ t3 0,1 ,
log b ~ t3 0,1 , prior
, u2 , , .bk ~ G 1,1 ,
log a ~ t3 0,1 ,
prior
It is reasonable to suppose that point estimates and estimates of their variances are related,
and we parameterize this assumption by assuming a common cluster structure for pairs,
k , bk . That is, each mixture / cluster component in the joint distribution for ,
share the same
.
T
i γ
The form for the Dirichlet prior, with hyperparameters set to K , induces a Dirichlet
process (DP) mixture formulation in the limit of the maximum number of allowable
mixture components, K (see Neal 2000). The larger is , the more of the K possible
mixture components (also referred to as clusters) will have k 0, so a further gamma
prior is imposed to allow the data to learn the number of mixture components.
Table 1 contains a summary of the three models considered in this paper (formulated for a
single covariate xi , for simplicity.)
3. Model fit comparison
We applied the models introduced in Section 2 to CES estimates of employment for
the period from October 2008 through September 2009. The quality of the employment
estimates can be assessed several months after their CES-based publication by comparing
the estimates to the census data, maintained by BLS’ Quarterly Census of Employment
and Wages (QCEW) program. The QCEW data become available with a lag of about 6 to
9 months, while the CES estimates provide timely snapshot of the economy on a monthly
basis.
CES domains are defined by intersections of industry and geography: industries in the CES
survey are defined by the North American Industry Classification System (NAICS); the
geographic resolution considered in this paper is the State level. Since the direct CES
survey estimates are used as input data in the proposed area-level models, we start by
briefly describing relevant details pertaining to construction of the CES estimator. A more
detailed description of the CES estimation procedures can be found in chapter 2 of Bureau
of Labor Statistics (2004).
For a given month, t , the target of the CES estimation is the change in employment from
the previous to current month. Consider a set of (geography-by-industry) domains,
i 1,..., N . The population ratio, Ri ,t , is the target employment change, defined as
Ri ,t
Yi ,t
,
Yi ,t 1
(9)
where Yi ,t is the employment level in domain i at month t.
The estimated relative change in employment level Rˆi ,t can be described as an adjusted
sample based estimator of the relative change
w y
rˆi ,t
j
jst
jt
i
w y
jst
i
j
,
j ,t 1
(10)
where y jt is the employment of business j at time t , w j is the sampling weight of unit
j , and st is a set of units sampled in domain i that provide non-zero employment inputs
i
in both previous and current months as a “matched” set of respondents. The presence of
matched sets of sampled units is typically high from one month to another but there are
also unmatched units; thus, there is an adjustment to rˆi ,t , yielding estimator Rˆi ,t of Ri ,t .
The adjustment is described in some detail, for example, in Gershunskaya and Savitsky
(2017) and is omitted here for brevity. In what follows, we assume Rˆi ,t to be an unbiased
estimator of target, Ri ,t . Monthly ratios Rˆi ,t , along with their respective sampling
variances vi ,t , constitute the domain-level data supplied for the modeling.
Estimates of employment levels for month t are obtained by multiplying estimated
previous month level, Yˆi ,t 1 , by the estimate of the relative employment change:
Yˆi ,t Yˆi ,t 1Rˆi ,t .
(11)
The corresponding estimate of the over-the-month employment change is
Yˆi ,t Yˆi ,t Yˆi ,t 1.
(12)
Every year, the estimation cycle starts at month 0 from a known QCEW-based employment
level Yi ,0 and after twelve months the CES estimated employment level Yˆi ,12 is compared
to the QCEW employment levels. Once a year, the CES estimated levels are revised to
reflect newly available QCEW levels (in a procedure commonly known as the annual
revision).
Figure 1 presents a plot of the estimation cycle. It shows monthly estimated levels for one
of the CES domains. The lines on the plot correspond to alternative (model-based)
estimates considered in the paper. The black line with solid circles is the target QCEW line.
The goal is to be closer to the QCEW line at the 12th month of the cycle. Direct samplebased estimates in small domains may be appreciably volatile. Model-based estimates
usually present various degree of smoothness compared to the direct estimates, as
exemplified in Figure 1.
The quality of CES estimates could be judged by their proximity to employment from
QCEW (which is considered a gold standard due to the absence of sampling error).
However, employment seasonal patterns in the QCEW are affected by the quarterly
submission of administrative data provided by units (business establishments). CES
estimates are unaffected by this quarterly seasonal influence due to a monthly submission
cycle. So we may not compare monthly QCEW and CES estimates. To overcome this
difficulty related to monthly comparisons of CES and QCEW and to facilitate focused
comparisons to true population figures, we provide results from the synthetic data. We
created synthetic data by adding Student’s t distributed noise to the QCEW series, thus
preserving the existing structure of the target. Our synthetic response expresses the
same seasonality as the QCEW series, facilitating month-by-month comparisons. The
relative fit performances of our candidate models are the same on both the real and
synthetic data sets because we use the QCEW to compose the synthetic data.
Figure 1: Domain #60 in Health Care and Social Assistance industry
(average number of responding units in the domain is 16.6)
In Table 2, we present mean absolute deviation (MAD) averaged over domains and months,
as follows:
MAD N 112 1 i 1 t 1 Yi ,t Yi ,t ,
N
12
where Yi ,t Yi ,t Yi ,t 1 , Yi ,t Yi ,t Yi ,t 1 , and Yi ,t , Yi ,t 1 signify estimates based on the
sample or on a model.
Results in Table 2 suggest that all model-based estimates are more efficient than the direct
sample estimates. The models that jointly model the point estimates and the variances,
FHS and FHSC, perform similarly to one another, but notably better than FH-V, which
separately models the point estimates and the variances because the point estimates and
variances are dependent.
Sampling variances fitted using our two models can be compared to the true
variances used to generate the synthetic data. Figure 2 presents an example of a
scatter plot, for all domains in one month in Health Care and Social Assistance
industry. Symbols (empty and filled circles and stars) correspond to domains and
show estimated variances versus true variances; stars represent the direct estimates
of variances; empty circles show estimated variances from the FHS model; filled
circles correspond to variance estimates from FHSC. The closer the symbols are to
the 45-degree line, the more accurate (less biased) are the estimates of the variances.
We observe that for the bulk of domains the FHSC model variance estimates lie
along the 45-degree line.
Table 2. Simulated data, over-the-months results
Ind
N Direct FH FH-V FHS FHSC
600
285 220
205 188
204
1000
795 503
507 444
448
2000 1692
328 275
279 273
277
3100 2808
275 213
185 187
192
3200 1680
396 252
234 213
221
4100 1488
3432
613
350
315
278
276
4200
456 292
300 265
266
4300 2328
996
282 206
187 188
194
5000
391 267
232 221
216
5500 1788
481 314
314 290
292
6054 1800
540
213 176
162 158
167
6055
923 689
642 601
616
6056 1380
708
751 564
528 512
519
6561
444 297
289 250
251
6562 2568
708
771 470
480 435
459
7071
960
770 578
551 506
510
7072
639 377
355 331
340
8000 1320
Overall 26796
511 344
328 302
306
The sizes of the symbols are proportional to standardized distances between the direct point
estimates and respective true values, d i yi i
vi . We can see a couple of larger
circles on the upper edge of the plot. These circles correspond to different estimators for
the same domain. The sizes of the circles suggests that the domain has an outlying value
of the direct point estimate. The location of the circles indicate that the variance is
overestimated by both models for this outlying domain. This would have the effect of overshrinkage of the estimated value to the mean (the “synthetic part”) of the model.
While we might accept over-shrinkage of an outlier, in practice we do not observe the true
value. This same over-shrinkage phenomenon would be also expected to occur in the case
where the true (but unobserved) generating value for a domain deviates from the
assumption of linearity. The two joint models provide various degrees of smoothing based
on the input data. Whenever the joint models encounter large residuals, i.e., deviations of
the observed input data from the linearity assumptions stipulated by formula (4), they may
enlarge the estimated variance, particularly for the FHS model that imposes a single
component normal distribution as the prior for random effects. Therefore, it is important to
study robustness of the models to deviations from the linearity assumption. We approach
this in the next section by introducing a Monte Carlo simulation study where the true
domain values are generated such that the global linearity assumption does not hold for
some of the domains.
Figure 2. Estimated vs true variances of the direct estimator
(Health Care and Social Assistance industry, month #1)
4. Model robustness study for deviations from linearity
The purpose of the simulation exercise described in this section is to study how the
proposed joint models behave in the case when there are domains with large deviations
from the model’s linearity assumption. To this end, we generate data using several
scenarios, as described below.
For a set of i 1,...,100 domains, we generate estimation targets i as
i xi ui ,
(13)
where auxiliary data xi ~ U (5,10), 1 and random effects are ui ~ N 0,1 .
We set 0 for the first 95 domains and 3 for the last 5 domains. Thus, the last 5
domains induce a deviation from the (overall) linearity assumption of the models.
The “observed point estimates” are
yi i ei ,
where ei ~ N 0, vi
(14)
and “true” variances are
i2 ~ IG g 1, g b .
(15)
Generating true variances i are not observed directly (or available for subsequent
modeling). We simulate observed estimates of variances as
2
1
vi ~ G 3,3 2 .
i
(16)
We consider several scenarios by varying the values of parameters b, g , thus reflecting
various schemes for the noise in the data:
1) Low average true variance b 0.5 ;
2) Medium average true variance b 1 ;
3) High average true variance b 1.5 .
For each level of b , consider three levels of variability of the true variance. The value of
g 1 induces the highest degree variability (of the variances,
),while g 4 and
g 8 induce gradually lower variability in the generated variances. The higher
variability scenarios (inversely proportional to g ) are expected to generate a heavier tailed
2
distribution for i that will induce outlying values of yi for some domains.
Table 3: Properties of the credible intervals, over 95 domains with 0
b, g
FH
FH-V
FHS
Coverage (0.95 nominal)
0.914
0.957
0.933
0.915
0.952
0.934
0.921
0.957
0.943
0.922
0.963
0.951
0.923
0.960
0.951
0.927
0.963
0.957
0.928
0.971
0.960
0.930
0.968
0.959
0.933
0.970
0.965
Length
2.259
2.392
2.259
2.208
2.384
2.234
2.004
2.371
2.186
2.933
3.102
2.881
2.869
3.080
2.846
2.599
3.060
2.725
3.445
3.659
3.366
3.364
3.625
3.316
3.027
3.595
3.135
[0.5, 8]
[0.5, 4]
[0.5, 1]
[1, 8]
[1, 4]
[1, 1]
[1.5, 8]
[1.5, 4]
[1.5, 1]
[0.5, 8]
[0.5, 4]
[0.5, 1]
[1, 8]
[1, 4]
[1, 1]
[1.5, 8]
[1.5, 4]
[1.5, 1]
FHSC
0.933
0.935
0.951
0.938
0.942
0.952
0.944
0.946
0.956
2.293
2.244
2.069
2.966
2.897
2.642
3.435
3.353
3.036
After S 100 simulations, we compute MSE for each of the above scenario b, g for
domain i as
1 S
MSEi ˆ ˆi i
S s 1
2
.
1 100
ˆ
Average MSE over all 100 domains is MSE
MSEi ˆ . We also compute
100 i 1
average
MSE
separately
over
a
set
of
domains
with
0,
95
1
MSE 0 ˆ MSEi ˆ , and over a set of domains with 3 ,
95 i 1
1 100
MSE 3 ˆ MSEi ˆ .
5 i 96
Table 4: Properties of the credible intervals, over 5 domains with 3
b, g
FH
FH-V
FHS
FHSC
Coverage (0.95 nominal)
[0.5, 8]
0.678
0.658
0.330
0.706
[0.5, 4]
0.702
0.654
0.342
0.694
[0.5, 1]
0.756
0.614
0.332
0.680
[1, 8]
0.676
0.654
0.482
0.706
[1, 4]
0.690
0.642
0.474
0.710
[1, 1]
0.734
0.632
0.468
0.692
[1.5, 8]
0.708
0.706
0.566
0.718
[1.5, 4]
0.726
0.704
0.566
0.702
[1.5, 1]
0.728
0.694
0.556
0.698
Length
[0.5, 8]
2.241
2.390
2.564
2.443
[0.5, 4]
2.231
2.388
2.575
2.448
[0.5, 1]
2.006
2.371
2.551
2.377
[1, 8]
2.927
3.118
3.094
3.071
[1, 4]
2.880
3.072
3.081
3.040
[1, 1]
2.608
3.062
2.977
2.819
[1.5, 8]
3.428
3.650
3.515
3.508
[1.5, 4]
3.378
3.611
3.489
3.467
[1.5, 1]
3.038
3.603
3.342
3.194
Coverage probabilities and interval lengths for 95% nominal credible intervals for the fitted
values based on all the models are presented in Table 3 (for 0 domains) and Table 4
(for 3 domains.) Coverages are derived for each domain over 100 simulations. The
domain results are averaged over respective groups of domains: c 0
1 95
cˆi and
95 i 1
1 100
1 S
ˆ
ˆ
c 3 ci , where ci I qˆi ,0.025 i qˆi ,0.975 and qˆi ,0.025 , qˆi ,0.975 are quantiles
5 i 96
S s 1
of the posterior distribution of the fitted values for domain i . The average length of the
1 95 ˆ
1 100 ˆ
li , where lˆi qˆi ,0.975 qˆi ,0.025
intervals are obtained as l 0
li and l 3 5 i
95 i 1
96
and S denotes the number of Monte Carlo simulation iterations.
For the 0 domains, coverages for joint models are close to nominal. The FH coverages
are somewhat low, especially for lower variances scenarios of b 0.5, g and
b 1, g . The coverages for the FH-V model are slightly higher than the nominal; their
average interval lengths are longer than in the other models.
The model coverages for 3 domains are low under all of the models, with the lowest
coverages for the FHS model. These results show that none of the models considered
provide satisfactory estimates for the domains where there are significant deviations from
the model linearity assumption. Therefore, it is important to develop a procedure that would
identify domains that do not fit the model well. In the next Section, we propose such a
procedure to create a list of “suspect” domains that are not well described by the model.
5. Improved handling of domains
Although the CES survey uses models to produce estimates for small domains, the direct
sample-based estimator is used for publication of moderately and larger sized domains.
Before these estimates are published, they have to be reviewed. In this section, we propose
a screening procedure that can be used to facilitate the analyst’s review of the direct
estimates before they go to production.
The proposed screening creates a list of domains that are not well described by the assumed
model. For the larger, direct sample-based domains, analysts may find influential reports
(that may need to be downweighted) or submission errors (that would be subsequently
repaired) among establishments that would induce outliers in the sample estimates. So,
even though models would not be used to provide estimators for large-sized domains, they
may be used to check for outliers in an efficient way.
Our screening procedure would also be expected to flag deviations from linearity among
all domains – including those which are modeled – for analyst checking. To the extent that
data submission errors and low quality data (due to small domain sizes) are ruled out, the
nominated domain may be assumed to represent a deviation from linearity, in which case
the direct estimator for that domain would replace the modeled estimate.
We earlier demonstrated that our models may poorly fit domains expressing deviations
from the linearity assumption due to over-smoothing. Ideally, we want to flag these
domains as not generated from our model, in this case, and just use the direct estimator.
Similarly, our models may be useful to flag outliers with respect to the model due to
unreliable estimators or establishment input errors. We would like to flag and correct these
points. It is time prohibitive to have a survey analyst perform manual checking of all
domains due to the tightly scheduled CES production environment. In what follows, we
formulate a hypothesis test from the posterior predictive distribution under the model to
assess whether the direct estimator for each domain was generated from our chosen model.
We nominate a few domains out of many under this procedure that allows focused, efficient
investigation by the survey analyst of whether any of the few identified domains are
outliers. If the survey analyst concludes that there are input errors in the nominated or
flagged domains, the errors will be corrected. If there are not input errors, the large
difference between modeled estimators and the direct estimators for these domains are
assumed to represent deviations from linearity.
The usual strategy for introducing of a model in the CES production is to consider a set of
candidate models 1 ,..., W and thoroughly test them on a number of historical series
over several years. Suppose researchers are satisfied with the results of such a multi-year
study, and one of the models, w , is accepted for production. The question remains, what
if the selected model w works well in general but fails for some domains in some
months?
As we earlier noted, the analysis may suggest that model-based estimates for some of these
domains are unreliable in the case of deviations from linearity; in such a case, the direct
sample estimates would be used for publication. Alternatively, the direct estimates may be
considered not trustworthy (for example, due to small sample size or extreme sample
reports). In the latter case, model estimates could be used even though they are seemingly
inconsistent with the data.
We now proceed to describe the method of creating the list of nominated suspect domains.
The method is based on the Bayesian multiple hypotheses testing and posterior predictive
checking.
For a given model w over the space of candidate models indexed by w 1,...,W , let
yil , l 1,..., L
be replicate data draws from posterior predictive distribution
p yil | yi , w for domain i (after marginalizing out the model parameters).
For each domain i , consider hypothesis H i 0 that the domain response is generated from
the model, which means that yi follows p yil | yi , w .
l
l
Define pi min P yi yi | yi , w , P yi yi | yi , w
, where P denotes the
probability of the event that yi is generated from model w . In this sense, pi denotes
the probability of erroneously rejecting H i 0 that domain i is generated from the model.
Let set D denote the set of “discoveries” (i.e., the domains that are deemed not generated
from the model according to the definition of H i 0 .) Then the expected number of “false
discoveries” is F E pi | i D, w and the estimated F is computed from the
average,
1
Fˆ
D
p .
i
(17)
D
Next, we set threshold, q , a hyperparameter setting that denotes the maximum percent of
allowable “falsely discovered domains” (Storey, 2003). The size of the list of “discoveries”
will depend on q : set D will contain the maximum number of domains such that F̂
does not exceed q .
The algorithm follows:
1. Sort pi ’s in the ascending order,
p1 ... p N and compute the cumulative
mean.
2. An example of the plot of the cumulative mean is presented in Figure 3. We may
review the plot to think of what the reasonable q -value could be. Or we may just
set the q -value once in advance.
3. Suppose we choose q 0.05 . Then D will consist of the first d domains with
smallest pi ’s: p1 ... p d , such that
1 d
p q.
d i 1 i
In other words, p d is the p -value that guarantees that the false discovery rate does not
exceed q 0.05.
Figure 3: Cumulative mean, by domain for model FHS (industry 6561, month #3)
We applied this test to the data from the simulation study considered in Section 4 and
created a list of “discoveries” to be sent for the review by analysts. Since this review is not
available in our simulations, we make a favorable assumption that analysts make correct
decisions of whether an estimate on the list is an outlier or a true (deviation from linearity)
phenomenon. Namely, we assume that all “discoveries” from the set of the 95 domains
generated with 0 were attributed to a “bad sample” cause and that the analyst’s
decision was to use modeled estimates for these domains; all “discoveries” from the set of
the domains generated with 3 were attributed to the failure of model’s linearity
assumption and the direct sample estimates were used for such domains, instead of the
model estimates.
We first focus on comparing the relative effectiveness of our models to discover deviations
form linearity on domains generated with 3 . The resulting MSEs after the
replacement, for different levels of thresholds, are given in Table 5.
Table 5: MSE for domains with 3
q 0.05
q 0.10
q 0.15
b, g
Y
FHS
FHSC
FHS*
FHSC*
FHS*
FHSC*
FHS*
FHSC*
[0.5, 8]
[0.5, 4]
[0.5, 1]
[1, 8]
[1, 4]
[1, 1]
[1.5, 8]
[1.5, 4]
[1.5, 1]
0.511
0.556
0.361
1.023
1.112
0.721
1.534
1.668
1.082
3.719
3.851
3.892
3.350
3.469
3.375
3.392
3.484
3.316
1.331
1.419
1.589
1.952
2.032
1.901
2.478
2.538
2.303
1.449
1.603
1.636
2.442
2.615
2.467
3.052
3.228
2.815
1.070
1.135
1.096
1.891
1.940
1.678
2.525
2.572
2.168
0.849
0.842
0.618
1.741
1.893
1.367
2.479
2.675
2.023
0.911
0.948
0.741
1.742
1.771
1.350
2.407
2.464
1.854
0.646
0.681
0.441
1.374
1.442
0.920
2.002
2.157
1.388
0.771
0.807
0.526
1.506
1.558
1.006
2.098
2.243
1.437
The first columns of Table 5 show MSEs of the “direct” estimates Y and the estimates
based on the FHS and FHSC models. Columns labeled FHS* and FHSC* show MSE
results after analysts correctly replace domains from group 3 nominated under each
model by the direct sample estimates. The set of nominated or discovered domains for the
FHS* and FHSC* columns were created using, respectively, the FHS- and FHSC-based
screening procedures with the same respective threshold levels. We show results for
threshold choices of 0.05, 0.10, and 0.15. As can be seen in Table 5, the correct replacement
of discoveries with the direct estimates leads to visible reductions in MSE; however, the
values of MSE are still higher than the respective MSEs for the “direct” estimates given in
the column labeled Y. The reason is that not all the deviations from linearity were captured
by the test under the chosen threshold levels, i.e., there remained domains with 3 that
were not captured by the screening and thus their respective (overly shrunken) model-based
estimates were not replaced by the direct estimates.
Table 6 displays the fraction of “true discoveries”, defined as the number of correct
discoveries divided by the total number of the 3 domains, i.e. of true deviations from
linearity, at increasing threshold levels. Table 6 shows that higher threshold levels increase
the number of domains correctly discovered. This increase in discoveries, in turn, would
reduce MSEs for those correctly replaced domains (Table 5); however, investigating the
added discoveries would also increase the workload for analysts. Table 7 shows the fraction
of “false discoveries”, defined as the number of discoveries among the domains where
there is no deviation from linearity, e.g., 0 , divided by the total number of such
domains. By construction, the false discovery rate increases with an increasing threshold,
which could result (after analyst investigation) in more domains to be mislabeled as
“deviations from linearity assumptions”, when in fact their appearance on the list could be
due to poor direct estimates (that induce outlyingness). In practice, tuning will be required
to set the threshold, taking into consideration the workload and timeline restrictions.
We note that the FHSC produces a lower discovery rate. This lower discovery rate is
expected, since FHSC is a more flexible model than FHS in the sense of adapting to
deviations from linearity. It is able to better model some of these domains by allocating
them to a cluster, which reduces the shrinking of these domain estimates. The flexible
estimation property of the FHSC model is also evidenced by the MSE results for the 3
domains shown in Table 5, where MSE values are lower for the FHSC model compared to
the FHS model; in particular, results for the q 0.05 threshold are better for FHSC*, as
compared to FHS*, even though the discovery rate is lower for the FHSC model; similarly,
results for the q 0.10 and q 0.15 threshold levels are close, even though FHSC has
much lower discovery rate. Some of the 3 domains are less shrunken under FHSC
(than FHS) and are, therefore, better predicted by the model and, hence, not “discovered”.
That our testing procedure produces fewer discoveries under FHSC for modeled domains,
but yet FHSC produces relatively lower errors is a feature of this model.
Table 6: Percent of “true discoveries”: for the 3 domains
q 0.05
b, g
[0.5, 8]
[0.5, 4]
[0.5, 1]
[1, 8]
[1, 4]
[1, 1]
[1.5, 8]
[1.5, 4]
[1.5, 1]
q 0.10
q 0.15
FHS* FHSC* FHS* FHSC* FHS* FHSC*
37%
35%
32%
18%
17%
15%
11%
11%
9%
8%
10%
8%
5%
7%
5%
4%
6%
4%
66%
68%
69%
48%
48%
49%
37%
37%
37%
20%
24%
26%
19%
23%
21%
21%
24%
21%
81%
82%
84%
66%
70%
72%
59%
61%
65%
40%
43%
47%
40%
44%
45%
42%
45%
47%
Table 7: Percent of “false discoveries”: for the 0 domains
q 0.05
b, g
[0.5, 8]
[0.5, 4]
[0.5, 1]
[1, 8]
[1, 4]
[1, 1]
[1.5, 8]
[1.5, 4]
[1.5, 1]
q 0.10
q 0.15
FHS* FHSC* FHS* FHSC* FHS* FHSC*
1%
1%
1%
1%
1%
1%
0%
0%
1%
0%
0%
0%
0%
0%
0%
0%
0%
0%
8%
7%
7%
5%
5%
5%
4%
4%
4%
0%
1%
1%
1%
1%
2%
1%
2%
2%
18%
18%
18%
14%
14%
13%
13%
13%
13%
2%
3%
3%
4%
5%
5%
6%
7%
7%
We, next, assess the effectiveness of our test procedure when it is applied to non-modeled
domains as a screening tool of sample-based, direct estimates before they are released for
publication. Here, the goal is to detect the domains where direct estimates are impacted by
poor sample or extreme sample measurements. For this purpose, we use the same
simulation and test results as in Tables 5-7 but change the focus of the evaluation to
compare the original “direct” estimates with the “corrected” estimates following analysts’
review of the list. In the actual production, we expect analysts to identify extreme
measurements and errors and update domain sample-based estimates after making
appropriate corrections. In this simulation study, we do not have the analysts review stage;
in place of the review, we use the following assumptions and approximations: assume
analysts correctly identify the flagged domains in the 0 group as being affected by
sample outliers; for those flagged domains in the 0 group, we use respective modelbased estimates to replace the original “direct” estimates as proxy of what the estimates
will look like after analysts’ treatment of outliers. We compute the MSE of the original
direct-based estimates and the revised estimates after replacements. The results are
reported in Table 8.
Table 8: MSE for domains with 0
q 0.05
q 0.10
q 0.15
b, g
Y
Y_FHS*
Y_FHSC*
Y_FHS*
Y_FHSC*
Y_FHS*
Y_FHSC*
[0.5, 8]
[0.5, 4]
[0.5, 1]
[1, 8]
[1, 4]
[1, 1]
[1.5, 8]
[1.5, 4]
[1.5, 1]
0.508
0.513
0.482
1.016
1.025
0.963
1.524
1.538
1.445
0.500
0.502
0.449
0.984
0.981
0.873
1.487
1.472
1.301
0.506
0.510
0.467
1.013
1.014
0.910
1.513
1.511
1.339
0.461
0.463
0.404
0.864
0.848
0.711
1.276
1.241
1.023
0.501
0.495
0.429
0.976
0.956
0.800
1.421
1.369
1.121
0.419
0.420
0.369
0.724
0.710
0.585
1.038
1.016
0.808
0.484
0.472
0.391
0.896
0.863
0.690
1.251
1.193
0.923
Column Y shows MSE for the original “direct” estimates. Columns Y_FHS* and
Y_FHSC* show MSEs after the original estimates for the domains on the list where
replaced by the respective model-based estimates, used as approximation of estimates
smoothed over the presumed outliers. We observe that the increasing of the threshold levels
results in more flagged domains; the replacement of respective original estimates by more
smooth model-based values results in improved estimation. We also observe that FHSbased procedure gives slightly better results, compared with the FHSC-based counterpart.
Since FHS does not cluster extreme values – be they deviations from linearity or extreme
sample measurements – this model will tend to produce more discoveries. This greater
sensitivity of FHS relative to FHSC could also lead to false discoveries. The discoveries
may also include a more equal mix deviations from linearity and extreme sample
measurements than under FHSC (which may better estimate some domains expressing
deviations from linearity). The more equal mix produces a greater reliance on the analyst
to correctly differentiate the two phenomena.
6. Summary
In this paper, we applied joint modeling of the point estimates and their variances to the
synthetic CES data and obtained more efficient results than in the case of the plugged in
“fixed and known” variances. We extended the models of Maiti et al. (2014) and Sugasawa
et al. (2017) by allowing the data to estimate a clustering structure on random effects and
variances to account for deviations from linearity and outlyingness. For the bulk of
domains, the co-clustering model provides better estimates of direct survey variances. Our
simulations show that co-clustering model is more robust to deviations from linearity
assumptions in terms of coverage. In the presence of large deviations from linearity, we
observed that although the resulting estimates from the co-clustering model are better than
without clusters, they are still not “good enough”: in the presence of large deviations from
the linearity assumption, model-based estimates may be worse than direct survey estimates.
It is a good practice to perform careful model checks before choosing a model. However,
thorough model evaluation can be an unrealistic task in a tightly scheduled production
environment. The checking task is so important, however, that estimates are thoroughly
tested based on a number of historical series before a model is accepted for implementation
in production. Therefore, we devised an automated, fast computing testing procedure based
on the Bayesian FDR to nominate a small subset of domains for analysts review on a timely
basis. Our procedure evaluates the probability that the direct estimate for a domain was
generated from our candidate model. This procedure could become a useful tool for
analysts to mark unusual estimates before they are published.
Lastly, there is indication that model fitted variances for direct survey estimates provide a
more stable alternative to the raw sample-based estimates of variances. This is a potentially
useful by-product from the joint modeling of direct estimates of point estimates and
variances.
References:
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A
practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B
57 289–300.
Bureau of Labor Statistics (2004), Employment, Hours, and Earnings from the
Establishment Survey, BLS Handbook of Methods, chap. 2, Washington, DC:
US
Department
of
Labor.
Available
at
http://www.bls.gov/opub/hom/pdf/homch2.pdf, Last accessed on May 1, 2018.
Fay, R. E., and Herriot, R. A. (1979), “Estimates of Income for Small Places: An
Application of James-Stein Procedures to Census Data,” Journal of the
American Statistical Association, 74, 269–277.
Gelman, A., Lee, D., and Guo, J. (2015) Stan: A probabilistic programming
language for Bayesian inference and optimization. In press, Journal of
Educational
and
Behavior
Science.
http://www.stat.columbia.edu/~gelman/research/published/stan_ jebs_2.pdf
Gershunskaya, J. and Savitsky, T.D. (2017) Dependent Latent Effects Modeling for
Survey Estimation with Application to the Current Employment Statistics
Survey. Journal of Survey Statistics and Methodology, Volume 5, Issue 4, 433–
453, https://doi.org/10.1093/jssam/smx021
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D.M. (2017),
Automatic differentiation variational inference. Journal of Machine Learning
Research, 18(14):1–45.
Maiti, T., Ren, H. and Sinha, A. (2014). Prediction error of small area predictors
shrinking both means and variances, Scandinavian Journal of Statistics, 41,
775-790.
Neal, R. M. (2000), “Markov Chain Sampling Methods for Dirichlet Process
Mixture Models,” Journal of Computational and Graphical Statistics, 9, 249–
265.
Stan Development Team (2017), Stan modeling Language User’s Guide and
Reference Manual, Version 2.17.0 [Computer Software Manual], available at
http://mc-stan.org/. Last accessed 04/23/2018
Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation
and the q-value. Annals of Statistics 31, 2013–2035.
Sugasawa, S., Tamae, H., and Kubokawa, T. (2017) Bayesian Estimators for Small Area
Models Shrinking Both Means and Variances. Scand J Statist, 44: 150–167.
doi: 10.1111/sjos.12246.