Proceedings of Seventh International Conference on Establishment Statistics, 2024
This paper, prepared for the Seventh International Conference on Establishment Statistics, notes ... more This paper, prepared for the Seventh International Conference on Establishment Statistics, notes various configurations and uses/techniques for cutoff sampling. The conference was held in Glasgow in June 2024. To meet space requirements, some bibliography items are abbreviated here, as of September 12, 2024. Proceedings are pending. -
There are various concepts regarding the nature and application of cutoff (or cutoff) sampling. We can start consideration with a population, subpopulation or stratum, divided into take-all, take-some, and take-none strata or groups, such as that noted by Benedetti, Bee, and Espa(2010). Other concepts of 'cutoff' sampling involve the combination of any two of those three groups. All such pairings have been considered by various authors. Inference may either be by a design-based technique, perhaps using calibration, or a model-based approach such as the use of ratio models as proposed earlier by Brewer(1963) and Royall(1970), though they both moved on from that work. They considered takeall and take-none strata or groups, with ratio model predictions. Other combinations are discussed, as used by other authors. However, a take-all and take-none combination with prediction was also used in Guadarrama, Molina, and Tillé(2020). Similar methodology used by Knaub has been voluminously practiced at the US Energy Information Administration (EIA) starting by 1990, as noted in Knaub(2023) and in the References and Bibliography found in Knaub(2022), modified due to the multiple attribute nature of almost all surveys, and other considerations. Bias has been shown to be low there. There is substantial potential for expanded use. Advantages are low smallrespondent burden, low cost, and high accuracy for this type of cutoff/quasi-cutoff sampling with prediction, using the same data item in a previous census as the predictor for each item. Quasi-cutoff sampling refers to the fact that establishment respondents report for multiple items, which will vary in size from one item to another. To obtain desired estimated variance of prediction errors associated with predicted totals by item, one may include other smaller establishment respondents, which may be large for certain items. (A few other establishments with smaller items could be included in the sample for real-time model verification, with subsequent possible action such as stratification or the use of quadratic linear regression (Royall(1992), Valliant(2009).) Note that use of size by item in prediction is an advantage over unequal probability sampling. Accuracy and usefulness may vary greatly for different concepts of 'cutoff' sampling. More marginal cases might be reminiscent of work by other authors to provide inference from other nonprobability sampling with some degree of accuracy. An example of a type of cutoff sampling which compares somewhat to convenience sampling is noted. When design-based inference is used, calibration may be helpful, but for convenience sampling, pseudo-weights would be used first. See Elliott and Valliant(2017) and Valliant(2020). Modeling seems to always be a consideration, directly or through calibration or other means. In all cases, total survey error (TSE) should be considered.
Please Note: This simple, proven, highly effective methodology has been developed and extended ov... more Please Note: This simple, proven, highly effective methodology has been developed and extended over many years at the US Energy Information Administration, and has great potential for other applications. ---
Abstract: Official Statistics from establishment surveys are not only the basis for routine monitoring of markets and perhaps systems in general, they are essential for discovering problems for which innovative approaches may be needed. Statistical agencies remain the workhorses for regularly providing this information. For establishment surveys one needs to collect data efficiently, making an effort to reduce burden on small establishments, and reduce costs to the Government, while promoting accuracy of results in terms of total survey error. For over three decades, these demanding standards have been met using quasi-cutoff sampling and prediction, applied extensively to some of the repeated official US energy establishment surveys. This success may be duplicated for other applications where sample surveys occur periodically, and there is an occasional census produced for the same data items. Sometimes stratification is needed, but sometimes the borrowing of strength, as in small area estimation/prediction, may be used. References will be given to help avoid pitfalls. The idea is to encourage expanding this elegant approach to other applications. The material here is an expanded version of a poster for the 2022 Joint Statistical Meetings. This is a tutorial/guide. Appendices are written in stand-alone form.
Edits made to origin. --- Some of the history of a monthly sales and revenue survey, and the cu... more Edits made to origin. --- Some of the history of a monthly sales and revenue survey, and the current status of that survey and others, are given as background to changes now being implemented. Sampling will now take place at the State level, instead of the national level; more complete use will be made of auxiliary information; work has been done to control cv's for several vari- ables simultaneously; and a test originally designed to consider degrees of homogeneity between several similar populations with limited sampling, has been used to investigate the effectiveness of this design at an aggregated (Census division) level.
InterStat:
Ordinary least squares (OLS) regression gets most of the attention in the statis... more InterStat:
Ordinary least squares (OLS) regression gets most of the attention in the statistical literature, but for cases of regression through the origin, say for use with skewed establishment survey data, weighted least squares (WLS) regression is needed. Here will be gathered some information on properties of weighted least squares regression, particularly with regard to regression through the origin for establishment survey data, for use in periodic publications.
-----------------------------------
March 17, 2016: Note that a special approximation for varian... more ----------------------------------- March 17, 2016: Note that a special approximation for variance, regarding estimated totals, was used here, for purposes of various possible needs for production of Official Statistics in a potentially changing environment. Flexibility for possible changes in modeling, data storage, aggregate levels to be published, and avoidance of future data processing errors on old data made this attractive. Simplicity of application in a production environment was emphasized. ----------------------------------- [This is not for time series analyses.] This research concerns multiple regression for survey imputation, when correlation with a given regressor may vary radically over time, and emphasis may shift to other regressors. There may be many applications for this methodology, but here we will consider the imputation of generation and fuel consumption values for electric power producers in a monthly publication environment. When imputation is done by regression, a sufficient amount of good quality observed data from the population of interest is required, as well as good-quality, related regressor data, for all cases. For this application, the concept of 'fuel switching' will be considered. That is, a given power producer may report using a given set of fuels for one time period, but for economic and/or other practical reasons, fuel usage may change dramatically in a subsequent time period. Testing has shown the usefulness of employing an additional regressor or regressors to represent alternative fuel sources. A performance measure found in Knaub(2002, ASA JSM CD) is used to compare results. Also, the impact of regression weights and the formulation of those weights, due to multiple regression, are considered. ----- Jan 8, 2016: Note that this is not a time series technique. This is for cross-sectional surveys, and was designed for use on establishment surveys for official statistics. I have had some discussions on ResearchGate recently, regarding the notion of bias-variance tradeoffs in modeling, and that more complicated models (tend to?) decrease (conditional?) bias and increase variance. Here, however, variance for estimated totals, under the sampling conditions here, is decreased when there is fuel switching. (Acknowledgement: Thank you to those who discussed my questions on ResearchGate.)
Notation: An asterisk means a WLS estimator (as in GS Maddala's notation). Also note that for the... more Notation: An asterisk means a WLS estimator (as in GS Maddala's notation). Also note that for the estimated variance of the prediction error, instead of V*(T*-T), I inappropriately had V*(T*). https://www.amstat.org/sections/SRMS/Proceedings/papers/1996_101.pdf Model-based inference has performed well for electric power establishment surveys at the Energy Information Administration (EIA), using cutoff sampling and weighted, simple linear regression, as pioneered by K.R.W. Brewer, R.M. Royall, and others. However, 'nonutility' generation sales for resale data have proved to be relatively difficult to estimate efficiently. Design-based inference would be even less efficient. A weighted, multiple linear regression model, using a cutoff sample, where one regressor is the data element of interest as captured in a previous census, and another regressor is the nameplate capacity of the generating entity, has proved to be extremely valuable. This is being applied to monthly salnpling, where regressor data come from previous annual census information. Estimates of totals, with their corresponding estimates of variance, have been greatly improved by this methodology. This paper is an abbreviated version of an article found in the electronic journal, hlterStat, located on the Internet at http://interstat.statjournals.net. Joint Statistical Meetings, Chicago, Illinois, USA; 08/1996
Jan. 8, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the... more Jan. 8, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. It is a standard deviation. .................... Previous notes: The classical ratio estimator (CRE) is very simple, has a long history, and has a stunningly broad range of application, especially with regard to econometrics, and to survey statistics, particularly establishment survey statistics. The CRE has a number of desirable properties, one of which is that the sum of its estimated residuals is always zero. It is easily extended to multiple regression, and a property shown in Sarndal, Swensson and Wretman (1992) may be used to indicate the desirability of this zero sum of estimated residuals feature when constructing regression weights for multiple regression. In the single regressor form, the zero sum of estimated residuals property is related to an interesting phenomenon expressed in Fox (1997). Finally, relationships of the CRE to some other statistics are also considered. -- Note added November 2014: As noted in other works I have done, and elsewhere, for this model, only the individual values of x corresponding to individual y values need to be known, as long as the sum of the remaining x (for out-of-sample cases) is known, and then one can still estimate variance. If we do not know the sum of those remaining N-n x-values (where n is sample size selected minus cases to be imputed), but we know a range for that subtotal of x's, then we know a range of estimated variances for the estimated y-totals to go with that range. --
In repeated surveys there generally are auxiliary/regressor data available for all members of th... more In repeated surveys there generally are auxiliary/regressor data available for all members of the population, that are related to data collected in a current sample or census survey. With regard to modeling, these regressor data can be used to edit the current data through scatterplots, and to impute for missing data through regression. Another use for regressor data may be the study of total survey error. To do this, follow these steps: (1) stratify data by regression model application (the related scatterplots can be used for editing); (2) find predicted values for data not collected, if any, and (3) replace all data that are collected with corresponding predicted values. If model-based ratio prediction is used, variance proportionate to a measure of 'size,' then the sum of the predicted values equals the sum of the observed values they replace. (See "Fun Facts" near the end of this article: fact # 4.) The standard error of the total of the predicted values for every member of a finite population, divided by that total, and expressed as a percent, could be labeled as an estimated relative standard error under a superpopulation, or a model-based RSESP. This RSESP would be influenced by (1) the models chosen, (2) inherent variance, and (3) total survey error (sampling and nonsampling error). This article proposes this model-based RSESP as a survey performance indicator and provides background and examples using both real and artificial data.
Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the... more Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw something by G. Shmueli that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. .................... Previous notes: This article is from the Pakistan Journal of Statistics, http://www.pakjs.com/, in a special issue in honor of Ken Brewer. The URL for this article is http://www.pakjs.com/journals//27(4)/27(4)6.pdf . ---
Here we will review some of the historical development of the use of the coefficient of heteroscedasticity for modeling survey data, particularly establishment survey data, and for inference at aggregate levels. Some of the work by Kenneth R. W. Brewer helped develop this concept. Dr. Brewer has worked to combine design-based and model-based inference. Here, however, we will concentrate on regression modeling, and particularly on some of his earlier work.
Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the... more Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. This is a standard deviation. .................... Previous notes: From InterStat, http://interstat.statjournals.net/, September 2012:
Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator, which has use in quasi-cutoff sampling, balanced sampling, and in econometrics applications. Other applications for this article in other areas of statistics may arise. Multiple regression for a given attribute can be important, but is only considered briefly here. The need for data to estimate for multiple attributes is also important, and must be considered. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, to see the relative impact of factors needed for planning, for each such case. Typically one may consider the volume coverage for an attribute of interest, or related data, say regressor data, to be important, but relative standard errors for estimated totals, or confidence bounds, are needed, to have a better idea of the adequacy of a sample. For multiple attributes (variables of current interest), this may be applied iteratively, as a change in sample for one attribute impacts the sample size for another. Please see the definitions section in https://www.academia.edu/16226638/Efficacy_of_Quasi-Cutoff_Sampling_and_Model-Based_Estimation_For_Establishment_Surveys_and_Related_Considerations
Proceedings of the American Statistical Association, Survey Research Methods Section, JSM 2013, 2013
Abstract at bottom - notes first:
Oct 16, 2017: On page 2895, I have this: "... a balanced s... more Abstract at bottom - notes first:
Oct 16, 2017: On page 2895, I have this: "... a balanced sample ... would be somewhat comparable to a random sample...." -
However, a simple random sample for a skewed population would generally result in even less efficiency than using a balanced sample here. It would usually more heavily 'represent' small members of the population.
Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. It is a standard deviation.
....................
Previous notes:
Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator (CRE). This has application to quasi-cutoff sampling (simply cutoff sampling when there is only one attribute), balanced sampling, econometric applications, and perhaps others. Multiple regression for a given attribute can occasionally be important, but is only considered briefly here. Nonsampling error always has an impact. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, for resource planning at that base level. Typically one may consider the volume coverage for an attribute of interest, or related size data, say regressor data, to be important, but standard errors for estimated totals are needed to judge the adequacy of a sample. Thus the focus here is on a 'formula' for estimating sampling requirements for a model-based CRE, analogous to estimating the number of observations needed for simple random sampling. Both balanced sampling and quasi-cutoff/cutoff sampling are considered.
Key Words: Classical Ratio Estimator, Volume Coverage, Measure of Size, Model-Based Estimation, Official Statistics, Resource Allocation Planning, Sample Size Requirements, Weighted Least Squares Regression
Edited from InterStat, http://interstat.statjournals.net/ -
This (renamed) 2007 article su... more Edited from InterStat, http://interstat.statjournals.net/ -
This (renamed) 2007 article surveys the use of cutoff sampling and inference by various organizations and as described in the literature. This is a technique often used for establishment surveys. Online searches were made which included the key words "cutoff sampling" and "cut-off sampling." Both spellings are in use. Various approaches are described, but the focus is on the model-based approach, using the classical ratio estimator (CRE). Concluding remarks are made. More material was added in 2014.
....
Key Words: establishment surveys, total survey error, model-based classical ratio estimator, CRE, multiple regression, RSE, RSESP, certainty stratum, link relative estimator
This is a letter to the editor of the Journal of Official Statistics (JOS). It addresses an artic... more This is a letter to the editor of the Journal of Official Statistics (JOS). It addresses an article in the previous issue of JOS on cutoff sampling, which referenced this author, and attempts to clarify some positions, including that with regard to multiple attributes resulting in quasi-cutoff sampling. That article by Benedetti, Bee, and Espa is openly available at: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/20200206/jos_cut-off_benedetti-mfl.pdf. ... Note that my work has applied to multiple attribute (multiple item) surveys at the US Energy Information Administration (EIA), starting with one circa 1990, using quasi-cutoff sampling where there are tradeoffs such that some members of the population large for one item but not others may not be in the sample. Thus the comment on page 655 in the Benedetti, Bee, and Espa article that I was limited to a "univariate setup" is not actually the case, though apparently that was not clear. ... In their article they go into detail to pick an optimal sample, but that method may not be necessary or appropriate to work with smaller populations such as the many small populations at the US Energy Information Administration. They look at a "take-some" stratum using a design-based approach. Emphasis at the EIA is with the different levels of model-based variance by item that may occur for the not sampled part of each item population. They consider that prediction for that part may be analyzed as bias to be determined after-the-fact. ... Historical data are used for sample selection. ... See my comments on ResearchGate (in the "Comments" area), which address accuracy for the not sampled part of the population, for each item.
This is a book review, for the Journal of Official Statistics (Statistics Sweden), with regard to... more This is a book review, for the Journal of Official Statistics (Statistics Sweden), with regard to nonresponse and model-assisted design-based sampling. The reviewer was asked to compare this to his own experiences in model-based estimation. -- Disclaimer: The reviewer's opinions are his own, and this in no way indicates any endorsement in any respect by his employer. This work was done off-duty. --- JOS: http://www.jos.nu/, and this article is found at http://www.jos.nu/Articles/abstract.asp?article=222351
Errata brought to my attention:
-- Figures with 'confidence bounds' here are actually curved bo... more Errata brought to my attention:
-- Figures with 'confidence bounds' here are actually curved bounds about prediction intervals for predicted y-values. They would better be called "Prediction Bounds."
-- The "estimated standard error of the random factors of the estimated residuals," should have been designated the "estimated standard deviation of the estimated random factors of the estimated residuals." That is, the sigmas are standard deviations, as they are not reduced with increased sample size.
-- This paper is for the prediction or estimation of official statistics. A better title would be as follows: Use of Ratios for Estimation or Prediction of Official Statistics at a Statistical Agency.
----------
----------
From InterStat, 2012, http://interstat.statjournals.net/ -
The US Energy Information Administration (EIA) has made good use of available auxiliary data over a number of years, for a variety of surveys on energy sources and consumption, for estimating the statistics that the EIA mission requires. Such use of already available data reduces data collection burden for a given level of accuracy. Many of these instances relate to a single auxiliary variable, and involve some type of ratio. The many uses of ratios at the EIA are both disparate and unifying: disparate in that the applications can appear to be fairly distinct, but unifying in that there are interrelationships between these methods that may be esthetically pleasing, and of practical importance. Better communication and future improvements may be achieved by considering what is different between these methods, and what they have in common. Here we will explore these ideas. --
Please note that the illustration near the end is a (much) cropped photo I took of a painting done by my grandfather, Karl H. (Henry) Knaub, more than half of a century before I wrote this paper.
:-)
From InterStat; interstat.statjournals.net -
Most sample surveys, especially household survey... more From InterStat; interstat.statjournals.net - Most sample surveys, especially household surveys, are design-based, meaning sampling and inference (e.g., means and standard errors) are determined by a process of randomization. That is, each member of the population has a predetermined probability of being selected for a sample. Establishment surveys generally have many smaller entities, and a relatively few large ones. Still, each can be assigned a given probability of selection. However, an alternative may be to collect data from generally the largest establishments – some being larger for some attributes than others – and use regression to estimate for the remainder. For design-based sampling, or even for a census survey, such models are often needed to impute for nonresponse. When such modeling would be needed for many small respondents, generally if sample data are collected on a frequent basis, but regressor (related) data are available for all of the population, then cutoff sampling with regression used for inference may be a better alternative. Note that with regression, one can always calculate an estimate of variance for an estimated total. (For example, see Knaub(1996), and note Knaub(2007d).) Key Words: classical ratio estimator, conditionality principle, model failure, probability proportionate to size (PPS), randomization principle, regression, skewed data, superpopulation, total survey error
From InterStat, http://interstat.statjournals.net/ -
A cutoff sample may generally be consid... more From InterStat, http://interstat.statjournals.net/ - A cutoff sample may generally be considered only because it is easy to administer and relatively inexpensive. It is not often recognized that a cutoff sample may also be the option providing the smallest total survey error (TSE). Consider that model-assisted design-based sampling adjusts for samples drawn at random to compensate for the fact that the mean of the random sample can vary greatly from the mean of the population. Thus the importance of regression models in survey statistics is recognized. For cutoff sampling, accuracy may be improved by predicting for many of the ‘small’ cases that may not be able to report accurately on a frequent basis. Survey resources may then be used to concentrate on data collection for the largest possible observations. There are considerations that may mitigate the impact of model-failure with respect to estimating for the cases where there is no chance of sample selection. This article emphasizes those mitigating conditions.
From InterStat, http://interstat.statjournals.net/
Weighted least squares regression through the... more From InterStat, http://interstat.statjournals.net/ Weighted least squares regression through the origin has many uses in statistical science. An important use is for the estimation of attribute totals from establishment survey samples, where we might use quasi-cutoff sampling. Two questions in particular will be explored here, with respect to survey statistics: (1) How do we know this is performing well? and (2) What if the smallest members of the population appear to behave differently? This review article contains a summary of conclusions from experimental findings, and explanations with numerous references. Key Words: Establishment Surveys, Heteroscedasticity, Model-Based Classical Ratio Estimator, Multiple Attributes, Nonsampling Error, Prediction, Regression Through the Origin, Total Survey Error, Weighted Least Squares Regression
[This is "Small Area Estimation/Prediction" only in that we 'borrow strength.' Also, technically ... more [This is "Small Area Estimation/Prediction" only in that we 'borrow strength.' Also, technically totals are "predicted," rather than "estimated," as this is model-based here. Further, as e is an approximation for epsilon, gamma here is also not exactly the true gamma.] ...
...
From InterStat, http://interstat.statjournals.net/, May 2014:
Here, small area estimation is applied in the sense that we are "borrowing strength" from data outside of given subpopulations for which we are to publish estimated totals, or means, or ratios of totals. We will consider estimated totals for establishment surveys. A subpopulation for which we wish to estimate a total will be called a "publication group" (PG), and data that may be modeled together, using one regression, will be called an "estimation group" (EG). See Knaub(1999, 2001, 2003) regarding this for a more complex application. When a PG consists of a set of EGs, that is stratification. When an EG contains PGs, this is a simple form of small area estimation because we are using data outside of a given publication group to help estimate statistics/parameters for that model, used to estimate for each impacted PG. (In Knaub(1999, 2001), there are overlapping 'areas' as well.) Here we consider very small areas (PGs), which may fall within a 'larger' EG, and here we are only considering one regressor, but this could be generalized (Knaub(1999)). Sample sizes and population sizes considered in this paper can be very small within a given PG, say a State and economic end-use sector. In the case of n = N = 1, a single response is the total for that PG. If it is part of an EG with other data, then if there is a nonresponse in that case, an estimate in place of that observation may be obtained for contribution, for example, to a US-level aggregate number for that end-use sector, and a variance contribution to be added to the US-level variance would be found as well. Further, a scatterplot for such an estimation group, especially if a confidence band were constructed (Knaub(2009), section 4, and Knaub(2012b), Figure 1) could be used to help edit data. If that PG with n = N = 1 were looked at alone, one could not have a scatterplot that would determine if a response were reasonable for the current circumstances. (A forecast for that one point would not be as good if some event were to cause a break in the time series, and one would have to consider a time series for every single point, many more graphs, and for some there would be no series available. But a scatterplot to accompany this regression modeling would consider every point used in the model. Data for which there are no regressor data, such as "births," are "added on" to totals outside of modeling.) Techniques here may be used for estimation ("prediction") for sample surveys, and to impute for nonresponse for sample surveys and census surveys. There may be applications to other fields of statistics as well.
Key Words: Regression, Model-Based Estimation, Weighted Least Squares, Scatterplots, Small Area Estimation, Data Editing, Establishment Surveys, Seasonality, Borrowing Strength
-
-
SAE vs stratification:
Note that if you have a wide geographic region (or some other 'wide' grouping), and one model, say regression through the origin, is appropriate for all the data (checking scatterplots, confidence intervals regarding the prediction errors, and standard errors of the regression coefficients for subgroups), then small area estimation (SAE) might be helpful. But if each part - say, State - should be modeled separately, then the overall group - say a superState, multiple-State region - could benefit by stratification. As in design-based sampling, a 'larger' group benefits by stratification if there is small variance within strata and big differences between strata.
International Conference on Establishment Surveys (ICES), Buffalo, NY, USA; 06/1993
A graphical a... more International Conference on Establishment Surveys (ICES), Buffalo, NY, USA; 06/1993 A graphical analysis of heteroscedasticity for linear regression was presented, including measurement and impact of nonlinearity. Comparison was made to the Iterated Reweighted Least Squares approach and results.
- New note, April 23, 2019: Errata: Consider the first column of the first page. Ken Brewer pointed out to me, when we discussed the coefficient of heteroscedasticity at a later date, that I should not have been using the word "components," but "factors" instead. The key here is the factoring of estimated residuals into random and nonrandom "factors." - Also please note that the accuracy for each trial gamma (the coefficient of heteroscedasticity) in the process might sometimes be substantially improved by involving the corresponding new set of predicted y values. - Further please note that my choice of the goal value "w" for gamma was not a good one, as we do not want to confuse that with regression weight.
Proceedings of Seventh International Conference on Establishment Statistics, 2024
This paper, prepared for the Seventh International Conference on Establishment Statistics, notes ... more This paper, prepared for the Seventh International Conference on Establishment Statistics, notes various configurations and uses/techniques for cutoff sampling. The conference was held in Glasgow in June 2024. To meet space requirements, some bibliography items are abbreviated here, as of September 12, 2024. Proceedings are pending. -
There are various concepts regarding the nature and application of cutoff (or cutoff) sampling. We can start consideration with a population, subpopulation or stratum, divided into take-all, take-some, and take-none strata or groups, such as that noted by Benedetti, Bee, and Espa(2010). Other concepts of 'cutoff' sampling involve the combination of any two of those three groups. All such pairings have been considered by various authors. Inference may either be by a design-based technique, perhaps using calibration, or a model-based approach such as the use of ratio models as proposed earlier by Brewer(1963) and Royall(1970), though they both moved on from that work. They considered takeall and take-none strata or groups, with ratio model predictions. Other combinations are discussed, as used by other authors. However, a take-all and take-none combination with prediction was also used in Guadarrama, Molina, and Tillé(2020). Similar methodology used by Knaub has been voluminously practiced at the US Energy Information Administration (EIA) starting by 1990, as noted in Knaub(2023) and in the References and Bibliography found in Knaub(2022), modified due to the multiple attribute nature of almost all surveys, and other considerations. Bias has been shown to be low there. There is substantial potential for expanded use. Advantages are low smallrespondent burden, low cost, and high accuracy for this type of cutoff/quasi-cutoff sampling with prediction, using the same data item in a previous census as the predictor for each item. Quasi-cutoff sampling refers to the fact that establishment respondents report for multiple items, which will vary in size from one item to another. To obtain desired estimated variance of prediction errors associated with predicted totals by item, one may include other smaller establishment respondents, which may be large for certain items. (A few other establishments with smaller items could be included in the sample for real-time model verification, with subsequent possible action such as stratification or the use of quadratic linear regression (Royall(1992), Valliant(2009).) Note that use of size by item in prediction is an advantage over unequal probability sampling. Accuracy and usefulness may vary greatly for different concepts of 'cutoff' sampling. More marginal cases might be reminiscent of work by other authors to provide inference from other nonprobability sampling with some degree of accuracy. An example of a type of cutoff sampling which compares somewhat to convenience sampling is noted. When design-based inference is used, calibration may be helpful, but for convenience sampling, pseudo-weights would be used first. See Elliott and Valliant(2017) and Valliant(2020). Modeling seems to always be a consideration, directly or through calibration or other means. In all cases, total survey error (TSE) should be considered.
Please Note: This simple, proven, highly effective methodology has been developed and extended ov... more Please Note: This simple, proven, highly effective methodology has been developed and extended over many years at the US Energy Information Administration, and has great potential for other applications. ---
Abstract: Official Statistics from establishment surveys are not only the basis for routine monitoring of markets and perhaps systems in general, they are essential for discovering problems for which innovative approaches may be needed. Statistical agencies remain the workhorses for regularly providing this information. For establishment surveys one needs to collect data efficiently, making an effort to reduce burden on small establishments, and reduce costs to the Government, while promoting accuracy of results in terms of total survey error. For over three decades, these demanding standards have been met using quasi-cutoff sampling and prediction, applied extensively to some of the repeated official US energy establishment surveys. This success may be duplicated for other applications where sample surveys occur periodically, and there is an occasional census produced for the same data items. Sometimes stratification is needed, but sometimes the borrowing of strength, as in small area estimation/prediction, may be used. References will be given to help avoid pitfalls. The idea is to encourage expanding this elegant approach to other applications. The material here is an expanded version of a poster for the 2022 Joint Statistical Meetings. This is a tutorial/guide. Appendices are written in stand-alone form.
Edits made to origin. --- Some of the history of a monthly sales and revenue survey, and the cu... more Edits made to origin. --- Some of the history of a monthly sales and revenue survey, and the current status of that survey and others, are given as background to changes now being implemented. Sampling will now take place at the State level, instead of the national level; more complete use will be made of auxiliary information; work has been done to control cv's for several vari- ables simultaneously; and a test originally designed to consider degrees of homogeneity between several similar populations with limited sampling, has been used to investigate the effectiveness of this design at an aggregated (Census division) level.
InterStat:
Ordinary least squares (OLS) regression gets most of the attention in the statis... more InterStat:
Ordinary least squares (OLS) regression gets most of the attention in the statistical literature, but for cases of regression through the origin, say for use with skewed establishment survey data, weighted least squares (WLS) regression is needed. Here will be gathered some information on properties of weighted least squares regression, particularly with regard to regression through the origin for establishment survey data, for use in periodic publications.
-----------------------------------
March 17, 2016: Note that a special approximation for varian... more ----------------------------------- March 17, 2016: Note that a special approximation for variance, regarding estimated totals, was used here, for purposes of various possible needs for production of Official Statistics in a potentially changing environment. Flexibility for possible changes in modeling, data storage, aggregate levels to be published, and avoidance of future data processing errors on old data made this attractive. Simplicity of application in a production environment was emphasized. ----------------------------------- [This is not for time series analyses.] This research concerns multiple regression for survey imputation, when correlation with a given regressor may vary radically over time, and emphasis may shift to other regressors. There may be many applications for this methodology, but here we will consider the imputation of generation and fuel consumption values for electric power producers in a monthly publication environment. When imputation is done by regression, a sufficient amount of good quality observed data from the population of interest is required, as well as good-quality, related regressor data, for all cases. For this application, the concept of 'fuel switching' will be considered. That is, a given power producer may report using a given set of fuels for one time period, but for economic and/or other practical reasons, fuel usage may change dramatically in a subsequent time period. Testing has shown the usefulness of employing an additional regressor or regressors to represent alternative fuel sources. A performance measure found in Knaub(2002, ASA JSM CD) is used to compare results. Also, the impact of regression weights and the formulation of those weights, due to multiple regression, are considered. ----- Jan 8, 2016: Note that this is not a time series technique. This is for cross-sectional surveys, and was designed for use on establishment surveys for official statistics. I have had some discussions on ResearchGate recently, regarding the notion of bias-variance tradeoffs in modeling, and that more complicated models (tend to?) decrease (conditional?) bias and increase variance. Here, however, variance for estimated totals, under the sampling conditions here, is decreased when there is fuel switching. (Acknowledgement: Thank you to those who discussed my questions on ResearchGate.)
Notation: An asterisk means a WLS estimator (as in GS Maddala's notation). Also note that for the... more Notation: An asterisk means a WLS estimator (as in GS Maddala's notation). Also note that for the estimated variance of the prediction error, instead of V*(T*-T), I inappropriately had V*(T*). https://www.amstat.org/sections/SRMS/Proceedings/papers/1996_101.pdf Model-based inference has performed well for electric power establishment surveys at the Energy Information Administration (EIA), using cutoff sampling and weighted, simple linear regression, as pioneered by K.R.W. Brewer, R.M. Royall, and others. However, 'nonutility' generation sales for resale data have proved to be relatively difficult to estimate efficiently. Design-based inference would be even less efficient. A weighted, multiple linear regression model, using a cutoff sample, where one regressor is the data element of interest as captured in a previous census, and another regressor is the nameplate capacity of the generating entity, has proved to be extremely valuable. This is being applied to monthly salnpling, where regressor data come from previous annual census information. Estimates of totals, with their corresponding estimates of variance, have been greatly improved by this methodology. This paper is an abbreviated version of an article found in the electronic journal, hlterStat, located on the Internet at http://interstat.statjournals.net. Joint Statistical Meetings, Chicago, Illinois, USA; 08/1996
Jan. 8, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the... more Jan. 8, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. It is a standard deviation. .................... Previous notes: The classical ratio estimator (CRE) is very simple, has a long history, and has a stunningly broad range of application, especially with regard to econometrics, and to survey statistics, particularly establishment survey statistics. The CRE has a number of desirable properties, one of which is that the sum of its estimated residuals is always zero. It is easily extended to multiple regression, and a property shown in Sarndal, Swensson and Wretman (1992) may be used to indicate the desirability of this zero sum of estimated residuals feature when constructing regression weights for multiple regression. In the single regressor form, the zero sum of estimated residuals property is related to an interesting phenomenon expressed in Fox (1997). Finally, relationships of the CRE to some other statistics are also considered. -- Note added November 2014: As noted in other works I have done, and elsewhere, for this model, only the individual values of x corresponding to individual y values need to be known, as long as the sum of the remaining x (for out-of-sample cases) is known, and then one can still estimate variance. If we do not know the sum of those remaining N-n x-values (where n is sample size selected minus cases to be imputed), but we know a range for that subtotal of x's, then we know a range of estimated variances for the estimated y-totals to go with that range. --
In repeated surveys there generally are auxiliary/regressor data available for all members of th... more In repeated surveys there generally are auxiliary/regressor data available for all members of the population, that are related to data collected in a current sample or census survey. With regard to modeling, these regressor data can be used to edit the current data through scatterplots, and to impute for missing data through regression. Another use for regressor data may be the study of total survey error. To do this, follow these steps: (1) stratify data by regression model application (the related scatterplots can be used for editing); (2) find predicted values for data not collected, if any, and (3) replace all data that are collected with corresponding predicted values. If model-based ratio prediction is used, variance proportionate to a measure of 'size,' then the sum of the predicted values equals the sum of the observed values they replace. (See "Fun Facts" near the end of this article: fact # 4.) The standard error of the total of the predicted values for every member of a finite population, divided by that total, and expressed as a percent, could be labeled as an estimated relative standard error under a superpopulation, or a model-based RSESP. This RSESP would be influenced by (1) the models chosen, (2) inherent variance, and (3) total survey error (sampling and nonsampling error). This article proposes this model-based RSESP as a survey performance indicator and provides background and examples using both real and artificial data.
Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the... more Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw something by G. Shmueli that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. .................... Previous notes: This article is from the Pakistan Journal of Statistics, http://www.pakjs.com/, in a special issue in honor of Ken Brewer. The URL for this article is http://www.pakjs.com/journals//27(4)/27(4)6.pdf . ---
Here we will review some of the historical development of the use of the coefficient of heteroscedasticity for modeling survey data, particularly establishment survey data, and for inference at aggregate levels. Some of the work by Kenneth R. W. Brewer helped develop this concept. Dr. Brewer has worked to combine design-based and model-based inference. Here, however, we will concentrate on regression modeling, and particularly on some of his earlier work.
Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the... more Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. This is a standard deviation. .................... Previous notes: From InterStat, http://interstat.statjournals.net/, September 2012:
Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator, which has use in quasi-cutoff sampling, balanced sampling, and in econometrics applications. Other applications for this article in other areas of statistics may arise. Multiple regression for a given attribute can be important, but is only considered briefly here. The need for data to estimate for multiple attributes is also important, and must be considered. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, to see the relative impact of factors needed for planning, for each such case. Typically one may consider the volume coverage for an attribute of interest, or related data, say regressor data, to be important, but relative standard errors for estimated totals, or confidence bounds, are needed, to have a better idea of the adequacy of a sample. For multiple attributes (variables of current interest), this may be applied iteratively, as a change in sample for one attribute impacts the sample size for another. Please see the definitions section in https://www.academia.edu/16226638/Efficacy_of_Quasi-Cutoff_Sampling_and_Model-Based_Estimation_For_Establishment_Surveys_and_Related_Considerations
Proceedings of the American Statistical Association, Survey Research Methods Section, JSM 2013, 2013
Abstract at bottom - notes first:
Oct 16, 2017: On page 2895, I have this: "... a balanced s... more Abstract at bottom - notes first:
Oct 16, 2017: On page 2895, I have this: "... a balanced sample ... would be somewhat comparable to a random sample...." -
However, a simple random sample for a skewed population would generally result in even less efficiency than using a balanced sample here. It would usually more heavily 'represent' small members of the population.
Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. It is a standard deviation.
....................
Previous notes:
Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator (CRE). This has application to quasi-cutoff sampling (simply cutoff sampling when there is only one attribute), balanced sampling, econometric applications, and perhaps others. Multiple regression for a given attribute can occasionally be important, but is only considered briefly here. Nonsampling error always has an impact. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, for resource planning at that base level. Typically one may consider the volume coverage for an attribute of interest, or related size data, say regressor data, to be important, but standard errors for estimated totals are needed to judge the adequacy of a sample. Thus the focus here is on a 'formula' for estimating sampling requirements for a model-based CRE, analogous to estimating the number of observations needed for simple random sampling. Both balanced sampling and quasi-cutoff/cutoff sampling are considered.
Key Words: Classical Ratio Estimator, Volume Coverage, Measure of Size, Model-Based Estimation, Official Statistics, Resource Allocation Planning, Sample Size Requirements, Weighted Least Squares Regression
Edited from InterStat, http://interstat.statjournals.net/ -
This (renamed) 2007 article su... more Edited from InterStat, http://interstat.statjournals.net/ -
This (renamed) 2007 article surveys the use of cutoff sampling and inference by various organizations and as described in the literature. This is a technique often used for establishment surveys. Online searches were made which included the key words "cutoff sampling" and "cut-off sampling." Both spellings are in use. Various approaches are described, but the focus is on the model-based approach, using the classical ratio estimator (CRE). Concluding remarks are made. More material was added in 2014.
....
Key Words: establishment surveys, total survey error, model-based classical ratio estimator, CRE, multiple regression, RSE, RSESP, certainty stratum, link relative estimator
This is a letter to the editor of the Journal of Official Statistics (JOS). It addresses an artic... more This is a letter to the editor of the Journal of Official Statistics (JOS). It addresses an article in the previous issue of JOS on cutoff sampling, which referenced this author, and attempts to clarify some positions, including that with regard to multiple attributes resulting in quasi-cutoff sampling. That article by Benedetti, Bee, and Espa is openly available at: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/20200206/jos_cut-off_benedetti-mfl.pdf. ... Note that my work has applied to multiple attribute (multiple item) surveys at the US Energy Information Administration (EIA), starting with one circa 1990, using quasi-cutoff sampling where there are tradeoffs such that some members of the population large for one item but not others may not be in the sample. Thus the comment on page 655 in the Benedetti, Bee, and Espa article that I was limited to a "univariate setup" is not actually the case, though apparently that was not clear. ... In their article they go into detail to pick an optimal sample, but that method may not be necessary or appropriate to work with smaller populations such as the many small populations at the US Energy Information Administration. They look at a "take-some" stratum using a design-based approach. Emphasis at the EIA is with the different levels of model-based variance by item that may occur for the not sampled part of each item population. They consider that prediction for that part may be analyzed as bias to be determined after-the-fact. ... Historical data are used for sample selection. ... See my comments on ResearchGate (in the "Comments" area), which address accuracy for the not sampled part of the population, for each item.
This is a book review, for the Journal of Official Statistics (Statistics Sweden), with regard to... more This is a book review, for the Journal of Official Statistics (Statistics Sweden), with regard to nonresponse and model-assisted design-based sampling. The reviewer was asked to compare this to his own experiences in model-based estimation. -- Disclaimer: The reviewer's opinions are his own, and this in no way indicates any endorsement in any respect by his employer. This work was done off-duty. --- JOS: http://www.jos.nu/, and this article is found at http://www.jos.nu/Articles/abstract.asp?article=222351
Errata brought to my attention:
-- Figures with 'confidence bounds' here are actually curved bo... more Errata brought to my attention:
-- Figures with 'confidence bounds' here are actually curved bounds about prediction intervals for predicted y-values. They would better be called "Prediction Bounds."
-- The "estimated standard error of the random factors of the estimated residuals," should have been designated the "estimated standard deviation of the estimated random factors of the estimated residuals." That is, the sigmas are standard deviations, as they are not reduced with increased sample size.
-- This paper is for the prediction or estimation of official statistics. A better title would be as follows: Use of Ratios for Estimation or Prediction of Official Statistics at a Statistical Agency.
----------
----------
From InterStat, 2012, http://interstat.statjournals.net/ -
The US Energy Information Administration (EIA) has made good use of available auxiliary data over a number of years, for a variety of surveys on energy sources and consumption, for estimating the statistics that the EIA mission requires. Such use of already available data reduces data collection burden for a given level of accuracy. Many of these instances relate to a single auxiliary variable, and involve some type of ratio. The many uses of ratios at the EIA are both disparate and unifying: disparate in that the applications can appear to be fairly distinct, but unifying in that there are interrelationships between these methods that may be esthetically pleasing, and of practical importance. Better communication and future improvements may be achieved by considering what is different between these methods, and what they have in common. Here we will explore these ideas. --
Please note that the illustration near the end is a (much) cropped photo I took of a painting done by my grandfather, Karl H. (Henry) Knaub, more than half of a century before I wrote this paper.
:-)
From InterStat; interstat.statjournals.net -
Most sample surveys, especially household survey... more From InterStat; interstat.statjournals.net - Most sample surveys, especially household surveys, are design-based, meaning sampling and inference (e.g., means and standard errors) are determined by a process of randomization. That is, each member of the population has a predetermined probability of being selected for a sample. Establishment surveys generally have many smaller entities, and a relatively few large ones. Still, each can be assigned a given probability of selection. However, an alternative may be to collect data from generally the largest establishments – some being larger for some attributes than others – and use regression to estimate for the remainder. For design-based sampling, or even for a census survey, such models are often needed to impute for nonresponse. When such modeling would be needed for many small respondents, generally if sample data are collected on a frequent basis, but regressor (related) data are available for all of the population, then cutoff sampling with regression used for inference may be a better alternative. Note that with regression, one can always calculate an estimate of variance for an estimated total. (For example, see Knaub(1996), and note Knaub(2007d).) Key Words: classical ratio estimator, conditionality principle, model failure, probability proportionate to size (PPS), randomization principle, regression, skewed data, superpopulation, total survey error
From InterStat, http://interstat.statjournals.net/ -
A cutoff sample may generally be consid... more From InterStat, http://interstat.statjournals.net/ - A cutoff sample may generally be considered only because it is easy to administer and relatively inexpensive. It is not often recognized that a cutoff sample may also be the option providing the smallest total survey error (TSE). Consider that model-assisted design-based sampling adjusts for samples drawn at random to compensate for the fact that the mean of the random sample can vary greatly from the mean of the population. Thus the importance of regression models in survey statistics is recognized. For cutoff sampling, accuracy may be improved by predicting for many of the ‘small’ cases that may not be able to report accurately on a frequent basis. Survey resources may then be used to concentrate on data collection for the largest possible observations. There are considerations that may mitigate the impact of model-failure with respect to estimating for the cases where there is no chance of sample selection. This article emphasizes those mitigating conditions.
From InterStat, http://interstat.statjournals.net/
Weighted least squares regression through the... more From InterStat, http://interstat.statjournals.net/ Weighted least squares regression through the origin has many uses in statistical science. An important use is for the estimation of attribute totals from establishment survey samples, where we might use quasi-cutoff sampling. Two questions in particular will be explored here, with respect to survey statistics: (1) How do we know this is performing well? and (2) What if the smallest members of the population appear to behave differently? This review article contains a summary of conclusions from experimental findings, and explanations with numerous references. Key Words: Establishment Surveys, Heteroscedasticity, Model-Based Classical Ratio Estimator, Multiple Attributes, Nonsampling Error, Prediction, Regression Through the Origin, Total Survey Error, Weighted Least Squares Regression
[This is "Small Area Estimation/Prediction" only in that we 'borrow strength.' Also, technically ... more [This is "Small Area Estimation/Prediction" only in that we 'borrow strength.' Also, technically totals are "predicted," rather than "estimated," as this is model-based here. Further, as e is an approximation for epsilon, gamma here is also not exactly the true gamma.] ...
...
From InterStat, http://interstat.statjournals.net/, May 2014:
Here, small area estimation is applied in the sense that we are "borrowing strength" from data outside of given subpopulations for which we are to publish estimated totals, or means, or ratios of totals. We will consider estimated totals for establishment surveys. A subpopulation for which we wish to estimate a total will be called a "publication group" (PG), and data that may be modeled together, using one regression, will be called an "estimation group" (EG). See Knaub(1999, 2001, 2003) regarding this for a more complex application. When a PG consists of a set of EGs, that is stratification. When an EG contains PGs, this is a simple form of small area estimation because we are using data outside of a given publication group to help estimate statistics/parameters for that model, used to estimate for each impacted PG. (In Knaub(1999, 2001), there are overlapping 'areas' as well.) Here we consider very small areas (PGs), which may fall within a 'larger' EG, and here we are only considering one regressor, but this could be generalized (Knaub(1999)). Sample sizes and population sizes considered in this paper can be very small within a given PG, say a State and economic end-use sector. In the case of n = N = 1, a single response is the total for that PG. If it is part of an EG with other data, then if there is a nonresponse in that case, an estimate in place of that observation may be obtained for contribution, for example, to a US-level aggregate number for that end-use sector, and a variance contribution to be added to the US-level variance would be found as well. Further, a scatterplot for such an estimation group, especially if a confidence band were constructed (Knaub(2009), section 4, and Knaub(2012b), Figure 1) could be used to help edit data. If that PG with n = N = 1 were looked at alone, one could not have a scatterplot that would determine if a response were reasonable for the current circumstances. (A forecast for that one point would not be as good if some event were to cause a break in the time series, and one would have to consider a time series for every single point, many more graphs, and for some there would be no series available. But a scatterplot to accompany this regression modeling would consider every point used in the model. Data for which there are no regressor data, such as "births," are "added on" to totals outside of modeling.) Techniques here may be used for estimation ("prediction") for sample surveys, and to impute for nonresponse for sample surveys and census surveys. There may be applications to other fields of statistics as well.
Key Words: Regression, Model-Based Estimation, Weighted Least Squares, Scatterplots, Small Area Estimation, Data Editing, Establishment Surveys, Seasonality, Borrowing Strength
-
-
SAE vs stratification:
Note that if you have a wide geographic region (or some other 'wide' grouping), and one model, say regression through the origin, is appropriate for all the data (checking scatterplots, confidence intervals regarding the prediction errors, and standard errors of the regression coefficients for subgroups), then small area estimation (SAE) might be helpful. But if each part - say, State - should be modeled separately, then the overall group - say a superState, multiple-State region - could benefit by stratification. As in design-based sampling, a 'larger' group benefits by stratification if there is small variance within strata and big differences between strata.
International Conference on Establishment Surveys (ICES), Buffalo, NY, USA; 06/1993
A graphical a... more International Conference on Establishment Surveys (ICES), Buffalo, NY, USA; 06/1993 A graphical analysis of heteroscedasticity for linear regression was presented, including measurement and impact of nonlinearity. Comparison was made to the Iterated Reweighted Least Squares approach and results.
- New note, April 23, 2019: Errata: Consider the first column of the first page. Ken Brewer pointed out to me, when we discussed the coefficient of heteroscedasticity at a later date, that I should not have been using the word "components," but "factors" instead. The key here is the factoring of estimated residuals into random and nonrandom "factors." - Also please note that the accuracy for each trial gamma (the coefficient of heteroscedasticity) in the process might sometimes be substantially improved by involving the corresponding new set of predicted y values. - Further please note that my choice of the goal value "w" for gamma was not a good one, as we do not want to confuse that with regression weight.
Tool to accompany "Estimating the Coefficient of Heteroscedasticity" for heteroscedasticity in re... more Tool to accompany "Estimating the Coefficient of Heteroscedasticity" for heteroscedasticity in regression.
This is a review paper for cutoff sampling, and near-cutoff (quasi-cutoff) sampling for multiple ... more This is a review paper for cutoff sampling, and near-cutoff (quasi-cutoff) sampling for multiple variables of interest, using prediction (regression model-based estimation). It connects some ideas in various papers to give the reader a place to start when deciding if model-based estimation and sampling, particularly cutoff or quasi-cutoff sampling, is the best option for a given application. One might expect that with a highly skewed population, the sample size required will be much smaller than for a probability of selection design-based sample, thus greatly reducing resource requirements, including monetary costs, but results can also be more accurate. Circa 1990, results for an electric industry survey of electric sales volume and revenues, actually many small surveys as there are many categories when publishing official statistics by economic end-use and State, were compared for design-based versus model-based sampling and estimation. Using a stratified random sample, stratified by size, with a certainty stratum, results had been obtained, and then compared to model-based results using only the certainty stratum data from the sample survey, and previous, less frequently collected census survey data as regressor data. Estimated totals and relative standard errors were compatible. In the years, actually decades, since then, many thousands of results for that and other energy establishment surveys with highly skewed data were obtained which passed other tests and proved invaluable. In 1999, a survey was suddenly expected to provide many times more estimated totals for data aggregations than it had previously. The data were already quite problematic considering the variety of establishments and data collection challenges. These additional requirements were extreme and needed almost immediately. I developed a flexible small area estimation methodology, using a simplified system of quasi-cutoff sampling, prediction, and 'borrowing strength' between subpopulations, to meet this nearly impossible challenge. This can perform well in situations where probability-of-selection-based methods would not be feasible. Since then, quasi-cutoff sampling with prediction has also been shown to be useful for other energy establishment surveys, and communication with other statisticians, experience, and literature has revealed that there are other good uses for this methodology. This paper provides some guidance, examples, and references to provide assistance to the reader who might apply this methodology, directly or indirectly, and/or be curious about it. Other papers on my ResearchGate page, not referenced here, might also be of assistance. (Note added page 7 plus "approximation" noted before equation - 11/23/2023.)
Tool for model selection.
Note: Terminology alert: The use of the term "test data" here is inf... more Tool for model selection.
Note: Terminology alert: The use of the term "test data" here is informal, and may be applied at various stages. It may sometimes conflict with formal use.
Also on ResearchGate:
Erratum: These would not be 'confidence bounds,' but rather curved bounds... more Also on ResearchGate: Erratum: These would not be 'confidence bounds,' but rather curved bounds about prediction intervals for predicted y-values. They would better be called "Prediction Bounds." - Accordingly, I will change the title. Apologies. ------ Excel spreadsheet tool for graphing prediction bounds about y-value predictions for a classical ratio estimator/linear regression through the origin. (Note that normality of estimated residuals near the origin would often be problematic.) ----- Software programmed for this purpose (using STDI from SAS PROC REG, just as an example) would be more efficient, but this should function with any spreadsheet. Further, this demonstrates an analysis of this process. ----- Note that confidence bounds on b would make a wedge-shaped appearing figure within the predicted y bounds shown.
Also on ResearchGate: This study examines various aspects of a model-based sample selection, usin... more Also on ResearchGate: This study examines various aspects of a model-based sample selection, using a spreadsheet. (See "Conclusions" sheet.) It was applied to one monthly sample. To be somewhat more confident of results, it should be applied to substantially more such monthly samples, with regressor data from various years. In this example, variances were high, but the sample used in production was larger. It is also part of a much larger survey, with restrictive burden considerations, or the sample might be even larger still. As always, uncertainty must be considered when making decisions. If all suggestions in this work are followed, this should lower overall variance. Stratification can be very helpful. Nonsampling measurement error is also discussed. If sample sizes are too large, nonsampling error, respondent burden, and statistical agency burden may become too large. However, as nonsampling error must always be considered, and sampling error may be two or more times the relative standard error (RSE), an RSE estimate over 1% may often be far too large to be consistent with data user expectations. But, a census (RSE = 0%), is costly and no better, or often even worse, if nonsampling error is high. (Please see the sheet on total survey error, TSE.) – Disclaimer: The views expressed are those of the author, and are not official US Energy Information Administration (EIA) positions, unless claimed to be in an official US Government document.
Also on ResearchGate: Excel tool for anticipating variance as a part of estimating sample size re... more Also on ResearchGate: Excel tool for anticipating variance as a part of estimating sample size requirements from a finite population, when using the classical ratio estimator.
Accompanies "Projected Variance for the Model-Based Classical Ratio Estimator: Estimating Sample Size Requirements."
Likely better to use your programming software, but this works, and demonstrates the principles.
......
Note re the paper:
On page 2895, I have this: "... a balanced sample ... would be somewhat comparable to a random sample...."
However, a simple random sample for a skewed population would generally result in even less efficiency than using a balanced sample here. It would usually more heavily 'represent' small members of the population.
Presentation slides for a lunch gathering of mathematical statisticians at the US Energy Informat... more Presentation slides for a lunch gathering of mathematical statisticians at the US Energy Information Administration (EIA) on September 27, 2017. - Topic: Quasi-Cutoff Sampling and the Classical Ratio Estimator. - Background is given and a discussion is provided as to why this is so useful for many establishment surveys. The historical development of this methodology at the EIA is reviewed, with examples and graphics displayed, taken from some of a number of papers illustrating innovative problem-solving work for the EIA done over a number of years.
EIA Seminar slides to explain ‘prediction’ to other statisticians and data analysts. Graphics are... more EIA Seminar slides to explain ‘prediction’ to other statisticians and data analysts. Graphics are illustrative. The 3-D scatterplots (Joel Douglas) would be rotated to 2-D views for analyses, but are shown in partially rotated form here, for purposes of qualitative illustration. [Note that a random sample can easily be drawn which is substantially "unrepresentative" of the population, perhaps especially with continuous data that have a few outsized members of a finite population. That is why model-assisted design-based sampling can be so useful. (Sometimes the model may be more important.)] EIA Seminar, Washington, DC, USA; 06/2010
Proceedings of the Survey Research Methods Section, ASA, JSM 2017, 2017
The concepts of design-based and model-based ratio estimation, especially for the classical ratio... more The concepts of design-based and model-based ratio estimation, especially for the classical ratio estimator (CRE), are reviewed and compared. Ratio estimation is useful for energy, and natural resource official statistics - agriculture, and forestry - as well as in social science. Organization/business statistics uses may involve simple econometrics, as well as survey statistics. There are numerous applications. Notably, ratio estimation is very often useful for highly skewed establishment survey populations where, per K.R.W. Brewer, there should be at least as much implicit heteroscedasticity as for that of the CRE. Meaningfulness might be enhanced by understanding this comparison. The model-based and design-based interpretations of the CRE, their corresponding concepts of variance and bias, with relation to sampling and estimation, are reviewed, and familiar extensions of these estimators are also considered. Simple random sampling, cutoff, stratified, cluster, and unequal probability of selection methodologies, and some history are of interest. Even if a regression model is not explicitly considered, this review considers the role it plays, regardless.
- Also, in the paper, note that if we say that the "E" in CRE refers to the R_hat, or b, then my comment about "CRP" in the paper is incorrect, as it only makes sense with regard to the predicted y values or predicted totals, which is what I was thinking about. Sorry. - Also, in section 5, "Variance," I note that Valliant, Dorfman, and Royall (2000) does "...not [explicitly] use regression weights..." but implicitly, I should have said, assumes that the coefficient of heteroscedasticity, gamma, is 0.5. - [
This is a note on best practice quasi-cutoff sampling with prediction, extremely useful for Offic... more This is a note on best practice quasi-cutoff sampling with prediction, extremely useful for Official Statistics, and in very heavy use at the US Energy Information Administration, which was not adequately covered in a paper on US Federal Government establishment 'cutoff sampling.'
Government agency establishment surveys may be periodic sample surveys with an occasional census ... more Government agency establishment surveys may be periodic sample surveys with an occasional census of the same or more variables/data items which provide predictor data. In addition to size, stratification might possibly be by specific type of establishment in the same population such as ownership category, or by geographic region, or some other characteristic which is subsumed under a larger category of interest. In an electric power plant survey in 1999, and a natural gas survey article in 2014, the author studied the data with regard to combining small areas to 'borrow strength' when modeling to predict totals. In 1999, I showed that for generation data from hydroelectric power plants, small areas from different geographic climate divisions of the National Climatic Data Center, under the National Oceanic and Atmospheric Administration (NCDC/NOAA) should not be combined, and thus would be in different strata. Similarly, upon consulting with me on stratification, another member of the US Energy Information Administration who was working on a survey on gas wells showed that shale and non-shale well natural gas production should be modeled separately. An example in a book by Ray Chambers and Robert Clark referenced in this paper shows sugarcane farms that should be separated by geographic location because of differences in soil, water and climate, which influenced the variable "area" as a predictor for farm receipts, costs, profit, or harvest yield. 1 Thus Chambers and Clark were looking beyond having the same data item in a previous census as a size measure/predictor for that item in a current sample, which Cochran found useful, also referenced in this paper. However, for Official Statistics, we will prefer the latter case. When using a different data item for a predictor, multiple regression might be needed to avoid bias. Conversely, when the same data item is used, multiple regression may just increase variance. Evidence that the same data item in a previous census is a sufficient predictor for a ratio model might be that the expected range Ken Brewer found for the level of heteroscedasticity in another of the references for this paper, is generally found to be the empirical range obtained when this is the case, if data quality is not problematic. (In the References, see Chambers and Clark(2012), Cochran(1953), and Brewer(2002).
Methods are discussed here which are useful when deciding if ordinary least squares (OLS) regress... more Methods are discussed here which are useful when deciding if ordinary least squares (OLS) regression is justified. One could expect that a larger predicted value would be associated with a larger estimate for the sigma of the estimated residuals. However, problems such as model misspecification, and data quality issues might impact upon that. Hypothesis tests to examine the decision "Do we have heteroscedasticity or not?" are not useful from a practical perspective. The practical question is "How much heteroscedasticity do we have, even if none?" Here we consider answering that question by selecting a coefficient of heteroscedasticity for use in regression weights, in weighted least squares (WLS) regression. Then the practical impact on estimates of regression coefficients, and predicted values, and in particular, the impact on estimated variances of prediction errors, can be evaluated. In the summary, one will see how to use an estimate of the coefficient of heteroscedasticity (a spreadsheet has been provided for obtaining this, or a default value) to estimate regression weights for use in WLS regression.
Revised, May 10, 2020: Essential heteroscedasticity is due to the size measure, z (ideally, best ... more Revised, May 10, 2020: Essential heteroscedasticity is due to the size measure, z (ideally, best predicted y), which is used in the regression weight, where that weight is z for a given member of the population, raised to the -2γ, where γ is the coefficient of heteroscedasticity, gamma. Heteroscedasticity is a natural part of the error structure. But nonessential heteroscedasticity may be consider as a faux heteroscedasticity, caused by 'problems' which may be addressed to reduce their impact. These problems are with regard to model and data related issues, such as modeling with missing independent variables, perhaps overfitting, not applying these models only to the subpopulations they reasonably represent,* and data measurement error. These situations will be described. -- March 18, 2019: Some conditions may actually reduce the impact of essential heteroscedasticity. Examples may include some cases with omitted variables and/or data quality problems. ---------- *The "groups" or "subpopulations" could be strata as discussed in Chambers, R, and Clark, R(2012), An Introduction to Model-Based Survey Sampling with Applications, Oxford Statistical Science Series, with scatterplots on page 58 in that book.
This note reviews insightful observations by K.R.W. (Ken) Brewer, found in his book published in ... more This note reviews insightful observations by K.R.W. (Ken) Brewer, found in his book published in 2002, with regard to the degree of heteroscedasticity to be expected for survey populations. Details and implications are noted. -
-- -- Further please note: The first sentence of the introduction could read as follows: "Heteroscedasticity, the change in variance of Y|predicted-y, or epsilon|predicted-y (in the simplest case, this would be for V(Y|x)), is very often referred to as a 'problem' that needs to be 'solved' or 'corrected.'" But heteroscedasticity in regression is a feature, not a bug. -
Here, small area estimation is applied in the sense that we are "borrowing strength" fr... more Here, small area estimation is applied in the sense that we are "borrowing strength" from data outside of given subpopulations for which we are to publish estimated totals, or means, or ratios of totals. We will consider estimated totals for establishment surveys. A subpopulation for which we wish to estimate a total will be called a "publication group" (PG), and data that may be modeled together, using one regression, will be called an "estimation group" (EG). See Knaub(1999, 2001, 2003) regarding this for a more complex application. When a PG consists of a set of EGs, that is stratification. When an EG contains PGs, this is a simple form of small area estimation because we are using data outside of a given publication group to help estimate statistics/parameters for that model, used to estimate for each impacted PG. (In Knaub(1999, 2001), there are overlapping 'areas' as well.) Here we consider very small areas (PGs), which may fall within a &...
InterStat - Practical Methods for Electric Power (and other) Survey Data - edits made -
Establi... more InterStat - Practical Methods for Electric Power (and other) Survey Data - edits made -
Establishment surveys often create distinct circumstances, and research has been generated to solve many of the resulting statistical problems, to varying degrees. The application of the results of this research, regarding estimation, imputation and editing, may or may not be found to also be useful for household surveys or other applications. For electric power data, the circumstances may be even more unusual than for many other establishment surveys. Still, research to solve these problems, although inspired by the need to cope with given situations, may result in methods that are more generally useful. This appears to be the case for much of the work done since approximately 1988 at the Office of Coal, Nuclear, Electric and Alternate Fuels (CNEAF), within the Energy Information Administration (EIA). This work is briefly reviewed and current efforts are described. Commentary is given regarding the practical, problem solving emphasis of these methods. Both census surveys and sample surveys are considered. Sampling errors and nonsampling errors are discussed, as well as the usefulness of regression modeling for purposes of editing and/or imputation. The author's opinions are his own and not EIA policy unless designated by other documents. /////
Note: Page 5 edited on 9/5/2022. ...
Note: The argument mentioned on page 7 (where a standard error of 25 is repeated) such that a larger prediction should have larger variance is actually just true for sigma of the estimated residuals. See https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity and reference there to Ken Brewer.
Short version. (Long version on InterStat.) --
Model-based inference has performed well for e... more Short version. (Long version on InterStat.) --
Model-based inference has performed well for electric power establishment surveys at the Energy Information Administration (EIA), using cutoff sampling and weighted, simple linear regression, as pioneered by K.R.W. Brewer, R.M. Royall, and others. However, 'nonutility' generation sales for resale data have proved to be relatively difficult to estimate efficiently. Designbased inference would be even less efficient. A weighted, multiple linear regression model, using a cutoff sample, where one regressor is the data element of interest as captured in a previous census, and another regressor is the nameplate capacity of the generating entity, has proved to be extremely valuable. This is being applied to monthly salnpling, where regressor data come from previous annual census information. Estimates of totals, with their corresponding estimates of variance, have been greatly improved by this methodology. This paper is an abbreviated version of an article found in the electronic j...
Note: This is from the American Statistical Association (ASA) Survey Research Methods Section (SR... more Note: This is from the American Statistical Association (ASA) Survey Research Methods Section (SRMS) Proceedings of 1992. It was presented at the Joint Statistical Meetings (JSM) circa August 1992. - The ASA SRMS Proceedings are all found at the following URL: http://www.amstat.org/sections/srms/Proceedings/ This paper, under the web pages for 1992, starts at page 876. ------------------------------------------------------------------------- This is the third in a series of papers which have dealt with adapted uses of linear model sampling and analyses for electric power industry data. Several applications are outlined, including monthly estimation of fuel costs per million BTU when neither total costs nor total BTU are known or estimated accurately and data are preliminary. Also included is the combination of two estimators, each of which uses a different regressor, for the estimation of generation expense. This latter process, unlike a single multiple regression, makes better use ...
This study examines various aspects of a model-based sample selection, using a spreadsheet. (See ... more This study examines various aspects of a model-based sample selection, using a spreadsheet. (See "Conclusions" sheet.) It was applied to one monthly sample. To be somewhat more confident of results, it should be applied to substantially more such monthly samples, with regressor data from various years. In this example, variances were high, but the sample used in production was larger. It is also part of a much larger survey, with restrictive burden considerations, or the sample might be even larger still. As always, uncertainty must be considered when making decisions. If all suggestions in this work are followed, this should lower overall variance. Stratification can be very helpful. Nonsampling measurement error is also discussed. If sample sizes are too large, nonsampling error, respondent burden, and statistical agency burden may become too large. However, as nonsampling error must always be considered, and sampling error may be two or more times the relative standard ...
ABSTRACT to be found at http://www.amstat.org/sections/SRMS/Proceedings/ Joint Statistical Meetin... more ABSTRACT to be found at http://www.amstat.org/sections/SRMS/Proceedings/ Joint Statistical Meetings (JSM) 2013 - Session 89 Projected Variance for the Model-based Classical Ratio Estimator: Estimating Sample Size Requirements Sponsor: Survey Research Methods Section Keywords: Model-based Estimation, Classical Ratio Estimator, Official Statistics, Resource Allocation Planning, Volume Coverage, Sample Size Requirements James Knaub U.S. Energy Information Administration Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator (CRE). This has application to quasi-cutoff sampling (simply cutoff sampling when there is only one attribute), balanced sampling, econometrics applications, and perhaps others. Multiple regression for a given attribute can occasionally be important, but is only considered briefly here. Nonsampling error always has an impact. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, for resource planning at that base level. Typically one may consider the volume coverage for an attribute of interest, or related size data, say regressor data, to be important, but standard errors for estimated totals are needed to judge the adequacy of a sample. Thus the focus here is on a 'formula' for estimating sampling requirements for a model-based CRE, analogous to estimating the number of observations needed for simple random sampling. Balanced and cutoff sampling are considered. - (When estimating the WLS version of MSE (random factors of residuals only), the smallest observations in previous data test sets may sometimes best be ignored due to their sometimes relatively lower data quality in highly skewed establishment survey data, when samples are frequently collected, as noted by myself and other colleagues. - JRK - October 2014.) ----- For multiple attributes (variables of current interest), this may be applied iteratively, as a change in sample for one attribute impacts the sample size for another. Please see the definitions section in https://www.researchgate.net/publication/261472614_Efficacy_of_Quasi-Cutoff_Sampling_and_Model-Based_Estimation_For_Establishment_Surveys_and_Related_Considerations?ev=prf_pub.
ABSTRACT Renamed - This article surveys the use of cutoff sampling and inference by various organ... more ABSTRACT Renamed - This article surveys the use of cutoff sampling and inference by various organizations and as described in the literature. This is a technique often used for establishment surveys. Online searches were made which included the key words "cutoff sampling" and "cut-off sampling." Both spellings are in use. Various approaches are described, but the focus is on the model-based approach, using the classical ratio estimator (CRE). Conclusions are drawn. - A five page addendum was added in 2014 regarding how this actually applies to quasi-cutoff sampling, and how that is defined. The addendum also addresses "representativeness," multiple regression, and the use of scatterplots. - You could think of a cutoff sample as a census where you have to impute for missing data, as if for nonresponse. Many census surveys likely impute a bigger portion of the estimated totals. Further, those nonresponses are more likely "nonignorable." The missing data for even the quasi-cutoff samples are generally nearer the origin, and considering the nature of heteroscedasticity, less likely to be a problem. In fact, for the smallest x, prediction of y may work better than trying/struggling to observe them and consequently obtaining lower data quality (more measurement error). Data near the origin often have proportionately greater measurement error. - And we have a good uncertainty measure for much of this: the variance of the prediction error. - [Note that a random sample can easily be drawn which is substantially "unrepresentative" of the population, especially with continuous data that have a few outsized members of a finite population. That is why model-assisted design-based sampling can be so useful. (Sometimes the model may be more important.)]
ABSTRACT The classical ratio estimator (CRE) is very simple, has a long history, and has a stunni... more ABSTRACT The classical ratio estimator (CRE) is very simple, has a long history, and has a stunningly broad range of application, especially with regard to econometrics, and to survey statistics, particularly establishment survey statistics. The CRE has a number of desirable properties, one of which is that the sum of its estimated residuals is always zero. It is easily extended to multiple regression, and a property shown in Sarndal, Swensson and Wretman (1992) may be used to indicate the desirability of this zero sum of estimated residuals feature when constructing regression weights for multiple regression. In the single regressor form, the zero sum of estimated residuals property is related to an interesting phenomenon expressed in Fox (1997). Finally, relationships of the CRE to some other statistics are also considered. -- Note added November 2014: As noted in other works I have done, and elsewhere, for this model, only the individual values of x corresponding to individual y values need to be known, as long as the sum of the remaining x (for out-of-sample cases) is known, and then one can still estimate variance. If we do not know the sum of those remaining N-n x-values (where n is sample size selected minus cases to be imputed), but we know a range for that subtotal of x's, then we know a range of estimated variances for the estimated y-totals to go with that range. --- Another possibility, when knowing all of the individual x-values for a population is not feasible, might be to work out a regression prediction model version of a double sampling approach using ratio estimation. In traditional double sampling (two-phase sampling), a larger probability sampling for x values is taken in a first sample, the first phase, either to help stratify and/or to help in regression or ratio estimation, and a smaller probability sub-sampling of y values is taken in a second sample, the second phase. This is for probability sampling and the traditional estimation depends upon that. But a prediction version would mean reliance on regression in the estimation following the second phase. But what about estimating the x-total from the first stage sample? Would that need to be a randomized sample of x? If so, that may not be very efficient for an establishment survey. Stratify? And if used in an estimator, how do you interpret a final variance estimate that is based partially on randomization (for the x-total), and partially on prediction? (This has come up in discussions with Samson Adeshiyan.) Perhaps instead, the estimated x-total could be based on a cutoff or other sampling of x, and another known variable, z, for which we have a census of z. Then something may be worked out in two phases. But if we have z, it might be more effective to just do the classical ratio estimation (single phase) for model-based estimation shown here in this paper.
ABSTRACT This is a book review with regard to nonresponse and model-assisted design-based samplin... more ABSTRACT This is a book review with regard to nonresponse and model-assisted design-based sampling. The reviewer was asked to compare this to his own experiences in model-based estimation. -- Disclaimer: The reviewer's opinions are his own, and this in no way indicates any endorsement in any respect by his employer. This work was done off-duty. --- JOS: http://www.jos.nu/, and this article is found at http://www.jos.nu/Articles/abstract.asp?article=222351
ABSTRACT Excel spreadsheet tool for graphing confidence bounds about y-value predictions for a cl... more ABSTRACT Excel spreadsheet tool for graphing confidence bounds about y-value predictions for a classical ratio estimator/regression line through the origin. (Note normality near the origin would often be problematic.) ----- Software programmed for this purpose would be more efficient, but this should function with any spreadsheet. Further, this demonstrates an analysis of this process. ----- Note that confidence bounds on b would make a wedge-shaped appearing figure within the predicted y bounds shown. This could be compared to the bowed bounds found for OLS regression.
ABSTRACT This article is from the Pakistan Journal of Statistics, http://www.pakjs.com/, in a spe... more ABSTRACT This article is from the Pakistan Journal of Statistics, http://www.pakjs.com/, in a special issue in honor of Ken Brewer. The URL for this article is http://www.pakjs.com/journals//27(4)/27(4)6.pdf . --- Here we will review some of the historical development of the use of the coefficient of heteroscedasticity for modeling survey data, particularly establishment survey data, and for inference at aggregate levels. Some of the work by Kenneth R. W. Brewer helped develop this concept. Dr. Brewer has worked to combine design-based and model-based inference. Here, however, we will concentrate on regression modeling, and particularly on some of his earlier work.
NONE, Knaub and Douglas in-house presentation, Jun 29, 2010
In-house presentation by Knaub and Douglas on use of quasi-cutoff sampling and prediction, includ... more In-house presentation by Knaub and Douglas on use of quasi-cutoff sampling and prediction, including editing. Graphics included.
ABSTRACT Survey Weights – a SAGE encyclopedia entry - Knaub, J. (2007). Survey weights. In N. Sal... more ABSTRACT Survey Weights – a SAGE encyclopedia entry - Knaub, J. (2007). Survey weights. In N. Salkind (Ed.), Encyclopedia of measurement and statistics. (pp. 981-982). Thousand Oaks, CA: SAGE Publications, Inc. doi: http://dx.doi.org/10.4135/9781412952644.n447 “Used with permission by SAGE Publications, Inc. For any further distribution or usage of the material, contact permissions@sagepub.com”. After downloading, this is not to be further distributed without obtaining permission from Sage.
This research concerns multiple regression for survey imputation, when correlation with a given r... more This research concerns multiple regression for survey imputation, when correlation with a given regressor may vary radically over time, and emphasis may shift to other regressors. There may be many applications for this methodology, but here we will consider the imputation of generation and fuel consumption values for electric power producers in a monthly publication environment. When imputation is done by regression, a sufficient amount of goodquality observed data from the population of interest is required, as well as good-quality, related regressor data, for all cases. For this application, the concept of 'fuel switching' will be considered. That is, a given power producer may report using a given set of fuels for one time period, but for economic and/or other practical reasons, fuel usage may change dramatically in a subsequent time period. Testing has shown the usefulness of employing an additional regressor or regressors to represent alternative fuel sources. A perfor...
Uploads
Journal or Conference Papers by James Knaub
There are various concepts regarding the nature and application of cutoff (or cutoff) sampling. We can start consideration with a population, subpopulation or stratum, divided into take-all, take-some, and take-none strata or groups, such as that noted by Benedetti, Bee, and Espa(2010). Other concepts of 'cutoff' sampling involve the combination of any two of those three groups. All such pairings have been considered by various authors. Inference may either be by a design-based technique, perhaps using calibration, or a model-based approach such as the use of ratio models as proposed earlier by Brewer(1963) and Royall(1970), though they both moved on from that work. They considered takeall and take-none strata or groups, with ratio model predictions. Other combinations are discussed, as used by other authors. However, a take-all and take-none combination with prediction was also used in Guadarrama, Molina, and Tillé(2020). Similar methodology used by Knaub has been voluminously practiced at the US Energy Information Administration (EIA) starting by 1990, as noted in Knaub(2023) and in the References and Bibliography found in Knaub(2022), modified due to the multiple attribute nature of almost all surveys, and other considerations. Bias has been shown to be low there. There is substantial potential for expanded use. Advantages are low smallrespondent burden, low cost, and high accuracy for this type of cutoff/quasi-cutoff sampling with prediction, using the same data item in a previous census as the predictor for each item. Quasi-cutoff sampling refers to the fact that establishment respondents report for multiple items, which will vary in size from one item to another. To obtain desired estimated variance of prediction errors associated with predicted totals by item, one may include other smaller establishment respondents, which may be large for certain items. (A few other establishments with smaller items could be included in the sample for real-time model verification, with subsequent possible action such as stratification or the use of quadratic linear regression (Royall(1992), Valliant(2009).) Note that use of size by item in prediction is an advantage over unequal probability sampling. Accuracy and usefulness may vary greatly for different concepts of 'cutoff' sampling. More marginal cases might be reminiscent of work by other authors to provide inference from other nonprobability sampling with some degree of accuracy. An example of a type of cutoff sampling which compares somewhat to convenience sampling is noted. When design-based inference is used, calibration may be helpful, but for convenience sampling, pseudo-weights would be used first. See Elliott and Valliant(2017) and Valliant(2020). Modeling seems to always be a consideration, directly or through calibration or other means. In all cases, total survey error (TSE) should be considered.
Abstract: Official Statistics from establishment surveys are not only the basis for routine monitoring of markets and perhaps systems in general, they are essential for discovering problems for which innovative approaches may be needed. Statistical agencies remain the workhorses for regularly providing this information. For establishment surveys one needs to collect data efficiently, making an effort to reduce burden on small establishments, and reduce costs to the Government, while promoting accuracy of results in terms of total survey error. For over three decades, these demanding standards have been met using quasi-cutoff sampling and prediction, applied extensively to some of the repeated official US energy establishment surveys. This success may be duplicated for other applications where sample surveys occur periodically, and there is an occasional census produced for the same data items. Sometimes stratification is needed, but sometimes the borrowing of strength, as in small area estimation/prediction, may be used. References will be given to help avoid pitfalls. The idea is to encourage expanding this elegant approach to other applications. The material here is an expanded version of a poster for the 2022 Joint Statistical Meetings. This is a tutorial/guide. Appendices are written in stand-alone form.
Ordinary least squares (OLS) regression gets most of the attention in the statistical literature, but for cases of regression through the origin, say for use with skewed establishment survey data, weighted least squares (WLS) regression is needed. Here will be gathered some information on properties of weighted least squares regression, particularly with regard to regression through the origin for establishment survey data, for use in periodic publications.
March 17, 2016: Note that a special approximation for variance, regarding estimated totals, was used here, for purposes of various possible needs for production of Official Statistics in a potentially changing environment. Flexibility for possible changes in modeling, data storage, aggregate levels to be published, and avoidance of future data processing errors on old data made this attractive. Simplicity of application in a production environment was emphasized.
-----------------------------------
[This is not for time series analyses.]
This research concerns multiple regression for survey imputation, when correlation with a given regressor may vary radically over time, and emphasis may shift to other regressors. There may be many applications for this methodology, but here we will consider the imputation of generation and fuel consumption values for electric power producers in a monthly publication environment. When imputation is done by regression, a sufficient amount of good quality observed data from the population of interest is required, as well as good-quality, related regressor data, for all cases. For this application, the concept of 'fuel switching' will be considered. That is, a given power producer may report using a given set of fuels for one time period, but for economic and/or other practical reasons, fuel usage may change dramatically in a subsequent time period. Testing has shown the usefulness of employing an additional regressor or regressors to represent alternative fuel sources. A performance measure found in Knaub(2002, ASA JSM CD) is used to compare results. Also, the impact of regression weights and the formulation of those weights, due to multiple regression, are considered. ----- Jan 8, 2016: Note that this is not a time series technique. This is for cross-sectional surveys, and was designed for use on establishment surveys for official statistics. I have had some discussions on ResearchGate recently, regarding the notion of bias-variance tradeoffs in modeling, and that more complicated models (tend to?) decrease (conditional?) bias and increase variance. Here, however, variance for estimated totals, under the sampling conditions here, is decreased when there is fuel switching. (Acknowledgement: Thank you to those who discussed my questions on ResearchGate.)
Joint Statistical Meetings, Chicago, Illinois, USA; 08/1996
....................
Previous notes:
The classical ratio estimator (CRE) is very simple, has a long history, and has a stunningly broad range of application, especially with regard to econometrics, and to survey statistics, particularly establishment survey statistics. The CRE has a number of desirable properties, one of which is that the sum of its estimated residuals is always zero. It is easily extended to multiple regression, and a property shown in Sarndal, Swensson and Wretman (1992) may be used to indicate the desirability of this zero sum of estimated residuals feature when constructing regression weights for multiple regression. In the single regressor form, the zero sum of estimated residuals property is related to an interesting phenomenon expressed in Fox (1997). Finally, relationships of the CRE to some other statistics are also considered.
-- Note added November 2014: As noted in other works I have done, and elsewhere, for this model, only the individual values of x corresponding to individual y values need to be known, as long as the sum of the remaining x (for out-of-sample cases) is known, and then one can still estimate variance. If we do not know the sum of those remaining N-n x-values (where n is sample size selected minus cases to be imputed), but we know a range for that subtotal of x's, then we know a range of estimated variances for the estimated y-totals to go with that range. --
....................
Previous notes:
This article is from the Pakistan Journal of Statistics, http://www.pakjs.com/, in a special issue in honor of Ken Brewer. The URL for this article is http://www.pakjs.com/journals//27(4)/27(4)6.pdf . ---
Here we will review some of the historical development of the use of the coefficient of heteroscedasticity for modeling survey data, particularly establishment survey data, and for inference at aggregate levels. Some of the work by Kenneth R. W. Brewer helped develop this concept. Dr. Brewer has worked to combine design-based and model-based inference. Here, however, we will concentrate on regression modeling, and particularly on some of his earlier work.
....................
Previous notes:
From InterStat, http://interstat.statjournals.net/, September 2012:
Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator, which has use in quasi-cutoff sampling, balanced sampling, and in econometrics applications. Other applications for this article in other areas of statistics may arise. Multiple regression for a given attribute can be important, but is only considered briefly here. The need for data to estimate for multiple attributes is also important, and must be considered. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, to see the relative impact of factors needed for planning, for each such case. Typically one may consider the volume coverage for an attribute of interest, or related data, say regressor data, to be important, but relative standard errors for estimated totals, or confidence bounds, are needed, to have a better idea of the adequacy of a sample. For multiple attributes (variables of current interest), this may be applied iteratively, as a change in sample for one attribute impacts the sample size for another. Please see the definitions section in https://www.academia.edu/16226638/Efficacy_of_Quasi-Cutoff_Sampling_and_Model-Based_Estimation_For_Establishment_Surveys_and_Related_Considerations
Oct 16, 2017: On page 2895, I have this: "... a balanced sample ... would be somewhat comparable to a random sample...." -
However, a simple random sample for a skewed population would generally result in even less efficiency than using a balanced sample here. It would usually more heavily 'represent' small members of the population.
Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. It is a standard deviation.
....................
Previous notes:
Source: http://www.amstat.org/sections/SRMS/Proceedings/
Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator (CRE). This has application to quasi-cutoff sampling (simply cutoff sampling when there is only one attribute), balanced sampling, econometric applications, and perhaps others. Multiple regression for a given attribute can occasionally be important, but is only considered briefly here. Nonsampling error always has an impact. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, for resource planning at that base level. Typically one may consider the volume coverage for an attribute of interest, or related size data, say regressor data, to be important, but standard errors for estimated totals are needed to judge the adequacy of a sample. Thus the focus here is on a 'formula' for estimating sampling requirements for a model-based CRE, analogous to estimating the number of observations needed for simple random sampling. Both balanced sampling and quasi-cutoff/cutoff sampling are considered.
Key Words: Classical Ratio Estimator, Volume Coverage, Measure of Size, Model-Based Estimation, Official Statistics, Resource Allocation Planning, Sample Size Requirements, Weighted Least Squares Regression
This (renamed) 2007 article surveys the use of cutoff sampling and inference by various organizations and as described in the literature. This is a technique often used for establishment surveys. Online searches were made which included the key words "cutoff sampling" and "cut-off sampling." Both spellings are in use. Various approaches are described, but the focus is on the model-based approach, using the classical ratio estimator (CRE). Concluding remarks are made. More material was added in 2014.
....
Key Words: establishment surveys, total survey error, model-based classical ratio estimator, CRE, multiple regression, RSE, RSESP, certainty stratum, link relative estimator
http://www.jos.nu/Articles/abstract.asp?article=222351
-- Figures with 'confidence bounds' here are actually curved bounds about prediction intervals for predicted y-values. They would better be called "Prediction Bounds."
-- The "estimated standard error of the random factors of the estimated residuals," should have been designated the "estimated standard deviation of the estimated random factors of the estimated residuals." That is, the sigmas are standard deviations, as they are not reduced with increased sample size.
-- This paper is for the prediction or estimation of official statistics. A better title would be as follows: Use of Ratios for Estimation or Prediction of Official Statistics at a Statistical Agency.
----------
----------
From InterStat, 2012, http://interstat.statjournals.net/ -
The US Energy Information Administration (EIA) has made good use of available auxiliary data over a number of years, for a variety of surveys on energy sources and consumption, for estimating the statistics that the EIA mission requires. Such use of already available data reduces data collection burden for a given level of accuracy. Many of these instances relate to a single auxiliary variable, and involve some type of ratio. The many uses of ratios at the EIA are both disparate and unifying: disparate in that the applications can appear to be fairly distinct, but unifying in that there are interrelationships between these methods that may be esthetically pleasing, and of practical importance. Better communication and future improvements may be achieved by considering what is different between these methods, and what they have in common. Here we will explore these ideas. --
Please note that the illustration near the end is a (much) cropped photo I took of a painting done by my grandfather, Karl H. (Henry) Knaub, more than half of a century before I wrote this paper.
:-)
Most sample surveys, especially household surveys, are design-based, meaning sampling and inference (e.g.,
means and standard errors) are determined by a process of randomization. That is, each member of the population has
a predetermined probability of being selected for a sample. Establishment surveys generally have many smaller
entities, and a relatively few large ones. Still, each can be assigned a given probability of selection. However, an
alternative may be to collect data from generally the largest establishments – some being larger for some attributes
than others – and use regression to estimate for the remainder. For design-based sampling, or even for a census survey,
such models are often needed to impute for nonresponse. When such modeling would be needed for many small
respondents, generally if sample data are collected on a frequent basis, but regressor (related) data are available for all
of the population, then cutoff sampling with regression used for inference may be a better alternative. Note that with
regression, one can always calculate an estimate of variance for an estimated total. (For example, see Knaub(1996),
and note Knaub(2007d).)
Key Words: classical ratio estimator, conditionality principle, model failure, probability proportionate to size (PPS),
randomization principle, regression, skewed data, superpopulation, total survey error
A cutoff sample may generally be considered only because it is easy to administer and relatively inexpensive. It is not often recognized that a cutoff sample may also be the option providing the smallest total survey error (TSE). Consider that model-assisted design-based sampling adjusts for samples drawn at random to compensate for the fact that the mean of the random sample can vary greatly from the mean of the population. Thus the importance of regression models in survey statistics is recognized. For cutoff sampling, accuracy may be improved by predicting for many of the ‘small’ cases that may not be able to report accurately on a frequent basis. Survey resources may then be used to concentrate on data collection for the largest possible observations. There are considerations that may mitigate the impact of model-failure with respect to estimating for the cases where there is no chance of sample selection. This article emphasizes those mitigating conditions.
Weighted least squares regression through the origin has many uses in statistical science. An important use is for the estimation of attribute totals from establishment survey samples, where we might use quasi-cutoff sampling. Two questions in particular will be explored here, with respect to survey statistics: (1) How do we know this is performing well? and (2) What if the smallest members of the population appear to behave differently? This review article contains a summary of conclusions from experimental findings, and explanations with numerous references.
Key Words: Establishment Surveys, Heteroscedasticity, Model-Based Classical Ratio Estimator, Multiple Attributes, Nonsampling Error, Prediction, Regression Through the Origin, Total Survey Error, Weighted Least Squares Regression
...
From InterStat, http://interstat.statjournals.net/, May 2014:
Here, small area estimation is applied in the sense that we are "borrowing strength" from data outside of given subpopulations for which we are to publish estimated totals, or means, or ratios of totals. We will consider estimated totals for establishment surveys. A subpopulation for which we wish to estimate a total will be called a "publication group" (PG), and data that may be modeled together, using one regression, will be called an "estimation group" (EG). See Knaub(1999, 2001, 2003) regarding this for a more complex application. When a PG consists of a set of EGs, that is stratification. When an EG contains PGs, this is a simple form of small area estimation because we are using data outside of a given publication group to help estimate statistics/parameters for that model, used to estimate for each impacted PG. (In Knaub(1999, 2001), there are overlapping 'areas' as well.) Here we consider very small areas (PGs), which may fall within a 'larger' EG, and here we are only considering one regressor, but this could be generalized (Knaub(1999)). Sample sizes and population sizes considered in this paper can be very small within a given PG, say a State and economic end-use sector. In the case of n = N = 1, a single response is the total for that PG. If it is part of an EG with other data, then if there is a nonresponse in that case, an estimate in place of that observation may be obtained for contribution, for example, to a US-level aggregate number for that end-use sector, and a variance contribution to be added to the US-level variance would be found as well. Further, a scatterplot for such an estimation group, especially if a confidence band were constructed (Knaub(2009), section 4, and Knaub(2012b), Figure 1) could be used to help edit data. If that PG with n = N = 1 were looked at alone, one could not have a scatterplot that would determine if a response were reasonable for the current circumstances. (A forecast for that one point would not be as good if some event were to cause a break in the time series, and one would have to consider a time series for every single point, many more graphs, and for some there would be no series available. But a scatterplot to accompany this regression modeling would consider every point used in the model. Data for which there are no regressor data, such as "births," are "added on" to totals outside of modeling.) Techniques here may be used for estimation ("prediction") for sample surveys, and to impute for nonresponse for sample surveys and census surveys. There may be applications to other fields of statistics as well.
Key Words: Regression, Model-Based Estimation, Weighted Least Squares, Scatterplots, Small Area Estimation, Data Editing, Establishment Surveys, Seasonality, Borrowing Strength
Quasi-Cutoff Sampling and Simple Small Area Estimation with Nonresponse. Available from: https://www.researchgate.net/publication/262066356_Quasi-Cutoff_Sampling_and_Simple_Small_Area_Estimation_with_Nonresponse [accessed Oct 2, 2015].
-
-
-
SAE vs stratification:
Note that if you have a wide geographic region (or some other 'wide' grouping), and one model, say regression through the origin, is appropriate for all the data (checking scatterplots, confidence intervals regarding the prediction errors, and standard errors of the regression coefficients for subgroups), then small area estimation (SAE) might be helpful. But if each part - say, State - should be modeled separately, then the overall group - say a superState, multiple-State region - could benefit by stratification. As in design-based sampling, a 'larger' group benefits by stratification if there is small variance within strata and big differences between strata.
A graphical analysis of heteroscedasticity for linear regression was presented, including measurement and impact of nonlinearity. Comparison was made to the Iterated Reweighted Least Squares approach and results.
- New note, April 23, 2019: Errata: Consider the first column of the first page. Ken Brewer pointed out to me, when we discussed the coefficient of heteroscedasticity at a later date, that I should not have been using the word "components," but "factors" instead. The key here is the factoring of estimated residuals into random and nonrandom "factors." - Also please note that the accuracy for each trial gamma (the coefficient of heteroscedasticity) in the process might sometimes be substantially improved by involving the corresponding new set of predicted y values. - Further please note that my choice of the goal value "w" for gamma was not a good one, as we do not want to confuse that with regression weight.
There are various concepts regarding the nature and application of cutoff (or cutoff) sampling. We can start consideration with a population, subpopulation or stratum, divided into take-all, take-some, and take-none strata or groups, such as that noted by Benedetti, Bee, and Espa(2010). Other concepts of 'cutoff' sampling involve the combination of any two of those three groups. All such pairings have been considered by various authors. Inference may either be by a design-based technique, perhaps using calibration, or a model-based approach such as the use of ratio models as proposed earlier by Brewer(1963) and Royall(1970), though they both moved on from that work. They considered takeall and take-none strata or groups, with ratio model predictions. Other combinations are discussed, as used by other authors. However, a take-all and take-none combination with prediction was also used in Guadarrama, Molina, and Tillé(2020). Similar methodology used by Knaub has been voluminously practiced at the US Energy Information Administration (EIA) starting by 1990, as noted in Knaub(2023) and in the References and Bibliography found in Knaub(2022), modified due to the multiple attribute nature of almost all surveys, and other considerations. Bias has been shown to be low there. There is substantial potential for expanded use. Advantages are low smallrespondent burden, low cost, and high accuracy for this type of cutoff/quasi-cutoff sampling with prediction, using the same data item in a previous census as the predictor for each item. Quasi-cutoff sampling refers to the fact that establishment respondents report for multiple items, which will vary in size from one item to another. To obtain desired estimated variance of prediction errors associated with predicted totals by item, one may include other smaller establishment respondents, which may be large for certain items. (A few other establishments with smaller items could be included in the sample for real-time model verification, with subsequent possible action such as stratification or the use of quadratic linear regression (Royall(1992), Valliant(2009).) Note that use of size by item in prediction is an advantage over unequal probability sampling. Accuracy and usefulness may vary greatly for different concepts of 'cutoff' sampling. More marginal cases might be reminiscent of work by other authors to provide inference from other nonprobability sampling with some degree of accuracy. An example of a type of cutoff sampling which compares somewhat to convenience sampling is noted. When design-based inference is used, calibration may be helpful, but for convenience sampling, pseudo-weights would be used first. See Elliott and Valliant(2017) and Valliant(2020). Modeling seems to always be a consideration, directly or through calibration or other means. In all cases, total survey error (TSE) should be considered.
Abstract: Official Statistics from establishment surveys are not only the basis for routine monitoring of markets and perhaps systems in general, they are essential for discovering problems for which innovative approaches may be needed. Statistical agencies remain the workhorses for regularly providing this information. For establishment surveys one needs to collect data efficiently, making an effort to reduce burden on small establishments, and reduce costs to the Government, while promoting accuracy of results in terms of total survey error. For over three decades, these demanding standards have been met using quasi-cutoff sampling and prediction, applied extensively to some of the repeated official US energy establishment surveys. This success may be duplicated for other applications where sample surveys occur periodically, and there is an occasional census produced for the same data items. Sometimes stratification is needed, but sometimes the borrowing of strength, as in small area estimation/prediction, may be used. References will be given to help avoid pitfalls. The idea is to encourage expanding this elegant approach to other applications. The material here is an expanded version of a poster for the 2022 Joint Statistical Meetings. This is a tutorial/guide. Appendices are written in stand-alone form.
Ordinary least squares (OLS) regression gets most of the attention in the statistical literature, but for cases of regression through the origin, say for use with skewed establishment survey data, weighted least squares (WLS) regression is needed. Here will be gathered some information on properties of weighted least squares regression, particularly with regard to regression through the origin for establishment survey data, for use in periodic publications.
March 17, 2016: Note that a special approximation for variance, regarding estimated totals, was used here, for purposes of various possible needs for production of Official Statistics in a potentially changing environment. Flexibility for possible changes in modeling, data storage, aggregate levels to be published, and avoidance of future data processing errors on old data made this attractive. Simplicity of application in a production environment was emphasized.
-----------------------------------
[This is not for time series analyses.]
This research concerns multiple regression for survey imputation, when correlation with a given regressor may vary radically over time, and emphasis may shift to other regressors. There may be many applications for this methodology, but here we will consider the imputation of generation and fuel consumption values for electric power producers in a monthly publication environment. When imputation is done by regression, a sufficient amount of good quality observed data from the population of interest is required, as well as good-quality, related regressor data, for all cases. For this application, the concept of 'fuel switching' will be considered. That is, a given power producer may report using a given set of fuels for one time period, but for economic and/or other practical reasons, fuel usage may change dramatically in a subsequent time period. Testing has shown the usefulness of employing an additional regressor or regressors to represent alternative fuel sources. A performance measure found in Knaub(2002, ASA JSM CD) is used to compare results. Also, the impact of regression weights and the formulation of those weights, due to multiple regression, are considered. ----- Jan 8, 2016: Note that this is not a time series technique. This is for cross-sectional surveys, and was designed for use on establishment surveys for official statistics. I have had some discussions on ResearchGate recently, regarding the notion of bias-variance tradeoffs in modeling, and that more complicated models (tend to?) decrease (conditional?) bias and increase variance. Here, however, variance for estimated totals, under the sampling conditions here, is decreased when there is fuel switching. (Acknowledgement: Thank you to those who discussed my questions on ResearchGate.)
Joint Statistical Meetings, Chicago, Illinois, USA; 08/1996
....................
Previous notes:
The classical ratio estimator (CRE) is very simple, has a long history, and has a stunningly broad range of application, especially with regard to econometrics, and to survey statistics, particularly establishment survey statistics. The CRE has a number of desirable properties, one of which is that the sum of its estimated residuals is always zero. It is easily extended to multiple regression, and a property shown in Sarndal, Swensson and Wretman (1992) may be used to indicate the desirability of this zero sum of estimated residuals feature when constructing regression weights for multiple regression. In the single regressor form, the zero sum of estimated residuals property is related to an interesting phenomenon expressed in Fox (1997). Finally, relationships of the CRE to some other statistics are also considered.
-- Note added November 2014: As noted in other works I have done, and elsewhere, for this model, only the individual values of x corresponding to individual y values need to be known, as long as the sum of the remaining x (for out-of-sample cases) is known, and then one can still estimate variance. If we do not know the sum of those remaining N-n x-values (where n is sample size selected minus cases to be imputed), but we know a range for that subtotal of x's, then we know a range of estimated variances for the estimated y-totals to go with that range. --
....................
Previous notes:
This article is from the Pakistan Journal of Statistics, http://www.pakjs.com/, in a special issue in honor of Ken Brewer. The URL for this article is http://www.pakjs.com/journals//27(4)/27(4)6.pdf . ---
Here we will review some of the historical development of the use of the coefficient of heteroscedasticity for modeling survey data, particularly establishment survey data, and for inference at aggregate levels. Some of the work by Kenneth R. W. Brewer helped develop this concept. Dr. Brewer has worked to combine design-based and model-based inference. Here, however, we will concentrate on regression modeling, and particularly on some of his earlier work.
....................
Previous notes:
From InterStat, http://interstat.statjournals.net/, September 2012:
Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator, which has use in quasi-cutoff sampling, balanced sampling, and in econometrics applications. Other applications for this article in other areas of statistics may arise. Multiple regression for a given attribute can be important, but is only considered briefly here. The need for data to estimate for multiple attributes is also important, and must be considered. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, to see the relative impact of factors needed for planning, for each such case. Typically one may consider the volume coverage for an attribute of interest, or related data, say regressor data, to be important, but relative standard errors for estimated totals, or confidence bounds, are needed, to have a better idea of the adequacy of a sample. For multiple attributes (variables of current interest), this may be applied iteratively, as a change in sample for one attribute impacts the sample size for another. Please see the definitions section in https://www.academia.edu/16226638/Efficacy_of_Quasi-Cutoff_Sampling_and_Model-Based_Estimation_For_Establishment_Surveys_and_Related_Considerations
Oct 16, 2017: On page 2895, I have this: "... a balanced sample ... would be somewhat comparable to a random sample...." -
However, a simple random sample for a skewed population would generally result in even less efficiency than using a balanced sample here. It would usually more heavily 'represent' small members of the population.
Jan. 3, 2016: Perhaps wherever I noted "the estimated standard error of the random factors of the estimated residuals," I should have said "the estimated standard deviation of the random factors of the estimated residuals." I saw some things by a couple of people that made me realize my language was sloppy here. But I also read where this may be fairly common. At any rate, this is a fixed value to be estimated. It is a standard deviation.
....................
Previous notes:
Source: http://www.amstat.org/sections/SRMS/Proceedings/
Here we explore planning for the allocation of resources for use in obtaining official statistics through model-based estimation. Concentration is on the model-based variance for the classical ratio estimator (CRE). This has application to quasi-cutoff sampling (simply cutoff sampling when there is only one attribute), balanced sampling, econometric applications, and perhaps others. Multiple regression for a given attribute can occasionally be important, but is only considered briefly here. Nonsampling error always has an impact. Allocation of resources to given strata should be considered as well. Here, however, we explore the projected variance for a given attribute in a given stratum, for resource planning at that base level. Typically one may consider the volume coverage for an attribute of interest, or related size data, say regressor data, to be important, but standard errors for estimated totals are needed to judge the adequacy of a sample. Thus the focus here is on a 'formula' for estimating sampling requirements for a model-based CRE, analogous to estimating the number of observations needed for simple random sampling. Both balanced sampling and quasi-cutoff/cutoff sampling are considered.
Key Words: Classical Ratio Estimator, Volume Coverage, Measure of Size, Model-Based Estimation, Official Statistics, Resource Allocation Planning, Sample Size Requirements, Weighted Least Squares Regression
This (renamed) 2007 article surveys the use of cutoff sampling and inference by various organizations and as described in the literature. This is a technique often used for establishment surveys. Online searches were made which included the key words "cutoff sampling" and "cut-off sampling." Both spellings are in use. Various approaches are described, but the focus is on the model-based approach, using the classical ratio estimator (CRE). Concluding remarks are made. More material was added in 2014.
....
Key Words: establishment surveys, total survey error, model-based classical ratio estimator, CRE, multiple regression, RSE, RSESP, certainty stratum, link relative estimator
http://www.jos.nu/Articles/abstract.asp?article=222351
-- Figures with 'confidence bounds' here are actually curved bounds about prediction intervals for predicted y-values. They would better be called "Prediction Bounds."
-- The "estimated standard error of the random factors of the estimated residuals," should have been designated the "estimated standard deviation of the estimated random factors of the estimated residuals." That is, the sigmas are standard deviations, as they are not reduced with increased sample size.
-- This paper is for the prediction or estimation of official statistics. A better title would be as follows: Use of Ratios for Estimation or Prediction of Official Statistics at a Statistical Agency.
----------
----------
From InterStat, 2012, http://interstat.statjournals.net/ -
The US Energy Information Administration (EIA) has made good use of available auxiliary data over a number of years, for a variety of surveys on energy sources and consumption, for estimating the statistics that the EIA mission requires. Such use of already available data reduces data collection burden for a given level of accuracy. Many of these instances relate to a single auxiliary variable, and involve some type of ratio. The many uses of ratios at the EIA are both disparate and unifying: disparate in that the applications can appear to be fairly distinct, but unifying in that there are interrelationships between these methods that may be esthetically pleasing, and of practical importance. Better communication and future improvements may be achieved by considering what is different between these methods, and what they have in common. Here we will explore these ideas. --
Please note that the illustration near the end is a (much) cropped photo I took of a painting done by my grandfather, Karl H. (Henry) Knaub, more than half of a century before I wrote this paper.
:-)
Most sample surveys, especially household surveys, are design-based, meaning sampling and inference (e.g.,
means and standard errors) are determined by a process of randomization. That is, each member of the population has
a predetermined probability of being selected for a sample. Establishment surveys generally have many smaller
entities, and a relatively few large ones. Still, each can be assigned a given probability of selection. However, an
alternative may be to collect data from generally the largest establishments – some being larger for some attributes
than others – and use regression to estimate for the remainder. For design-based sampling, or even for a census survey,
such models are often needed to impute for nonresponse. When such modeling would be needed for many small
respondents, generally if sample data are collected on a frequent basis, but regressor (related) data are available for all
of the population, then cutoff sampling with regression used for inference may be a better alternative. Note that with
regression, one can always calculate an estimate of variance for an estimated total. (For example, see Knaub(1996),
and note Knaub(2007d).)
Key Words: classical ratio estimator, conditionality principle, model failure, probability proportionate to size (PPS),
randomization principle, regression, skewed data, superpopulation, total survey error
A cutoff sample may generally be considered only because it is easy to administer and relatively inexpensive. It is not often recognized that a cutoff sample may also be the option providing the smallest total survey error (TSE). Consider that model-assisted design-based sampling adjusts for samples drawn at random to compensate for the fact that the mean of the random sample can vary greatly from the mean of the population. Thus the importance of regression models in survey statistics is recognized. For cutoff sampling, accuracy may be improved by predicting for many of the ‘small’ cases that may not be able to report accurately on a frequent basis. Survey resources may then be used to concentrate on data collection for the largest possible observations. There are considerations that may mitigate the impact of model-failure with respect to estimating for the cases where there is no chance of sample selection. This article emphasizes those mitigating conditions.
Weighted least squares regression through the origin has many uses in statistical science. An important use is for the estimation of attribute totals from establishment survey samples, where we might use quasi-cutoff sampling. Two questions in particular will be explored here, with respect to survey statistics: (1) How do we know this is performing well? and (2) What if the smallest members of the population appear to behave differently? This review article contains a summary of conclusions from experimental findings, and explanations with numerous references.
Key Words: Establishment Surveys, Heteroscedasticity, Model-Based Classical Ratio Estimator, Multiple Attributes, Nonsampling Error, Prediction, Regression Through the Origin, Total Survey Error, Weighted Least Squares Regression
...
From InterStat, http://interstat.statjournals.net/, May 2014:
Here, small area estimation is applied in the sense that we are "borrowing strength" from data outside of given subpopulations for which we are to publish estimated totals, or means, or ratios of totals. We will consider estimated totals for establishment surveys. A subpopulation for which we wish to estimate a total will be called a "publication group" (PG), and data that may be modeled together, using one regression, will be called an "estimation group" (EG). See Knaub(1999, 2001, 2003) regarding this for a more complex application. When a PG consists of a set of EGs, that is stratification. When an EG contains PGs, this is a simple form of small area estimation because we are using data outside of a given publication group to help estimate statistics/parameters for that model, used to estimate for each impacted PG. (In Knaub(1999, 2001), there are overlapping 'areas' as well.) Here we consider very small areas (PGs), which may fall within a 'larger' EG, and here we are only considering one regressor, but this could be generalized (Knaub(1999)). Sample sizes and population sizes considered in this paper can be very small within a given PG, say a State and economic end-use sector. In the case of n = N = 1, a single response is the total for that PG. If it is part of an EG with other data, then if there is a nonresponse in that case, an estimate in place of that observation may be obtained for contribution, for example, to a US-level aggregate number for that end-use sector, and a variance contribution to be added to the US-level variance would be found as well. Further, a scatterplot for such an estimation group, especially if a confidence band were constructed (Knaub(2009), section 4, and Knaub(2012b), Figure 1) could be used to help edit data. If that PG with n = N = 1 were looked at alone, one could not have a scatterplot that would determine if a response were reasonable for the current circumstances. (A forecast for that one point would not be as good if some event were to cause a break in the time series, and one would have to consider a time series for every single point, many more graphs, and for some there would be no series available. But a scatterplot to accompany this regression modeling would consider every point used in the model. Data for which there are no regressor data, such as "births," are "added on" to totals outside of modeling.) Techniques here may be used for estimation ("prediction") for sample surveys, and to impute for nonresponse for sample surveys and census surveys. There may be applications to other fields of statistics as well.
Key Words: Regression, Model-Based Estimation, Weighted Least Squares, Scatterplots, Small Area Estimation, Data Editing, Establishment Surveys, Seasonality, Borrowing Strength
Quasi-Cutoff Sampling and Simple Small Area Estimation with Nonresponse. Available from: https://www.researchgate.net/publication/262066356_Quasi-Cutoff_Sampling_and_Simple_Small_Area_Estimation_with_Nonresponse [accessed Oct 2, 2015].
-
-
-
SAE vs stratification:
Note that if you have a wide geographic region (or some other 'wide' grouping), and one model, say regression through the origin, is appropriate for all the data (checking scatterplots, confidence intervals regarding the prediction errors, and standard errors of the regression coefficients for subgroups), then small area estimation (SAE) might be helpful. But if each part - say, State - should be modeled separately, then the overall group - say a superState, multiple-State region - could benefit by stratification. As in design-based sampling, a 'larger' group benefits by stratification if there is small variance within strata and big differences between strata.
A graphical analysis of heteroscedasticity for linear regression was presented, including measurement and impact of nonlinearity. Comparison was made to the Iterated Reweighted Least Squares approach and results.
- New note, April 23, 2019: Errata: Consider the first column of the first page. Ken Brewer pointed out to me, when we discussed the coefficient of heteroscedasticity at a later date, that I should not have been using the word "components," but "factors" instead. The key here is the factoring of estimated residuals into random and nonrandom "factors." - Also please note that the accuracy for each trial gamma (the coefficient of heteroscedasticity) in the process might sometimes be substantially improved by involving the corresponding new set of predicted y values. - Further please note that my choice of the goal value "w" for gamma was not a good one, as we do not want to confuse that with regression weight.
Note: Terminology alert: The use of the term "test data" here is informal, and may be applied at various stages. It may sometimes conflict with formal use.
Erratum: These would not be 'confidence bounds,' but rather curved bounds about prediction intervals for predicted y-values. They would better be called "Prediction Bounds." - Accordingly, I will change the title. Apologies. ------
Excel spreadsheet tool for graphing prediction bounds about y-value predictions for a classical ratio estimator/linear regression through the origin. (Note that normality of estimated residuals near the origin would often be problematic.) ----- Software programmed for this purpose (using STDI from SAS PROC REG, just as an example) would be more efficient, but this should function with any spreadsheet. Further, this demonstrates an analysis of this process. ----- Note that confidence bounds on b would make a wedge-shaped appearing figure within the predicted y bounds shown.
Accompanies "Projected Variance for the Model-Based Classical Ratio Estimator: Estimating Sample Size Requirements."
Likely better to use your programming software, but this works, and demonstrates the principles.
......
Note re the paper:
On page 2895, I have this: "... a balanced sample ... would be somewhat comparable to a random sample...."
However, a simple random sample for a skewed population would generally result in even less efficiency than using a balanced sample here. It would usually more heavily 'represent' small members of the population.
Topic: Quasi-Cutoff Sampling and the Classical Ratio Estimator. -
Background is given and a discussion is provided as to why this is so useful for many establishment surveys. The historical development of this methodology at the EIA is reviewed, with examples and graphics displayed, taken from some of a number of papers illustrating innovative problem-solving work for the EIA done over a number of years.
[Note that a random sample can easily be drawn which is substantially "unrepresentative" of the population, perhaps especially with continuous data that have a few outsized members of a finite population. That is why model-assisted design-based sampling can be so useful. (Sometimes the model may be more important.)]
EIA Seminar, Washington, DC, USA; 06/2010
- Also, in the paper, note that if we say that the "E" in CRE refers to the R_hat, or b, then my comment about "CRP" in the paper is incorrect, as it only makes sense with regard to the predicted y values or predicted totals, which is what I was thinking about. Sorry. - Also, in section 5, "Variance," I note that Valliant, Dorfman, and Royall (2000) does "...not [explicitly] use regression weights..." but implicitly, I should have said, assumes that the coefficient of heteroscedasticity, gamma, is 0.5. - [
--
-- Further please note:
The first sentence of the introduction could read as follows: "Heteroscedasticity, the change in variance of Y|predicted-y, or epsilon|predicted-y (in the simplest case, this would be for V(Y|x)), is very often referred to as a 'problem' that needs to be 'solved' or 'corrected.'" But heteroscedasticity in regression is a feature, not a bug.
-
For more, see https://www.researchgate.net/publication/352134279_When_Would_Heteroscedasticity_in_Regression_Occur
...
/// Also note that delta at the top of page 5 is just a specific use of autocorrelation.
The variance and covariances only come from 'within cluster,' where here the 'cluster' is one datum of y for a given x (or for a given predicted y).
Establishment surveys often create distinct circumstances, and research has been generated to solve many of the resulting statistical problems, to varying degrees. The application of the results of this research, regarding estimation, imputation and editing, may or may not be found to also be useful for household surveys or other applications. For electric power data, the circumstances may be even more unusual than for many other establishment surveys. Still, research to solve these problems, although inspired by the need to cope with given situations, may result in methods that are more generally useful. This appears to be the case for much of the work done since approximately 1988 at the Office of Coal, Nuclear, Electric and Alternate Fuels (CNEAF), within the Energy Information Administration (EIA). This work is briefly reviewed and current efforts are described. Commentary is given regarding the practical, problem solving emphasis of these methods. Both census surveys and sample surveys are considered. Sampling errors and nonsampling errors are discussed, as well as the usefulness of regression modeling for purposes of editing and/or imputation. The author's opinions are his own and not EIA policy unless designated by other documents. /////
Note: Page 5 edited on 9/5/2022. ...
Note: The argument mentioned on page 7 (where a standard error of 25 is repeated) such that a larger prediction should have larger variance is actually just true for sigma of the estimated residuals. See https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity and reference there to Ken Brewer.
Model-based inference has performed well for electric power establishment surveys at the Energy Information Administration (EIA), using cutoff sampling and weighted, simple linear regression, as pioneered by K.R.W. Brewer, R.M. Royall, and others. However, 'nonutility' generation sales for resale data have proved to be relatively difficult to estimate efficiently. Designbased inference would be even less efficient. A weighted, multiple linear regression model, using a cutoff sample, where one regressor is the data element of interest as captured in a previous census, and another regressor is the nameplate capacity of the generating entity, has proved to be extremely valuable. This is being applied to monthly salnpling, where regressor data come from previous annual census information. Estimates of totals, with their corresponding estimates of variance, have been greatly improved by this methodology. This paper is an abbreviated version of an article found in the electronic j...