Discussion of “Probability vs. Nonprobability Sampling: From the Birth of Survey Sampling to the Present Day” by Graham Kalton

Julie Gershunskaya

Discussion of “Probability vs. Nonprobability Sampling: From the Birth of Survey Sampling to the Present Day” by Graham Kalton

Statistics in Transition new series

In this excellent overview of the history of probability and nonprobability sampling from the end of the nineteenth century to the present day, Professor Graham Kalton outlines the essence of past endeavors that helped to define philosophical approaches and stimulate the development of survey sampling methodologies. From the beginning, there was an understanding that a sample should, in some ways, resemble the population under study. In Kiær’s ideas of “representative sampling” and Neyman’s invention of probability-based approach, the prime concern of survey sampling has been to properly plan for representing characteristics of the finite population. Poststratification and other calibration methods were developed for the same important goal of better representation.

STATISTICS IN TRANSITION new series, June 2023 Vol. 24, No. 3, pp. 31–37, https://doi.org/10.59170/stattrans-2023-032 Received – 25.05.2023; accepted – 31.05.2023 Discussion of “Probability vs. Nonprobability Sampling: From the Birth of Survey Sampling to the Present Day” by Graham Kalton Julie Gershunskaya1, Partha Lahiri2 In this excellent overview of the history of probability and nonprobability sampling from the end of the nineteenth century to the present day, Professor Graham Kalton outlines the essence of past endeavors that helped to define philosophical approaches and stimulate the development of survey sampling methodologies. From the beginning, there was an understanding that a sample should, in some ways, resemble the population under study. In Kiær’s ideas of “representative sampling” and Neyman’s invention of probability-based approach, the prime concern of survey sampling has been to properly plan for representing characteristics of the finite population. Poststratification and other calibration methods were developed for the same important goal of better representation. Professor Kalton’s paper underscores growing interest in the use of nonprobability surveys. With recent proliferation of computers and the internet, wealth of data becomes available to researchers. However, “opportunistic” information collected with present-day capabilities usually is not purposely planned or controlled by survey statisticians. No matter how big such a nonprobability sample could be, it may inaccurately reflect the finite population of interest, thus presenting a substantial risk of an estimation bias. Below, we discuss several recent papers that propose ways to incorporate nonprobability surveys to produce estimates for both large and small areas. Specifically, we will consider two situations often encountered in practice. In the first situation, a nonprobability sample contains the outcome variable of interest, and the main task is to reduce the selection bias with the help of a reference probability sample that does not contain the outcome variable of interest. In the second situation, a probability sample contains the outcome variable of interest, but there is little or no sample available to produce granular level estimates. For such a small area estimation problem, we consider a case when we have access to a large nonprobability sample that does not contain the outcome variable but contains some related auxiliary variables also present in the probability sample. In both situations, researchers have discussed statistical data integration techniques in which a reference probability sample is combined with a nonprobability sample in an effort to overcome deficiencies associated with both probability and nonprobability samples. 1 U.S. Bureau of Labor Statistics, 2 Massachusetts Ave NE Washington, DC 20212, USA, E-mail: Gershunskaya.Julie@bls.gov. ORCID: https://orcid.org/0000-0002-0096-186X. 2 University of Maryland, College Park, MD 20742. USA. E-mail:plahiri@umd.edu. ORCID: https://orcid.org/0000-0002-7103-545X. © Julie Gershunskaya, Partha Lahiri. Article available under the CC BY-SA 4.0 licence 32 J. Gershunskaya, P. Lahiri: Discussion of “Probability vs. Nonprobability Sampling... One way to account for the selection bias of a nonprobability sample is by estimating the sample inclusion probabilities, given available covariates. Then, the inverse values of estimated inclusion probabilities are used, in a similar manner as the usual probability sample selection weights, to obtain estimates of target quantities. Several approaches to estimation of nonprobability sample inclusion probabilities (or propensity scores) have been considered in the literature. Recent papers by Chen et al. (2020), Wang et al. (2021), and Savitsky et al. (2022) propose ways to estimate these probabilities based on combining nonprobability and probability samples. Kim J. and K. Morikawa (2023) propose an empirical likelihood based approach under a different setting. To save space, we will not discuss their approach. We now review three statistical data integration methods. The approaches concern with the estimation of probabilities πci pxi q “ Ptci “ 1|xi u to be included into the nonprobability sample Sc , for units i “ 1, . . . , nc , where ci is the inclusion indicator of unit i taking on the value of 1 if unit i is included into the nonprobability sample, and 0 otherwise; xi is a vector of known covariates for unit i; nc is the total number of units in sample Sc . The problem, of course, is that we cannot estimate πci based on the set of units in nonprobability sample Sc alone, because ci “ 1 for all i in Sc . The probabilities are estimated by combining set Sc with a probability sample Sr . Due to its role in this approach, the probability sample here is also called “reference sample”. Assuming both nonprobability and probability samples are selected from the same finite population P, Chen et al. (2020) write a log-likelihood, over units in P, for the Bernoulli variable ci : ÿ (1) ℓ1 pθq “ tci log rπci pxi , θqs ` p1 ´ ci q log r1 ´ πci pxi , θqsu , iPP where θ is the parameter vector in a logistic regression model for πci . Since finite population units are not observed, Chen et al. (2020) employ a clever trick and re-group the sum in (1) by presenting it as a sum of two parts: part 1 involves the sum over the nonprobability sample units and part 2 is the sum over the whole finite population: „ ȷ ÿ ÿ πci pxi , θq (2) ` log r1 ´ πci pxi , θqs . ℓ1 pθq “ log 1 ´ πci pxi , θq iPP iPS c Units in part 1 of the log-likelihood in (2) are observed; for part 2, Chen et al. (2020) employ the pseudo-likelihood approach by replacing the sum over the finite population with its probability sample based estimate: „ ȷ ÿ ÿ πci pxi , θq ˆℓ1 pθq “ log (3) ` wri log r1 ´ πci pxi , θqs , 1 ´ πci pxi , θq iPS iPS c r where weights wri “ 1{πri are inverse values of the reference sample inclusion probabilities πri . Estimates are obtained by solving respective pseudo-likelihood based estimating equations. One shortcoming of the Chen et al. (2020) approach is that their Bernoulli likelihood is formulated with respect to an unobserved indicator variable. Although the regrouping STATISTICS IN TRANSITION new series, June 2023 33 employed in (2) helps to find a solution, results obtained by Wang et al. (2021) indicate that it is relatively inefficient, especially when the nonprobability sample size is much larger than the probability sample size. Wang et al. (2021) formulate their likelihood for an observed indicator variable and thus their method is different from the approach of Chen et al. (2020). To elaborate, Wang et al. (2021) introduce an imaginary construct consisting of two parts: they stack together nonprobability sample Sc (part 1) and finite population P (part 2). Since nonprobability sample units belong to the finite population, they appear in the stacked set twice. Let indicator variable δi “ 1 if unit i belongs to part 1, and δi “ 0 if i belongs to part 2 of the stacked set; the probabilities of being in part 1 of the stacked set are denoted by πδ i pxi q “ Ptδi “ 1|xi u. Wang et al. (2021) assume the following Bernoulli likelihood for observed variable δi : ” ı ÿ ´ ¯ı ” ÿ (4) ℓ2 pθ̃q “ log πδ i pxi , θ̃q ` log 1 ´ πδ i xi , θ̃ , iPP iPSc where θ̃ is the parameter vector in a logistic regression model for πδ i . Since the finite population is not available, they apply the following pseudo-likelihood approach: ı ÿ ” ´ ¯ı ” ÿ wri log 1 ´ πδ i xi , θ̃ . (5) ℓˆ2 pθ̃q “ log πδ i pxi , θ̃q ` iPSr iPSc Existing ready-to-use software can be used to obtain estimates of πδ i . However, the actual goal is to find probabilities πci rather than probabilities πδ i . Wang et al. (2021) propose a two-step approach, where at the second step, they find πci by employing the following identity: πci πδ i “ . (6) 1 ` πci Savitsky et al. (2022) use an exact likelihood for the estimation of inclusion probabilities πci , rather than a pseudo-likelihood based estimation. They propose to stack together nonprobability, Sc , and probability, Sr , samples. In this stacked set, S, indicator variable zi takes the value of 1 if unit i belongs to the nonprobability sample (part 1), and 0 if unit i belongs to the probability sample (part 2). In this construction, if there is an overlap between the two samples, Sc and Sr , then the overlapping units are included into stacked set S twice: once as a part of the nonprobability sample (with zi “ 1) and once as a part of the reference probability sample (with zi “ 0). We do not need to know which units overlap or whether there are any overlapping units. The authors use first principles to prove the following relationship between probabilities πzi pxi q “ Ptzi “ 1|xi u of being in part 1 of the stacked set and the sample inclusion probabilities πci and πri : πzi “ πci . πri ` πci (7) A similar expression (7) was derived by Elliott (2009) and Elliott and Valliant (2017) under the assumption of non-overlapping nonprobability and probability samples. The derivation given in Savitsky et al. (2022) does not require this assumption. 34 J. Gershunskaya, P. Lahiri: Discussion of “Probability vs. Nonprobability Sampling... To obtain estimates of πci from the combined sample, Beresovsky (2019) proposed to parameterize probabilities πci “ πci pxi , θq, as in Chen et al. (2020), and employ identity (7) to present πzi as a composite function of θ; that is, πzi “ πzi pπci pxi , θqq “ πci pxi , θq{pπri ` πci pxi , θqq. The log-likelihood for observed Bernoulli variable zi is given by ÿ ÿ (8) log r1 ´ πzi pπci pxi , θqqs . ℓ3 pθq “ log rπzi pπci pxi , θqqs ` iPSc iPSr Since the log-likelihood implicitly includes a logistic regression model formulation for probabilities πci , Beresovsky (2019) labeled the proposed approach Implicit Logistic Regression (ILR). For the maximum likelihood estimation (MLE), the score equations are obtained from (8) by taking the derivatives, with respect to θ, of the composite function πzi “ πzi pπci pθqq. This way, the estimates of πci are obtained directly from (8) in a single step. Savitsky et al. (2022) parameterized the likelihood, as in (8), and used the Bayesian estimation technique to fit the model. Note that to implement the ILR approach, the reference sample inclusion probabilities πri have to be known for all units in the combined set. This is not a limitation for many probability surveys. As discussed in Elliott and Valliant (2017), if probabilities πri cannot be determined exactly for units in the nonprobability sample, they can be estimated using a regression model. Savitsky et al. (2022) used Bayesian computations to simultaneously estimate πri and πci for nonprobability sample units, given available covariates xi . It must be noted that the estimation method of Wang et al. (2021) can be similarly modified to avoid the two-step estimation procedure: a logistic regression model could be formulated for inclusion probabilities πci , while probabilities πδ i in (6) could be viewed as a composite function, πδ i “ πδ i pπci pxi , θqq “ πci pxi , θq{p1 ` πci pxi , θqq. This approach is expected to be more efficient. Moreover, it avoids πci estimates greater than 1 that could occur when the estimation is performed in two steps. Once modified this way, preliminary simulations indicate that Wang et al. (2021) formulation would produce more efficient estimates than the Chen et al. (2020) counterpart, unless in a rare situation where the whole finite population rather than only a reference sample is available. Simulations show that the exact likelihood method based on formulation of Savitsky et al. (2022) and Beresovsky (2019) performs better than the pseudo-likelihood based alternatives. In the usual situation where the reference probability sample fraction is small, the relative benefits of the exact likelihood approach are even more pronounced. The existence of a well-designed probability reference sample plays a crucial role in the efforts to reduce the selection bias of a nonprobability sample. Importantly, an ongoing research indicates that the quality of estimates of the nonprobability sample inclusion probabilities is better if there is a good overlap in domains constructed using covariates from both samples. This observation harks back to problems appearing in traditional poststratification methods and to the notion of “representative sampling." Since survey practitioners usually do not have control over the planning or collection of the emerging multitude of nonrandom opportunistic samples, efforts should be directed to developing and maintaining comprehensive probability samples that include sets of good quality covariates. Beaumont et al. (2023) STATISTICS IN TRANSITION new series, June 2023 35 proposed several model selection methods in application of the modeling nonprobability sample inclusion probabilities. We now turn our attention to the second data integration situation involving small area estimation, a topic Professor Kalton touched on. This is a problem of great interest for making public policies, fund allocation and regional planning. Small area estimation programs already exist in some national statistical organizations such as the Small Area Income and Poverty Estimates (SAIPE) program of the US Census Bureau (Bell et al., 2016) and Chilean government system (Casas-Cordero Valencia et al., 2016.) The importance placed in the United Nations Sustainable Development Goals (SDG) for disaggregated level statistics is expected to increase the demand for such programs in various national statistical offices worldwide. Standard small area estimation methods generally use statistical models (e.g., mixed models) that combine probability sample data with administrative or census data containing auxiliary variables correlated with the outcome variable of interest. For a review of different small area models and methods, see Jiang and Lahiri (2006), Rao and Molina (2015), Ghosh (2020), and others. A key to success in small area estimation is to find relevant auxiliary variables not only in the probability sample survey but also in the supplementary big databases. Use of a big probability or nonprobability sample survey could be useful here as surveys typically contain a large number of auxiliary variables that are also available in the probability sample survey. In the context of small area estimation, Sen and Lahiri (2023) considered a statistical data integration technique in which a small probability survey containing the outcome variable of interest is statistically linked with a much bigger probability sample, which does not contain the outcome variable but contains many auxiliary variables also present in the smaller sample. They essentially fitted a mixed model to the smaller probability sample that connects the outcome variable to a set of auxiliary variables and then imputed the outcome variable for all units of the bigger probability sample using the fitted model and auxiliary variables. Finally, they suggested to produce small area estimates using survey weights and imputed values of the outcome variable contained in the bigger probability sample survey. As discussed in their paper, such a method can be used even if the bigger sample is a nonprobability survey using weights constructed by methods such as the ones described earlier. The development of new approaches demonstrates how the methods of survey estimation continue to evolve by taking into the future the best from fruitful theoretical and methodological developments of the past. As Professor Kalton highlights, we will increasingly encounter data sources that are not produced by standard probability sample designs. Statisticians will find ways to respond to new challenges, as is reflected in the following amusing quote: ...D.J. Finney once wrote about the statistician whose client comes in and says, “Here is my mountain of trash. Find the gems that lie therein." Finney’s advice was to not throw him out of the office but to attempt to find out what he considers "gems". After all, if the trained statistician does not help, he will find someone who will....(source: David Salsburg, ASA Connect Discussion) 36 J. Gershunskaya, P. Lahiri: Discussion of “Probability vs. Nonprobability Sampling... Of course, nonprobability samples should not be viewed as a “mountain of trash.” Indeed, they can contain a lot of relevant information for producing necessary estimates. It is just that one needs to explore different innovative ways to use information contained in nonprobability samples. In the United States federal statistical system, the need to innovate for combining information from multiple sources has been emphasized in the National Academies of Sciences and Medicine (2017) report on Innovations in Federal Statistics. As discussed, statisticians have been already engaged in suggesting new ideas, such as statistical data integration, to extract information out of multiple non-traditional databases. In coming years, statisticians will be increasingly occupied with finding solutions for obtaining useful information from non-traditional data sources. This is indeed an exciting time for survey statisticians. References Beaumont, J.-F., K. Bosa, A. Brennan, J. Charlebois, and K. Chu (2023). Handling nonprobability samples through inverse probability weighting with an application to statistics canada’s crowdsourcing data. Survey Methodology (accepted in 2023 and expected to appear in 2024). Bell, W. R., W. W. Basel, and J. J. Maples (2016). An overview of the US Census Bureau’s small area income and poverty estimates program, pp. 349–378. Wiley Online Library. Beresovsky, V. (2019). On application of a response propensity model to estimation from web samples. In ResearchGate. Casas-Cordero Valencia, C., J. Encina, and P. Lahiri (2016). Poverty mapping for the Chilean Comunas, pp. 379–404. Wiley Online Library. Chen, Y., P. Li, and C. Wu (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association 115(532), 2011–2021. Elliott, M. R. (2009). Combining data from probability and non-probability samples using pseudo-weights. Survey Practice 2, 813–845. Elliott, M. R. and R. Valliant (2017). Inference for Nonprobability Samples. Statistical Science 32(2), 249 – 264. Ghosh, M. (2020). Small area estimation: Its evolution in five decades. Statistics in Transition New Series, Special Issue on Statistical Data Integration, 1–67. Jiang, J. and P. Lahiri (2006). Mixed model prediction and small area estimation, editor’s invited discussion paper. Test 15, 1–96. Kim J. and K. Morikawa (2023). An empirical likelihood approach to reduce selectionbias in voluntary samples. Calcutta Statistical Association Bulletin 35 (to appear). STATISTICS IN TRANSITION new series, June 2023 37 National Academies of Sciences, E. and Medicine (2017). Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. Rao, J. N. K. and I. Molina (2015). Small Area Estimation, 2nd Edition. Wiley. Savitsky, T. D., M. R. Williams, J. Gershunskaya, V. Beresovsky, and N. G. Johnson (2022). Methods for combining probability and nonprobability samples under unknown overlaps. https://doi.org/10.48550/arXiv.2208.14541. Sen, A. and P. Lahiri (2023). Estimation of finite population proportions for small areas: a statistical data integration approach. https://doi.org/10.48550/arXiv.2305.12336. Wang, L., R. Valliant, and Y. Li (2021). Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts. Stat Med. 40(4), 5237–5250.

Log In

Discussion of “Probability vs. Nonprobability Sampling: From the Birth of Survey Sampling to the Present Day” by Graham Kalton

Related papers

Related papers

Related topics