Exercises and Solutions

Yves Tille

Exercises and Solutions

2018

By Yves Tille

Sampling Methods Exercises and Solutions

Pascal Ardilly Yves Tillé Translated from French by Leon Jang Sampling Methods Exercises and Solutions

Sampling Methods Exercises and Solutions Pascal Ardilly Yves Tillé Translated from French by Leon Jang Sampling Methods Exercises and Solutions Pascal Ardilly INSEE Direction générale Unité des Méthodes Statistiques, Timbre F410 18 boulevard Adolphe Pinard 75675 Paris Cedex 14 France Email: pascal.ardilly@insee.fr Yves Tillé Institut de Statistique, Université de Neuchâtel Espace de l’Europe 4, CP 805, 2002 Neuchâtel Switzerland Email: yves.tille@unine.ch Library of Congress Control Number: 2005927380 ISBN-10: 0-387-26127-3 ISBN-13: 978-0387-26127-0 Printed on acid-free paper. © 2006 Springer Science+Business Media, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, Inc., 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. (MVY) 9 8 7 6 5 4 3 2 1 springeronline.com Preface When we agreed to share all of our preparation of exercises in sampling theory to create a book, we were not aware of the scope of the work. It was indeed necessary to compose the information, type out the compilations, standardise the notations and correct the drafts. It is fortunate that we have not yet measured the importance of this project, for this work probably would never have been attempted! In making available this collection of exercises, we hope to promote the teaching of sampling theory for which we wanted to emphasise its diversity. The exercises are at times purely theoretical while others are originally from real problems, enabling us to approach the sensitive matter of passing from theory to practice that so enriches survey statistics. The exercises that we present were used as educational material at the École Nationale de la Statistique et de l’Analyse de l’Information (ENSAI), where we had successively taught sampling theory. We are not the authors of all the exercises. In fact, some of them are due to Jean-Claude Deville and Laurent Wilms. We thank them for allowing us to reproduce their exercises. It is also possible that certain exercises had been initially conceived by an author that we have not identified. Beyond the contribution of our colleagues, and in all cases, we do not consider ourselves to be the lone authors of these exercises: they actually form part of a common heritage from ENSAI that has been enriched and improved due to questions from students and the work of all the demonstrators of the sampling course at ENSAI. We would like to thank Laurent Wilms, who is most influential in the organisation of this practical undertaking, and Sylvie Rousseau for her multiple corrections of a preliminary version of this manuscript. Inès Pasini, Yves-Alain Gerber and Anne-Catherine Favre helped us over and over again with typing and composition. We also thank ENSAI, who supported part of the scientific typing. Finally, we particularly express our gratitude to Marjolaine Girin for her meticulous work with typing, layout and composition. Pascal Ardilly and Yves Tillé Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Population, variable and function of interest . . . . . . . . . . . . . . . . 1.3 Sample and sampling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Horvitz-Thompson estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 3 2 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Simple random sampling without replacement . . . . . . . . . . . . . . . 2.2 Simple random sampling with replacement . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.1 Cultivated surface area . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.2 Occupational sickness . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.3 Probability of inclusion and design with replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.4 Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.5 Number of clerics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.6 Size for proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.7 Estimation of the population variance . . . . . . . . . . Exercise 2.8 Repeated survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.9 Candidates in an election . . . . . . . . . . . . . . . . . . . . . . Exercise 2.10 Select-reject method . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.11 Sample update method . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.12 Domain estimation . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.13 Variance of a domain estimator . . . . . . . . . . . . . . . Exercise 2.14 Complementary sampling . . . . . . . . . . . . . . . . . . . . . Exercise 2.15 Capture-recapture . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.16 Subsample and covariance . . . . . . . . . . . . . . . . . . . . Exercise 2.17 Recapture with replacement . . . . . . . . . . . . . . . . . . Exercise 2.18 Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercise 2.19 Proportion of students . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 7 7 8 11 11 12 13 14 15 18 19 20 22 23 27 32 35 38 40 42 VIII Contents Exercise 2.20 Sampling with replacement and estimator improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Exercise 2.21 Variance of the variance . . . . . . . . . . . . . . . . . . . . . . 50 3 Sampling with Unequal Probabilities . . . . . . . . . . . . . . . . . . . . . . 59 3.1 Calculation of inclusion probabilities . . . . . . . . . . . . . . . . . . . . . . 59 3.2 Estimation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Exercise 3.1 Design and inclusion probabilities . . . . . . . . . . . . . . 60 Exercise 3.2 Variance of indicators and design of fixed size . . . . 61 Exercise 3.3 Variance of indicators and sampling design . . . . . . 61 Exercise 3.4 Estimation of a square root . . . . . . . . . . . . . . . . . . . . 63 Exercise 3.5 Variance and concurrent estimates of variance . . . . 65 Exercise 3.6 Unbiased estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Exercise 3.7 Concurrent estimation of the population variance . 69 Exercise 3.8 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Exercise 3.9 Systematic sampling of businesses . . . . . . . . . . . . . . 72 Exercise 3.10 Systematic sampling and variance . . . . . . . . . . . . . 73 Exercise 3.11 Systematic sampling and order . . . . . . . . . . . . . . . . 76 Exercise 3.12 Sunter’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Exercise 3.13 Sunter’s method and second-order probabilities . . 79 Exercise 3.14 Eliminatory method . . . . . . . . . . . . . . . . . . . . . . . . . 81 Exercise 3.15 Midzuno’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Exercise 3.16 Brewer’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Exercise 3.17 Sampling with replacement and comparison of means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Exercise 3.18 Geometric mean and Poisson design . . . . . . . . . . . 90 Exercise 3.19 Sen-Yates-Grundy variance . . . . . . . . . . . . . . . . . . . 92 Exercise 3.20 Balanced design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Exercise 3.21 Design effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Exercise 3.22 Rao-Blackwellisation . . . . . . . . . . . . . . . . . . . . . . . . 99 Exercise 3.23 Null second-order probabilities . . . . . . . . . . . . . . . . 101 Exercise 3.24 Hájek’s ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Exercise 3.25 Weighting and estimation of the population size . 105 Exercise 3.26 Poisson sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Exercise 3.27 Quota method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Exercise 3.28 Successive balancing . . . . . . . . . . . . . . . . . . . . . . . . . 114 Exercise 3.29 Absence of a sampling frame . . . . . . . . . . . . . . . . . . 116 4 Stratiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2 Estimation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Exercise 4.1 Awkward stratification . . . . . . . . . . . . . . . . . . . . . . . . 123 Exercise 4.2 Strata according to income . . . . . . . . . . . . . . . . . . . . 124 Contents Exercise Exercise Exercise Exercise Exercise Exercise Exercise Exercise Exercise Exercise Exercise Exercise Exercise Exercise Exercise IX 4.3 Strata of elephants . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.4 Strata according to age . . . . . . . . . . . . . . . . . . . . . . . 127 4.5 Strata of businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.6 Stratification and unequal probabilities . . . . . . . . . . 132 4.7 Strata of doctors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.8 Estimation of the population variance . . . . . . . . . . . 140 4.9 Expected value of the sample variance . . . . . . . . . . 143 4.10 Stratification and difference estimator . . . . . . . . . . 146 4.11 Optimality for a domain . . . . . . . . . . . . . . . . . . . . . . 148 4.12 Optimality for a difference . . . . . . . . . . . . . . . . . . . . 149 4.13 Naive estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.14 Comparison of regions and optimality . . . . . . . . . . 151 4.15 Variance of a product . . . . . . . . . . . . . . . . . . . . . . . . 153 4.16 National and regional optimality . . . . . . . . . . . . . . 154 4.17 What is the design? . . . . . . . . . . . . . . . . . . . . . . . . . 156 5 Multi-stage Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2 Estimator, variance decomposition, and variance . . . . . . . . . . . . 159 5.3 Specific case of sampling of PU with replacement . . . . . . . . . . . . 160 5.4 Cluster effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Exercise 5.1 Hard disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Exercise 5.2 Selection of blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Exercise 5.3 Inter-cluster variance . . . . . . . . . . . . . . . . . . . . . . . . . 165 Exercise 5.4 Clusters of patients . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Exercise 5.5 Clusters of households and size . . . . . . . . . . . . . . . . . 168 Exercise 5.6 Which design? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Exercise 5.7 Clusters of households . . . . . . . . . . . . . . . . . . . . . . . . 172 Exercise 5.8 Bank clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Exercise 5.9 Clusters of households and number of men . . . . . . . 179 Exercise 5.10 Variance of systematic sampling . . . . . . . . . . . . . . . 186 Exercise 5.11 Comparison of two designs with two stages . . . . . 189 Exercise 5.12 Cluster effect and variable sizes . . . . . . . . . . . . . . . 194 Exercise 5.13 Variance and list order . . . . . . . . . . . . . . . . . . . . . . . 199 6 Calibration with an Auxiliary Variable . . . . . . . . . . . . . . . . . . . . 209 6.1 Calibration with a qualitative variable . . . . . . . . . . . . . . . . . . . . . 209 6.2 Calibration with a quantitative variable . . . . . . . . . . . . . . . . . . . . 210 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Exercise 6.1 Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Exercise 6.2 Post-stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Exercise 6.3 Ratio and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Exercise 6.4 Comparison of estimators . . . . . . . . . . . . . . . . . . . . . 218 Exercise 6.5 Foot size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 X Contents Exercise Exercise Exercise Exercise Exercise Exercise Exercise 6.6 Cavities and post-stratification . . . . . . . . . . . . . . . . . 221 6.7 Votes and difference estimation . . . . . . . . . . . . . . . . 225 6.8 Combination of ratios . . . . . . . . . . . . . . . . . . . . . . . . . 230 6.9 Overall ratio or combined ratio . . . . . . . . . . . . . . . . . 236 6.10 Calibration and two phases . . . . . . . . . . . . . . . . . . . 245 6.11 Regression and repeated surveys . . . . . . . . . . . . . . 251 6.12 Bias of a ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 7 Calibration with Several Auxiliary Variables . . . . . . . . . . . . . . . 263 7.1 Calibration estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 7.2 Generalised regression estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 264 7.3 Marginal calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Exercise 7.1 Adjustment of a table on the margins . . . . . . . . . . . 265 Exercise 7.2 Ratio estimation and adjustment . . . . . . . . . . . . . . . 266 Exercise 7.3 Regression and unequal probabilities . . . . . . . . . . . . 272 Exercise 7.4 Possible and impossible adjustments . . . . . . . . . . . . 278 Exercise 7.5 Calibration and linear method . . . . . . . . . . . . . . . . . 279 Exercise 7.6 Regression and strata . . . . . . . . . . . . . . . . . . . . . . . . . 282 Exercise 7.7 Calibration on sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Exercise 7.8 Optimal estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Exercise 7.9 Calibration on population size . . . . . . . . . . . . . . . . . 287 Exercise 7.10 Double calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 8 Variance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 8.1 Principal techniques of variance estimation . . . . . . . . . . . . . . . . . 293 8.2 Method of estimator linearisation . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Exercise 8.1 Variances in an employment survey . . . . . . . . . . . . . 295 Exercise 8.2 Tour de France . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Exercise 8.3 Geometric mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Exercise 8.4 Poisson design and calibration on population size . 301 Exercise 8.5 Variance of a regression estimator . . . . . . . . . . . . . . 304 Exercise 8.6 Variance of the regression coefficient . . . . . . . . . . . . 306 Exercise 8.7 Variance of the coefficient of determination . . . . . . 310 Exercise 8.8 Variance of the coefficient of skewness . . . . . . . . . . . 311 Exercise 8.9 Half-samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 9 Treatment of Non-response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 9.1 Reweighting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 9.2 Imputation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Exercise 9.1 Weight of an aeroplane . . . . . . . . . . . . . . . . . . . . . . . . 320 Exercise 9.2 Weighting and non-response . . . . . . . . . . . . . . . . . . . 326 Exercise 9.3 Precision and non-response . . . . . . . . . . . . . . . . . . . . 334 Contents XI Exercise 9.4 Non-response and variance . . . . . . . . . . . . . . . . . . . . 343 Exercise 9.5 Non-response and superpopulation . . . . . . . . . . . . . . 349 Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Normal Distribution Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 1 Introduction 1.1 References This book presents a collection of sampling exercises covering the major chapters of this branch of statistics. We do not have as an objective here to present the necessary theory for solving these exercises. Nevertheless, each chapter contains a brief review that clarifies the notation used. The reader can consult more theoretical works. Let us first of all cite the books that can be considered as classics: Yates (1949), Deming (1950), Hansen et al. (1993a), Hansen et al. (1993b), Deming (1960), Kish (1965), Raj (1968), Sukhatme and Sukhatme (1970), Konijn (1973), Cochran (1977), a simple and clear work that is very often cited as a reference, and Jessen (1978). The post-mortem work of Hájek (1981) remains a masterpiece but is unfortunately difficult to understand. Kish (1989) offered a practical and interesting work which largely transcends the agricultural domain. The book by Thompson (1992) is an excellent presentation of spatial sampling. The work devoted to the basics of sampling theory has been recently republished by Cassel et al. (1993). The modern reference book for the past 10 years remains the famous Särndal et al. (1992), even if other interesting works have been published like Hedayat and Sinha (1991), Krishnaiah and Rao (1994), or the book Valliant et al. (2000), dedicated to the model-based approach. The recent book by Lohr (1999) is a very pedagogical work which largely covers the field. We recommend it to discover the subject. We also cite two works exclusively established in sampling with unequal probabilities: Brewer and Hanif (1983) and Gabler (1990), and the book by Wolter (1985) being established in variance estimation. In French, we can suggest in chronological order the books by Thionet (1953) and by Zarkovich (1966) as well as that by Desabie (1966), which are now classics. Then, we can cite the more recent books by Deroo and Dussaix (1980), Gouriéroux (1981), Grosbras (1987), the collective work edited by Droesbeke et al. (1987), the small book by Morin (1993) and finally the manual of exercises published by Dussaix and Grosbras (1992). The ‘Que Sais-je?’ by Dussaix and Grosbras (1996) expresses an appreciable translation of the 2 1 Introduction theory. Obviously, the two theoretical works proposed by the authors Ardilly (1994) and Tillé (2001) are fully adapted to go into detail on the subject. Finally, a very complete work is suggested, in Italian, by Cicchitelli et al. (1992) and, in Chinese, by Ren and Ma (1996). 1.2 Population, variable and function of interest Consider a finite population composed of N observation units; each of the units can be identified by a label, of which the set is denoted U = {1, ..., N }. We are interested in a variable y which takes the value yk on unit k. These values are not random. The objective is to estimate the value of a function of interest θ = f (y1 , ..., yk , ..., yN ). The most frequent functions are the total Y = yk , k∈U the mean Y = 1 Y yk = , N N k∈U the population variance σy2 = 2 1 yk − Y , N k∈U and the corrected population variance Sy2 = 2 1 yk − Y . N −1 k∈U The size of the population is not necessarily known and can therefore be considered as a total to estimate. In fact, we can write N= 1. k∈U 1.3 Sample and sampling design A sample without replacement s is a subset of U . A sampling design p(.) is a probability distribution for the set of all possible samples such that 1.4 Horvitz-Thompson estimator p(s) ≥ 0, for all s ⊂ U and 3 p(s) = 1. s⊂U The random sample S is a random set of labels for which the probability distribution is Pr(S = s) = p(s), for all s ⊂ U. The sample size n(S) can be random. If the sample is of fixed size, we denote the size simply as n. The indicator variable for the presence of units in the sample is defined by 1 if k ∈ S Ik = 0 if k ∈ / S. The inclusion probability is the probability that unit k is in the sample πk = Pr(k ∈ S) = E(Ik ) = p(s). s∋k This probability can (in theory) be deduced from the sampling design. The second-order inclusion probability is πkℓ = Pr(k ∈ S and ℓ ∈ S) = E(Ik Iℓ ) = p(s). s∋k,ℓ Finally, the covariance of the indicators is πk (1 − πk ) if ℓ = k ∆kℓ = cov(Ik , Iℓ ) = πkℓ − πk πℓ if ℓ = k. If the design is of fixed size n, we have πk = n, πkℓ = nπℓ , k∈U and k∈U (1.1) ∆kℓ = 0. k∈U 1.4 Horvitz-Thompson estimator The Horvitz-Thompson estimator of the total is defined by yk . Yπ = πk k∈S This estimator is unbiased if all the first-order inclusion probabilities are strictly positive. If the population size is known, we can estimate the mean with the Horvitz-Thompson estimator: 1 yk . Y π = N πk k∈S 4 1 Introduction The variance of Yπ is var(Yπ ) = yk yℓ ∆kℓ . πk πℓ k∈U ℓ∈U If the sample is of fixed size (var(#S) = 0), then Sen (1953) and Yates and Grundy (1953) showed that the variance can also be written 1 var(Yπ ) = − 2 yk k∈U ℓ∈U yℓ − πk πℓ 2 ∆kℓ . The variance can be estimated by: var( Yπ ) = yk yℓ ∆kℓ , πk πℓ πkℓ k∈S ℓ∈S where πkk = πk . If the design is of fixed size, we can construct another estimator from the Sen-Yates-Grundy expression: var( Yπ ) = − 2 yℓ ∆kℓ 1 yk − . 2 πk πℓ πkℓ k∈S ℓ∈S, ℓ=k These two estimators are unbiased if all the second-order inclusion probabilities are strictly positive. When the sample size is ‘sufficiently large’ (in practice, a few dozen most often suffices), we can construct confidence intervals with a confidence level of (1 − α) for Y according to: CI(1 − α) = Yπ − z1−α/2 var(Yπ ), Yπ + z1−α/2 var(Yπ ) , where z1−α/2 is the (1 − α/2)-quantile of a standard normal random variable (see Tables 10.1, 10.2, and 10.3). These intervals are estimated by replacing Yπ ). var(Yπ ) with var( 2 Simple Random Sampling 2.1 Simple random sampling without replacement A design is simple without replacement of fixed size n if and only if, for all s, N −1 if #s = n p(s) = n 0 otherwise, or N n = N! . n!(N − n)! We can derive the inclusion probabilities πk = n , N and πkℓ = n(n − 1) . N (N − 1) Finally, if k = ℓ −1 if k = ℓ. N −1 The Horvitz-Thompson estimator of the total becomes ∆kℓ = n(N − n) × N2 1 N Yπ = yk . n k∈S That for the mean is written as 1 Y π = yk . n k∈S The variance of Yπ is var(Yπ ) = N 2 1 − n Sy2 , N n 6 2 Simple Random Sampling and its unbiased estimator where var( Yπ ) = N 2 (1 − s2y = n s2y ) , N n 2 1 yk − Y π . n−1 k∈S The Horvitz-Thompson estimator of the proportion PD that represents a subpopulation D in the total population is nD , p= n where nD = #(S ∩ D), and p is the proportion of individuals of D in S. We verify: n PD (1 − PD ) N var(p) = 1 − , N n N −1 and we estimate without bias this variance by n p(1 − p) . var(p) = 1− N n−1 2.2 Simple random sampling with replacement If m units are selected with replacement and with equal probabilities at each trial in the population U , then we define ỹi as the value of the variable y for the i-th selected unit in the sample. We can select the same unit many times in the sample. The mean estimator m 1 Y W R = ỹi , m i=1 is unbiased, and its variance is σy2 . m In a simple design with replacement, the sample variance var(Y W R ) = m s̃2y = 1 (ỹi − Y W R )2 , m − 1 i=1 estimates σy2 without bias. It is possible however to show that if we are interested in nS units of sample S for distinct units, then the estimator 1 Y DU = yk , nS k∈S is unbiased for the mean and has a smaller variance than that of Y W R . Table 2.1 presents a summary of the main results under simple designs. Exercise 2.1 7 Table 2.1. Simple designs : summary table Simple sampling design Without replacement With replacement Sample size Mean estimator Variance of the mean estimator n m 1 yk Y = n k∈S 1 ỹi Y W R = m i=1 var Y = = σy2 (N − n) 2 Sy var Y W R = nN m E s2y = Sy2 Expected sample variance Variance estimator of the mean estimator var Y m E s2y = σy2 s2y (N − n) 2 Y W R = sy var nN m EXERCISES Exercise 2.1 Cultivated surface area We want to estimate the surface area cultivated on the farms of a rural township. Of the N = 2010 farms that comprise the township, we select 100 using simple random sampling. We measure yk , the surface area cultivated on the farm k in hectares, and we find yk = 2907 ha and yk2 = 154593 ha2 . k∈S k∈S 1. Give the value of the standard unbiased estimator of the mean 1 Y = yk . N k∈U 2. Give a 95 % confidence interval for Y . Solution In a simple design, the unbiased estimator of Y is 1 2907 = 29.07 ha. Y = yk = n 100 k∈S The estimator of the dispersion Sy2 is 1 2 2 n 100 154593 2 sy = = − 29.072 = 707.945. yk − Y n−1 n 99 100 k∈S 8 2 Simple Random Sampling The sample size n being ‘sufficiently large’, the 95% confidence interval is estimated in hectares as follows: N − n s2y 2010 − 100 707.45 = 29.07 ± 1.96 × Y ± 1.96 N n 2010 100 = [23.99; 34.15] . Exercise 2.2 Occupational sickness We are interested in estimating the proportion of men P affected by an occupational sickness in a business of 1500 workers. In addition, we know that three out of 10 workers are usually affected by this sickness in businesses of the same type. We propose to select a sample by means of a simple random sample. 1. What sample size must be selected so that the total length of a confidence interval with a 0.95 confidence level is less than 0.02 for simple designs with replacement and without replacement ? 2. What should we do if we do not know the proportion of men usually affected by the sickness (for the case of a design without replacement) ? To avoid confusions in notation, we will use the subscript W R for estimators with replacement, and the subscript W OR for estimators without replacement. Solution 1. a) Design with replacement. If the design is of size m, the length of the (estimated) confidence interval at a level (1 − α) for a mean is given by s̃2y s̃2y CI(1 − α) = Y − z1−α/2 , , Y + z1−α/2 m m where z1−α/2 is the quantile of order 1 − α/2 of a random normal standardised variate. If we denote PW R as the estimator of the proportion for the design with replacement, we can write ⎡ PW R (1 − PW R ) , CI(1 − α) = ⎣PW R − z1−α/2 m−1 ⎤ W R (1 − PW R ) P ⎦. PW R + z1−α/2 m−1 Exercise 2.2 9 Indeed, in this case, var( PW R ) = PW R (1 − PW R ) . (m − 1) So that the total length of the confidence interval does not exceed 0.02, it is necessary and sufficient that PW R (1 − PW R ) ≤ 0.02. 2z1−α/2 m−1 By dividing by two and squaring, we get 2 z1−α/2 which gives PW R (1 − PW R ) ≤ 0.0001, m−1 PW R (1 − PW R ) . 0.0001 For a 95% confidence interval, and with an estimator of P of 0.3 coming from a source external to the survey, we have z1−α/2 = 1.96, and 0.3 × 0.7 = 8068.36. m = 1 + 1.962 × 0.0001 The sample size (m=8069) is therefore larger than the population size, which is possible (but not prudent) since the sampling is with replacement. b) Design without replacement. If the design is of size n, the length of the (estimated) confidence interval at a level 1 − α for a mean is given by N − n s2y N − n s2y . , Y + z1−α/2 CI(1 − α) = Y − z1−α/2 N n N n 2 m − 1 ≥ z1−α/2 For a proportion P and denoting PW OR as the estimator of the proportion for the design without replacement, we therefore have ⎡ N − n PW OR (1 − PW OR ) , CI(1 − α) = ⎣PW OR − z1−α/2 N n−1 ⎤ N − n PW OR (1 − PW OR ) ⎦ PW OR + z1−α/2 . N n−1 So the total length of the confidence interval does not surpass 0.02, it is necessary and sufficient that 10 2 Simple Random Sampling 2z1−α/2 N − n PW OR (1 − PW OR ) ≤ 0.02. N n−1 By dividing by two and by squaring, we get 2 z1−α/2 which gives N − n PW OR (1 − PW OR ) ≤ 0.0001, N n−1 2 (n − 1) × 0.0001 − z1−α/2 or again N −n PW OR (1 − PW OR ) ≥ 0, N 1 2 PW OR (1 − PW OR ) n 0.0001 + z1−α/2 N 2 PW OR (1 − PW OR ), ≥ 0.0001 + z 1−α/2 or 2 0.0001 + z1−α/2 PW OR (1 − PW OR ) . n≥ 1 2 W OR ) P (1 − P 0.0001 + z1−α/2 W OR N For a 95% confidence interval, and with an a priori estimator of P of 0.3 coming from a source external to the survey, we have n≥ 0.0001 + 1.962 × 0.30 × 0.70 = 1264.98. 1 0.0001 + 1.962 × 1500 × 0.30 × 0.70 Here, a sample size of 1265 is sufficient. The obtained approximation justifies the hypothesis of a normal distribution for PW OR . The impact of the finite population correction (1 − n/N ) can therefore be decisive when the population size is small and the desired accuracy is relatively high. 2. If the proportion of affected workers is not estimated a priori, we are placed in the most unfavourable situation, that is, one where the variance is greatest: this leads to a likely excessive size n, but ensures that the length of the confidence interval is not longer than the fixed threshold of 0.02. For the design without replacement, this returns to taking a proportion of 50%. In this case, by adapting the calculations from 1-(b), we find n ≥ 1298. We thus note that a significant variation in the proportion (from 30% to 50%) involves only a minimal variation in the sample size (from 1265 to 1298). Exercise 2.4 11 Exercise 2.3 Probability of inclusion and design with replacement In a simple random design with replacement of fixed size m in a population of size N , 1. Calculate the probability that an individual k is selected at least once in a sample. 2. Show that 2 m m +O Pr(k ∈ S) = , N N2 when m/N is small. Recall that a function f (n) of n is of order of magnitude g(n) (noted f (n) = O(g(n))) if and only if f (n)/g(n) is limited, that is to say there exists a quantity M such that, for any n ∈ N, |f (n)|/g(n) ≤ M. 3. What are the conclusions ? Solution 1. We obtain this probability from the complementary event: m 1 . Pr (k ∈ S) = 1 − Pr (k ∈ / S) = 1 − 1 − N 2. Then, we derive m m−j m m 1 1 Pr (k ∈ S) = 1 − 1 − − =1− j N N j=0 ⎫ ⎧ ⎬ m m−2 ⎨m−2 m 1 m−j m m 1 m−j − +1 = − = 1− − − ⎭ N ⎩ j j N N N j=0 j=0 2 m m +O = . N N2 3. We conclude that if the sampling rate m/N is small, (m/N )2 is negligible in relation to m/N. We then again find the probability of inclusion of a sample without replacement, because the two modes of sampling become indistinguishable. Exercise 2.4 Sample size What sample size is needed if we choose a simple random sample to find, within two percentage points (at least) and with 95 chances out of 100, the proportion of Parisians that wear glasses ? 12 2 Simple Random Sampling Solution There are two reasonable positions from which to deal with these issues: • • The size of Paris is very large: the sampling rate is therefore negligible. Obviously not having any a priori information on the population sought after, we are placed in a situation which leads to a maximum sample size (strong ‘precautionary’ stance), having P = 50 %. If the reality is different (which is almost certain), we have in fine a lesser uncertainty than was fixed at the start (2 percentage points). We set n in a way so that 1.96 × P (1 − P ) = 0.02, with P = 0.5, n hence n = 2 401 people. Exercise 2.5 Number of clerics We want to estimate the number of clerics in the French population. For that, we choose to select n individuals using a simple random sample. If the true proportion (unknown) of clerics in the population is 0.1 %, how many people must be selected to obtain a coefficient of variation CV of 5 % ? Solution By definition: σ(p) σ(N p) = , NP P where P is the true proportion to estimate (0.1 % here) and p its unbiased estimator, which is the proportion of clerics in the selected sample. A CV of 5 % corresponds to a reasonably ‘average’ accuracy. In fact, CV = var(p) ≈ P (1 − P ) n Therefore, CV = which gives (f a priori negligible compared to 1). 1 (1 − P ) ≈ √ = 0.05, nP nP 1 1 × = 400 000. 0.001 0.052 This large size, impossible in practice to obtain, is a direct result of the scarcity of the sub-population studied. n= Exercise 2.6 13 Exercise 2.6 Size for proportions In a population of 4 000 people, we are interested in two proportions: P1 = proportion of individuals owning a dishwasher, P2 = proportion of individuals owning a laptop computer. According to ‘reliable’ information, we know a priori that: 45 % ≤ P1 ≤ 65 %, 5 % ≤ P2 ≤ 10 %. and What does the sample size n have to be within the framework of a simple random sample if we want to know at the same time P1 near ± 2 % and P2 near ± 1 %, with a confidence level of 95 % ? Solution We estimate without bias Pi , (i = 1, 2) by the proportion pi calculated in the sample: n 1 N Pi (1 − Pi ). var(pi ) = 1 − N nN −1 We want % % 1.96 × var(p1 ) ≤ 0.02, and 1.96 × var(p2 ) ≤ 0.01. In fact , max P1 (1 − P1 ) = 0.5(1 − 0.5) = 0.25, max P2 (1 − P2 ) = 0.1(1 − 0.1) = 0.09. 45 %≤P1 ≤65 % and 5 %≤P2 ≤10 % The maximum value of Pi (1 − Pi ) is 0.25 (see Figure 2.1) and leads to a maximum n (as a security to reach at least the desired accuracy). It is jointly necessary that 0.10 0.00 P(1−P) 0.20 Fig. 2.1. Variance according to the proportion: Exercise 2.6 0.0 0.2 0.4 0.6 P 0.8 1.0 14 2 Simple Random Sampling ⎧ 2 0.02 n1 N ⎪ ⎪ ⎪ ⎨ 1 − N n N − 1 × 0.25 ≤ 1.96 2 ⎪ ⎪ 0.01 n1 N ⎪ ⎩ 1− × 0.09 ≤ , N nN −1 1.96 which implies that n ≥ 1 500.62 n ≥ 1 854.74. The condition on the accuracy of p2 being the most demanding, we conclude in choosing: n = 1 855. Exercise 2.7 Estimation of the population variance Show that 2 1 1 2 σy2 = yk − Y = (yk − yℓ ) . N 2N 2 k∈U (2.1) k∈U ℓ∈U ℓ=k Use this equality to (easily) find an unbiased estimator of the population variance Sy2 in the case of simple random sampling where Sy2 = N σy2 /(N − 1). Solution A first manner of showing this equality is the following: 1 1 2 2 (yk − yℓ ) = (yk − yℓ ) 2 2N 2N 2 k∈U ℓ∈U ℓ=k k∈U ℓ∈U 1 2 2 yk + yℓ − 2 yk yℓ = 2N 2 k∈U ℓ∈U k∈U ℓ∈U k∈U ℓ∈U 1 2 1 1 2 2 = yk − 2 yk − Y yk yℓ = N N N k∈U k∈U k∈U ℓ∈U 1 2 2 (yk − Y ) = σy . = N k∈U A second manner is: 1 2N 2 k∈U ℓ∈U ℓ=k (yk − yℓ )2 = 1 (yk − Y − yℓ + Y )2 2N 2 k∈U ℓ∈U 1 (yk − Y )2 + (yℓ − Y )2 − 2(yk − Y )(yℓ − Y ) 2 2N k∈U ℓ∈U 1 1 (yk − Y )2 + (yℓ − Y )2 + 0 = σy2 . = 2N 2N = k∈U ℓ∈U Exercise 2.8 15 The unbiased estimator of σy2 is σ y2 = 2 1 (yk − yℓ ) , 2N 2 πkℓ k∈S ℓ∈S ℓ=k where πkℓ is the second-order inclusion probability. With a simple design without replacement of fixed sample size, πkℓ = thus σ y2 = n(n − 1) , N (N − 1) N (N − 1) 1 (yk − yℓ )2 . n(n − 1) 2N 2 k∈S ℓ∈S ℓ=k By adapting (2.1) with the sample S (in place of U ), we get: 1 1 2 (y − y ) = (yk − Y )2 , k ℓ 2n2 n k∈S k∈S ℓ∈S ℓ=k where 1 yk . Y = n k∈S Therefore We get σ y2 = 2 N − 1 (N − 1) 1 s2y . yk − Y = N n−1 N k∈S N N −1 2 sy , σ 2 = s2y . and Sy2 = N N −1 y This result is well-known and takes longer to show if we do not use the equality (2.1). σ y2 = Exercise 2.8 Repeated survey We consider a population of 10 service-stations and are interested in the price of a litre of high-grade petrol at each station. The prices during two consecutive months, May and June, appears in Table 2.2. 1. We want to estimate the evolution of the average price per litre between May and June. We choose as a parameter the difference in average prices. Method 1: we sample n stations (n < 10) in May and n stations in June, the two samples being completely independent ; Method 2: we sample n stations in May and we again question these stations in June (panel technique). Compare the efficiency of the two concurrent methods. 16 2 Simple Random Sampling Table 2.2. Price per litre of high-grade petrol: Exercise 2.8 Station 1 2 3 4 5 6 7 8 9 10 May 5.82 5.33 5.76 5.98 6.20 5.89 5.68 5.55 5.69 5.81 June 5.89 5.34 5.92 6.05 6.20 6.00 5.79 5.63 5.78 5.84 2. The same question, if we this time want to estimate an average price during the combined May-June period. 3. If we are interested in the average price in Question 2, would it not be better to select instead of 10 records twice with Method 1 (10 per month), directly 20 records without worrying about the months (Method 3) ? No calculation is necessary. N.B.: Question 3 is related to stratification. Solution 1. We denote pm as the simple average of the recorded prices among the n stations for month m (m = May or June). We have: 1−f 2 Sm , var(pm ) = n 2 where Sm is the variance of the 10 prices relative to month m. • Method 1. We estimate without bias the evolution of prices by pJune − pMay (the two estimators are calculated on two different a priori samples) and 1−f 2 2 (SMay + SJune ). n Indeed, the covariance is null because the two samples (and therefore the two estimators pMay and pJune ) are independent. Method 2. We have only one sample (the panel). Still, we estimate the evolution of prices without bias by pJune − pMay , and var1 (pJune − pMay ) = • 1−f 2 2 (SMay + SJune − 2SMay, June ). n This time, there is a covariance term, with: var2 (pJune − pMay ) = 1−f SMay, June , n where SMay, June represents the true empirical covariance between the 10 records in May and the 10 records in June. We therefore have: cov (pMay , pJune ) = 2 2 SMay + SJune var1 (pJune − pMay ) = 2 . 2 var2 (pJune − pMay ) SMay + SJune − 2SMay, June Exercise 2.8 After calculating, we find: 2 SMay = 0.05601 2 SJune = 0.0564711 SMay, June = 0.0550289 ⎫ ⎪ ⎪ ⎬ ⇒ ⎪ ⎪ ⎭ 17 var1 (pJune − pMay ) ≈ (6.81)2 . var2 (pJune − pMay ) The use of a panel allows for the division of the standard error by 6.81. This enormous gain is due to the very strong correlation between the prices of May and June (ρ ≈ 0.98): a station where high-grade petrol is expensive in May remains expensive in June compared to other stations (and vice versa). We easily verify this by calculating the true average prices in May (5.77) and June (5.84): if we compare the monthly average prices, only Station 3 changes position between May and June. 2. The average price for the two-month period is estimated without bias, with the two methods, by: p= • Method 1: • Method 2: var1 (p) = pMay + pJune . 2 1 1−f 2 2 × [SMay + SJune ]. 4 n 1 1−f 2 2 × [SMay + SJune + 2SMay, June]. 4 n This time, the covariance is added (due to the ‘+’ sign appearing in p). In conclusion, we have var2 (p) = 2 2 SMay + SJune var1 (p) = 2 = (0.71)2 = 0.50. 2 var2 (p) SMay + SJune + 2SMay, June The use of a panel proves to be ineffective: with equal sample sizes, we lose 29 % of accuracy. As the variances vary in 1/n, if we consider that the total cost of a survey is proportional to the sample size, this result amounts to saying that for a given variance, Method 1 allows a saving of 50 % of the budget in comparison to Method 2: this is obviously strongly significant. 3. Method 1 remains the best. Indeed, Method 3 amounts to selecting a simple random sample of size 2n in a population of size 2N , whereas Method 1 amounts to having two strata each of size N and selecting n individuals in each stratum: the latter instead gives a proportional allocation. In fact, we know that for a fixed total sample (2n here), to estimate a combined average, stratification with proportional allocation is always preferable to simple random sampling. 18 2 Simple Random Sampling Exercise 2.9 Candidates in an election In an election, there are two candidates. The day before the election, an opinion poll (simple random sample) is taken among n voters, with n equal to at least 100 voters (the voter population is very large compared to the sample size). The question is to find out the necessary difference in percentage points between the two candidates so that the poll produces the name of the winner (known by census the next day) 95 times out of 100. Perform the numeric application for some values of n. Hints: Consider that the loser of the election is A and that the percentage of votes he receives on the day of the election is PA ; the day of the sample, we denote PA as the percentage of votes obtained by this candidate A. We will convince ourselves of the fact that the problem above posed in ‘common terms’ can be clearly expressed using a statistical point of view: find the critical region so that the probability of declaring A as the winner on the day of the sample (while PA is in reality less than 50 %) is less than 5 %. Solution In adopting the terminology of test theory, we want a ‘critical region’ of the form ]c, +∞[, the problem being to find c, with: Pr[PA > c|PA < 50 %] ≤ 5 % (the event PA < 50 % is by definition certain; it is presented for reference). Indeed, the rule that will decide on the date of the sample who would win the following day can only be of type ‘P greater than a certain level’. We make 2 the hypothesis that PA ∼ N (PA , σA ), with: 2 = σA PA (1 − PA ) . n This approximation is justified because n is ‘sufficiently large’ (n ≥ 100). We try to find c such that: ' PA − PA c − PA '' Pr > ' PA < 50 % ≤ 5 %. σA σA ' However, PA remains unknown. In reality, it is the maximum of these probabilities that must be considered among all PA possible, meaning all PA < 0.5. Therefore, we try to find c such that: ' c − PA '' PA < 0.5 ≤ 0.05. max Pr N (0.1) > σA ' {PA } Now, the quantity c − PA PA (1−PA ) n Exercise 2.10 19 is clearly a decreasing function of PA (for PA < 0.5). We see that the maximum of the probability is attained for the minimum (c − PA )/σA , or in other words the maximum PA (subject to PA < 0.5). Therefore, we have PA = 50 %. We try to find c satisfying: ⎤ ⎡ c − 0.5 ⎦ ≤ 0.05. Pr ⎣N (0, 1) > 0.25 n Consulting a quantile table of the normal distribution shows that it is necessary for: c − 0.5 = 1.65. 0.25 n Conclusion: The critical region is ( 0.25 1 PA > + 1.65 , 2 n that is 1.65 1 PA > + √ 2 2 n . The difference in percentage points therefore must be at least the following: 1.65 PA − PB = 2PA − 1 ≥ √ . n √ If the difference in percentage points is at least equal to 1.65/ n, then we have less than a 5 % chance of declaring A the winner on the day of the opinion poll while in reality he will lose on the day of the elections, that is, we have at least a 95 % chance of making the right prediction. Table 2.3 contains several numeric applications. The case n = 900 corresponds to the opinion poll sample size traditionally used for elections. Table 2.3. Numeric applications: Exercise 2.9 n 100 400 900 2000 5000 10000 √ 1.65/ n 16.5 8.3 5.5 3.7 2.3 1.7 Exercise 2.10 Select-reject method Select a sample of size 4 in a population of size 10 using a simple random design without replacement with the select-reject method. This method is due to Fan et al. (1962) and is described in detail in Tillé (2001, p. 74). The procedure consists of sequentially reading the frame. At each stage, we decide whether or not to select a unit of observation with the following probability: number of units remaining to select in the sample . number of units remaining to examine in the population 20 2 Simple Random Sampling Use the following observations of a uniform random variable over [0, 1]: 0.375489 0.624004 0.517951 0.0454450 0.632912 0.246090 0.927398 0.32595 0.645951 0.178048 Solution Noting k as the observation number and j as the number of units already selected at the start of stage k, the algorithm is described in Table 2.4. The sample is composed of units {1, 4, 6, 8}. Table 2.4. Select-reject method: Exercise 2.10 k uk j 1 2 3 4 5 6 7 8 9 10 0.375489 0.624004 0.517951 0.045450 0.632912 0.246090 0.927398 0.325950 0.645951 0.178048 0 1 1 1 2 2 3 3 4 4 n−j N − (k − 1) 4/10 = 0.4000 3/9 = 0.3333 3/8 = 0.3750 3/7 = 0.4286 2/6 = 0.3333 2/5 = 0.4000 1/4 = 0.2500 1/3 = 0.3333 0/2 = 0.0000 0/1 = 0.0000 Ik 1 0 0 1 0 1 0 1 0 0 Exercise 2.11 Sample update method In selecting a sample according to a simple design without replacement, there exist several algorithms. One method proposed by McLeod and Bellhouse (1983), works in the following manner: • • • We select the first n units of the list. We then examine the case of record (n + 1). We select unit n + 1 with a probability n/(n + 1). If unit n + 1 is selected, we remove one unit from the sample that we selected at random and with equal probabilities. For the units k, where n + 1 < k ≤ N , we maintain this rule. Unit k is selected with probability n/k. If unit k is selected, we remove one unit from the sample that we selected at random and with equal probabilities. (k) 1. We denote πℓ as the probability that individual ℓ is in the sample at stage k, where (ℓ ≤ k), meaning after we have examined the case of record (k) k (k ≥ n). Show that πℓ = n/k. (It can be interesting to proceed in a recursive manner.) 2. Verify that the final probability of inclusion is indeed that which we obtain for a design with equal probabilities of fixed size. 3. What is interesting about this method? Exercise 2.11 21 Solution 1. • • (k) If k = n, then πℓ = 1 = n/n, for all ℓ ≤ n. (n+1) If k = n + 1, then we have directly πn+1 = n/(n + 1). Furthermore, for ℓ < k, (n+1) πℓ = Pr [unit ℓ being in the sample at stage (n + 1)] = Pr [unit (n + 1) not being selected at stage n] +Pr [unit (n + 1) being selected at stage n] ×Pr [unit ℓ not being removed at stage n] n n−1 n n + × = . = 1− n+1 n+1 n n+1 • If k > n+1, we use a recursive proof. We suppose that, for all ℓ ≤ k−1, (k−1) πℓ = n , k−1 (2.2) and we are going to show that if (2.2) is true then, for all ℓ ≤ k, n . (2.3) k The initial conditions are confirmed since we have proven (2.3) for k = n and k = n + 1. If ℓ = k, then the algorithm directly gives (k) = (k) = πℓ πk • n . k If ℓ < k, then we calculate in the sample, using Bayes’ theorem, πℓk = Pr [unit ℓ being in the sample at stage k] = Pr [unit k not being selected at stage k] ×Pr [unit ℓ being in the sample at stage k − 1] +Pr [unit k being selected at stage k] ×Pr [unit ℓ being in the sample at stage k − 1] ×Pr [unit ℓ not being removed at stage k] n n−1 n (k−1) (k−1) + × πℓ × = (1 − ) × πℓ k k n n (k−1) k − 1 = . = πℓ k k (N ) 2. At the end of the algorithm k = N and therefore πℓ ℓ ∈ U. = n/N , for all

Log In

Exercises and Solutions