Sampling Methods
Exercises and Solutions
Pascal Ardilly
Yves Tillé
Translated from French by Leon Jang
Sampling Methods
Exercises and Solutions
Pascal Ardilly
INSEE Direction générale
Unité des Méthodes Statistiques,
Timbre F410
18 boulevard Adolphe Pinard
75675 Paris Cedex 14
France
Email: pascal.ardilly@insee.fr
Yves Tillé
Institut de Statistique,
Université de Neuchâtel
Espace de l’Europe 4, CP 805,
2002 Neuchâtel
Switzerland
Email: yves.tille@unine.ch
Library of Congress Control Number: 2005927380
ISBN-10: 0-387-26127-3
ISBN-13: 978-0387-26127-0
Printed on acid-free paper.
© 2006 Springer Science+Business Media, Inc.
All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer Science+Business Media, Inc., 233 Spring Street,
New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter
developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even
if they are not identified as such, is not to be taken as an expression of opinion as to whether
or not they are subject to proprietary rights.
Printed in the United States of America. (MVY)
9 8 7 6 5 4 3 2 1
springeronline.com
Preface
When we agreed to share all of our preparation of exercises in sampling theory
to create a book, we were not aware of the scope of the work. It was indeed
necessary to compose the information, type out the compilations, standardise
the notations and correct the drafts. It is fortunate that we have not yet
measured the importance of this project, for this work probably would never
have been attempted!
In making available this collection of exercises, we hope to promote the
teaching of sampling theory for which we wanted to emphasise its diversity.
The exercises are at times purely theoretical while others are originally from
real problems, enabling us to approach the sensitive matter of passing from
theory to practice that so enriches survey statistics.
The exercises that we present were used as educational material at the
École Nationale de la Statistique et de l’Analyse de l’Information (ENSAI),
where we had successively taught sampling theory. We are not the authors of
all the exercises. In fact, some of them are due to Jean-Claude Deville and
Laurent Wilms. We thank them for allowing us to reproduce their exercises.
It is also possible that certain exercises had been initially conceived by an
author that we have not identified. Beyond the contribution of our colleagues,
and in all cases, we do not consider ourselves to be the lone authors of these
exercises: they actually form part of a common heritage from ENSAI that has
been enriched and improved due to questions from students and the work of
all the demonstrators of the sampling course at ENSAI.
We would like to thank Laurent Wilms, who is most influential in the organisation of this practical undertaking, and Sylvie Rousseau for her multiple
corrections of a preliminary version of this manuscript. Inès Pasini, Yves-Alain
Gerber and Anne-Catherine Favre helped us over and over again with typing
and composition. We also thank ENSAI, who supported part of the scientific
typing. Finally, we particularly express our gratitude to Marjolaine Girin for
her meticulous work with typing, layout and composition.
Pascal Ardilly and Yves Tillé
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Population, variable and function of interest . . . . . . . . . . . . . . . .
1.3 Sample and sampling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Horvitz-Thompson estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2
3
2
Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Simple random sampling without replacement . . . . . . . . . . . . . . .
2.2 Simple random sampling with replacement . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.1 Cultivated surface area . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.2 Occupational sickness . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.3 Probability of inclusion and design with
replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.4 Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.5 Number of clerics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.6 Size for proportions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.7 Estimation of the population variance . . . . . . . . . .
Exercise 2.8 Repeated survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.9 Candidates in an election . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.10 Select-reject method . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.11 Sample update method . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.12 Domain estimation . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.13 Variance of a domain estimator . . . . . . . . . . . . . . .
Exercise 2.14 Complementary sampling . . . . . . . . . . . . . . . . . . . . .
Exercise 2.15 Capture-recapture . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.16 Subsample and covariance . . . . . . . . . . . . . . . . . . . .
Exercise 2.17 Recapture with replacement . . . . . . . . . . . . . . . . . .
Exercise 2.18 Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercise 2.19 Proportion of students . . . . . . . . . . . . . . . . . . . . . . .
5
5
6
7
7
8
11
11
12
13
14
15
18
19
20
22
23
27
32
35
38
40
42
VIII
Contents
Exercise 2.20 Sampling with replacement and estimator
improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Exercise 2.21 Variance of the variance . . . . . . . . . . . . . . . . . . . . . . 50
3
Sampling with Unequal Probabilities . . . . . . . . . . . . . . . . . . . . . . 59
3.1 Calculation of inclusion probabilities . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Estimation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Exercise 3.1 Design and inclusion probabilities . . . . . . . . . . . . . . 60
Exercise 3.2 Variance of indicators and design of fixed size . . . . 61
Exercise 3.3 Variance of indicators and sampling design . . . . . . 61
Exercise 3.4 Estimation of a square root . . . . . . . . . . . . . . . . . . . . 63
Exercise 3.5 Variance and concurrent estimates of variance . . . . 65
Exercise 3.6 Unbiased estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Exercise 3.7 Concurrent estimation of the population variance . 69
Exercise 3.8 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Exercise 3.9 Systematic sampling of businesses . . . . . . . . . . . . . . 72
Exercise 3.10 Systematic sampling and variance . . . . . . . . . . . . . 73
Exercise 3.11 Systematic sampling and order . . . . . . . . . . . . . . . . 76
Exercise 3.12 Sunter’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Exercise 3.13 Sunter’s method and second-order probabilities . . 79
Exercise 3.14 Eliminatory method . . . . . . . . . . . . . . . . . . . . . . . . . 81
Exercise 3.15 Midzuno’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Exercise 3.16 Brewer’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Exercise 3.17 Sampling with replacement and comparison of
means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Exercise 3.18 Geometric mean and Poisson design . . . . . . . . . . . 90
Exercise 3.19 Sen-Yates-Grundy variance . . . . . . . . . . . . . . . . . . . 92
Exercise 3.20 Balanced design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Exercise 3.21 Design effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Exercise 3.22 Rao-Blackwellisation . . . . . . . . . . . . . . . . . . . . . . . . 99
Exercise 3.23 Null second-order probabilities . . . . . . . . . . . . . . . . 101
Exercise 3.24 Hájek’s ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Exercise 3.25 Weighting and estimation of the population size . 105
Exercise 3.26 Poisson sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Exercise 3.27 Quota method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Exercise 3.28 Successive balancing . . . . . . . . . . . . . . . . . . . . . . . . . 114
Exercise 3.29 Absence of a sampling frame . . . . . . . . . . . . . . . . . . 116
4
Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.2 Estimation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Exercise 4.1 Awkward stratification . . . . . . . . . . . . . . . . . . . . . . . . 123
Exercise 4.2 Strata according to income . . . . . . . . . . . . . . . . . . . . 124
Contents
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
IX
4.3 Strata of elephants . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.4 Strata according to age . . . . . . . . . . . . . . . . . . . . . . . 127
4.5 Strata of businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.6 Stratification and unequal probabilities . . . . . . . . . . 132
4.7 Strata of doctors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.8 Estimation of the population variance . . . . . . . . . . . 140
4.9 Expected value of the sample variance . . . . . . . . . . 143
4.10 Stratification and difference estimator . . . . . . . . . . 146
4.11 Optimality for a domain . . . . . . . . . . . . . . . . . . . . . . 148
4.12 Optimality for a difference . . . . . . . . . . . . . . . . . . . . 149
4.13 Naive estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.14 Comparison of regions and optimality . . . . . . . . . . 151
4.15 Variance of a product . . . . . . . . . . . . . . . . . . . . . . . . 153
4.16 National and regional optimality . . . . . . . . . . . . . . 154
4.17 What is the design? . . . . . . . . . . . . . . . . . . . . . . . . . 156
5
Multi-stage Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.2 Estimator, variance decomposition, and variance . . . . . . . . . . . . 159
5.3 Specific case of sampling of PU with replacement . . . . . . . . . . . . 160
5.4 Cluster effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Exercise 5.1 Hard disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Exercise 5.2 Selection of blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Exercise 5.3 Inter-cluster variance . . . . . . . . . . . . . . . . . . . . . . . . . 165
Exercise 5.4 Clusters of patients . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Exercise 5.5 Clusters of households and size . . . . . . . . . . . . . . . . . 168
Exercise 5.6 Which design? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Exercise 5.7 Clusters of households . . . . . . . . . . . . . . . . . . . . . . . . 172
Exercise 5.8 Bank clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Exercise 5.9 Clusters of households and number of men . . . . . . . 179
Exercise 5.10 Variance of systematic sampling . . . . . . . . . . . . . . . 186
Exercise 5.11 Comparison of two designs with two stages . . . . . 189
Exercise 5.12 Cluster effect and variable sizes . . . . . . . . . . . . . . . 194
Exercise 5.13 Variance and list order . . . . . . . . . . . . . . . . . . . . . . . 199
6
Calibration with an Auxiliary Variable . . . . . . . . . . . . . . . . . . . . 209
6.1 Calibration with a qualitative variable . . . . . . . . . . . . . . . . . . . . . 209
6.2 Calibration with a quantitative variable . . . . . . . . . . . . . . . . . . . . 210
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Exercise 6.1 Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Exercise 6.2 Post-stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Exercise 6.3 Ratio and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Exercise 6.4 Comparison of estimators . . . . . . . . . . . . . . . . . . . . . 218
Exercise 6.5 Foot size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
X
Contents
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
Exercise
6.6 Cavities and post-stratification . . . . . . . . . . . . . . . . . 221
6.7 Votes and difference estimation . . . . . . . . . . . . . . . . 225
6.8 Combination of ratios . . . . . . . . . . . . . . . . . . . . . . . . . 230
6.9 Overall ratio or combined ratio . . . . . . . . . . . . . . . . . 236
6.10 Calibration and two phases . . . . . . . . . . . . . . . . . . . 245
6.11 Regression and repeated surveys . . . . . . . . . . . . . . 251
6.12 Bias of a ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7
Calibration with Several Auxiliary Variables . . . . . . . . . . . . . . . 263
7.1 Calibration estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.2 Generalised regression estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 264
7.3 Marginal calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Exercise 7.1 Adjustment of a table on the margins . . . . . . . . . . . 265
Exercise 7.2 Ratio estimation and adjustment . . . . . . . . . . . . . . . 266
Exercise 7.3 Regression and unequal probabilities . . . . . . . . . . . . 272
Exercise 7.4 Possible and impossible adjustments . . . . . . . . . . . . 278
Exercise 7.5 Calibration and linear method . . . . . . . . . . . . . . . . . 279
Exercise 7.6 Regression and strata . . . . . . . . . . . . . . . . . . . . . . . . . 282
Exercise 7.7 Calibration on sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Exercise 7.8 Optimal estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Exercise 7.9 Calibration on population size . . . . . . . . . . . . . . . . . 287
Exercise 7.10 Double calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
8
Variance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
8.1 Principal techniques of variance estimation . . . . . . . . . . . . . . . . . 293
8.2 Method of estimator linearisation . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Exercise 8.1 Variances in an employment survey . . . . . . . . . . . . . 295
Exercise 8.2 Tour de France . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Exercise 8.3 Geometric mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Exercise 8.4 Poisson design and calibration on population size . 301
Exercise 8.5 Variance of a regression estimator . . . . . . . . . . . . . . 304
Exercise 8.6 Variance of the regression coefficient . . . . . . . . . . . . 306
Exercise 8.7 Variance of the coefficient of determination . . . . . . 310
Exercise 8.8 Variance of the coefficient of skewness . . . . . . . . . . . 311
Exercise 8.9 Half-samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
9
Treatment of Non-response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9.1 Reweighting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9.2 Imputation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Exercise 9.1 Weight of an aeroplane . . . . . . . . . . . . . . . . . . . . . . . . 320
Exercise 9.2 Weighting and non-response . . . . . . . . . . . . . . . . . . . 326
Exercise 9.3 Precision and non-response . . . . . . . . . . . . . . . . . . . . 334
Contents
XI
Exercise 9.4 Non-response and variance . . . . . . . . . . . . . . . . . . . . 343
Exercise 9.5 Non-response and superpopulation . . . . . . . . . . . . . . 349
Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Normal Distribution Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
1
Introduction
1.1 References
This book presents a collection of sampling exercises covering the major chapters of this branch of statistics. We do not have as an objective here to present
the necessary theory for solving these exercises. Nevertheless, each chapter
contains a brief review that clarifies the notation used. The reader can consult
more theoretical works. Let us first of all cite the books that can be considered
as classics: Yates (1949), Deming (1950), Hansen et al. (1993a), Hansen et al.
(1993b), Deming (1960), Kish (1965), Raj (1968), Sukhatme and Sukhatme
(1970), Konijn (1973), Cochran (1977), a simple and clear work that is very
often cited as a reference, and Jessen (1978). The post-mortem work of Hájek (1981) remains a masterpiece but is unfortunately difficult to understand.
Kish (1989) offered a practical and interesting work which largely transcends
the agricultural domain. The book by Thompson (1992) is an excellent presentation of spatial sampling. The work devoted to the basics of sampling
theory has been recently republished by Cassel et al. (1993). The modern reference book for the past 10 years remains the famous Särndal et al. (1992),
even if other interesting works have been published like Hedayat and Sinha
(1991), Krishnaiah and Rao (1994), or the book Valliant et al. (2000), dedicated to the model-based approach. The recent book by Lohr (1999) is a very
pedagogical work which largely covers the field. We recommend it to discover
the subject. We also cite two works exclusively established in sampling with
unequal probabilities: Brewer and Hanif (1983) and Gabler (1990), and the
book by Wolter (1985) being established in variance estimation.
In French, we can suggest in chronological order the books by Thionet
(1953) and by Zarkovich (1966) as well as that by Desabie (1966), which are
now classics. Then, we can cite the more recent books by Deroo and Dussaix
(1980), Gouriéroux (1981), Grosbras (1987), the collective work edited by
Droesbeke et al. (1987), the small book by Morin (1993) and finally the manual
of exercises published by Dussaix and Grosbras (1992). The ‘Que Sais-je?’
by Dussaix and Grosbras (1996) expresses an appreciable translation of the
2
1 Introduction
theory. Obviously, the two theoretical works proposed by the authors Ardilly
(1994) and Tillé (2001) are fully adapted to go into detail on the subject.
Finally, a very complete work is suggested, in Italian, by Cicchitelli et al.
(1992) and, in Chinese, by Ren and Ma (1996).
1.2 Population, variable and function of interest
Consider a finite population composed of N observation units; each of the
units can be identified by a label, of which the set is denoted
U = {1, ..., N }.
We are interested in a variable y which takes the value yk on unit k. These
values are not random. The objective is to estimate the value of a function of
interest
θ = f (y1 , ..., yk , ..., yN ).
The most frequent functions are the total
Y =
yk ,
k∈U
the mean
Y =
1
Y
yk = ,
N
N
k∈U
the population variance
σy2 =
2
1
yk − Y ,
N
k∈U
and the corrected population variance
Sy2 =
2
1
yk − Y .
N −1
k∈U
The size of the population is not necessarily known and can therefore be
considered as a total to estimate. In fact, we can write
N=
1.
k∈U
1.3 Sample and sampling design
A sample without replacement s is a subset of U . A sampling design p(.) is a
probability distribution for the set of all possible samples such that
1.4 Horvitz-Thompson estimator
p(s) ≥ 0, for all s ⊂ U and
3
p(s) = 1.
s⊂U
The random sample S is a random set of labels for which the probability
distribution is
Pr(S = s) = p(s), for all s ⊂ U.
The sample size n(S) can be random. If the sample is of fixed size, we denote
the size simply as n. The indicator variable for the presence of units in the
sample is defined by
1 if k ∈ S
Ik =
0 if k ∈
/ S.
The inclusion probability is the probability that unit k is in the sample
πk = Pr(k ∈ S) = E(Ik ) =
p(s).
s∋k
This probability can (in theory) be deduced from the sampling design. The
second-order inclusion probability is
πkℓ = Pr(k ∈ S and ℓ ∈ S) = E(Ik Iℓ ) =
p(s).
s∋k,ℓ
Finally, the covariance of the indicators is
πk (1 − πk ) if ℓ = k
∆kℓ = cov(Ik , Iℓ ) =
πkℓ − πk πℓ if ℓ = k.
If the design is of fixed size n, we have
πk = n,
πkℓ = nπℓ ,
k∈U
and
k∈U
(1.1)
∆kℓ = 0.
k∈U
1.4 Horvitz-Thompson estimator
The Horvitz-Thompson estimator of the total is defined by
yk
.
Yπ =
πk
k∈S
This estimator is unbiased if all the first-order inclusion probabilities are
strictly positive. If the population size is known, we can estimate the mean
with the Horvitz-Thompson estimator:
1 yk
.
Y π =
N
πk
k∈S
4
1 Introduction
The variance of Yπ is
var(Yπ ) =
yk yℓ
∆kℓ .
πk πℓ
k∈U ℓ∈U
If the sample is of fixed size (var(#S) = 0), then Sen (1953) and Yates and
Grundy (1953) showed that the variance can also be written
1
var(Yπ ) = −
2
yk
k∈U ℓ∈U
yℓ
−
πk
πℓ
2
∆kℓ .
The variance can be estimated by:
var(
Yπ ) =
yk yℓ ∆kℓ
,
πk πℓ πkℓ
k∈S ℓ∈S
where πkk = πk . If the design is of fixed size, we can construct another estimator from the Sen-Yates-Grundy expression:
var(
Yπ ) = −
2
yℓ
∆kℓ
1 yk
−
.
2
πk
πℓ
πkℓ
k∈S ℓ∈S,
ℓ=k
These two estimators are unbiased if all the second-order inclusion probabilities are strictly positive. When the sample size is ‘sufficiently large’ (in
practice, a few dozen most often suffices), we can construct confidence intervals with a confidence level of (1 − α) for Y according to:
CI(1 − α) = Yπ − z1−α/2
var(Yπ ), Yπ + z1−α/2
var(Yπ ) ,
where z1−α/2 is the (1 − α/2)-quantile of a standard normal random variable
(see Tables 10.1, 10.2, and 10.3). These intervals are estimated by replacing
Yπ ).
var(Yπ ) with var(
2
Simple Random Sampling
2.1 Simple random sampling without replacement
A design is simple without replacement of fixed size n if and only if, for all s,
N −1
if #s = n
p(s) =
n
0
otherwise,
or
N
n
=
N!
.
n!(N − n)!
We can derive the inclusion probabilities
πk =
n
,
N
and
πkℓ =
n(n − 1)
.
N (N − 1)
Finally,
if k = ℓ
−1
if k = ℓ.
N −1
The Horvitz-Thompson estimator of the total becomes
∆kℓ =
n(N − n)
×
N2
1
N
Yπ =
yk .
n
k∈S
That for the mean is written as
1
Y π =
yk .
n
k∈S
The variance of Yπ is
var(Yπ ) = N 2 1 −
n Sy2
,
N n
6
2 Simple Random Sampling
and its unbiased estimator
where
var(
Yπ ) = N 2 (1 −
s2y =
n s2y
) ,
N n
2
1
yk − Y π .
n−1
k∈S
The Horvitz-Thompson estimator of the proportion PD that represents a subpopulation D in the total population is
nD
,
p=
n
where nD = #(S ∩ D), and p is the proportion of individuals of D in S. We
verify:
n PD (1 − PD ) N
var(p) = 1 −
,
N
n
N −1
and we estimate without bias this variance by
n p(1 − p)
.
var(p)
= 1−
N
n−1
2.2 Simple random sampling with replacement
If m units are selected with replacement and with equal probabilities at each
trial in the population U , then we define ỹi as the value of the variable y for
the i-th selected unit in the sample. We can select the same unit many times
in the sample. The mean estimator
m
1
Y W R =
ỹi ,
m i=1
is unbiased, and its variance is
σy2
.
m
In a simple design with replacement, the sample variance
var(Y W R ) =
m
s̃2y =
1
(ỹi − Y W R )2 ,
m − 1 i=1
estimates σy2 without bias. It is possible however to show that if we are interested in nS units of sample S for distinct units, then the estimator
1
Y DU =
yk ,
nS
k∈S
is unbiased for the mean and has a smaller variance than that of Y W R . Table 2.1 presents a summary of the main results under simple designs.
Exercise 2.1
7
Table 2.1. Simple designs : summary table
Simple sampling design
Without replacement With replacement
Sample size
Mean estimator
Variance of the mean estimator
n
m
1
yk
Y =
n k∈S
1
ỹi
Y W R =
m i=1
var Y
=
=
σy2
(N − n) 2
Sy var Y W R =
nN
m
E s2y = Sy2
Expected sample variance
Variance estimator of the mean
estimator
var
Y
m
E s2y = σy2
s2y
(N − n) 2
Y W R =
sy var
nN
m
EXERCISES
Exercise 2.1 Cultivated surface area
We want to estimate the surface area cultivated on the farms of a rural township. Of the N = 2010 farms that comprise the township, we select 100 using
simple random sampling. We measure yk , the surface area cultivated on the
farm k in hectares, and we find
yk = 2907 ha and
yk2 = 154593 ha2 .
k∈S
k∈S
1. Give the value of the standard unbiased estimator of the mean
1
Y =
yk .
N
k∈U
2. Give a 95 % confidence interval for Y .
Solution
In a simple design, the unbiased estimator of Y is
1
2907
= 29.07 ha.
Y =
yk =
n
100
k∈S
The estimator of the dispersion Sy2 is
1 2 2
n
100 154593
2
sy =
=
− 29.072 = 707.945.
yk − Y
n−1 n
99
100
k∈S
8
2 Simple Random Sampling
The sample size n being ‘sufficiently large’, the 95% confidence interval is
estimated in hectares as follows:
N − n s2y
2010 − 100 707.45
= 29.07 ± 1.96
×
Y ± 1.96
N n
2010
100
= [23.99; 34.15] .
Exercise 2.2 Occupational sickness
We are interested in estimating the proportion of men P affected by an occupational sickness in a business of 1500 workers. In addition, we know that
three out of 10 workers are usually affected by this sickness in businesses of
the same type. We propose to select a sample by means of a simple random
sample.
1. What sample size must be selected so that the total length of a confidence
interval with a 0.95 confidence level is less than 0.02 for simple designs
with replacement and without replacement ?
2. What should we do if we do not know the proportion of men usually
affected by the sickness (for the case of a design without replacement) ?
To avoid confusions in notation, we will use the subscript W R for estimators
with replacement, and the subscript W OR for estimators without replacement.
Solution
1. a) Design with replacement.
If the design is of size m, the length of the (estimated) confidence
interval at a level (1 − α) for a mean is given by
s̃2y
s̃2y
CI(1 − α) = Y − z1−α/2
,
, Y + z1−α/2
m
m
where z1−α/2 is the quantile of order 1 − α/2 of a random normal standardised variate. If we denote PW R as the estimator of the proportion
for the design with replacement, we can write
⎡
PW R (1 − PW R )
,
CI(1 − α) = ⎣PW R − z1−α/2
m−1
⎤
W R (1 − PW R )
P
⎦.
PW R + z1−α/2
m−1
Exercise 2.2
9
Indeed, in this case,
var(
PW R ) =
PW R (1 − PW R )
.
(m − 1)
So that the total length of the confidence interval does not exceed
0.02, it is necessary and sufficient that
PW R (1 − PW R )
≤ 0.02.
2z1−α/2
m−1
By dividing by two and squaring, we get
2
z1−α/2
which gives
PW R (1 − PW R )
≤ 0.0001,
m−1
PW R (1 − PW R )
.
0.0001
For a 95% confidence interval, and with an estimator of P of 0.3
coming from a source external to the survey, we have z1−α/2 = 1.96,
and
0.3 × 0.7
= 8068.36.
m = 1 + 1.962 ×
0.0001
The sample size (m=8069) is therefore larger than the population
size, which is possible (but not prudent) since the sampling is with
replacement.
b) Design without replacement.
If the design is of size n, the length of the (estimated) confidence
interval at a level 1 − α for a mean is given by
N − n s2y
N − n s2y
.
, Y + z1−α/2
CI(1 − α) = Y − z1−α/2
N n
N n
2
m − 1 ≥ z1−α/2
For a proportion P and denoting PW OR as the estimator of the proportion for the design without replacement, we therefore have
⎡
N − n PW OR (1 − PW OR )
,
CI(1 − α) = ⎣PW OR − z1−α/2
N
n−1
⎤
N − n PW OR (1 − PW OR ) ⎦
PW OR + z1−α/2
.
N
n−1
So the total length of the confidence interval does not surpass 0.02, it
is necessary and sufficient that
10
2 Simple Random Sampling
2z1−α/2
N − n PW OR (1 − PW OR )
≤ 0.02.
N
n−1
By dividing by two and by squaring, we get
2
z1−α/2
which gives
N − n PW OR (1 − PW OR )
≤ 0.0001,
N
n−1
2
(n − 1) × 0.0001 − z1−α/2
or again
N −n
PW OR (1 − PW OR ) ≥ 0,
N
1
2
PW OR (1 − PW OR )
n 0.0001 + z1−α/2
N
2
PW OR (1 − PW OR ),
≥ 0.0001 + z
1−α/2
or
2
0.0001 + z1−α/2
PW OR (1 − PW OR )
.
n≥
1
2
W OR )
P
(1
−
P
0.0001 + z1−α/2
W
OR
N
For a 95% confidence interval, and with an a priori estimator of P of
0.3 coming from a source external to the survey, we have
n≥
0.0001 + 1.962 × 0.30 × 0.70
= 1264.98.
1
0.0001 + 1.962 × 1500
× 0.30 × 0.70
Here, a sample size of 1265 is sufficient. The obtained approximation
justifies the hypothesis of a normal distribution for PW OR . The impact
of the finite population correction (1 − n/N ) can therefore be decisive
when the population size is small and the desired accuracy is relatively
high.
2. If the proportion of affected workers is not estimated a priori, we are
placed in the most unfavourable situation, that is, one where the variance
is greatest: this leads to a likely excessive size n, but ensures that the
length of the confidence interval is not longer than the fixed threshold of
0.02. For the design without replacement, this returns to taking a proportion of 50%. In this case, by adapting the calculations from 1-(b), we
find n ≥ 1298. We thus note that a significant variation in the proportion
(from 30% to 50%) involves only a minimal variation in the sample size
(from 1265 to 1298).
Exercise 2.4
11
Exercise 2.3 Probability of inclusion and design with replacement
In a simple random design with replacement of fixed size m in a population
of size N ,
1. Calculate the probability that an individual k is selected at least once in
a sample.
2. Show that
2
m
m
+O
Pr(k ∈ S) =
,
N
N2
when m/N is small. Recall that a function f (n) of n is of order of magnitude g(n) (noted f (n) = O(g(n))) if and only if f (n)/g(n) is limited,
that is to say there exists a quantity M such that, for any n ∈ N,
|f (n)|/g(n) ≤ M.
3. What are the conclusions ?
Solution
1. We obtain this probability from the complementary event:
m
1
.
Pr (k ∈ S) = 1 − Pr (k ∈
/ S) = 1 − 1 −
N
2. Then, we derive
m
m−j
m
m
1
1
Pr (k ∈ S) = 1 − 1 −
−
=1−
j
N
N
j=0
⎫
⎧
⎬ m m−2
⎨m−2
m 1 m−j m
m 1 m−j
−
+1 =
−
= 1−
−
−
⎭ N
⎩
j
j
N
N
N
j=0
j=0
2
m
m
+O
=
.
N
N2
3. We conclude that if the sampling rate m/N is small, (m/N )2 is negligible
in relation to m/N. We then again find the probability of inclusion of a
sample without replacement, because the two modes of sampling become
indistinguishable.
Exercise 2.4 Sample size
What sample size is needed if we choose a simple random sample to find,
within two percentage points (at least) and with 95 chances out of 100, the
proportion of Parisians that wear glasses ?
12
2 Simple Random Sampling
Solution
There are two reasonable positions from which to deal with these issues:
•
•
The size of Paris is very large: the sampling rate is therefore negligible.
Obviously not having any a priori information on the population sought
after, we are placed in a situation which leads to a maximum sample size
(strong ‘precautionary’ stance), having P = 50 %. If the reality is different
(which is almost certain), we have in fine a lesser uncertainty than was
fixed at the start (2 percentage points).
We set n in a way so that
1.96 ×
P (1 − P )
= 0.02, with P = 0.5,
n
hence n = 2 401 people.
Exercise 2.5 Number of clerics
We want to estimate the number of clerics in the French population. For that,
we choose to select n individuals using a simple random sample. If the true
proportion (unknown) of clerics in the population is 0.1 %, how many people
must be selected to obtain a coefficient of variation CV of 5 % ?
Solution
By definition:
σ(p)
σ(N p)
=
,
NP
P
where P is the true proportion to estimate (0.1 % here) and p its unbiased
estimator, which is the proportion of clerics in the selected sample. A CV of
5 % corresponds to a reasonably ‘average’ accuracy. In fact,
CV =
var(p) ≈
P (1 − P )
n
Therefore,
CV =
which gives
(f a priori negligible compared to 1).
1
(1 − P )
≈ √
= 0.05,
nP
nP
1
1
×
= 400 000.
0.001 0.052
This large size, impossible in practice to obtain, is a direct result of the scarcity
of the sub-population studied.
n=
Exercise 2.6
13
Exercise 2.6 Size for proportions
In a population of 4 000 people, we are interested in two proportions:
P1 = proportion of individuals owning a dishwasher,
P2 = proportion of individuals owning a laptop computer.
According to ‘reliable’ information, we know a priori that:
45 % ≤ P1 ≤ 65 %,
5 % ≤ P2 ≤ 10 %.
and
What does the sample size n have to be within the framework of a simple
random sample if we want to know at the same time P1 near ± 2 % and P2
near ± 1 %, with a confidence level of 95 % ?
Solution
We estimate without bias Pi , (i = 1, 2) by the proportion pi calculated in the
sample:
n 1 N
Pi (1 − Pi ).
var(pi ) = 1 −
N nN −1
We want
%
%
1.96 × var(p1 ) ≤ 0.02, and 1.96 × var(p2 ) ≤ 0.01.
In fact ,
max
P1 (1 − P1 ) = 0.5(1 − 0.5) = 0.25,
max
P2 (1 − P2 ) = 0.1(1 − 0.1) = 0.09.
45 %≤P1 ≤65 %
and
5 %≤P2 ≤10 %
The maximum value of Pi (1 − Pi ) is 0.25 (see Figure 2.1) and leads to a
maximum n (as a security to reach at least the desired accuracy).
It is jointly necessary that
0.10
0.00
P(1−P)
0.20
Fig. 2.1. Variance according to the proportion: Exercise 2.6
0.0
0.2
0.4
0.6
P
0.8
1.0
14
2 Simple Random Sampling
⎧
2
0.02
n1 N
⎪
⎪
⎪
⎨ 1 − N n N − 1 × 0.25 ≤ 1.96
2
⎪
⎪
0.01
n1 N
⎪
⎩ 1−
× 0.09 ≤
,
N nN −1
1.96
which implies that
n ≥ 1 500.62
n ≥ 1 854.74.
The condition on the accuracy of p2 being the most demanding, we conclude
in choosing: n = 1 855.
Exercise 2.7 Estimation of the population variance
Show that
2
1
1
2
σy2 =
yk − Y =
(yk − yℓ ) .
N
2N 2
k∈U
(2.1)
k∈U ℓ∈U
ℓ=k
Use this equality to (easily) find an unbiased estimator of the population
variance Sy2 in the case of simple random sampling where Sy2 = N σy2 /(N − 1).
Solution
A first manner of showing this equality is the following:
1
1
2
2
(yk − yℓ ) =
(yk − yℓ )
2
2N
2N 2
k∈U ℓ∈U
ℓ=k
k∈U ℓ∈U
1
2
2
yk +
yℓ − 2
yk yℓ
=
2N 2
k∈U ℓ∈U
k∈U ℓ∈U
k∈U ℓ∈U
1 2
1
1 2
2
=
yk − 2
yk − Y
yk yℓ =
N
N
N
k∈U
k∈U
k∈U ℓ∈U
1
2
2
(yk − Y ) = σy .
=
N
k∈U
A second manner is:
1
2N 2
k∈U ℓ∈U
ℓ=k
(yk − yℓ )2 =
1
(yk − Y − yℓ + Y )2
2N 2
k∈U ℓ∈U
1
(yk − Y )2 + (yℓ − Y )2 − 2(yk − Y )(yℓ − Y )
2
2N
k∈U ℓ∈U
1
1
(yk − Y )2 +
(yℓ − Y )2 + 0 = σy2 .
=
2N
2N
=
k∈U
ℓ∈U
Exercise 2.8
15
The unbiased estimator of σy2 is
σ
y2 =
2
1 (yk − yℓ )
,
2N 2
πkℓ
k∈S ℓ∈S
ℓ=k
where πkℓ is the second-order inclusion probability. With a simple design
without replacement of fixed sample size,
πkℓ =
thus
σ
y2 =
n(n − 1)
,
N (N − 1)
N (N − 1) 1
(yk − yℓ )2 .
n(n − 1) 2N 2
k∈S ℓ∈S
ℓ=k
By adapting (2.1) with the sample S (in place of U ), we get:
1
1
2
(y
−
y
)
=
(yk − Y )2 ,
k
ℓ
2n2
n
k∈S
k∈S ℓ∈S
ℓ=k
where
1
yk .
Y =
n
k∈S
Therefore
We get
σ
y2 =
2 N − 1
(N − 1) 1
s2y .
yk − Y =
N
n−1
N
k∈S
N
N −1 2
sy ,
σ
2 = s2y .
and Sy2 =
N
N −1 y
This result is well-known and takes longer to show if we do not use the equality
(2.1).
σ
y2 =
Exercise 2.8 Repeated survey
We consider a population of 10 service-stations and are interested in the price
of a litre of high-grade petrol at each station. The prices during two consecutive months, May and June, appears in Table 2.2.
1. We want to estimate the evolution of the average price per litre between
May and June. We choose as a parameter the difference in average prices.
Method 1: we sample n stations (n < 10) in May and n stations in June,
the two samples being completely independent ;
Method 2: we sample n stations in May and we again question these stations in June (panel technique).
Compare the efficiency of the two concurrent methods.
16
2 Simple Random Sampling
Table 2.2. Price per litre of high-grade petrol: Exercise 2.8
Station 1
2
3
4
5
6
7
8
9 10
May 5.82 5.33 5.76 5.98 6.20 5.89 5.68 5.55 5.69 5.81
June 5.89 5.34 5.92 6.05 6.20 6.00 5.79 5.63 5.78 5.84
2. The same question, if we this time want to estimate an average price
during the combined May-June period.
3. If we are interested in the average price in Question 2, would it not be
better to select instead of 10 records twice with Method 1 (10 per month),
directly 20 records without worrying about the months (Method 3) ? No
calculation is necessary.
N.B.: Question 3 is related to stratification.
Solution
1. We denote pm as the simple average of the recorded prices among the n
stations for month m (m = May or June).
We have:
1−f 2
Sm ,
var(pm ) =
n
2
where Sm
is the variance of the 10 prices relative to month m.
•
Method 1. We estimate without bias the evolution of prices by pJune −
pMay (the two estimators are calculated on two different a priori samples) and
1−f
2
2
(SMay
+ SJune
).
n
Indeed, the covariance is null because the two samples (and therefore
the two estimators pMay and pJune ) are independent.
Method 2. We have only one sample (the panel). Still, we estimate the
evolution of prices without bias by pJune − pMay , and
var1 (pJune − pMay ) =
•
1−f
2
2
(SMay
+ SJune
− 2SMay, June ).
n
This time, there is a covariance term, with:
var2 (pJune − pMay ) =
1−f
SMay, June ,
n
where SMay, June represents the true empirical covariance between the
10 records in May and the 10 records in June. We therefore have:
cov (pMay , pJune ) =
2
2
SMay
+ SJune
var1 (pJune − pMay )
= 2
.
2
var2 (pJune − pMay )
SMay + SJune − 2SMay, June
Exercise 2.8
After calculating, we find:
2
SMay
= 0.05601
2
SJune
= 0.0564711
SMay, June = 0.0550289
⎫
⎪
⎪
⎬
⇒
⎪
⎪
⎭
17
var1 (pJune − pMay )
≈ (6.81)2 .
var2 (pJune − pMay )
The use of a panel allows for the division of the standard error by 6.81.
This enormous gain is due to the very strong correlation between the
prices of May and June (ρ ≈ 0.98): a station where high-grade petrol is
expensive in May remains expensive in June compared to other stations
(and vice versa). We easily verify this by calculating the true average
prices in May (5.77) and June (5.84): if we compare the monthly average
prices, only Station 3 changes position between May and June.
2. The average price for the two-month period is estimated without bias,
with the two methods, by:
p=
•
Method 1:
•
Method 2:
var1 (p) =
pMay + pJune
.
2
1 1−f 2
2
×
[SMay + SJune
].
4
n
1 1−f 2
2
×
[SMay + SJune
+ 2SMay, June].
4
n
This time, the covariance is added (due to the ‘+’ sign appearing in
p).
In conclusion, we have
var2 (p) =
2
2
SMay
+ SJune
var1 (p)
= 2
= (0.71)2 = 0.50.
2
var2 (p)
SMay + SJune
+ 2SMay, June
The use of a panel proves to be ineffective: with equal sample sizes, we
lose 29 % of accuracy.
As the variances vary in 1/n, if we consider that the total cost of a survey
is proportional to the sample size, this result amounts to saying that for
a given variance, Method 1 allows a saving of 50 % of the budget in
comparison to Method 2: this is obviously strongly significant.
3. Method 1 remains the best. Indeed, Method 3 amounts to selecting a simple random sample of size 2n in a population of size 2N , whereas Method
1 amounts to having two strata each of size N and selecting n individuals
in each stratum: the latter instead gives a proportional allocation.
In fact, we know that for a fixed total sample (2n here), to estimate
a combined average, stratification with proportional allocation is always
preferable to simple random sampling.
18
2 Simple Random Sampling
Exercise 2.9 Candidates in an election
In an election, there are two candidates. The day before the election, an opinion poll (simple random sample) is taken among n voters, with n equal to at
least 100 voters (the voter population is very large compared to the sample
size). The question is to find out the necessary difference in percentage points
between the two candidates so that the poll produces the name of the winner
(known by census the next day) 95 times out of 100. Perform the numeric
application for some values of n.
Hints: Consider that the loser of the election is A and that the percentage of
votes he receives on the day of the election is PA ; the day of the sample, we
denote PA as the percentage of votes obtained by this candidate A.
We will convince ourselves of the fact that the problem above posed in ‘common terms’ can be clearly expressed using a statistical point of view: find the
critical region so that the probability of declaring A as the winner on the day
of the sample (while PA is in reality less than 50 %) is less than 5 %.
Solution
In adopting the terminology of test theory, we want a ‘critical region’ of the
form ]c, +∞[, the problem being to find c, with:
Pr[PA > c|PA < 50 %] ≤ 5 %
(the event PA < 50 % is by definition certain; it is presented for reference).
Indeed, the rule that will decide on the date of the sample who would win the
following day can only be of type ‘P greater than a certain level’. We make
2
the hypothesis that PA ∼ N (PA , σA
), with:
2
=
σA
PA (1 − PA )
.
n
This approximation is justified because n is ‘sufficiently large’ (n ≥ 100). We
try to find c such that:
'
PA − PA
c − PA ''
Pr
>
' PA < 50 % ≤ 5 %.
σA
σA '
However, PA remains unknown. In reality, it is the maximum of these probabilities that must be considered among all PA possible, meaning all PA < 0.5.
Therefore, we try to find c such that:
'
c − PA ''
PA < 0.5 ≤ 0.05.
max Pr N (0.1) >
σA '
{PA }
Now, the quantity
c − PA
PA (1−PA )
n
Exercise 2.10
19
is clearly a decreasing function of PA (for PA < 0.5). We see that the maximum
of the probability is attained for the minimum (c − PA )/σA , or in other words
the maximum PA (subject to PA < 0.5). Therefore, we have PA = 50 %. We
try to find c satisfying:
⎤
⎡
c − 0.5 ⎦
≤ 0.05.
Pr ⎣N (0, 1) >
0.25
n
Consulting a quantile table of the normal distribution shows that it is necessary for:
c − 0.5
= 1.65.
0.25
n
Conclusion: The critical region is
(
0.25
1
PA > + 1.65
,
2
n
that is
1.65
1
PA > + √
2 2 n
.
The difference in percentage points therefore must be at least the following:
1.65
PA − PB = 2PA − 1 ≥ √ .
n
√
If the difference in percentage points is at least equal to 1.65/ n, then we
have less than a 5 % chance of declaring A the winner on the day of the
opinion poll while in reality he will lose on the day of the elections, that is, we
have at least a 95 % chance of making the right prediction. Table 2.3 contains
several numeric applications. The case n = 900 corresponds to the opinion
poll sample size traditionally used for elections.
Table 2.3. Numeric applications: Exercise 2.9
n
100 400 900 2000 5000 10000
√
1.65/ n 16.5 8.3 5.5 3.7 2.3 1.7
Exercise 2.10 Select-reject method
Select a sample of size 4 in a population of size 10 using a simple random
design without replacement with the select-reject method. This method is
due to Fan et al. (1962) and is described in detail in Tillé (2001, p. 74). The
procedure consists of sequentially reading the frame. At each stage, we decide
whether or not to select a unit of observation with the following probability:
number of units remaining to select in the sample
.
number of units remaining to examine in the population
20
2 Simple Random Sampling
Use the following observations of a uniform random variable over [0, 1]:
0.375489 0.624004 0.517951 0.0454450 0.632912
0.246090 0.927398 0.32595 0.645951 0.178048
Solution
Noting k as the observation number and j as the number of units already
selected at the start of stage k, the algorithm is described in Table 2.4. The
sample is composed of units {1, 4, 6, 8}.
Table 2.4. Select-reject method: Exercise 2.10
k
uk
j
1
2
3
4
5
6
7
8
9
10
0.375489
0.624004
0.517951
0.045450
0.632912
0.246090
0.927398
0.325950
0.645951
0.178048
0
1
1
1
2
2
3
3
4
4
n−j
N − (k − 1)
4/10 = 0.4000
3/9 = 0.3333
3/8 = 0.3750
3/7 = 0.4286
2/6 = 0.3333
2/5 = 0.4000
1/4 = 0.2500
1/3 = 0.3333
0/2 = 0.0000
0/1 = 0.0000
Ik
1
0
0
1
0
1
0
1
0
0
Exercise 2.11 Sample update method
In selecting a sample according to a simple design without replacement, there
exist several algorithms. One method proposed by McLeod and Bellhouse
(1983), works in the following manner:
•
•
•
We select the first n units of the list.
We then examine the case of record (n + 1). We select unit n + 1 with a
probability n/(n + 1). If unit n + 1 is selected, we remove one unit from
the sample that we selected at random and with equal probabilities.
For the units k, where n + 1 < k ≤ N , we maintain this rule. Unit k is
selected with probability n/k. If unit k is selected, we remove one unit
from the sample that we selected at random and with equal probabilities.
(k)
1. We denote πℓ as the probability that individual ℓ is in the sample at
stage k, where (ℓ ≤ k), meaning after we have examined the case of record
(k)
k (k ≥ n). Show that πℓ = n/k. (It can be interesting to proceed in a
recursive manner.)
2. Verify that the final probability of inclusion is indeed that which we obtain
for a design with equal probabilities of fixed size.
3. What is interesting about this method?
Exercise 2.11
21
Solution
1. •
•
(k)
If k = n, then πℓ = 1 = n/n, for all ℓ ≤ n.
(n+1)
If k = n + 1, then we have directly πn+1 = n/(n + 1). Furthermore,
for ℓ < k,
(n+1)
πℓ
= Pr [unit ℓ being in the sample at stage (n + 1)]
= Pr [unit (n + 1) not being selected at stage n]
+Pr [unit (n + 1) being selected at stage n]
×Pr [unit ℓ not being removed at stage n]
n
n−1
n
n
+
×
=
.
= 1−
n+1 n+1
n
n+1
•
If k > n+1, we use a recursive proof. We suppose that, for all ℓ ≤ k−1,
(k−1)
πℓ
=
n
,
k−1
(2.2)
and we are going to show that if (2.2) is true then, for all ℓ ≤ k,
n
.
(2.3)
k
The initial conditions are confirmed since we have proven (2.3) for
k = n and k = n + 1. If ℓ = k, then the algorithm directly gives
(k)
=
(k)
=
πℓ
πk
•
n
.
k
If ℓ < k, then we calculate in the sample, using Bayes’ theorem,
πℓk = Pr [unit ℓ being in the sample at stage k]
= Pr [unit k not being selected at stage k]
×Pr [unit ℓ being in the sample at stage k − 1]
+Pr [unit k being selected at stage k]
×Pr [unit ℓ being in the sample at stage k − 1]
×Pr [unit ℓ not being removed at stage k]
n
n−1
n
(k−1)
(k−1)
+ × πℓ
×
= (1 − ) × πℓ
k
k
n
n
(k−1) k − 1
= .
= πℓ
k
k
(N )
2. At the end of the algorithm k = N and therefore πℓ
ℓ ∈ U.
= n/N , for all