Unit 10 - Sampling
Unit 10 - Sampling
When the population is very numerous or inaccessible, we extract a sample or a subset of the
whole population. The number of elements of the sample, N, is called size. It is very important
for our study that the sample is representative of the population.
Statistics
Statistics is the Science that collects data, describes them in a useful and simple way, analyzes
them and interprets those data with the help of the Probability Theory.
When analyzing data, it is possible to use one of two statistics methodologies: descriptive
statistics or inferential statistics.
- Descriptive statistics collects data, organises them in tables or graphs and calculates
the parameters
- Inferential statistics makes conclusions from the data by using the Probability Theory
– Deductive statistics tries to check if the information given by the sample matches the 1
In this unit we are going to see the point estimation. To do it, we have to distinguish the
sample statistics and population parameters:
– Deductive statistics or statistics are the measures of location and dispersion of the
sample and they depend on the sample. We use the sample mean, x , and the
standard deviation of the sample, s
Sampling
To choose samples, we use sampling and its different types:
In a random or probability sampling, each individual in the population has the same probability
to be chosen for the sample.
–Simple random sampling (the base for the other types of sampling) consists in listing the
elements of the population and choosing the n elements of the sample randomly
–Systematic random sampling consists in choosing the n elements of the sample k in k where:
N # e le m e n ts o f th e p o p u la tio n
k
n s iz e o f th e s a m p le
–Stratified random sampling, also sometimes called proportional or quota random sampling,
involves dividing your population into homogeneous subgroups and then taking a simple
random sample in each subgroup.
–Cluster random sampling. The problem with random sampling methods when we have to
sample a population that's scattered across a wide geographic region is that you will have to
geographically cover a very large ground in order to get to each of the units you sampled. In
cluster sampling, we follow these steps:
• divide population into clusters (usually along geographic boundaries)
• randomly sample clusters
• measure all units within sampled clusters
– With replacement: Nn
If you make an error in the sampling process, the results of the sample won’t match the
population. This error can be:
–Random sample error: in order to reduce this error, you must increase the size of the sample
–A systematic error is associated to the process of selection of the sample and it is reduced by
improving the selection
Exercise: A library has 5 sections, in each one there is a certain number of books as is shown in
the table:
Section 1 2 3 4 5
Books 500 860 1200 700 740
In order to try to estimate the number of books published in Spanish, we want to extract a
sample of 5% of the total books by doing a stratified random sampling with the sections as
strata. How many books in each section do we have to select?
Solution: 25, 43, 60, 35 and 37 books
Normal distribution
As you know, every distribution with mean μ and standard deviation σ can be associated to a
normal or gaussian distribution with mean 0 and standard deviation 1.
This is the standard normal distribution or unit normal distribution N(0,1). The random
variable associated to this distribution, Z, is called standard normal deviate.
We use a table with the values of P(Z ≤ a), a > 0, which is the area of the shaded zone.
P(Z a) 1 P(Z a)
We use this case to calculate other possibilities: P ( Z a ) P ( Z a )
P (Z a) P (Z a) 1 P (Z a)
P (a Z b) P (Z b) P (Z a )
For using the table, we have to convert the variable X which follows a distribution with mean μ
into a standard normal distribution.
X
Then, we make the change: Z
X a a
Then the calculus of probability is: P(X a) P PZ
n
The binomial distribution, B(n,p), has this function of probability: P ( X r ) p r · q n r (q 1 p)
r
With mean and standard deviation: n p n p q
X np
If X = B(n,p) is a binomial variable, then the variable: Z approximates to N(0,1)
npq
Exercise: If X = N(20,2), calculate:
Sampling distributions
The study of the characteristics of a population is made by the study of some samples. The
statistics of the sample can let us decide about the best approximation of the parameter of the
population we need.
To solve this, we need to know the relationship between the statistics of the sample and the
parameters of the population. These ones are deduced by the statistics, that’s why we need to
know the sampling distribution of these statistics.
We can do it by using:
Central limit theorem: if we take a simple random sample of size n with mean μ and standard
deviation σ (n great enough, n ≥ 30), the sampling distribution of the mean Xn approximates to
a normal distribution
X n
N , X X
n
n n
n
Generally the standard deviation of the population is unknown. Then, we approximate this
parameter by using the standard deviation of the sample, if n is great enough (n ≥ 100).
Example: The heights of 1200 students of a secondary school are distributed in a normal
distribution with mean 1.72 and SD 0.09 m. If we take 100 samples of 36 students each,
calculate:
b) How many samples are expected to have a mean between 1.68 and 1.73 m?
c) How many samples are expected to have a mean lower than 1.69 m?
0 .0 9
a) X 1 .7 2 X
0 .0 1 5
n n
n 36
b) 1 .6 8 1 .7 2 1 .7 3 1 .7 2
P (1 .6 8 X 1 .7 3 ) P Z
0 .0 1 5 0 .0 1 5
P ( 2 .6 7 Z 0 .6 7 ) P ( Z 0 .6 7 ) (1 P ( Z 2 .6 7 ) 0 .7 4 4 8
Exercise: In the population, IQ scores are normally distributed with a mean of 100 and
standard deviation of 15. If we repeatedly pulled random samples of 25 individuals from the
population and measured their IQ, How many samples are expected to have a mean between
95 and 105?
Each sample of the population has a percentage of individuals which has this characteristic. p
is the proportion of success of this random variable in the population. The proportion of failure
is q = 1 – p
Let all samples of size n in the population. Each sample has a proportion of individuals
with this characteristic.
The distribution associated to the random variable that matches each sample with its
proportion is called sampling distribution of a proportion.
As for big populations the binomial distribution approximates a normal one, the sampling
distribution of a proportion follows a normal distribution, too
pq
N p,
n
Because generally the proportions of the population are unknown, then we approximate it by
the sample ones.
Example: A machine makes precision parts. Generally 3% of the parts it makes are defective. A
customer receives a 500 part box.
a) What is the probability that he will find more than 5% of the parts in the box are defective?
b) What is the probability that he will find fewer than 1% of defective parts?
0 .0 3·0 .9 7
p p 0 .3 0 .0 0 7 6 N ( 0 .0 3 .0 .0 0 7 6 )
p
500
p 0 .0 3 0 .5 0 .0 3
a ) P ( p 0 .5 ) 1 P ( p 0 .5 ) 1 P 1 P ( Z 2 .6 3 ) 0 .0 0 4 3
0 .0 0 7 6 0 .0 0 7 6
p 0 .0 3 0 .0 1 0 .0 3
b ) P ( p 0 .0 1) P P ( Z 2 .6 3 ) 1 P ( Z 2 .6 3 ) 0 .0 0 4 3
0 .0 0 7 6 0 .0 0 7 6
Exercise: Suppose the proportion of all college students who have smoked in the past 6
months is p=0.40. For a class of n=200 that is representative of the population of all students,
what is the probability that the proportion of students who have smoked in the past 6 months
is less than 0.32?
Solution: 0.0104 6
Example: an electrical appliance made by Company A has an average life of 2500 hours, with a
SD of 500 hours. Another Company, B, makes this appliance with an average life of 2300 hours,
and a SD of 800 hours. We take 300 Company A devices and 200 Company B devices. Calculate
the probability that the average life of the sample from Company A isn’t 100 hours more than
the average life of the sample from Company B.
500 800
2 2
A B 2 0 0; 6 3 .5
300 200
100 200
P X A X B
100 P Z P ( Z 1 .5 7 ) 1 P ( Z 1 .5 7 ) 0 .0 5 8 2
6 3 .5
Exercise: The mean height of 15-year-old boys (in cm) is 175 and the variance is 64. For girls,
the mean is 165 and the variance is 64. If eight boys and eight girls were sampled, what is the
probability that the mean height of the sample of girls would be higher than the mean height
of the sample of boys?
Solution: 0.0062
Inside inferential statistics we find inductive statistics, which estimate the population
parameters through the sample ones. This can be done by using intervals or points.
Point estimation consists in estimating the unknown population parameter by a unique value.
This estimation is more precise but less reliable than the interval one.
– Biased point estimate: if the mean of the sample distribution of a statistic doesn’t
equal the corresponding population parameter
We must use an unbiased point estimate, the most efficient one. That is, the one whose
sample distribution has less dispersion.
Exam
1.- A company has a total of 360 employees in four different categories:
Managers: 36
Drivers: 54
Administrative Staff: 90
Production Staff: 180
How many from each category should be included in a stratified random sample of size 20 ?
a) 2, 4, 4 and 10
b) 2, 3, 6 and 9
c) 2, 3, 5 and 10
d) None of them
2.- A farmer owns 120 Jersey cows and 180 Friesians. How many cows of each breed should he
include in a stratified random sample of 50 for a survey of milk quality?
a) 20 and 30
b) 22 and 28
c) 15 and 35
d) None of them
3.- If a random variable has the normal distribution with μ= 82.0 and σ= 4.8, find the
probabilities that it will take on a value between 83.2 and 88.0
a) 0.2957
b) 0.3944
c) 0.2734
d) None of them
4.- Scientists study a large fish population. The mean fish length is 50 cm and the SD is 26 cm. If
a sample of 169 fish is randomly selected, what is the probability that the sample's mean is
between 46 and 48 cm?
a) 0.2734
b) 0.1359 8
c) 0.2552
d) None of them
5.- The numerical population of grade point averages at a college has mean 2.61 and standard
deviation 0.05. If a random sample of size 100 is taken from the population, what is the
probability that the sample mean will be between 2.51 and 2.71?
a) 0.9772
b) 0.9544
c) 0.9192
d) None of them
6.- According to data from the American Cancer Society, about 3.86% of women develop
breast cancer between the ages of 40-59.
What is the probability that, in a random sample of 500 39-year-old women without breast
cancer, more than 20 will develop breast cancer by the age of 60?
a) 0.5636
b) 0.4602
c) 0.4364
d) None of them
7.- In a typical class, about 70% of students receive a C or better. Out of a random sample of
100 students, what is the probability that less than 60 receive a C or better?
a) 0.0179
b) 0.0139
c) 0.015
d) None of them
8.- For boys, the average number of absences in the first grade is 15 with a standard deviation
of 7; for girls, the average number of absences is 10 with a standard deviation of 6.
In a nationwide survey, suppose 100 boys and 50 girls are sampled. What is the probability
that the male sample will have at most three more days of absences than the female sample?
a) 0.035
b) 0.0359
c) 0.0287
d) None of them
9.- In a study to compare the average weights of children in Sixth Grade at an elementary
school, a random sample of 20 boys and one of 25 girls will be used. It is known that, both for
boys and for girls, weights follow a normal distribution. The average of the weights of all the
boys in Sixth Grade in that school is 100 pounds and its standard deviation is 14.142, while the
average of the weights of all the girls in sixth grade in that school is 85 lbs and its standard
deviation is 12.247 pounds. Find the probability that the average of the weights of 20 boys is at
least 20 pounds bigger than the one of 25 girls.
a) 0.1151
b) 0.1056
c) 0.0968
d) None of them
Vocabulary
- Sampling: muestreo - Systematic random sampling: muestreo aleatorio
sistemático
- Population: población
- Stratified random samplig: muestreo aleatorio
- Sample: muestra estratificado
- Interval estimation: estimación por - Standard normal deviate: variable normal estándar
intervalos o tipificada
11
The end