B.lect1

Outline
Why Statistics?
Populations, Samples, and Census
Some Sampling Concepts

Lecture 1
Chapter 1: Basic Statistical Concepts

M. George Akritas

M. George Akritas Lecture 1 Chapter 1: Basic Statistical Concepts

Outline
Why Statistics?

Why Statistics?


Representative Samples
Simple Random and Stratiﬁed Sampling
Sampling With and Without Replacement
Non-representative Sampling


Outline
Why Statistics?

Example (Examples of Engineering/Scientific Studies)
Comparing the compressive strength of two or more cement
mixtures.
Comparing the effectiveness of three cleaning products in
removing four different types of stains.
Predicting failure time on the basis of stress applied.
Assessing the effectiveness of a new traffic regulatory measure
in reducing the weekly rate of accidents.
Testing a manufacturer’s claim regarding a product’s quality.
Studying the relation between salary increases and employee
productivity in a large corporation.

What makes these studies challenging (and thus to require
Statistics) is the inherent or intrinsic variability:


Outline
Why Statistics?

The compressive strength of different preparations of the same
cement mixture will differ. The figure in http://sites.
stat.psu.edu/~mga/401/fig/HistComprStrCement.pdf
shows 32 compressive strength measurements, in MPa
(MegaPascal units), of test cylinders 6 in. in diameter by 12
in. high, using water/cement ratio of 0.4, measured on the
28th day after they are made.
Under the same stress, two beams will fail at different times.
The proportion of defective items of a certain product will
differ from batch to batch.

Intrinsic variability renders the objectives of the case studies, as
stated, ambiguous.


Outline
Why Statistics?

The objectives of the case studies can be made precise if stated in
terms of averages or means.

Comparing the average hardness of two diﬀerent cement
mixtures.
Predicting the average failure time on the basis of stress
applied.
Estimation of the average coeﬃcient of thermal expansion.
Estimation of the average proportion of defective items.

Moreover, because of variability, the words ”average” and ”mean”
have a technical meaning which can be made clear through the
concepts of population and sample.


Outline
Why Statistics?

Deﬁnition
Population is a well-deﬁned collection of objects or subjects, of
relevance to a particular study, which are exposed to the same
treatment or method. Population members are called units.

Example (Examples of populations:)

All water samples that can be taken from a lake.
All items of a certain manufactured product.
All students enrolled in Big Ten universities during the
2007-08 academic year.
Two types of cleaning products. (Each type corresponds to a
population.)


Outline
Why Statistics?

The objective of a study is to investigate certain characteristic(s)
of the units of the population(s) of interest.

Example (Examples of characteristics:)

All water samples taken from a lake. Characteristics: Mercury
concentration; Concentration of other pollutants.
All items of a certain manufactured product (that have, or will
be produced). Characteristic: Proportion of defective items.
All students enrolled in Big Ten universities during the
2007-08 academic year. Characteristics: Favorite type of
music; Political aﬃliation.
Two types of cleaning products. Characteristic: cleaning
eﬀectiveness.


Outline
Why Statistics?

In the example where different (but of the same type) beams
are exposed to different stress levels:
the characteristic of interest is time to failure of a beam under
each stress level, and
each stress level used in the study corresponds to a separate
population which consists of all beams that will be exposed to
that stress level.
This emphasizes that populations are defined not only by the
units they consist of, but also by the method or treatment
applied to these units.


Outline
Why Statistics?

Full (i.e. population-level) understanding of a characteristic
requires the examination of all population units, i.e. a census.

For example, full understanding of the relation between salary
and productivity of a corporation’s employees requires
obtaining these two characteristics from all employees.
However,
taking a census can be time consuming and expensive: The
2000 U.S. Census costed $6.5 billion, while the 2010 Census
costed $13 billion.
Moreover, census is not feasible if the population is
hypothetical or conceptual, i.e. not all members are
available for examination.
Because of the above, we typically settle for examining all
units in a sample, which is a subset of the population.


Outline
Why Statistics?

Due to the intrinsic variability, the sample properties/attributes of
the characteristic of interest will differ from those of the
population. For example

The average mercury concentration in 25 water samples will
differ from the overall mercury concentration in the lake.
The proportion in a sample of 100 PSU students who favor
the use of solar energy will differ from the corresponding
proportion of all PSU students.
The relation between bear’s chest girth and weight in a
sample of 10 bears, will differ from the corresponding relation
in the entire population of 50 bears in a forested region.


Outline
Why Statistics?

The GOOD NEWS is that, if the sample is suitably drawn, then
sample properties approximate the population properties.

400
300
Weight

200
100

20 25 30 35 40 45 50 55

Chest Girth

Figure: Population and sample relationships 1between Basic Statistical Concepts
M. George Akritas Lecture Chapter 1:
chest girth and

Outline
Why Statistics?

Sampling Variability

Samples properties of the characteristic of interest also differ
from sample to sample. For example:
1. The number of US citizens, in a sample of size 20, who favor
expanding solar energy, will (most likely) be different from the
corresponding number in a different sample of 20 US citizens.
2. The average mercury concentration in two sets of 25 water
samples drawn from a lake will differ.
The term sampling variability is used to describe such
differences in the characteristic of interest from sample to
sample.


Outline
Why Statistics?

400
300
Weight

200
100

20 25 30 35 40 45 50 55

Chest Girth

Figure: Illustration of Sampling Variability.


Outline
Why Statistics?

Population level properties/attributes of characteristic(s) of
interest are called (population) parameters.
Examples of parameters include averages, proportions,
percentiles, and correlation coeﬃcient.
The corresponding sample properties/attributes of
characteristics are called statistics. The term sports statistics
comes from this terminology.
Sample statistics approximate the corresponding population
parameters but are not equal to them.
Statistical inference deals with the uncertainty issues which
arise in approximating parameters by statistics.
The tools of statistical inference include point and interval
estimation, hypothesis testing and prediction.


Outline
Why Statistics?

Example (Examples of Estimation, Hypothesis Testing and
Prediction)

Estimation (point and interval) would be used in the task of
estimating the coeﬃcient of thermal expansion of a metal, or
the air pollution level.
Hypothesis testing would be used for deciding whether to take
corrective action to bring the air pollution level down, or
whether a manufacturer’s claim regarding the quality of a
product is false.
Prediction arises in cases where we would like to predict the
failure time on the basis of the stress applied, or the age of a
tree on the basis of its trunk diameter.


Outline Representative Samples
Why Statistics? Simple Random and Stratified Sampling
Populations, Samples, and Census Sampling With and Without Replacement
Some Sampling Concepts Non-representative Sampling

For valid statistical inference the sample must be
representative of the population. For example, a sample of
PSU basketball players is not representative of PSU students,
if the characteristic of interest is height.
Typically it is hard to tell whether a sample is representative
of the population. So, we define a sample to be representative
if . . . (cyclical definition!!)

it allows for valid statistical inference.

The only guarantee for that comes from the method used to
select the sample (sampling method).
The good news is that there are several sampling methods
guarantee representativeness.



Deﬁnition
A sample of size n is a simple random sample if the selection
process ensures that every sample of size n has equal chance of
being selected.
To select a s.r.s. of size 10 from a population of 100 units, any
of the 100!/(10!90!) samples of size 10 must be equally likely.
In simple random sampling every member of the population
has the same chance of being included in the sample. The
reverse, however, is not true.

Example
To select a sample of 2 students from a population of 20 male and
20 female students, one selects at random one male and one
female students. Is this a s.r.s.? (Does every student have the
same chance of being included in the sample?)


Another sampling method for obtaining a representative sample is
called stratified sampling.

Definition
A stratified sample consists of simple random samples from each
of a number of groups (which are non-overlapping and make up
the entire population) called strata.

Examples of strata include: ethnic groups, age groups, and
production facilities.
If the units in the different strata differ in terms of the
characteristic under study, stratified sampling is preferable to
s.r.s. For example, if different production facilities differ in
terms of the proportion of defective products, a stratified
sample is preferable.



How do we select a s.r.s. of size n from a population of N units?
STEP 1: Assign to each unit a number from 1 to N.
STEP 2: Write each number on a slips of paper, place the N
slips of paper in an urn, and shuﬄe them.
STEP 3: Select n slips of paper at random, one at a time.
Alternatively, the entire process can be performed in software like
R. We will see this in the next lab session.



Sampling without replacement simply means that a
population unit can be included in a sample at most once. For
example, a simple random sample is obtained by sampling
without replacement: Once a unit’s slip of paper is drawn, it
is not placed back into the urn.
Sampling with replacement means that after a unit’s slip of
paper is chosen, it is put back in the urn. Thus a population
unit could be included in the sample anywhere between 0 and
n times. Rolling a die can be thought of as sampling with
replacement from the numbers 1, 2, . . . , 6.
Though conceptually undesirable, sampling with replacement
is easier to work with from a mathematical point of view.
When a population is very large, sampling with and without
replacement are practically equivalent.



Non-representative samples arise whenever the sampling plan
is such that a part, or parts, of the population of interest are
either excluded from, or systematically under-represented in,
the sample. This is called selection bias.
Two examples of non-representative samples are self-selected
and convenience samples.
A self-selected sample often occurs when people are asked to
send in their opinions in surveys or questionnaires. For
example, in a political survey, often those who feel that things
are running smoothly or who support an incumbent will
(apathetically) not respond, whereas those activists who
strongly desire change will voice their opinions.



A convenience sample is a sample made up from units that
are most easily reached. For example, randomly selecting
students from your classes will not result in a sample that is
representative of all PSU students because your classes are
mostly comprised of students with the same major as you.
A famous example of selection bias is the following.

Example (The Literary Digest poll of 1936)
The magazine had been extremely successful in predicting the
results in US presidential elections, but in 1936 it predicted a
3-to-2 victory for Republican Alf Landon over the Democratic
incumbent Franklin Delano Roosevelt. Worth noting is that this
prediction was based on 2.3 million responses (out of 10 million
questionnaires sent). On the other hand Gallup correctly predicted
the outcome of that election by surveying only 50,000 people.


Go to next lesson http://www.stat.psu.edu/~mga/401/
course.info/b.lect2.pdf
Go to the Stat 401 home page
http://www.stat.psu.edu/~mga/401/course.info/
http://www.stat.psu.edu/~mga
http://www.google.com


B.lect1

More Related Content

Similar to B.lect1

Similar to B.lect1 (20)

More from Ankit Katiyar

More from Ankit Katiyar (20)

B.lect1