Lecture Note On Statistical Methods With An Application
Lecture Note On Statistical Methods With An Application
parameters.
• Learn to think for yourself. Knowledge of statistics will allow you to see the
• To present and interpret the results to researchers and other decision makers
INTRODUCTION TO
STATISTICS
A lecture note by
Dereje D. (PhD)
Department of Statistics
Hawassa University
Introduction 3
Key Terms
• Population. Universe. The entire category under consideration. This is the
data which we have not completely examined but to which our conclusions
refer. The population size is usually indicated by a capital N.
• Examples: every lawyer in the United States; all single women in the United
States.
Key Terms
Parameter. A characteristic of a population. The
population mean, µ and the population standard
deviation, σ, are two examples of population
parameters. If you want to determine the population
parameters, you have to take a census of the entire
population. Taking a census is very costly.
Key Terms
• Statistical Inference. The process of using sample
statistics to draw conclusions about population parameters
is known as statistical inference.
• Note that pollsters do not call every adult who can vote for
president. This would be very expensive. What pollsters do is call a
representative sample of about 1,000 people and use the sample
statistics (the sample proportion) to estimate who is going to win
the election (population proportion).
Introduction 6
Key Terms
Example of statistical inference from quality control:
Key Terms
• Descriptive Statistics. Those statistics that summarize a sample of
numerical data in terms of averages and other measures for the
purpose of description, such as the mean and standard deviation.
• This includes the presentation of data in the form of graphs, charts, and
tables.
Introduction 8
• subject makes a mistake – subject may not remember the answer (e.g., “How
much money do you have invested in the stock market?”
Survey Errors
• Nonresponse error. If the rate of response is low, the sample may not be
representative. The people who respond may be different from the rest of the
population. Usually, respondents are more educated and more interested in
the topic of the survey. Thus, it is important to achieve a reasonably high rate
of response. (How to do this? Use follow- ups.)
• Answer: A small but representative sample can be useful in making inferences. But, a large and
probably unrepresentative sample is useless. No way to correct for it. Thus, sample 1 is better
than sample 2.
Introduction 13
Sampling Techniques
• Nonprobability Samples – based on convenience or
judgment
• Convenience (or chunk) sample - students in a class, mall intercept
instance, they may each be told to interview 100 subjects – 50 males and
50 females. Of the 50, say, 10 nonwhite and 40 white.
• The problem with a nonprobability sample is that we do not know how
representative our sample is of the population.
Introduction 14
Probability Samples
• Probability Sample. A sample collected in such
a way that every element in the population has a
known chance of being selected.
Probability Samples
• Other kinds of probability samples (beyond the
scope of this course).
• systematic random sample.
• Choose the first element randomly, then every kth
observation, where k = N/n
• stratified random sample.
• The population is sub-divided based on a characteristic
and a simple random sample is conducted within each
stratum
• cluster sample
• First take a random sample of clusters from the population
of cluster. Then, a simple random sample within each
cluster. Example, election district, orchard.
Introduction 16
Types of Data
• Qualitative data result in categorical responses. Also called
Nominal, or categorical data
• Example: Sex MALE
FEMALE
• We can say that one object has more or less of the characteristic than another object when
we rate them on an ordinal scale. Thus, a category 5 hurricane is worse than a category 4
hurricane which is worse than a category 3 hurricane, etc. Examples: social class, hardness
of minerals scale, income as categories, class standing, rankings of football teams, military
rank (general, colonel, major, lieutenant, sergeant, etc.),
• Example: Income (choose one)
_Under 5,000birr – checked by, say, Yaikob
_50,000 – 9,999 – checked by, say, Amsal
_10,000 and over – checked by, say, Alamodin
In this example, Alamodin checks the third category even though he earns several billion birrs. The distance between
Alamodin and Amsal is not the same as the distance between Amsal and Yaikob.
• Appropriate statistics: – same as those for nominal data, plus the median; but not the mean.
Introduction 19
• Example:
• Appropriate statistics
• same as for nominal
• same as for ordinal plus,
• the mean
Introduction 21
(A) is nominal, so the best we can get from this data are frequencies. (B) is
ratio, so we can compute: mean, median, mode, frequencies.
Introduction 23
Descriptive Statistics
• In this lecture we discuss using descriptive statistics, as
opposed to inferential statistics.
• Here we are interested only in summarizing the data in
front of us, without assuming that it represents anything
more.
• We will look at both numerical and categorical data.
• We will look at both quantitative and graphical techniques.
• The basic overall idea is to turn data into information.
Methods of data presentation
Numerical presentation
Graphical presentation
Mathematical presentation
1- Numerical presentation
Tabular presentation (simple – complex)
Total
Table (I): Distribution of 50 patients at the
surgical department of Alexandria
hospital in May 2008 according to their
Blood ABOFrequenc
blood groups %
group y
A 12 24
B 18 36
AB 5 10
O 15 30
Total 50 100
Table (II): Distribution of 50 patients at
the surgical department of HU Specialized
hospital in May 2008 according to their
age
Age Frequenc %
(years) y
20-<30 12 24
30- 18 36
40- 5 10
50+ 15 30
Total 50 100
Complex frequency distribution Table
Lung cancer
Total
Smokin Cases Control
g No. % No. % No. %
Smoke 38.3
15 75% 8 20% 23
r 3
Non 61.6
smoker 5 25% 32 80% 37
7
Complex frequency distribution Table
Smoke
15 65.2 8 34.8 23 100
r
Non
smoker 5 13.5 32 86.5 37 100
2- Graphical presentation
Graphs drawn using Cartesian
coordinates
• Line graph
• Frequency polygon
• Frequency curve
• Histogram
• Bar graph
• Scatter plot
Pie chart
rules
Statistical maps
Line Graph
MMR/1000
MM
60 Year R
50 196 50
40 0
30
20 197 45
10 0
0
Year 198 26
1960 1970 1980 1990 2000
0
199 15
0
Figure (1): Maternal mortality200 12
rate of (country), 1960-2000 0
Frequency polygon
Age Sex Mid-point of
(years) Males Femal interval
es
20 - 3 (12%) 2 (20+30) / 2 = 25
(10%
)
30 - 9 (36%) 6 (30+40) / 2 = 35
(30%
)
40- 7 (8%) 5 (40+50) / 2 = 45
(25%
)
50 - 4 (16%) 3 (50+60) / 2 = 55
Frequency polygon Sex
Age M-P
M F
Males Females
% (12% (10%
20- 25
40 ) )
35 (36% (30%
30- 35
30 ) )
25 (25%
40- (8%) 45
20
)
15 (16% (15%
50- 55
) )
10
60- (20%
5 (8%) 65
70 )
0
Age
25 35 45 55 65
9
8 Female
7 Male
6
Frequency
5
4
3
2
1
0
20- 30- 40- 50- 60-69
Age in years
Histogram
Distribution of a group of cholera patients by age
15
10
5
0
Age (years)
Figure (2): Distribution of 100 cholera
patients at (place) , in (time) by age
Bar chart
%
45
40
35
30
25
20
15
10
5
0
Single Married Divorced Widowed
Marital status
Bar chart
%
50
Male
40 Female
30
20
10
0
Single Married Divorced Widowed
Marital status
Pie chart
Deletion
Inversion
3%
18%
Translocation
79%
Descriptive Statistics I 41
Measures of Location
• Measures of location place the data set on the scale of
real numbers.
The Mean
• The sample mean is the sum of all the observations (∑X i) divided by
the number of observations (n):
ΣXi = X1 + X2 + X3 + X4 + … + Xn
∑Xi = 1 + 2+ 2+ 4 + 5 + 10 = 24
= 24 / 6 = 4.0
Descriptive Statistics I 44
The Mean
Example.
For the data: 1, 1, 1, 1, 51. Calculate the mean. Note: n
= 5 (five observations)
∑Xi = 1 + 1+ 1+ 1+ 51 = 55
= 55 / 5 = 11.0
The Median
• The median is the middle value of the ordered data
• To get the median, we must first rearrange the data into
an ordered array (in ascending or descending order).
Generally, we order the data from the lowest value to
the highest value.
• Therefore, the median is the data value such that half of
the observations are larger and half are smaller. It is
also the 50th percentile (we will be learning about
percentiles in a bit).
• If n is odd, the median is the middle observation of the
ordered array. If n is even, it is midway between the two
central observations.
Descriptive Statistics I 46
The Median
0 2 3 5 20 99 100
Example:
Note: Data has been ordered from lowest to highest. Since n is odd
(n=7), the median is the (n+1)/2 ordered observation, or the 4th
observation.
Answer: The median is 5.
The mean and the median are unique for a given set of data.
There will be exactly one mean and one median.
The Median
10 20 30 40 50 60
Example:
The Median
• The median has 3 interesting characteristics:
The Mode
• The mode is the value of the data that occurs with
the greatest frequency.
Example. 1, 1, 1, 2, 3, 4, 5
Answer. The mode is 1 since it occurs three times. The other
values each appear only once in the data set.
The Mode
• The mode is different from the mean and the median in
that those measures always exist and are always
unique. For any numeric data set there will be one
mean and one median.
• The mode may not exist.
• Data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 0
• Here you have 10 observations and they are all different.
Quantiles
• Measures of non-central location used to
summarize a set of data
Quartiles
• Quartiles split a set of ordered data into four parts.
• Imagine cutting a chocolate bar into four equal pieces… How many cuts
would you make? (yes, 3!)
• Q1 is the First Quartile
• 25% of the observations are smaller than Q1 and 75% of the observations
are larger
• Q2 is the Second Quartile
• 50% of the observations are smaller than Q2 and 50% of the observations
are larger. Same as the Median. It is also the 50th percentile.
• Q3 is the Third Quartile
• 75% of the observations are smaller than Q3and 25% of the observations
are larger
• Some books use a formula to determine the quartiles. We prefer a quick-and-
dirty approximation method outlined in the next slide.
Descriptive Statistics I 53
Quartiles
• A quartile, like the median, either takes the value of one of the observations, or the
value halfway between two observations.
• The simple method we like to use is just to first split the data set into two equal parts
to get the median (Q2) and then get the median of each resulting subset.
• The method we are using is an approximation. If you solve this in MS Excel, which relies on
a formula, you may get an answer that is slightly different.
Descriptive Statistics I 54
Exercise
Computer Sales (n = 12 salespeople)
Original Data: 3, 10, 2, 5, 9, 8, 7, 12, 10, 0, 4, 6
Compute the mean, median, mode, quartiles.
Other Quantiles
• Similar to what we just learned about quartiles, where 3
quartiles split the data into 4 equal parts,
• There are 9 deciles dividing the distribution into 10 equal
portions (tenths).
• There are four quintiles dividing the population into 5 equal
portions.
• … and 99 percentiles (next slide)
• In all these cases, the convention is the same. The point,
be it a quartile, decile, or percentile, takes the value of
one of the observations or it has a value halfway
between two adjacent observations. It is never necessary
to split the difference between two observations more
finely.
Descriptive Statistics I 56
Percentiles
• We use 99 percentiles to divide a data set into 100 equal
portions.
Some Exercises
Data (n=16):
1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 8, 10
Compute the mean, median, mode, quartiles.
Answer.
1 1 2 2 ┋ 2 2 3 3 ┋ 4 4 5 5 ┋ 6 7 8 10
Mean = 65/16 = 4.06
Median = 3.5
Mode = 2
Q1 = 2
Q2 = Median = 3.5
Q3 = 5.5
Descriptive Statistics I 58
Exercise: # absences
Data – number of absences (n=13) :
0, 5, 3, 2, 1, 2, 4, 3, 1, 0, 0, 6, 12
Compute the mean, median, mode, quartiles.
Measures of Dispersion
• Dispersion is the amount of spread, or variability, in
a set of data.
Measures of Dispersion
We see that supplier B’s chips have a longer average life.
Supplier A chips Supplier B chips
(life in years) (life in years)
11 170
However, what if the company offers 11 1
a 3-year warranty? 10 1
10 160
11 2
Then, computers manufactured 11 150
Measures of Dispersion
• We will study these five measures of dispersion
• Range
• Interquartile Range
• Standard Deviation
• Variance
• Coefficient of Variation
Descriptive Statistics I 65
The Range
• Range = Largest Value – Smallest Value
• Example (n = 15):
0, 0, 2, 3, 4, 7, 9, 12, 17, 18, 20, 22, 45, 56, 98
Q1 = 3, Q3 = 22
IQR = 22 – 3 = 19 (Range = 98)
Standard Deviation
• The standard deviation, s, measures a kind of
“average” deviation about the mean. It is not really the
“average” deviation, even though we may think of it
that way.
Standard Deviation
• Instead, we use:
Standard Deviation
Example. Two data sets, X and Y. Which of the
two data sets has greater variability? Calculate
the standard deviation for each.
=3 1 0
2 0
=3 3 0
4 5
(continued…) 5 10
Descriptive Statistics I 70
Standard Deviation
X (X-) (X-)2
1 3 -2 4
2 3 -1 1
SX == 1.58 3 3 0 0
4 3 1 1
5 3 2 4
∑=0 10
Y (Y-) (Y- )2
0 3 -3 9
0 3 -3 9
SY == = 4.47 0
5
3
3
-3
2
9
4
10 3 7 49
∑=0 80
• You divide by N only when you have taken a census and therefore know
the population mean. This is rarely the case.
Variance
The variance, s2, is the standard deviation (s)
squared. Conversely, .
Definitional formula:
Computational formula:
Key Terms
• Probability. The word probability is actually undefined,
but the probability of an event can be explained as the
proportion of times, under identical circumstances, that
the event can be expected to occur.
• It is the event's long-run frequency of occurrence.
• For example, the probability of getting a head on a coin toss = .5. If you
tossing a coin repeatedly, for a long time, you will note that a head occurs
about one half of the time.
Key Terms
• Objective probabilities are long-run frequencies of occurrence, as
above. Probability, in its classical (or, objective) meaning refers to a
repetitive process, one which generates outcomes which are not
identical and not individually predictable with certainty but which may
be described in terms of relative frequencies. These processes are
called stochastic processes (or, chance processes). The individual
results of these processes are called events.
Key Terms
• Random variable. That which is observed as the result of a stochastic
process. A random variable takes on (usually numerical) values. Associated
with each value is a probability that the value will occur.
For example, when you toss a die:
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6
• Simple probability. P(A). The probability that an event (say, A) will occur.
• Joint probability. P(A and B). P(A ∩ B). The probability of events A and B
occurring together.
Example: Readership
• In a small village in Ethiopia, we are looking at readership of the Ethiopian Herald (T)
and the Admas news papers (W):
• P(T) = .25
P(W) = .20
P(T and W) = .05
• Question: What is the probability of being either a Ethiopian Herald or a Admas news
paper?
Example: Readership
5 15
20 60
T W
N
Thus, it can be easily seen that 40 people (out of 100) read either NYT or
WSJ.
Probability 90
Example: Readership
• Other Probabilities: P(T′ and W′) = .60
• P(T′ and W) = .15
• P(T and W′) = .20
• P(T or W) = 1 – P(T′ and W′) = 1– .60 = .40
Note that:
P(T′ or W′) = 1- P(T and W) = 1 - .05 = .95
Probability 91
Example: Readership
• Another way to do solve this problem is to construct a
table of joint probabilities:
T T′
Mutually exclusive means that two things cannot occur at the same time [P (A
and B) = 0]. You cannot get a head and tail at the same time; you cannot be
dead and alive at the same time; you cannot pass and fail a course at the
same time; etc.
Independence has to do with the effect of, say B, on A. If knowing about B has
no effect on A, then they are independent. It is very much like saying that A
and B are unrelated.
Are waist size and gender independent of each other? Suppose I know that someone who is an
adult has a 24-inch waist, does that give me a hint as to whether that person is male or female?
How many adult men have a 24-inch waist? How many women? Is P (24-inch waist/adult male)
= P (24 inch waist/adult female). We suspect that the two probabilities are not the same, that
there is a relationship between gender and waist size (also hand size and height for that matter).
Thus, they are not independent.
Probability 93
Thus, cancer and smoking are not independent. There is a relationship between cancer
and smoking.
Since .10 is not equal to .06, we conclude that cancer and smoking are not independent.
These calculations are much easier if you set up a joint probability table. Coming up in
the next two slides.
Probability 97
S S′
(smoker) (non-smoker)
Notice that the marginal probabilities are the row and column totals. This is not an
accident. The marginal probabilities are totals of the joint probabilities and weighted
averages of the conditional probabilities.
Probability 98
M (male) F (female)
The Joint Probability Table:
B (beer drinker) .225 .175 .40
• Given that an individual is Female, what is the probability that that person is a Beer
drinker?
Computed probabilities:
Joint Probability Marginal Totals
P(D and M) = .08 P(D) = .20
P(D and F) = .12 P(D′) = .80
P(D′ and M) = .32 P(M) = .40
P(D′ and F) = .48 P(F) = .60
Probability 102
Example:
Probability Distribution for the Toss of a Die
Xi P(Xi)
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
X=x 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8
112
I. Binomial Distribution
The origin of binomial experiment lies in Bernoulli trial.
Bernoulli trial is an experiment of having only two mutually
exclusive outcomes which are designated by “success(s)” and
“failure (f)”. Sample space of Bernoulli trial {s, f}
Notation: Let probability of success and failure are p and q
respectively
P (success) = P(s) = p and P (failure) = P (f) = q, where q= 1- p
Definition: Let X be the number of success in n repeated
Binomial trials with probability of success p on each trial,
then the probabilities distribution of a discrete random
variable X is called binomial distribution.
113
ASSUMPTIONS
A binomial experiment is a probability experiment that satisfies the following
3. The probability of each outcome does not change from trial to trial.
P( X
x 0
x) 1
120
COMMON CONTINUOUS
PROBABILITY DISTRIBUTIONS
126
value that lies b/n a and b is equal to the area under the curve a and
b.
Since X must assume some value, it follows that the total area
Uniform distribution
Normal Distributions
It is the most important distribution in describing a continuous
random variable and used as an approximation of other
distribution.
A random variable X is said to have a normal distribution if its
probability density function is given by
1
1
2
2
x
f ( x) e 2
2
Where X is the real value of X,
i.e. - ∞ <x< ∞, -∞<µ<∞ and σ>0
Where µ=E(x) σ2 = variance(X)
µ and σ2 are the Parameters of the Normal Distribution.
135
Examples:
1. Find the area under the standard normal distribution which lies
Z 0 and Z 0.96
a) Between
Solution: Area P(0 Z 0.96) 0.3315
Z 1.45 and Z 0
b) Between
Solution: Area P ( 1.45 Z 0)
P (0 Z 1.45)
0.4265
Z 0.35
c)SOLUTION
To the right: of
Area P ( Z 0.35)
P ( 0.35 Z 0) P ( Z 0)
P (0 Z 0.35) P ( Z 0)
0.1368 0.50 0.6368
139
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
140
Example:
Example :The average grade for an exam is 74, and the standard
deviation is 7. If 12% of the class is given As, and the grades are
curved to follow a normal distribution, what is the lowest possible A
and the highest possible B?
Solution: In this example, we begin with a known area of probability,
find the z value, and then determine x from the formula x = σz+μ.
An area of 0.12, corresponding to the fraction of students receiving
As, is shaded in Figure We require a z value that leaves 0.12 of the
area to the right and, hence, an area of 0.88 to the left. From Table,
P(Z<1.18) has the closest value to 0.88, so the desired z value is
1.18. Hence, x= (7)(1.18) + 74 = 82.26. Therefore, the lowest A is 83
and the highest B is 82.
142
Exercise
1. An electrical firm manufactures light bulbs that have a life, before
burn-out, that is normally distributed with mean equal to 800
hours and a standard deviation of 40 hours. Find the probability
that a bulb burns between 778 and 834 hours.(ans. 0.5111)
2. A certain type of storage battery lasts, on average, 3.0 years with a
standard deviation of 0.5 year. Assuming that battery life is
normally distributed, find the probability that a given battery will
last less than 2.3 years.(ans. 0.0808)
3. Given a normal distribution with μ=40 and σ= 6, find the value of
x that has
(a) 45% of the area to the left and (ans x= 39.22)
(b) 14% of the area to the right. (ans x= 46.48)
143
Exponential Distribution
Definition : A continuous random variable has an exponential
distribution with parameter λ if its probability density function f
is given by
Example : Suppose that the number of miles that a car can run
before its battery wears out is exponentially distributed with an
average value of 10,000 miles. If a person desires to take a 5000
mile trip; what is the probability that he or she will be able to
complete the trip without having replacing the car battery?
Solution: It follows by the memory less property of the exponential
distribution that the remaining lifetime (in thousands of miles) of
the battery is exponential with parameter λ =1/10. Hence the
desired probability is
a) P{remaining lifetime >5} = 1- F(5)= e-5λ =e-1/2 ≈ 0.604
149
Variance
n
X ) P x
2
Alternatively, Var (X) = ( xi i
i 1
153
Properties of Variances
For any r.v X and constant C, it can be shown that
Var (CX) = C2 Var (X)
Var (X +C) = Var (X) +0 = Var (X)
If X and Y are independent random variables, then
Var (X + Y) = Var (X) + Var (Y)
More generally if X1, X2 ……, Xk are independent random
variables,
Then Var(X1 +X2 + …..+ Xk) = Var(X1) +Var(X2) +…. + var(Xk)
i.e k k
Var ( x) xi Var X i
i 1 i 1
Z
T
X
n
The F Distribution
• If a random variable, F, has an F distribution with (k1,k2) df,
then it is denoted as F~Fk1,k2
• F is a function of X1~2k1 and X2~2k2 as follows:
X1
k
F 1
X2
k
2
SAMPLING
DISTRIBUTIONS
Sampling Distributions 158
Sampling Distribution of X̅
• The sample mean, X̅ , is a random variable.
Sampling Distribution of X̅
•σ= = √2 = 1.41
• [Note: N, not n-1. This is the formula for computing the population
standard deviation, σ.]
Sampling Distributions 165
•Z= = -5
Example- Bulbs
• A manufacturer claims that its bulbs have a mean life of
15,000 hours and a standard deviation of 1,000 hours. A
quality control expert takes a random sample of 100 bulbs
and finds a mean of X̅ = 14,000 hours. Should the
manufacturer revise the claim?
Example- Bulbs
• Solution (DRAW A PICTURE!)
14,000 15,000 X
• Z= = = -10
• That’s 10 standard deviations away from the mean! Yes, the claim
should definitely be revised.
Sampling Distributions 172
Z = (Xi − μ) / σ
Z= = −1
Z= = +1
= −2
• Ans: The probability the sample mean will be below 98,000/year is .5 − .4772
= .0228.
Sampling Distributions 175
Implications
• The relationship between X̅ and µ is the foundation of
statistical inference
Importance
• This concept is what the whole rest of the course is about.
DETERMINING SAMPLE SIZE
Sample Size Determination 191
ePrecision
X Z / n
• e, the half-width of the confidence interval estimator is the
precision with which we are estimating. e is also called
sampling error.
Sample Size Determination 192
Z Z
If e then n
n e
Z 2 2
and so n 2
e
Sample Size Determination 193
Example
Suppose σ=20 based upon previous studies. We would
like to estimate the population mean within ±10 of its true
value, at α=.05 (i.e., 95% confidence). What sample size
should we take?
1.96 2 20 2 = 15.4
n
10 2
so, we need a sample size of at least 15.4, or n=16. We
round UP, not down.
Sample Size Determination 194
2
Z P(1 P)
e2
• Q: If we are trying to estimate the population proportion, P,
what do we use for P in this formula?
Sample Size Determination 195
Then,
n = 1.962.5(1 .5) = 9,604
.012 That is a VERY large sample.
Sample Size Determination 196
Homework
• Practice, practice, practice.
• As always, do lots and lots of problems. You can find these in the
online lecture notes and homework assignments.
INTRODUCTION TO
STATISTICAL INFERENCE
Estimation
Estimation 199
Estimation
• Why don’t we just use a single number (a point estimate)
like, say, X̅ to estimate a population parameter, μ?
• For instance, say you wanted a confidence interval estimator for the
mean income of a college
You might have
graduate:
That the mean income is between
100% confidence $0 and $∞
95% confidence $35,000 and $41,000
90% confidence $36,000 and $40,000
80% confidence $37,500 and $38,500
… …
0% confidence $38,000 (a point estimate)
• The wider the interval, the greater the confidence you will have in it
as containing the true population parameter μ.
Estimation 202
/2 /2
-Z/2 Z/2
Estimation 204
Question
• You work for a company that makes smart TVs, and your
boss asks you to determine with certainty the exact life of
a smart TV. She tells you to take a random sample of 100
TVs.
Answer – Take 1
• Since your boss has asked for 100% confidence, the only
answer you can accurately provide is: -∞ to + ∞ years.
• After you are fired, perhaps you can get your job back by
explaining to your boss that statisticians cannot work with
100% confidence if they are working with data from a
sample. If you want 100% confidence, you must take a
census. With a sample, you can never be absolutely certain
as to the value of the population parameter.
at 95% confidence:
11.50 ± 1.96*(2.50/√100)
11.50 ± 1.96*(.25)
11.50 ± .49
The 95% CIE is: 11.01 years ---- 11.99 years
[Note: Ideally we should be using σ but since n is large we assume that s is close to the
true population standard deviation.]
Estimation 207
σ2 = s2 =
Estimation 210
Key Points
• Once you are working with a sample, not the entire
population, you cannot be 100% certain of population
parameters. If you need to know the value of a parameter
certainty, take a census.
• The more confidence you want to have in the estimator,
the larger the interval is going to be.
• Traditionally, statisticians work with 95% confidence.
However, you should be able to use the Z-table to
construct a CIE at any level of confidence.
Estimation 214
Hypothesis Testing
• Testing a Claim: Companies often make claims about
products. For example, a frozen yogurt company may claim
that its product has no more than 90 calories per cup. This
claim is about a parameter – i.e., the population mean number
of calories per cup (μ).
Hypothesis Testing
• A hypothesis is made about the value of a parameter, but the only
facts available to estimate the true parameter are those provided
by the sample. If the statistic differs (and of course it will) from
the hypothesis stated about the parameter, a decision must be
made as to whether or not this difference is significant. If it is, the
hypothesis is rejected. If not, it cannot be rejected.
/2 /2
-Z/2 Z/2
Hypothesis Testing 220
• As we lower the error, the β error goes up: reducing the error of
rejecting H0 (the error of rejection) increases the error of “Accepting”
H0 when it is false (the error of acceptance).
• This is similar (in fact exactly the same) to the problem we had earlier
with confidence intervals. Ideally, we would love a very narrow
interval, with a lot of confidence. But, practically, we can never have
both: there is a tradeoff.
Hypothesis Testing 221
• This is why our legal system does not require a guilty verdict to
be “beyond a shadow of a doubt” (i.e., complete certainty) but
“beyond reasonable doubt.”
Hypothesis Testing 222
2. Specify the level of significance (α) to be used. This level of significance tells you
the probability of rejecting H0 when it is, in fact, true. (Normally, significance level of
0.05 or 0.01 are used)
3. Select the test statistic: e.g., Z, t, F, etc. So far, we have been using the Z
distribution. We will be learning about the t-distribution (used for small samples)
later on.
4. Establish the critical value or values of the test statistic needed to reject H 0. DRAW
A PICTURE!
• On the other hand, if a firm claims that a box of its raisin bran cereal
contains at least 100 raisins, a one-tail test has to be used. If the
sample mean is more than 100, everything is ok. The problems arise
only if the sample mean is less than 100. The question will be
whether we are looking at sampling error or perhaps the company is
lying and the true (population) mean is less than 100 raisins.
Hypothesis Testing 226
Two-Tail Tests
• A company claims that its soda vending machines deliver exactly 8 ounces of soda.
Clearly, You do not want the vending machines to deliver too much or too little soda.
How would you formulate this?
Answer:
H0: µ = 8 ounces
H1: µ ≠ 8 ounces
If you are testing at α=.01, The .01 is split into two: .005 in the left tail and .005 in the
right tail The critical values are ±2.575
.005 .005
-2.575 2.575
Hypothesis Testing 227
Two-Tail Tests
• A company claims that its bolts have a circumference of exactly 12.50
inches. (If the bolts are too wide or narrow, they will not fit properly):
Answer:
H0: µ = 12.50 inches
H1: µ ≠ 12.50 inches
One-Tail Tests
• A company claims that its batteries have an average life of at least 500 hours. How
would you formulate this?
Answer:
H0: µ ≧ 500 hours
H1: µ < 500 hours
If you are testing at an α = .05, The entire .05 is in the left tail (hint: H1 points to where
the rejection region should be.) The critical value is -1.645.
Hypothesis Testing 229
One-Tail Tests
A company claims that its overpriced, bottled spring water has no more than 1 mcg of
benzene (poison). How would you formulate this:
Answer:
H0: µ ≦ 1 mcg. benzene
H1: µ > 1 mcg. benzene
If you are testing at an α = .05, The entire .05 is in the right tail (hint: H1 points to where
the rejection region should be.) The critical value is +1.645.
.05
1.645
Hypothesis Testing 230
Example: Two-Tail Test
A pharmaceutical company claims that each of its pills contains exactly 20.00 milligrams
of Cumidin (a blood thinner). You sample 64 pills and find that the sample mean X̅
=20.50 mg and s = .80 mg. Should the company’s claim be rejected? Test at α = 0.05.
• Formulate the hypotheses
H0: µ =20.00 mg
H1: µ 20.00 mg
• Choose the test statistic and find the critical values; draw region of rejection
Test statistic: Z
At α = 0.05, the critical values are ±1.96.
• Use the data to get the calculated value of the test statistic
Z= = =5 [ .80/√.64 = .10 This is the standard error of the mean. ]
20.50 1.96(.10)
95%, CIE: 20.304 mg 20.696 mg
• Note: When testing a hypothesis, we often have to perform a one-tail test if the claim requires it.
However, we will always use only two-sided confidence interval estimators when using sample
statistics to estimate population parameters.
Hypothesis Testing 233
Homework
• Practice, practice, practice.
• Do lots and lots of problems. You can find these in the online
lecture notes.
STUDENT’S T-
DISTRIBUTION
Student’s t-Distribution
• In the previous lectures on statistical inference (estimation, hypothesis
testing) about the mean, we used the Z statistic even when we did not
know σ. In that case, we used s as a point estimate of σ for “large
enough” n.
• Use t for (1) Small sample AND (2) Taken from N.D. population** AND (3) Unknown σ
The t-Distribution
• The t-distribution looks like the normal distribution, except that it has
more spread.
• It is still symmetrical about the mean; mean=median=mode.
• It also goes from -∞ to +∞.
• Degrees of freedom (df) = n -1. We will shortly see why we lose the degree of
freedom (hint: remember we divided by (n-1) when computing the sample
standard deviation).
The t Distribution 238
s
X t
n
The t Distribution 239
Example 1
• A consulting firm claims that its consultants earn on
average exactly $260 an hour. You decide to test the
claim using a sample of 16 consultants. You find that X̅ =
$200 and s = $96. Test the claim at α = .05 level. Assume
that the population follows a normal distribution.
Example 1 (cont’d)
• Step 2: =.05
• Step 3: Choose the test statistic
• We are going to use the t-distribution since n is small (only 16) and we do not know σ
t 15
2 .5 % 2 .5 %
-2.1315 2.1315
• Step 4: Establish the critical value or values of the test statistic needed to
reject H0.
• This is a t15 so we cannot use the critical values of ±1.96 which we used for a Z test.
If you go to the t-table (next slide) and examine the column that has .025 on the top
and the row for 15 degrees of freedom, you will find that the critical values are
±2.1315
• Please do not forget that you have (n-1) degrees of freedom (d.f.); n = 16, but d.f. =
15.
The t Distribution 242
… … … … … … …
∞ 1.645 1.96 2.33 2.575
The t Distribution 243
Example 1 (cont’d)
• Step 5: Determine the actual value (computed value) of the test
statistic.
200 260 60
• t15 = 96 24= - 2.50
16
Example 1 (cont’d)
• Now, how do we construct a confidence interval estimator
(CIE), using t?
• Suppose there had not been a claim about the population
mean and you simply wanted to construct a 95% CIE for μ.
Using the sample evidence, what should you do?
200± 96
2.1315
16
$200 ± $51.16
$148.84 ↔ $251.16
Interpretation: We have 95% confidence that this interval
($148.84 ↔ $251.16) really does contain the true population
mean, µ.
The t Distribution 245
Example 2
• A company claims that its soup vending machines deliver
(on average) exactly 4.00 ounces of soup. The company
statistician finds:
n=25
X̅ =3.97 ounces
s=.04 ounces
-2.4922 2.4922
3.97 4 - .03
t24 -3.75 3.97 − 4 - .03
.04 / 25 .008 t = = = -3 .7 5 REJECT H 0
.04 / 2 5 .0 0 8
Reject H0
The t Distribution 247
Example 2 (cont’d)
•(b) 98% CIE of µ
3.97 ± 2.4922(.008)
3.97 ± .02
3.95 oz ←———→ 3.99 oz
Example 3
• A school claims that the average reading score of its
students is at least 70.
• A sample of 16 students is randomly selected (n=16) to test this
claim.
• X̅ = 68 and s = 9
(a) H 0 : µ ≥ 70
t 15
H1: µ<70
-1.7531
68 − 70 -2
68 70 -2 t = = = -0 .8 8 DO NOT REJECT H 0
t15 -0.88 15
9 / 16 2 .2 5
9 / 16 2.25
Do Not Reject H0
The t Distribution 250
Example 3 (cont’d)
• (b) Two-sided 95% CIE.
• We will always construct two-sided CIEs in this course.
68 2.1315(2.25)
68 4.8
Example 4
• The Vandelay Water Company claims that at most there
is1 ppm (part per million) of benzene in their terribly
expensive horrid- tasting bottled water.
(a) Test the claim at α=.05
(b) Supposing no claim was made. Construct a two-tailed 95%
C.I.E of μ
• SampleData:
n=25 randomly selected bottles of water
X̅ = 1.16 ppm and s = .20 ppm
The t Distribution 252
Example 4 (cont’d)
Test the claim at α=.05
H0: µ≤1.0 ppm
H1: µ>1.0 ppm
Reject H0
The t Distribution 253
Example 4 (cont’d)
0.025 0.025
1.16.083
Do Your Homework
• Practice, practice, practice.
• Do lots and lots of problems. You can find these in the online
lecture notes.
TWO-SAMPLE T-TEST
Inferences About Means from Two Independent Groups
Two-Sample t Test 256
About Homoskedasticity
• Technically, to use this formula one must know (or be able
to prove statistically) that the two variances are “equal” –
this property is called homoskedasticity.
• Incidentally, this is sometimes spelled with a c, “homoscedasticity.”
• Note that
• σ1,σ2 are not given
• n1+ n2 = 31
Company 1 Company 2
Sample average X̅ 1= $210 X̅ 2 = $175
Standard deviation s1 = $25 s2 = $20
Sample size n1 = 10 n2 = 20
Two-Sample t Test 264
.005 .005
-2.7633 2.7633
Two-Sample t Test 265
.005 .005
-2.7633 2.7633
Spending on Wine
• A marketer wants to determine whether men and women
spend different amounts on wine. (It is well known that
men spend considerably more on beer.)
• The researcher randomly samples 34 people (17 women
and 17 men) and finds that the average amount spent on
wine (in a year) by women is $437.47. The average
amount spent by men is $552.94.
• Given the Excel printout (next slide), is the difference
statistically significant?
Two-Sample t Test 270
$100.00 $107.00
• Digression – what was the $250.00 $240.00
input? $890.00
$765.00
$880.00
$770.00
$456.00 $409.00
$356.00 $500.00
$876.00 $800.00
• This is the data input into MS $740.00 $900.00
$231.00 $1,000.00
Excel as two columns of $222.00 $489.00
$555.00 $800.00
numbers showing how much $666.00 $890.00
$876.00 $770.00
money 17 men and 17 women $10.00 $509.00
$290.00 $100.00
spent on wine over the year. $98.00 $102.00
$56.00 $134.00
Two-Sample t Test 272
Job Satisfaction
• Comparing men and women on job satisfaction
• 10 is the highest job satisfaction score; 0 the lowest
MEN WOMEN
7 1 1 4
8 7 10 3
6 2 3 5
5 4 4 6
6 6 1 4
5 7 1 2
6 8 2 5
9 9 3 1
8 7 5 4
Two-Sample t Test 275
• If men and women in the firm actually have the same job
satisfaction (i.e., the difference between the population means is
actually 0), the likelihood of getting a difference between two
sample means of men and women of 2.61 or greater is .0012.
This is why we will reject HO that the two population means are
the same.
Homework
• Practice, practice, practice.
• Do lots and lots of problems. You can find these in the online
lecture notes and homework assignments. Solve using both the
formulas (with your calculator) and with MS Excel.
UNDERSTANDING
HYPOTHESIS TESTING
Going Deeper…
First we will take a careful look at how we set up the null and
alternate hypotheses, and the difference between tests that
approach a claim to refute it and those that substantiate the claim.
Understanding Hypothesis Testing 281
Setting up H0 and H1
o Suppose someone states that all geese are white. To refute this
statement, you only need to produce one black goose. To prove it is
true, you will need to check the color of every goose in the world.
This is why it is very important to know how to set up the null and
alternate hypotheses.
Understanding Hypothesis Testing 282
Setting up H0 and H1
Sometimes, getting H0 and H1 right will depend on who is doing the
testing
Substantiating a claim
• When it comes to substantiating a claim, a company wants to set
up the hypotheses, H0 and H1, in such a way that there is strong
evidence that its claim is accurate. The cost of being mistaken is
very high. The company wants to demonstrate that its claim is true.
It does not want the sample evidence to be inconclusive (i.e.,
having no evidence to reject the claim). This is why the company’s
claim belongs in the alternative hypothesis, H 1.
Sample evidence:
n = 64
= 220 lbs
s = 48 lbs
Understanding Hypothesis Testing 285
1.645
Refuting a claim
• Example. Same as above, but this time the research is done by
a competing firm, which wishes to refute Company A’s claim.
• H0: µ ≧ 8 years
-1.645
H1: µ < 8 years
• The computed Z value of -3.33 is in the region of rejection since it is < -1.645.
Thus, we Reject H0 at p < .05
Understanding Hypothesis Testing 291
8.0
H0: µ ≧ 8 years
H1: µ < 8 years
Understanding Hypothesis Testing 296
X̅ =7.9 years
s = 1.2 years
n = 100 tablets
• Note that this is not very unlikely. In fact, our p-value of .20 is
more than the .05 we were willing to work with as the
(maximum) probability of rejecting H 0 when the claim really is
true.
• We set up the “straw man”, H0, but could not knock it down.
The company’s claim could be true. The sample evidence
was not able to reject the company’s claim.
Understanding Hypothesis Testing 299
Do Your Homework
• Practice, practice, practice.
• Do lots and lots of problems. You can find these in the online
lecture notes.
TWO-SAMPLE Z-TEST
Inferences About Means from Two Groups
E 1 2 1 2
12 22
and 1
2
n1 n2
Two Sample Z Test 306
Two-Sample Z-test
To Calculate Z
1 2 1 2
12 22
n1 n2
1 2
for known σ1 and σ2
12 22
n1 n2
Two Sample Z Test 308
To Calculate Z (cont’d)
• If σ1 and σ2 are unknown, we can use
X1 X 2
Z=
s12 s22
n1 n2
• The difference between the two sample means is -0.4 colds (4.4 – 4.8).
This difference could be a statistically significant difference or could just
be chance. A chance difference means that another researcher
comparing two other groups might find that the placebo group has fewer
colds, on average. This is why we need a statistical test.
Two Sample Z Test 313
4.4 4.8 .4 .4
Z 3.15
.7 2 .82 .01605 .127
81 64
-.40 ± 1.96(.127)
-.40 ± .25 Thus, the margin of sampling error is .25 colds
Men Women
X 80.0 76.5
s 10 16
n 100 64
Two Sample Z Test 317
80.0 76.5 3 . 5 3 .5
Z 1.56
(10) 2 16 5 2.24
2
100 64
3.5± 1.96(2.24)
3.5± 4.4 The margin of sampling error is 4.4
Homework
• Practice, practice, practice.
• Do lots and lots of problems. You can find these in the online
lecture notes.
UNDERSTANDING HOW
HYPOTHESIS TESTING
WORKS
Two-Sample Z Tests
Example [continued]
Here’s what we’ve been doing.
• H0: µ1 = µ2
H1: µ1 ≠ µ2
83.7 74.3 9 .4
Z 2.97
162 182 10
• Computed
64 54
-2.97 2.97
• This represents the likelihood of getting the sample
evidence - a difference of 9.4 or greater.
p-value vs. α
Do Your Homework
• Practice, practice, practice.
• Do lots and lots of problems. You can find these in the online
lecture notes.
Q: Why does the formula for a CIE use Ps and the formula for Zcalc
use P?
Z Test for Proportions 333
.50 .10
0 : .90
2.5% 2.5%
1 : .90
-1.96 1.96
a) Test at α=.05
100
.10 ±.06
Homework
• Practice, practice, practice.
• Do lots and lots of problems. You can find these in the online
lecture notes and homework assignments.
TWO-SAMPLE Z TEST
FOR P
Inferences About Proportions of Two Groups
Two Sample Z Test for Proportions 343
• Examples:
• You wish to compare the defective rates (a proportion) of two
companies that supply the computer chips needed for your tablet
computer.
• You want to compare the death rates for heart transplants at two
hospitals.
• You want to compare the graduation rates of two high schools in
the same area.
Two Sample Z Test for Proportions 344
Ps1 Ps 2
Z
1 1
p 1 p
n1 n2
Two Sample Z Test for Proportions 346
Where
• X1 represents the # of “successes” in sample 1
• X2 represents the # of “successes” in sample 2
• A “success” is the outcome you are interested in, e.g., a defective part.
2 = sample
• n for group 2
X Xsize
p 1 2
n1 n2
H0: P1=P2
H1: P1≠P2
77 120 197
P .657
300 300
.77 .60 .17
2.93
1 1 .058
(.657)(.343)
100 200
Z=
Reject H0
Two Sample Z Test for Proportions 349
Pooled P= (100+44)/(400+200) =
Two Sample Z Test for Proportions 350
H0: P1=P2
H1: P1≠P2
Death rates:
Ps1 = 10/34 = .294 (women)
Ps2 = 30/53 = .566 (men)
= = .46
Z = = = -2.48
Conclusion: Reject H0. -2.48 is in the rejection region. The two population
proportions are not the same.
• Women have a statistically higher survival rate under adverse conditions than do men.
Possible reason given is that women have an additional layer of fat tissue which provides a
fetus (and, of course, women) with sufficient nourishment in case of a famine.
Two Sample Z Test 353
Homework
• Practice, practice, practice.
• Do lots and lots of problems. You can find these in the online
lecture notes and homework assignments.
INTRO TO
REGRESSION
Simple Linear Regression
Simple Regression 355
Regression
• Using regression analysis, we can derive an equation by
which the dependent variable (Y) is expressed (and
estimated) in terms of its relationship with the
independent variable (X).
• In simple regression, there is only one independent variable (X) and one dependent
variable (Y). The dependent variable is the outcome we are trying to predict.
• In multiple regression, there are several independent variables (X1, X2, … ), and still
only one dependent variable, Y. We are trying to use the X variables to predict the Y
variable.
80
70
60
50
40
30
20
10
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
Simple Regression 359
• Yi = β 0 + β 1 Xi + εi
where,
β0 = true Y intercept for the population
β1 = true slope for the population
εi = random error in Y for observation i
Steps in Regression
1- For Xi (independent variable) and Yi (dependent variable),
Calculate:
ΣYi
ΣXi
ΣXiYi
ΣXi2
ΣYi2
Steps in Regression
3- Calculate the coefficient of determination: r2 = (r)2
0 ≤ r2 ≤ 1
This is the proportion of the variation in the dependent variable (Yi) explained by
the independent variable (Xi)
Note that you have already calculated the numerator and the denominator for parts
of r. Other than a single division operation, no new calculations are required.
BTW, r and b1 are related. If a correlation is negative, the slope term must be
negative; a positive slope means a positive correlation.
Steps in Regression
6- The regression equation (a straight line) is:
Yˆi = b0 + b1Xi
r n 2
tn-2 =
1 r2
Steps in Regression
(c) F-test – we can do it in MS Excel
MSExplaine d MSRegressi on
F= F=
MSUn exp lained MSResidual
Yi Xi X iY i X i2 Y i2
2 1 2 1 4
5 2 10 4 25
8 3 24 9 64
10 4 40 16 100
15 5 75 25 225
40 15 151 55 418
Simple Regression 369
155
Step 4- b1 = = 3.1 The slope is positive. There is a positive relationship
50
between water and crop yield.
40 15
Step 5- b0 = - 3.1 = -1.3
5 5
Y Y Yˆ Y Y Yˆ
i i i i
ˆ
2
2
ˆ
Yi Y Yi Y Yi Yi
2
Total Explained Un exp lained
Variation Variation Variation
inY
X
Simple Regression 375
Yˆi Y b Y b X Y
2 Yi 2
0 i 1 i i
Explained Variation: n
Yi Yˆi Y
2
i
2
b0Yi b1X iY
Unexplained Variation:
In other words, 98% of the total variation in crop yield is explained by the linear
relationship of yield with amount of water used on the crop.
Simple Regression 376
Answers to … ?
Regression Statistics
Multiple R 0.980498039
R Square 0.961376404
Adjusted R Square 0.953651685
Standard Error 0.626783171
Observations 7
ANOVA
df SS MS F Significance F
Regression 1 48.89285714 48.89285714 124.4545455 0.000100948
Residual 5 1.964285714 0.392857143
Total 6 50.85714286
On the other hand, if all the points are on a line, then the unexplained variation
(residual variation) is 0. This results in an F-ratio of infinity.
An F-value of, say, 30 means that the explained variation is 30 times greater than
the unexplained variation. This is not likely to be chance and the F-value will be
significant.
Simple Regression 384
Education (X) 9 10 11 11 12 14 14 16 17 19 20 20
Income (‘000s) (Y) 20 22 24 23 30 35 30 29 50 45 43 70
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.860811139
R Square 0.740995817
Adjusted R Square 0.715095399
Standard Error 7.816452413
Observations 12
ANOVA
df SS MS F Significance F
The Mean Square Error (or using Excel terminology, MS Residual) is 61.0969.
The square root of this number 7.816 45 is the standard error of estimate and is
used for confidence intervals.
The mean square (MS) is the sum of squares (SS) divided by its degrees of
freedom.
Simple Regression 387
Linear Correlation
• The topic of this lecture involves measuring the strength of the linear
relationship between two random variables (each with at least an interval
scale level of measurement).
Linear Correlation
• We will use a simple formula to compute r, the correlation coefficient,
from sample data. This correlation coefficient, r, ranges from -1 to +1.
A Positive Relationship
• A correlation of coefficient, r, of +1 indicates a perfect positive linear
relationship between the two variables. In fact, if we draw a scatter plot placing
all the paired sample data on a graph, all the points would lie on a straight line.
Of course, in real life, one almost never encounters perfect relationships
between variables. For instance, it is certainly true that there is a very strong
positive relationship between hours studied and grades. However, there are
other variables that affect grades as well. Two students can each spend 20
hours studying for an exam and one will get a 100 on the exam and the other
will get an 80. This indicates that there is also random variation and/or other
variables that explain performance on a test (e.g., IQ, previous knowledge, test
taking ability, etc.).
Correlation 393
A Negative Relationship
• A correlation of -1 indicates a perfect negative
linear relationship (i.e., an inverse relationship).
In fact, in a scatter plot, all the points would lie on
a line with a downward slope.
Correlation 394
No Linear Correlation (r = 0)
• A correlation of 0 indicates absolutely no relationship between X and Y. In
real life, correlations of 0 are very rare. You might, rather, get a
correlation of .10 and it will not be significant, i.e., it is not statistically
different from 0. (There are ways to test correlations for significance.)
• Examples:
• Poverty and crime are correlated. Which is the cause?
• 3 % of older singles suffer from chronic depression; does being single cause depression?
Perhaps, being depressed results in one being single. People do not want to marry unhappy
people.
• Cities with more cops also have more murders. Does ‘more cops’ cause ‘more murders’? If
so, get rid of the cops!
• There is a strong inverse correlation between the amount of clothing people wear and the
weather; people wear more clothing when the temperature is low and less clothing when it is
high. Therefore, a good way to make the temperature go up during a winter cold spell is for
everyone to wear very little clothing and go outside.
• There is a strong correlation between the number of umbrellas people are carrying and the
amount of rain. Thus, the way to make it rain is for all of us to go outside carrying umbrellas!
Correlation 397
Coefficient of Determination, R2
• The coefficient of determination, R2 (in Excel, it is called R-squared) is also an
important measure. It ranges from 0 to 1.0 (or, 0% to 100%) and measures
the proportion of the variation in Y explained by X. R2 is actually equal to (r)2,
or in other words, the square of the correlation coefficient.
• When you do correlation, you generally do not worry about which is the X and
which is the Y variable since you have no interest in predicting Y from X. In
regression, where you want to see an equation relating X to Y, you must
specify which is the Y-variable (dependent variable) and which is the X-
variable (independent variable). The correlation coefficient is the same
regardless of which variable is X and which one is Y. The regression equation
will be different if you reverse the X and Y.
Correlation 398
Coefficient of Determination, R2
• If all the points are on the line, r = 1 (or -1 if there is an inverse relationship),
then R2 is 100%. This means that all of the variation in Y is explained by
(variations in) X. This indicates that X does a perfect job in explaining Y and
there is no unexplained variation.
Y (Grade) 100 95 90 80 70 65 60 40 30 20
X (Height) 73 79 62 69 74 77 81 63 68 74
∑Xi = 720
∑Yi = 650
∑XiYi = 46,990
∑Xi2 = 52,210
∑Yi2 = 49,150
Correlation 401
n XY X Y
• r=
n X 2
X n Y Y
2 2 2
• R2 = 1.4%
• The correlation coefficient is not significant (for now, you have to trust me on this;
there is a t-test we can learn later to test for significance). The correlation
coefficient, r, of .1189 is not significantly different from 0. Thus, there is no
relationship between height and grades. Correlation coefficients of less than .30 are
generally considered very weak and of little practical importance even if they turn
out to be significant. If you go back to the scatter plot, you will note that the X and Y
do not seem to be related.
• Keep in mind that r is based on a sample of 10 students. The population consists of millions
of students. There is sampling error in measuring r. Therefore, we need a statistical test to
determine whether the sample correlation coefficient is significantly different from 0.
Correlation 402
∑Xi = 62
∑Yi = 650
∑ XiYi = 4,750
∑Xi2 = 464
∑Yi2 = 49,150
Correlation 403
• R2 = 94.09%
∑Xi=77
∑Yi=771
∑XiYi=4,867
∑Xi2 =649
∑Yi2 =56,667
Correlation 405
• r = -.99; R2 = 98.01%
• R2 = 79.39%
1.1 Introduction
• For example, suppose that the effective life of a cutting
tool depends on the cutting speed and the tool angle. A
possible multiple regression model could be
where
Y – tool life
x1 – cutting speed
x2 – tool angle
412
1.1 Introduction
Figure 12-1 (a) The regression plane for the model E(Y) = 50 +
10x1 + 7x2. (b) The contour plot
413
1.1 Introduction
414
1.1 Introduction
1.1 Introduction
Figure 4 Matrix of scatter plots (from Minitab) for the wire bond
pull strength data in Table 12-2.
421
Example 1
422
Example 1
424
where
426
Example 2
429
Example 2
430
Example 2
431
Example 2
2. Analysis of Variance
(ANOVA)
One-Way Analysis of Variance
• Assumptions
• Populations are normally distributed
• Populations have equal variances
• Samples are randomly and independently drawn
Hypotheses of One-Way ANOVA
• H0 : μ1 μ2 μ3 μK
• All population means are equal
• i.e., no variation in means between groups
μ1 μ2 μ3
One-Way ANOVA
(continued)
H0 : μ1 μ2 μ3 μK
H1 : Not all μi are the same
At least one mean is different:
The Null Hypothesis is NOT true
(Variation is present between groups)
or
Variability
• The variability of the data is key factor to test the equality
of means
• In each case below, the means may look different, but a
large variation within groups in B makes the evidence
that the means are different weak
A B
A B C A B C
Group Group
Small variation within groups Large variation within groups
Partitioning the Variation
• Total variation can be split into two parts:
Where: i 1 j 1
Total Variation
(continued)
Re s p o n s e , X
i 1 j 1
Where:
SSW = Sum of squares within groups
K = number of groups
ni = sample size from group i
Xi = sample mean from group i
Xij = jth observation in group i
Within-Group Variation
(continued)
K ni
S S W (x ij x i ) 2
i 1 j 1
SSW
Summing the variation
within each group and then
MSW
adding over all groups n K
Mean Square Within =
SSW/degrees of freedom
μi
Within-Group Variation
(continued)
Re s p o n s e , X
x3
x2
x1
Between-Group Variation
i1
Where:
SSG = Sum of squares between groups
K = number of groups
ni = sample size from group i
xi = sample mean from group i
x = grand mean (mean of all data values)
Chap 17-447
Between-Group Variation
(continued)
K
SSG ni ( x i x ) 2
i1
SSG
Variation Due to
Differences Between MSG
Groups K 1
Mean Square Between Groups
= SSG/degrees of freedom
μi μj
Between-Group Variation
(continued)
2 2 2
SSG n1(x1 x) n2 (x 2 x) ... nK (x K x)
Re s p o n s e , X
x3
x
x2
x1
SST
MST
n 1
SSW
MSW
n K
SSG
MSG
K 1
One-Way ANOVA Table
Source of SS df MS F ratio
Variation (Variance)
Between SSG MSG
SSG K-1 MSG =
Groups K - 1 F = MSW
Within SSW
SSW n-K MSW =
Groups n-K
Total SST = n-1
SSG+SSW
K = number of groups
n = sum of the sample sizes from all groups
df = degrees of freedom
Chap 17-451
One-Factor ANOVA
F Test Statistic
H0: μ1= μ2 = … = μK
H1: At least two population means are different
• Test statistic
MSG
F
MSW
MSG is mean squares between variances
MSW is mean squares within variances
• Degrees of freedom
• df1 = K – 1 (K = number of groups)
• df2 = n – K (n = sum of sample sizes from all groups)
Interpreting the F Statistic
Decision Rule:
Reject H0 if = .05
1 2 3
Club
One-Factor ANOVA Example
Computations
Pro 1 Pro 2 Pro 3 x1 = 249.2 n1 = 5
254 234 200 x2 = 226.0 n2 = 5
263 218 222
241 235 197 x3 = 205.8 n3 = 5
237 227 206 n = 15
251 216 204 x = 227.0
K=3
SSG = 5 (249.2 – 227)2 + 5 (226 – 227)2 + 5 (205.8 – 227)2 = 4716.4
SSW = (254 – 249.2)2 + (263 – 249.2)2 +…+ (204 – 205.8)2 = 1119.6
F2,12,.05 = 3.89
ANOVA -- Single Factor:
Two-Way ANOVA
(continued)
• Assumptions
Group
Block 1 2 … K
(j 1,2,by
x j means
• Denote the block sample ,K)
x i (i 1,2, ,H)
Partition of Total Variation
• SST = SSG + SSB + SSE
Variation due to
Total Sum of
Squares (SST) = differences between
groups (SSG)
+
Variation due to
differences between
blocks (SSB)
+
The error terms are assumed Variation due to
to be independent, normally random sampling
distributed, and have the same (unexplained error)
variance
(SSE)
Chap 17-463
K
B etween - Groups : S S G H ( x j x ) 2 K–1
j 1
H
B etween - B locks : S S B K (x i x ) 2 H–1
i 1
K H
E rror : S S E (x ji x j x i x ) 2 (K – 1)(K – 1)
j 1 i 1
Two-Way Mean Squares
• The mean squares are
SST
MST
n 1
SST
MSG
K 1
SST
MSB
H 1
SSE
MSE
(K 1)(H 1)
Two-Way ANOVA:
The F Test Statistic
n
1
Y Yi
n i 1
What Make a Good Estimator?
• Unbiasedness
• Efficiency
• Mean Square Error (MSE)
• Asymptotic properties (for large samples):
• Consistency
Unbiasedness of Estimator
E (Y ) Y
Proof: Sample Mean is Unbiased
1 n 1 n
E (Y ) E Yi E (Yi )
n i 1 n i 1
1 n 1
Y nY Y
n i 1 n
Efficiency of Estimator
1 1 n n
2
Var (Y ) Var Yi 2 2
n i 1 n i 1 n
MSE of Estimator
E Y Y Var Y E Y Y
2 2
Consistency of Estimator
P Y Y 0 as n
More on Consistency
• An unbiased estimator is not necessarily consistent –
suppose choose Y1 as estimate of mY, since E(Y1)= mY, then
plim(Y1) mY
• An unbiased estimator, W, is consistent if Var(W) 0 as n
• Law of Large Numbers refers to the consistency of sample
average as estimator for m, that is, to the fact that:
plim(Y) Y
476
Y Y a
Z ~ N 0,1
n
477
n 2
1
S 2
Yi Y
n 1 i 1
478
statistics
• If we pull lots and lots of samples, we’ll get a distribution of
sample statistics
479
Hypothesis Tests
Qualitative Data
Q ualitative
Data
2 or m ore
1 pop. pop.
Proportion Independence
2 pop.
2 Test of Independence
1. Shows If a Relationship Exists Between 2
Qualitative Variables, but does Not Show Causality
2. Assumptions
Multinomial Experiment
All Expected Counts 5
2 Test of Independence
Contingency Table
Qualitative Variables
484
2 Test of Independence
Contingency Table
1. Shows # Observations From 1 Sample Jointly in 2
Qualitative Variables
Levels of variable 2
Residence
Disease Urban Rural Total
Status
Disease 63 49 112
No disease 15 33 48
Total 78 82 160
Levels of variable 1
485
2 Test of Independence
Hypotheses & Statistic
1. Hypotheses
2 Test of Independence
Hypotheses & Statistic
1. Hypotheses
H0: Variables Are Independent
Ha: Variables Are Related (Dependent)
2. Test Statistic
Observed count
O Eij
2
Expected
2
all cells
ij
Eij count
487
2 Test of Independence
Hypotheses & Statistic
1. Hypotheses
H0: Variables Are Independent
Ha: Variables Are Related (Dependent)
2. Test Statistic
Observed count
Degrees of Freedom: (r
- E – 1)
O1)(c
2
Expected
2
all cells
ij
Eij
ij
count
488
78
Marginal probability =
160
491
78
Marginal probability =
160
492
Expected count =
Row total Column tot al
Sample size
494
48x78 48x82
160 160
2 Test of Independence
Example on HIV
• You randomly sample 286 sexually active individuals and
collect information on their HIV status and History of STDs.
At the .05 level, is there evidence of a relationship?
HIV
STDs Hx No Yes Total
No 84 32 116
Yes 48 122 170
Total 132 154 286
495
2 Test of Independence
Solution
H0: Null hypothesis Test Statistic:
Ha: Alternative hypothesis
= significance level
df = Degrees of freedom
Critical Value(s): Decision:
Conclusion:
Reject
0 2 496
2 Test of Independence
Solution
H0: No Relationship Test Statistic:
Ha: Relationship
=
df =
Critical Value(s):
Decision:
Reject Conclusion:
0 2 497
2 Test of Independence
Solution
H0: No Relationship Test Statistic:
Ha: Relationship
= .05
df = (2 - 1)(2 - 1) = 1
Critical Value(s):
Decision:
Reject Conclusion:
0 2 498
2 Test of Independence
Solution
H0: No Relationship Test Statistic:
Ha: Relationship
= .05
df = (2 - 1)(2 - 1) = 1
Critical Value(s):
Decision:
Reject
Conclusion:
= .05
0 3.841 2 499
500
2 Test of Independence
Solution
E(nij) 5 in all
cells
116x132 HIV 154x116
286 No Yes 286
STDs HX Obs. Exp. Obs. Exp. Total
No 84 53.5 32 62.5 116
Yes 48 78.5 122 91.5 170
Total 132 132 154 154 286
170x132 170x154
286 286
501
2 Test of Independence
Solution
2
Oij Eij
2
2
O11 E11
2
O22 E22
2
E11 E22
84 53.5 32 62.5
2 2
122 91.5
2
54.29
53.5 62.5 91.5
2 Test of Independence
Solution
H0: No Relationship Test Statistic:
Ha: Relationship
= .05
2 = 54.29
df = (2 - 1)(2 - 1) = 1
Critical Value(s):
Decision:
Reject Conclusion:
= .05
0 3.841 2 502
2 Test of Independence Solution
0 3.841 2 503
2 Test of Independence Solution