Learning Module - Statistics and Probability
Learning Module - Statistics and Probability
CHAPTER 1
RANDOM VARIABLES AND PROBABILITY DISTRIBUTION
Objectives
The learner should be able to
1. illustrate a random variable;
2. distinguish between a discrete and a continuous random variable;
3. illustrate a probability distribution for a discrete random variable and its properties;
4. construct the probability mass function of a discrete random variable and its
corresponding histogram;
5. compute probabilities corresponding to a given random variable;
6. illustrate, calculate and interpret the mean and variance of a discrete random
variable; and
7. solve problems involving mean and variance of probability distribution.
Statistics
• a branch of applied mathematics concerned with collecting, organizing, and interpreting
data. The data are represented by means of graphs.
• the mathematical study of the likelihood and probability of events occurring based on
known quantitative data or a collection of data
• attempts to infer the properties of a large collection of data from inspection of a
sample of the collection thereby allowing educated guesses to be made with a
minimum of expense
Random variable is a variable that is subject to random variations so that it can take on
multiple different values, each with an associated probability.
It is a set of possible values from a random experiment.
In short : X = { 0, 1 }
So:
• Experiment : tossing a coin
• Event : Head or Tail
• Values to each event : Head = 0 ; Tail = 1
• Random Variable ( X ) : set of values
•
Note: Other values of choice can be used for Head and Tail. Ex. Head = 5 and Tail = 10.
1
Sample ( n ) – is a subset of a population. It refers to a set of data collected from a larger set
called population.
Example: In a population of 2500 people, what will be the actual sample size if it is 15% of
the total population ?
Answer : n = 0.15 ( N )
= 0.15 ( 2500 )
n = 375
Probability ( Pr )
• the chance or likelihood that a certain event will occur
• based on reasoning written as a ratio of the number of favorable outcomes to the
number of possible outcomes : Pr (x) = number of favorable outcomes
number of possible outcomes
• expressed in fraction ( e.g. 3/4 ), in decimal ( 0.75 ) or in percentage ( 75% )
2
11
Example 1: Toss a coin 2 times. Present the discrete probability distribution using a notation
formula, a table and a graph.
Solution:
Possible outcomes: 1st toss
2nd toss H T
H T H T
A. Notation formula
Sample space ( S ) = { HH, HT, TH, TT }
If “ x” is the random variable for the number of Head , then x assumes the value of
0 head, 1 head and 2 heads.
Hence ,
• the probability of getting 0 Head : Pr ( x = no head ) = 1/4
• the probability of getting 1 Head : Pr ( x = 1 head ) = 2/4 or 1/2
• the probability of getting 2 Heads: Pr ( x = 2 heads ) = 1/4
B. Table
The probability distribution is constructed by listing the outcomes and determining the
probability value for each of the outcomes.
C. Graph ( histogram )
Shown below is the histogram of the discrete probability distribution.
0.75
Pr (X ) 0.50
0.25
0 1 2
( x – number of heads )
3
Example 2 : In a class of 40 students, 30 students passed in all subjects, 5 failed in one
subject, 3 failed in two subjects and 2 failed in three subjects. Find the probability
distribution of the variable for number of subjects a student from the given class
has failed in.
Solution :
A. Notation Formula
Probability of failing in 0 subjects : P ( X = 0 ) = 30/40 = 0.75
Probability of failing in 1 subject : P ( X = 1 ) = 5/40 = 0.125
Probability of failing in 2 subjects : P ( X = 2 ) = 3/40 = 0.075
Probability of failing in 3 subjects : P ( X = 3 ) = 2/40 = 0.05
B. Table
x 0 1 2 3
C. Graph
0.80 0.75
0.70
0.60
0.50
Pr ( X ) 0.40
0.30
0.20 0.125
0.10 0.075 0.05
0.0
0 1 2 3
X – number of failed subjects
Activity
Construct a probability distribution table for the values of the variables and
the corresponding probabilities when :
4
11
Lesson 1.3 Mean, Variance, and Standard Deviation for a Probability Distribution
Mean
The mean of the probability distribution is different from the mean on measures of
central tendency. The mean of the probability distribution is obtained as the sum of the
product of the possible outcomes and the probability of the outcome. In mathematical notation
it is represented as:
µ = Σ [x • Pr ( x )]
where;
µ - mean of the probability
x – possible outcome
Pr(x)- probability of the outcome
Formula:
Variance : s2 = Σ [ x2 • Pr (x)] -- µ2
Example 1:
When 3 coins were tossed once and simultaneously and the occurrences of the number
of heads were recorded, what will be the mean of the occurrences of the number of heads?
Compute also the variance and the standard deviation.
Solution:
Possible outcomes:
S = { HHH, HHT, HTH, HTT, THH, THT, TTH, TTT }
Note:
The probability of each of the sample points is 1/8.
Pr ( X ) X ( Number of Heads )
1/8 0 ( No Head )
3/8 1 Head
3/8 2 Heads
1/8 3 Heads
Σ Pr ( x ) = 1
5
Solution:
A. mean
µ = Σ [ x • Pr ( x )]
= Σ [ 0(1/8) + 1(3/8) + 2(3/8) + 3(1/8) ]
= Σ [ 0 + 3/8 + 6/8 + 3/8 ]
= 12/8
µ = 3/2 or 1.5
B. variance
s2 = Σ [ x2 • Pr (x)] -- µ2
= Σ [ 02(1/8) + 12 (3/8) + 22(3/8) + 32(1/8) ] – 1.52
= Σ [ 0(1/8) + 1(3/8) + 4(3/8) + 9(1/8) ] – 2.25
= Σ [ 0 + 3/8 + 12/8 + 9/8 ] – 2.25
= 24/8 – 2.25
= 3 – 2.25
2
s = 0.75
C. standard deviation
s = √ s2
= √ 0.75
s = ± 0.87
Activity
A distribution of the number of students that obtain an average score of 85 and above
for 3 years and the corresponding probabilities is shown below. Find the mean , variance and
standard deviation of the probability.
6
11
III. A family has three children. Using B to stand for boy and G to stand for girl, find the
probability that :
IV. Construct a probability distribution table and draw a histogram for the canteen’s customers
who will order 1, 2, 3 number of viand with a probabilities of 0.45, 0.35, and 0.20 respectively.
Table : Histogram:
7
V. A research was conducted to determine the total number of cars each household has in a
certain village. The result of survey is shown below. Solve for the mean, variance, and standard
deviation of the distribution.
VI. The maximum life span of an iron is 5 years. Find the mean, variance, and standard
deviation given distribution of the life span below.
VII. Find the mean of the distribution if the probabilities that a family children will have 0, 1, 2, or
3 boys are 1/5 , 4/15, 1/6, and 11/30 respectively.
8
11
CHAPTER 2
NORMAL DISTRIBUTION
Objectives
The learner should be to
1. illustrate a normal random variable and its characteristics,
2. construct a normal curve,
3. identify regions under the normal curve corresponding to different standard normal
values,
4. convert a normal random variable to a standard normal variable and vice versa, and
5. compute probabilities and percentages using the standard normal table.
2. The mean, median, and mode of a normal distribution coincide to its center and have
equal values. Note that the distribution is unimodal since it has only one mode.
3. The tail ends of the curve are asymptotically extended on both sides of the x-axis.
4. The area under a normal curve represents the total population. The total area under a
normal curve is 100% or 1.00. Each sides of a normal curve measures 50% or 0.50.
The area under normal curve indicates probability.
5. The shape of a normal curve of a normal distribution depends on the values of the
mean and standard deviation. The negative standard deviations are located on the left
side of the curve and the positive standard deviations are located on right side of the
normal curve. The larger the standard deviation is, the more spread out the distribution.
Note:
The shape of a normal curve of a normal distribution depends on the following
conditions:
1. same means with different standard deviation
2. different means with different standard deviation
3. different means with same standard deviation
9
100 %
50 % 50 %
10
11
Z=X–X
s
where: x - value of the variable
x - the mean
Note:
1. The standard score (z) represents the number of standard deviation the variable x is away
from the mean.
2. The area under normal curve is always positive since there is no such thing as a
negative value for an area.
4. Add or subtract the areas taken from the table of normal distribution if necessary.
Note:
Add the values of the areas if and only if the standard scores ( Z ) are located on
opposite sides of the distribution.
Subtract the values of the areas if and only if the standard scores ( Z ) are located
on the same side of the distribution.
Example 1:
Solution:
Draw a sketch of the normal curve and shade the desired area.
11
From Table 1 ( Areas Under Normal Curve ):
X .00 .01
0.0 .0000
2.0 0.4772
Area Z= 2 = 0.4772
Example 2:
Area net = Area Z=-3 - Area Z=-1 ( Subtraction since both Zs lie
= 0.4987 – 0.3413 on the same side )
= 0.1574 or 15.74 %
12
11
Example 3:
Area Net = Area Z=2.0 + Area Z= - 1.0 ( Add since standard scores
= 0.4772 + 0.3413 are located on opposite
= 0.8185 or 81.85 % sides )
13
Example 5:
-0.72
A Z= -0.72 = 0.2642
14
11
Exercises
With the following given, find the net area under a normal distribution. Draw a sketch of
the normal curve and shade the desired area.
1. between Z = 0 and Z = 2.47
15
Lesson 2.3 Probability Distribution Curve
The Probability distribution is used to treat data for normally distributed variables. Note that
there are no gaps in a continuous distribution.
Example 1:
Example 2:
Area Net = Area Z= - 1.28 + Area Z= 0.50 ( Add the two areas since they are
= 0.3997 + 0.1915 located on opposite sides of the
= 0.5912 or 59.12 % distribution curve )
16
11
Z= - 1.28 Z= 0..50
Example 1:
1600 Grade 11 students took statistics examination and they obtained a mean score of
82% and a standard deviation of 6%. If the data are normally distributed, find the number of
students who obtained:
a. a score of 87% and above
b. a score of 80% to 90%
c. a passing score
Solution:
a. a score of 87% and above
x = 87 ; s = 5 ; x = 82
Step 1: Transform the variable (x) into standard score (Z) if necessary.
Z=X–X
s
Z = 87 – 82
5
Z = 5 or 1
5
Hence : 87% is equivalent to 1.00.
Step 2: Find the area of the variable in a normal distribution table : Area Z=1.0 = 0.3413
17
LYCEUM OF ALABANG
Step 3: Subtract the area taken from the table which is equivalent to 0.3413 from
0.5000 which represents as the total area of the right side of the distribution
Area Net = 0.5 – 0.3413
= 0.1587 or 15.87 %
Therefore, the probability of the students who obtained a score of 87% and above is 0.1587.
Step 4: Multiply the computed probability to the total number of students who took the
examination to get the total students who got a score of 87% and above.
No. of students who got a score of 87% and above = 0.1587 x 1600
= 253.92 or 254 students
Note: Rounded off to whole number since there is no such person to be considered as
decimal.
Z 1= X1 – X Z 2= X2 – X
s s
= 80 – 82 = 90 – 82
5 5
Z 1 = - 0.4 Z 2 = 1.6
Hence :
80% is equivalent to – 0.4 ( Z )
90% is equivalent to 1.6.( Z )
Step 2: Find the area of the variables in the normal distribution table.
Area Z= -0.4 = 0.1554
Area Z= 1.6 = 0.4452
18
LYCEUM OF ALABANG
Therefore, the probability of the students who obtained a score of 80% to 90% is 0.6006.
Step 3: Multiply the computed probability to the total number of students who took the
examination to get the total students who got a score of 80% to 90%
c. a passing score
Since the passing score is 75%, therefore the value of the variable x is 75%
x = 75% s = 5% ẍ = 82%
Step 1: Transform the variable (x) into standard score (Z) if necessary.
Z=X–X
s
Z = 75 – 82
5
Z = -7
5
Z = - 1.4
Hence , 75% is equivalent to – 1.4 ( Z ).
Step 3: Multiply the computed probability to the total number of students who took the
examination to get the total students who got a passing score.
19
LYCEUM OF ALABANG
Example 2:
The weights of 1-year-old baby are approximately normally distributed, with a mean of
22.8 lbs and a standard deviation of about 2.15. If there were 164 randomly selected 1-year-old
babies, how many babies weigh at least 20 pounds?
Solution:
Given: x = 20 lbs. s = 2.15 x = 22.8 lbs
Z=X–X
s
Z = 20 – 22.8
2.1 5
Z = -2.8
2.15
Z = - 1.3
Hence, 20 lbs. is equivalent to -1.30 ( Z )
Since we were ask to find the number of babies that weigh at least 20lbs., it means that
we are going to solve the probability of babies that weigh 20 lbs. and above. See figure
below.
Therefore, the probability of the babies who weigh 20 lbs. and above is 0.9032 or 90.32%
20
LYCEUM OF ALABANG
Step 3: Multiply the computed probability to the total number of babies to get the total
babies that weigh 20 lbs. and above.
Example 3:
The tests for an individual's intelligence quotient (IQ) are designed to be normally
distributed, with a mean of 100 and a standard deviation of 15. In 1916, psychologist Lewis M.
Thurman set a guideline of 120 for "potential genius". Using this information, what percentage
of individuals are "potential geniuses"?
Solution:
Based on the information, we can consider a man as a potentially genius if he will obtain a
score of 120 and above. See figure below.
x = 120 s = 15 x = 100
21
LYCEUM OF ALABANG
Z=X–X
s
Z = 120 – 100
15
Z = 20
15
Z = 1.33
Hence, 20 lbs. is equivalent to 1.33 ( Z )
Note: No need to compute for the number of potential geniuses since only the percentage is
being asked.
22
LYCEUM OF ALABANG
CHAPTER TEST
I. Sketch the graph and find the net area under a normal curve distribution that lies between or
on :
1. Z = 0 to Z = 1.23
2. Z = 0 to Z = -1.72
3. Z = -0.34 to Z = 2.61
II. Using a standard normal distribution, find the probability of the following Z :
1. Pr (Z ≤ 2.56)
2. Pr (-1.25 ≤ Z ≤ 1.17)
3. Pr (0.25 ≤ Z ≤ 1.94)
4. Pr (-2.64 ≤ Z ≤ -1.72)
5. Pr (-2.58 ≤ Z ≤ 1.26)
23
LYCEUM OF ALABANG
1. 7,618 took the entrance examination in a certain university, and obtained a mean score of
84%. If the scores where normally distributed and has 7% standard deviation, how many
students obtained a score of:
a. 85% to 90%
24
LYCEUM OF ALABANG
Objectives
The learner should be able to
1. illustrate random sampling,
2. distinguish between parameter and statistics,
3. find the mean and variance of the sampling distribution of the sample mean,
4. illustrate the Central Limit Theorem, and
5. solve problems involving sampling distributions of the sample mean.
The field of inferential statistics enables you to make educated guesses about the
numerical characteristics of large groups. The logic of sampling gives you a way to test
conclusions about such groups using only a small portion of its members.
25
LYCEUM OF ALABANG
Often, researchers want to know things about populations but do not have data for
every person or thing in the population. If a company's customer service division wanted to
learn whether its customers were satisfied, it would not be practical (or perhaps even possible)
to contact every individual who purchased a product. Instead, the company might select a
sample of the population. A sample is a smaller group of members of a population selected to
represent the population. In order to use statistics to learn things about the population, the
sample must be random.
For example, say you want to know the mean income of the subscribers to a particular
magazine—a parameter of a population. You draw a random sample of 100 subscribers and
determine that their mean income is $27,500 (a statistic). You conclude that the population
mean income μ is likely to be closed to $27,500 as well. This example is one of statistical
inference.
The mean of a discrete random variable X is a weighted average of the possible values
that the random variable can take. Unlike the sample mean of a group of observations, which
gives each observation equal weight, the mean of a random variable weights outcome i ,
according to its probability, pi. The common symbol for the mean (also known as the expected
value of X) is µ , formally defined by:
µ = Σ xi pi
where :
µ - mean
i - mean of a random variable weights outcome
pi - probability of each mean random variable outcome
The law of large numbers states that the observed random mean from an increasingly
large number of observations of a random variable will always approach the distribution mean .
That is, as the number of observations increases, the mean of these observations will become
closer and closer to the true mean of the random variable. This does not imply, however, that
short term averages will reflect the mean.
26
LYCEUM OF ALABANG
Variance
The variance of a discrete random variable X measures the spread, or variability, of the
distribution, and is defined by:
s2 = Σ ( x – x )2
n-1
where
s2 - variance
x - random variable
x - mean of the sample values
n - size of the random variable
s = Σ ( x – x )2 or s = √ s2
n-1
Example 1:
Problem : 5, 8, and 9 are the scores obtained by 3 selected students in a particular quiz. By
using a random distribution with random variable size of r = 2, solve the following:
a. population mean
b. variance and standard deviation.
Solution
a. population mean
Step 1: Determine the number of sample values of size r = 2 using the combination formula :
nCr = n!
( n – r )! r!
3C2 = 3!
( 3 – 2 )! 2!
= 6
1( 2 )
3C2 =3 ( there 3 sample values and the probability of each
is 1/3 or 0.33 )
27
LYCEUM OF ALABANG
7.333…
1 5&8 6.5
2 5&9 7
3 8&9 8.5
Compute the sample variance ( s2 ) and standard deviation ( s or √ s2 ) of each sample values.
1. 5 & 8 2. 5 & 9
𝒔𝟐= 𝟒. 𝟓 𝒔𝟐 = 𝟖
28
LYCEUM OF ALABANG
s= s=
= = √8
s = 2.12 s = 2.83
3. 8 & 9
𝒔𝟐= 𝟎. 𝟓
s=
s=
s = 0.71
Example 2:
Problem : A population consists of 5 values such as 11, 13, 15, 17, and 19. Compute the
following with a random variable size of r = 3:
a. population mean
29
LYCEUM OF ALABANG
Solution:
a. population mean
Step 1: Determine the number of sample values of size r = 3 using the combination formula.
nCr = n!
( n – r )! r!
3C2 = 5!
( 5 – 3 )! 3!
= 5x4x3x2x1
(2x1)( 3x21x )
3C2 = 10 ( there 10 sample values and the probability of each
is 1/10 or 0.10 )
30
LYCEUM OF ALABANG
Since there are mean values that are common, the table below may use to make the solution
of the population mean simpler.
Mean ( ) Frequency (f) Probability Pr(x)
13 1 0.10
13.67 1 0.10
14.33 2 0.20
15 2 0.20
15.67 2 0.20
16.33 1 0.10
17 1 0.10
1.00
𝜇=∑ 𝑖𝑝𝑖
𝝁= 15
31
LYCEUM OF ALABANG
s2 = 4 s2 = 9.33335
s = √ s2 s = √ s2
s=√4 s = √ 9.33335
s=2 s = 3.0551
Activity:
Central limit theorem is a statistical theory that states that given a sufficiently large
sample size from a population with a finite level of variance, the mean of all samples from the
same population will be approximately equal to the mean of the population. If random samples
of a large sample size n that increase without limit are taken from a population with a specific
mean (𝝁) and standard deviation (s), the sampling distribution of the sample mean ( ) is
approximately normally distributed with a mean (𝝁) and standard deviation of
32
LYCEUM OF ALABANG
z=x–𝝁
𝑠x
𝝁 - population mean
Note:
1. For any sample size n, the sampling distribution of a sample mean is a normal
distribution if the original variable is normally distributed.
2. For a sample size of 30 or more, it is required to use a normal distribution to
estimate the distribution of a sample mean if the original variable is normally distributed.
Example 1:
Problem : The mean raw score of Grade 11 students in Statistics examination was 20 with a
standard deviation of 4. If 36 students are randomly selected, find the probability
that the mean score of the students is higher than 21.
Solution:
s=3 n = 36
= 3
√ 36
sx = 1/2 or 0.5
z=x–𝝁
𝑠 x
= 21 – 20
0.5
z=2
33
LYCEUM OF ALABANG
Step 5: Find the area of the variable in a normal distribution table ( Area z=2 = 0.4772 ).
Therefore, the probability of obtaining sample that has a raw score of higher than 21 is
0.0228 or 2.28%.
Example 2:
Problem: The average amount of salt in mg. for certain instant noodle per cup sold in the
market is 200 mg. with a standard deviation of 10 mg. Assume that the variable is
distributed, and if a single cup noodle is selected , find the probability that the
of salt in the noodle will be more than 210 mg.
Solution:
Step 1: Compute the standard deviation of the sample mean
s = 10 n=1
s X = 10
𝝁 = 200
= 210
𝑠 = 10
34
LYCEUM OF ALABANG
z=x–𝝁
𝑠 x
= 210 – 200
10
z=1
Step 5: Find the area of the variable in a normal distribution table ( Area z = 1 = 0.3413 )
Therefore, the probability of obtaining a sample noodle that contains 210 mg. of salt is
0.1587 or 15.87 %.
Example 3:
Problem: The average consumption of rice of a rural male adult person in a year is 96 kilos. If
the standard deviation is 20 kilos and the distribution is approximately normal, find
the probability that the mean of the sample will be less than 102 kilos in a year if a
sample of 49 individual male adults chosen.
Step 1: Compute the standard deviation of the sample mean .
s = 20 n = 49
𝒔 x = 𝟐. 𝟖𝟓𝟕𝟏
35
LYCEUM OF ALABANG
µ = 96 x = 102 𝒔 = 𝟐. 𝟖𝟓𝟕𝟏
z=x–𝝁
𝑠 x
= 102 – 96
2.8571
z = 2.1
Step 5: Find the area of the variable in a normal distribution table ( Area z=2.1 = 0.4821 ).
Therefore, the probability of obtaining 49 samples that consume less than 102 kilos of
rice is 0.9821 or 98.21%.
Example 4:
Problem: The average life span of TV sets manufactured by company X is 10.5 years and the
standard deviation is 1.8 years. If a random sample of 50 TV sets are chosen, find
the probability that the mean life span of its TV sets is 10 to 11 years.
36
LYCEUM OF ALABANG
Solution:
s = 1.8 n = 50
s x = 0.2546
Step 3: Compute the Z score. Since there were two ( 2 ) sample means, we are going to
compute two values of Z.
z=x–𝝁 z=x–𝝁
𝑠 x 𝑠 x
= 10 – 10.5 = 11 – 10.5
0.2546 0.2546
z = - 1.96 z = 1.96
37
LYCEUM OF ALABANG
Therefore, the probability that the mean life span of its TV sets range from 10 to 11
years is 0.9500 or 95%.
38
LYCEUM OF ALABANG
CHAPTER TEST
A. Random Sampling
1. To obtain a random sample of 25, a researcher selects every 20th hamburger to determine
the fat content of the hamburger a burger store sells. Will his sample have the
characteristic of a random sample? Explain why or why not?
39
LYCEUM OF ALABANG
1. The average age of public jeepneys plying in Metro Manila is 15 years. Assume that the
standard deviation is 4 years. If a random sample of 64 public jeepneys are chosen, find
the probability that the mean of jeepney’s age is :
a. between 12 to 19 years
a. over 18 years
40
LYCEUM OF ALABANG
CHAPTER 4
ESTIMATION OF PARAMETERS
Objectives
The learner should be able to
1. illustrate point and interval estimations,
2. distinguish between point and interval estimation,
3. identify the appropriate form of the confidence interval estimator for the population
mean ,
4. illustrate and construct a t – distribution,
5. identify regions under the t – distribution corresponding to different t-values,
6. identify point estimator for the population proportion,
7. compute for the point estimate of the population proportion, and
8. compute for the confidence interval estimate of the population proportion .
Parameter Estimation
It refers to the process of using sample data (in reliability engineering, usually times-to-
failure or success data) to estimate the parameters of the selected distribution.
Point estimator is used to estimate a population parameter and does not provide
information as to how close the estimate is to the population parameter. It is always obtained
by constructing an internal estimate by subtracting tor adding a value called the margin of error
from or to a point estimator.
In simple terms, any statistic can be a point estimate. A statistic is an estimator of some
parameters in a population. For example:
• The sample standard deviation, (s), is a point estimate of the population standard
deviation (σ).
Unbiased estimator
It is an accurate statistic that’s used to approximate a population parameter. “Accurate”
in this sense means that it’s neither an overestimate nor an underestimate. If an overestimate
or underestimate does happen, the mean of the difference is called a “bias.”
41
LYCEUM OF ALABANG
If the estimator (i.e. the sample mean , x ) equals the parameter (i.e. the population
mean , 𝝁 ), then it’s an unbiased estimator or when the mean of the statistic’s sampling
distribution is equal to the population parameter.
A researcher can obtain unbiased estimators by avoiding bias during sampling and data
collection.
For example, to figure out the average amount people spend on food per week, it is
impossible to survey the whole population of over 100 million. So, it is more convenient to take
a random sample of around 1,000. After the survey, it was found out that the average amount
people spend per week is Php 2000 per person. Is this an unbiased estimator? Possibly. It all
depends on how the sample was taken . For example:
• Were the questions unbiased? For example, an ambiguous question like “How
much do you spend on groceries a week?” might seem simple enough. But some
people could take this to mean “How much did you spend this week on
groceries?” (if it’s the middle of the month, people might spend less) or “How
much money did you spend on your household groceries this week?” (be clear
that you’re asking per person, not per household.
• Was the sample chosen in an unbiased way (i.e. a simple random sample).
• Has any population members been excluded? For example, if you are performing
an internet survey, you may be excluding the poorest 25% of people who do not
have internet.
Example:
Problem: In the year 2015, the municipal registrar reported that the average matrimonial age
for male person is µ = 26.8 years old. Data on April 6, 2016 showed the ages of 5
male persons getting married are 24, 28, 21, 31 and 27, while on May 3, 2016 the
ages of 8 male persons getting married are 20,33,30,28,35,21,27 and 24. Determine
the :
a. unbiased estimator
b. most efficient estimator
Solutions:
Since Population mean for male persons (𝝁)= 26.8 years old.
a. unbiased estimator
Step 1: determine the sample mean nearest to the population mean.
Since the sample mean 2 of 27.25 years is nearer to the population mean of 26.8 years
(difference of 0.45) than sample mean 1 of 26.2 years (difference of 0.6). Hence, the
unbiased estimator is sample mean 2.
42
LYCEUM OF ALABANG
variance ( s2 ) = Σ ( x – x )2
n-1
s2 ( variance 1 ) = 14.7
s2 ( variance2 ) = Σ (20- 27.25 )2 + ( 33 – 27.25 )2+( 30 – 27.25 )2+ ( 28 –27.25 )2+ ( 35 – 27.5 )2+ ( 21 – 27.25 )2 + ( 27 – 27.25 ) 2+ ( 24 – 27.25 )2
8
s2 ( variance 2 ) = 25.44
Sample 1 is more efficient estimator since its sample variance is smaller than the
sample variance of sample 2.
Interval Estimate
It is a range (interval) of values that is likely to contain the true value of the parameter.
An interval estimate is associated with the degree of confidence.
43
LYCEUM OF ALABANG
Critical
It is a factor used to compute the margin of error to determine the interval estimate of
the population parameter.
The central limit theorem states that the sampling distribution of a statistic will be
nearly normal if the sample size is large enough.
Consider estimating and determining the sample size by applying proportion instead of
using the means. Let as assume that normal distribution can be used as approximation to the
distribution. All outcomes classified in one or two other categories are typically referred to as
success or failure. We have independent trials; and in each trial the probability of success is
denoted by p, and the probability of failure is denoted by q. If the conditions np ≥ 5 and nq ≥ 5
are both satisfied, we can use normal distribution.
and q =1- p
Margin of Error ( E )
E = z α p (1–p)
2 n
Example 1:
Problem : A survey was conducted among grade 12 students of Lyceum of Alabang, and
found out that there were 980 students out of 1600 who will pursue their study in
college.
44
LYCEUM OF ALABANG
a. Determine the proportion of students who will pursue their studies in college.
b. Find the interval estimate of proportion at 95% level of confidence of all Grade
12 student who will pursue college.
Solution:
= 0.6125 or 61.25%
Hence, 61.25% is the proportion of students who will pursue their studies in college.
The sample proportion is the best point estimate of the population proportion.
b. interval estimate of proportion at 95% level of confidence of all Grade 12 student who will
pursue college.
=1–
= 1 – 0.6125
= 0.3875
= 0.3875
= 0.6125
45
LYCEUM OF ALABANG
E = z α p (1–p)
2 n
0.6125 ( 1 – 0.6125 )
= 1.95996
1600
= 1.95996 0.6125 ( 0.3875 )
1600
= 1.95996 ( 0.0122 )
E = 0.0239
Therefore, the interval estimate P =0.6125 0.0239 or 0.5886 < P < 0.6364.
In the absence of p and q , we can assign the value of 0.5 for p . Since q=1-0.5 or
q=0.5, their product will be 0.25. The formula for sample size (n) is, thus,
Example 2:
Problem : A survey was conducted to 1600 Grade 12 high school students to find out, who can
pursue college. Find the sample size (n) using a margin of error of 0.03 and
confidence level of 95%. Compute the sample size if:
Solution:
46
LYCEUM OF ALABANG
=1–
= 1 – 0.6125
= 0.3875
= 0.2875
= 0.6125
E = 0.03
2
n = zα p (1–p)
2 E2
Hence, 1013 Grade 12 students who have prior information will be included in the survey.
n = ( 1.95996 )2 ( 0.25 )
( 0.03 )2
n = 1067
47
LYCEUM OF ALABANG
Therefore, the sample will be composed of 1,067 Grade 12 students, with no prior information
on the survey.
When the frequency distribution of a population is normal then the t-distribution can be solved
using :
Values of t can be obtained by locating the value of the degree of freedom, n-1. The
value of the degree of freedom is the number of scores that can vary after certain restrictions
are met.
Example 3:
Problem: The scores of 6 students in 5 quizzes are: 82, 85, 78, 75, 87, 73. For these scores:
n = 6, ẍ = 80 and s = 5.59. Construct the 95% estimate interval for the 5
scores.
Solution:
Degree of freedom ( df ) = n -1
=6–1
= 5
48
LYCEUM OF ALABANG
a
From Table 2: Level of Confidence : t = 2.57058
2
E=
E = 2.57058•
E = 2.57058•
E = 2.57058 • 2.2821
E = 5.866 or 5.9
49
LYCEUM OF ALABANG
Example 4:
Problem: Find the critical value if the degree of confidence is 95%.
Solution:
The margin of error, E, in sample data that is used to estimate a population is the
probability, 1 – α. This is the difference between the observe sample mean (ẍ) and the true
value of the population. The margin of error which is also called the maximum error of the
estimate can be determined using the formula;
- critical value
n - sample size
To calculate the Margin of Error when the population standard deviation is unknown,
replace the population standard deviation with the sample standard deviation. And the
confidence level or estimate interval can be solved by using the formula:
Example 5:
Problem: Given the body temperature, n = 105, ẍ = 98.10, and s = 0.61, for a degree of
confidence 95%, find the:
a. Margin of Error
b. Interval Estimate
50
LYCEUM OF ALABANG
Solution:
a. Margin of Error
E = 1.95996 •
E = 1.95996 •
E = 1.95996 • 0.0595
E = 0.117
Therefore, the interval estimate is the interval 97.983 < µ < 98.217
51
LYCEUM OF ALABANG
( addendum )
Step 1: Find the number of observations ( n ), calculate their mean ( x ), and standard
deviation ( s ).
Using the example:
n = 40
x = 175
s = 20
Note:
1. We should use the standard deviation of the entire population, but in many cases
we won't know it.
2. We can use the standard deviation for the sample if we have enough observations
(at least n=30).
Step 2: Decide what Confidence Interval we want: 95% or 99% are common choices.
Then find the "Z" value for that Confidence Interval from TABLE 2.
Step 3: Use the Z value in this formula for the Confidence Interval
x ±Z s
√n
Where:
x is the mean
Z is the chosen Z-value from the table above
s is the standard deviation
n is the number of observations
And we have:
175 ± 1.960 × 20√40
Which is:
175cm ± 6.20cm
52
LYCEUM OF ALABANG
CHAPTER TEST
A. Find Z if:
1. α = 0.10
2. α = 0.01
B. Find Z for the value corresponding to significant (confidence) level of 90%.
1. The manager of a commercial bank wants to confirm his belief that the bank has very few
customers with regular savings account. Based on a survey of 150 randomly selected adults,
only 30 of them have regular savings account.For 99% significant (confidence) level, find the
interval estimates of adults with regular savings account.
53
LYCEUM OF ALABANG
CHAPTER 5
TESTS OF HYPOTHESIS
Objectives
The learner should be able to :
1. illustrate null and alternative hypotheses,
2. illustrate level of significance and rejection region,
3. formulate the appropriate null and alternative hypotheses on a population mean
and proportion,
4. identify the appropriate form of the test statistic
5. compute for the test statistic value ,and
6. solve problems involving test of hypothesis .
A hypothesis test is a statistical test that is used to determine whether there is enough
evidence in a sample of data to infer that a certain condition is true for the entire population.
A hypothesis test examines two opposing hypotheses about a population :
a. null hypothesis ( Ho )
b. alternative hypothesis. ( Ha )
Null hypothesis
It is the statement being tested. Usually the null hypothesis is a statement of "no effect"
or "no difference".
Alternative hypothesis
It is the statement you want to be able to conclude is true. A common misconception is
that statistical hypothesis tests are designed to select the more likely of two hypotheses.
Instead, a test will remain with the null hypothesis until there is enough evidence (data) to
support the alternative hypothesis.
Example:
Problem: Is normal body temperature really 98.6 oF?
Solution:
Consider the population of many adults. A researcher hypothesized that the average
adult body temperature is lower than the often-advertised 98.6 degrees F. That is, the
researcher wants an answer to the question: "Is the average adult body temperature 98.6
degrees? Or is it lower?" To answer his research question, the researcher starts by assuming
that the average adult body temperature was 98.6 degrees F.
54
LYCEUM OF ALABANG
Then, the researcher went out and tried to find evidence that refutes his initial assumption. In
doing so, he selects a random sample of 130 adults. The average body temperature of the 130
sampled adults is 98.25 degrees.
Then, the researcher uses the data he collected to make a decision about his initial
assumption. It is either likely or unlikely that the researcher would collect the evidence he did
given his initial assumption that the average adult body temperature is 98.6 degrees:
▪ If it is likely, then the researcher does not reject his initial assumption that the average
adult body temperature is 98.6 degrees. There is not enough evidence to do otherwise.
▪ If it is unlikely, then:
➢ either the researcher's initial assumption is correct and he experienced a very
unusual event;
➢ or the researcher's initial assumption is incorrect.
Types of Test
1. Two-tailed test
A test to determine whether a population parameter has changed since the null
hypothesis can be rejected by observing a statistic that falls either the two tails of the sampling
distribution.
2. One-tailed test
It is use if the following conditions satisfy:
1. the sample data from the population that has a parameter less than the hypothesized
value
2. the sample data from the population that has a parameter greater than the
hypothesized value
55
LYCEUM OF ALABANG
Note:
➢ If we reject the null hypothesis, we do not prove that the alternative
hypothesis is true.
➢ If we do not reject the null hypothesis, we do not prove that the null
hypothesis is true.
We merely state that there is enough evidence to behave one way or the other. This is
always true in statistics, because of this, whatever the decisions; there is always a chance
that we made an error.
• Type II error - when the null hypothesis is not rejected even if it is false
Right-tailed test
Left-tailed Test
56
LYCEUM OF ALABANG
Two-tailed test
57
LYCEUM OF ALABANG
58
LYCEUM OF ALABANG
The P-value is the probability of observing a sample statistic as extreme as the test
statistic. Since the test statistic is a z-score
Just in case, the standard deviation is not given the use the formula below to obtain the
value of the standard deviation.
59
LYCEUM OF ALABANG
Every hypothesis test requires the analyst to state a null hypothesis and an
alternative hypothesis. The table below shows three sets of hypotheses. Each makes
a statement about the difference, d, between two population proportions, P1 and P2. (In
the table, the symbol ≠ means " not equal to ").
The first set of hypotheses (Set 1) is an example of a two-tailed test, since an extreme
value on either side of the sampling distribution would cause a researcher to reject the null
hypothesis. The other two sets of hypotheses (Sets 2 and 3) are one-tailed tests, since an
extreme value on only one side of the sampling distribution would cause a researcher to reject
the null hypothesis.
When the null hypothesis states that there is no difference between the two population
proportions (i.e., d = 0), the null and alternative hypothesis for a two-tailed test are often stated
in the following form.
H0: P1 = P2
Ha: P1 ≠ P2
The analysis plan describes how to use sample data to accept or reject the null
hypothesis. It should specify the following elements.
▪ Significance level. Often, researchers choose significance levels equal to 0.01, 0.05,
or 0.10; but any value between 0 and 1 can be used.
▪ Test method. Use the two-proportion z-test to determine whether the hypothesized
difference between population proportions differs significantly from the observed
sample difference.
Using sample data, complete the following computations to find the test statistic and its
associated P-Value.
➢ Pooled sample proportion. Since the null hypothesis states that P1=P2, we use
a pooled sample proportion (p) to compute the standard error of the sampling
distribution.
60
LYCEUM OF ALABANG
➢ Standard Error. Compute the standard error (SE) of the sampling distribution
difference between two proportions
➢ Test Statistic. The test statistic is a z-score (z) defined by the following
equation.
4. Interpret results.
If the sample findings are unlikely, given the null hypothesis, the researcher
rejects the null hypothesis. Typically, this involves comparing the P-value to the
significance level, and rejecting the null hypothesis when the P-value is less than
the significance level.
Example:
Problem : Suppose the Drug Company develops a new drug, designed to prevent colds. The
company states that the drug is equally effective for men and women. To test this
claim, they choose a simple random sample of 100 women and 200 men from a
population of 100,000 volunteers. At the end of the study, 38% of the women caught
a cold; and 51% of the men caught a cold. Based on these findings, can we reject
the company's claim that the drug is equally effective for men and women? Use a
0.05 level of significance.
Solution:
Note: These hypotheses constitute a two-tailed test. The null hypothesis will be rejected if
the proportion from population 1 is too big or if it is too small.
61
LYCEUM OF ALABANG
P = 0.467
SE = √ 0.2489 ∙ (0.015)
SE =√0.00373
SE = 0.061
Z score :
Z = - 2.13
Since we have a two-tailed test, the P-value is the probability that the z-score is less
than -2.13 or greater than 2.13. Thus, the P-value = 0.0166 + 0.0166 = 0.0332.
62
LYCEUM OF ALABANG
Note: If you use this approach on an exam, you may also want to mention why
this approach is appropriate. Specifically, the approach is appropriate because
the sampling method was simple random sampling, the samples were
independent, each population was at least 10 times larger than its sample, and
each sample included at least 10 successes and 10 failures.
Hypothesis Test of a Mean will be conducted when the following conditions are met:
▪ The sampling method is simple random sampling.
▪ The sampling distribution is normal or nearly normal.
We can say that the sampling distribution will be approximately normally distributed if
any of the following conditions apply :
▪ The population distribution is normal.
▪ The population distribution is symmetric, unimodal, without outliers, and the sample
size is 15 or less.
▪ The population distribution is moderately skewed, unimodal, without outliers, and the
sample size is between 16 and 40.
▪ The sample size is greater than 40, without outliers.
➢ State the hypotheses. Every hypothesis test requires the analyst to state a null
hypothesis and an alternative hypothesis. The table below may use.
➢ Formulate an Analysis Plan. The analysis plan describes how to use sample data
to accept or reject the null hypothesis. It should specify the following elements.
63
LYCEUM OF ALABANG
➢ Analyze Sample Data. Using sample data, conduct a one-sample t-test. This
involves:
When the population size is much larger (at least 20 times larger) than the
sample size, the standard error can be approximated by:
Degrees of Freedom. The degrees of freedom (DF) are equal to the sample
size (n) minus one.
DF = n - 1.
Test Statistic. The test statistic is a t statistic (t) defined by the following
equation.
t=(x-µ)
SE
P-value
P-value is the probability of observing a sample statistic as extreme as
the test statistic.
➢Interpret Results. This involves comparing the P-value to the significance level, and
rejecting the null hypothesis when the P-value is less than the significance level.
Example:
Problem :
An elementary school has 1000 students. The principal of the school thinks that the average
IQ of students at Bon Air is at least 110. To prove her point, she administers an IQ test to 20
randomly selected students. Among the sampled students, the average IQ is 108 with a
64
LYCEUM OF ALABANG
standard deviation of 10. Based on these results, should the principal accept or reject her
original hypothesis? Assume a significance level of 0.01. (Assume that test scores in the
population of engines are normally distributed.) Solution:
Step 1: State the hypotheses. The first step is to state the null hypothesis and an alternative
hypothesis.
Note: The hypotheses constitute a one-tailed test. The null hypothesis will be rejected
if the sample mean is too small.
SE = 2.236
t - test statistic (t )
t=
t=
t =-0.894
65
LYCEUM OF ALABANG
The observed sample mean produced a t - test statistic of -0.894. The P(t<-0.894) =
0.19. This means we would expect to find a sample mean of 108 or smaller in 19 percent of
our samples, if the true population IQ were 110. Thus the P-value in this analysis is 0.19
Step 4: Interpret results. Since the P-value (0.19) is greater than the significance level (0.01),
we cannot reject the null hypothesis.
This lesson explains how to conduct a hypothesis test for the difference between two
means. The test procedure, called the two-sample t-test, is appropriate when the following
conditions are met:
▪ The sampling method for each sample is simple random sampling.
▪ The samples are independent.
▪ Each population is at least 20 times larger than its respective sample.
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution. It is used for testing hypothesis
when the:
1. sample standard deviation is known
2. sample size is at least 30
It is also used when there is only one sample in the experiment that is known, and both
standard deviation and the mean of the population are known. Likewise, the two-sample z-test
is used to compare the population means between two groups.
When the data are normally distributed we shall follow the following steps:
Example 1:
Problem:
A researcher reported that the mean grade of Grade 11 students in statistics was 84%.
A random sample of 100 students showed a mean of 87% with a standard deviation of 4%. Is
there a significant difference between the grades of Grade 11 students? Use α = 0.05.
This problem may be computed in two ways. Its either you are going to use one-
tailed test or two-tailed test.
66
LYCEUM OF ALABANG
Ho: µ = 84
Ha: µ ≠ 84
Step 2: Specify the level of significance and decide whether two-tailed test, or one-
tailed test (right-tailed test or left-tailed test) shall be used, and decide the test
statistic to be used, and find the critical value from TABLE 4.
z = ( x - µ )√ n
s
µ = 84
x = 87
n = 100
s=4
z = ( 87 – 84 ) √ 100
4
z = 7.5
Step 4: Graph computed z- value and critical value, and make a decision
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
67
LYCEUM OF ALABANG
Since the computed z was located in the rejected region therefore, null hypothesis
is rejected.
Ho: µ = 84
Ha: µ >84 (since the sample mean is 87)
Step 2: Specify the level of significance and decide whether two-tailed test, or one-
tailed test (right-tailed test or left-tailed test) shall be use, and decide the test
statistic to be used, and find the critical value from TABLE 4.
z = ( x - µ )√ n
s
µ = 84
x = 87
n = 100
s=4
z = ( 87 – 84 ) √ 100
4
z = 7.5
Step 4: Graph the computed z value and critical value, and make a decision
68
LYCEUM OF ALABANG
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed z was located in the rejected region therefore, null hypothesis
is rejected.
Example 2:
Problem :
A supermarket owner believes that the mean of family income of its customers is Php
45,000 per month. 49 customers were randomly selected and asked their family income. The
sample mean was Php 42,200 per month and the standard deviation was Php 2,800. Is there
enough difference to say that the mean family income per month is Php 45,000 per month at
1% significant level?
Solution:
Ho: µ = 45,000
Ha: µ ≠45,000
Step 2: Specify the level of significance and decide whether two-tailed test, or one-
tailed test (right-tailed test or left-tailed test) shall be use, and decide the test
statistic to be used, and find the critical value from table 4.
Critical value: ±2.575 (The value is taken from Z tabular value/ table 4)
z = ( x - µ )√ n
s
µ = 45,000
x = 42,200
n = 49
s = 2,800
69
LYCEUM OF ALABANG
z = ( 42,200 – 45,000 ) √ 49
2,800
z=7
Step 4: Graph the z value and critical value, and make a decision.
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed z is located in the rejected region therefore, null hypothesis is
rejected.
Example 3:
Problem :
The average lifetime of 120 Brand X AA batteries and 120 Brand Y AA batteries were
found to be 9.1 hours and 9.6 hours respectively. Suppose the population standard deviations
of lifetimes are 1.9 hours of Brand X and 2.1 for Brand Y batteries, test the hypothesis using α =
0.05.
Ho: µ1 = µ2
Ha: µ1≠ µ2
Step 2: Specify the level of significance and decide whether two-tailed test or one-
tailed test (right-tailed test or left-tailed test) shall be used, and decide the test
statistic to be used, and find the critical value from Table 4.
70
LYCEUM OF ALABANG
µ1 = 9.1
µ2 = 9.6
n1 = 120
n2 = 120
s1 = 1.9
s2 = 2.1
z = 9.1 – 9.6
(1.9)2 + (2.1)2
120 120
= - 0.5
√ 0.03 + 0.037
z = -1.93
Step 4: Graph the computed z value and critical value, and make a decision
z = - 1.93
71
LYCEUM OF ALABANG
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed z is located in the accepted region therefore, null hypothesis
is accepted.
Step 2: Specify the level of significance and decide whether two-tailed test or one-
tailed test (right-tailed test or left-tailed test) shall be use, and decide the test
statistic to be used, and find the critical value from Table 4.
µ1 = 9.1
µ2 = 9.6
n1 = 120
n2 = 120
s1 = 1.9
s2 = 2.1
z = 9.1 – 9.6
(1.9)2 + (2.1)2
120 120
= - 0.5
√ 0.03 + 0.037
z = -1.93
72
LYCEUM OF ALABANG
Step 4: Graph the computed z value and critical value, and make a decision
z = - 1.93
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
73
LYCEUM OF ALABANG
Example 1:
Problem:
A chemical company alleged that the average weight of its bag of chemical is 50 kgs.
With a standard deviation of 0.9 kg., a sample of 25 bags was taken and revealed a mean
weight of 48.1 kgs. If the significant level is 1%, is there a significant difference between the
weights of the chemical bags?
Step 2: Specify the level of significance and decide whether two tailed test, or one
tailed test (right-tailed test or left-tailed test) shall be used, and decide the test
statistic to be used Find the degrees of freedom, and find the critical value
from Table 5 .
74
LYCEUM OF ALABANG
x = 48.1
µ = 50
n = 25
s = 0.9
t = ( 48.1 – 50 )√ 25
0.9
t= -10.56
Step 4: Graph the computed t value and critical value, and make a decision
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed t is located in the rejected region therefore, null hypothesis is
rejected.
75
LYCEUM OF ALABANG
Step 2: Specify the level of significance and decide whether two tailed test, or one tailed
test (right-tailed test or left-tailed test) shall be use, and decide the test statistic
to be used, find the degrees of freedom, and find the critical value from table 5
x = 48.1
µ = 50
n = 25
s = 0.9
t = ( 48.1 – 50 )√ 25
0.9
t = -10.56
Step 4: Graph the computed t value and critical value, and make a decision
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
76
LYCEUM OF ALABANG
Since the computed t is located in the rejected region therefore, null hypothesis is
rejected.
Example 2:
Problem :
A study was conducted to examine the relationship between the attitudes towards
mathematics and success at college level mathematics. Twenty-two man and twenty women
were identified as being at high risk of failure. The students were asked to responds to a series
of questions, and their answers were used to obtain a math anxiety score. Summary values
appear in the table below. Test the hypothesis using a 0.05 level of significance.
Gender n x s
Male 22 40.8 9.3
Female 20 37.5 10.2
Step 2: Specify the level of significance and decide whether two tailed test, or one
tailed test (right-tailed test or left-tailed test) shall be use, and decide the test
statistic to be used, find the degrees of freedom, and find the critical value
from table 5
= (22 + 20) – 2
= 42 – 2
= 40
Critical value: +1.684 (taken from Table 5 and the positive value
is used since it is a right-tailed test)
77
LYCEUM OF ALABANG
x1 = 40.8
x2 = 37.5
n1 = 22
n2 = 20
s1 = 9.3
s2 = 10.2
t= 40.8 - 37.5
( 22 – 1 )( 9.3 )2 + ( 20 – 1 ) (10.2 )2 ( 22 + 20 )
( 22 + 20 – 2 ) ( 22 )( 20 )
t= 1.1
Step 4: Graph computed t value and critical value, and make a decision .
t = 1.1
78
LYCEUM OF ALABANG
Since the computed t is located in the accepted region therefore, null hypothesis
is accepted.
Step 2:Specify the level of significance and decide whether two tailed test, or one
tailed test (right-tailed test or left-tailed test) shall be use, and decide the test
statistic to be used, find the degrees of freedom, and find the critical value
from table 5
= (22 + 20) – 2
= 42 – 2
= 40
Critical value: ± 2.021 (taken from Table 5 ; one is positive and the other
is negative since it is two-tailed )
x1 = 40.8
x2 = 37.5
n1 = 22
n2 = 20
s1 = 9.3
s2 = 10.2
79
LYCEUM OF ALABANG
t= 40.8 - 37.5
( 22 – 1 )( 9.3 )2 + ( 20 – 1 ) (10.2 )2 ( 22 + 20 )
( 22 + 20 – 2 ) ( 22 )( 20 )
t = 1.1
Sep 4: Graph the computed t value and critical value, and make a decision .
t = 1.1
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed t was located in the accepted region therefore, null hypothesis
is accepted.
80
LYCEUM OF ALABANG
CHAPTER TEST
Apply the appropriate test hypothesis steps and procedures for the following research
problems.
1. A supermarket owner believes that the mean income of its costumers is Php50,000 per
month. One-hundred costumers are randomly selected and asked of their monthly income.
The sample mean is Php48,500 per month and standard deviation is Php3,200.Is their
sufficient evidence to indicate that the mean income of the costumers of the supermarket is
Php50,000per month? Use α= 0.05.
81
LYCEUM OF ALABANG
2. It is reported that the average monthly salary of accounting graduates in the accounting
field is Php18,000. A dean of a certain university conducted a survey of 60 accounting
graduates and found their average salary at Php20,500 per month with standard deviation
of Php1,500 per month. Using α = 0.05, is there a significant difference between the
accounting graduates salaries?
82
LYCEUM OF ALABANG
3. A prospective MBA student was made to estimate the difference in the monthly salaries of
professors in private and state colleges. He claimed that the difference in the starting
salaries of MBA graduates of the two colleges were relevant. An independent study of a
simple random samples of the most recent MBA graduates of both colleges revealed the
following statistics:
83
LYCEUM OF ALABANG
4. A distributor claims that the average strength of the brand X rope exceeds the average
strength of the brand Y rope. To test its claim, 25 pieces of each brand are tested under
similar conditions. Brand X had an average strength of 90.7 kilograms with a standard
deviation of 7.82 kilograms, while brand Y have an average strength of 93.7 kilograms with
a standard deviation of 6.75 kilograms. Test whether the claim of the distributor is correct
at 5% level of significance.
84
LYCEUM OF ALABANG
85
LYCEUM OF ALABANG
CHAPTER 6
Objectives
The learner should be able to
1. construct a scatter plot,
2. describe shape, trend, and variation based on scatter plot,
3. estimate strength of association between the variables based on scatter plot,
4. calculate the Pearson’s sample correlation coefficient,
5. solve problem involving correlation analysis,
6. identify the independent and dependent variables,
7. draw the best-fit line on a scatter plot,
8. calculate the slope and y-intercept of the regression line,
9. predict the value of the dependent variable given the value of the independent
variable, and
10. solve problems involving regression analysis.
In some cases, the measurement scale for data is ordinal, but the variable is treated as
continuous. For example, a Likert scale that contains five values - strongly agree, agree,
neither agree nor disagree, disagree, and strongly disagree - is ordinal. However, where a
Likert scale contains seven or more value - strongly agree, moderately agree, agree, neither
agree nor disagree, disagree, moderately disagree, and strongly disagree - the underlying
scale is sometimes treated as continuous (although where you should do this is a cause of
great dispute).
It is worth noting that how we categorize variables is somewhat of a choice. Whilst we
categorized gender as a dichotomous variable (you are either male or female), social scientists
may disagree with this, arguing that gender is a more complex variable involving more than
two distinctions, but also including measurement levels like gender queer, intersex and
transgender. At the same time, some researchers would argue that a Likert scale, even with
seven values, should never be treated as a continuous variable.
86
LYCEUM OF ALABANG
Independent Variable
Sometimes called an experimental or predictor, it is a variable that is being
manipulated in an experiment in order to observe the effect on a dependent variable.
Dependent Variable
It is sometimes called an outcome variable. The dependent variable is simply a
variable that is dependent on an independent variable(s).
All experiments examine some kind of variable(s). A variable is not only something that
we measure, but also something that we can manipulate and something we can control for.
The dependent variable is just like the name sounds; it depends upon some factor
that you, the researcher, controls. For example:
Whatever event you are expecting to change is always the dependent variable. In
the first example above race performance is the variable you would expect to change if you
changed your training, so that’s the dependent variable. In the second example, the dependent
variable is weight and in the third example the dependent variable is the amount earned
Example:
Imagine that a tutor asks 100 students to complete a math test. The tutor wants to
know why some students perform better than others. Whilst the tutor does not know the
answer to this, she thinks that it might be because of two reasons: (1) some students spend
more time revising for their test; and (2) some students are naturally more intelligent than
others. As such, the tutor decides to investigate the effect of revision time and intelligence on
the test performance of the 100 students. The dependent and independent variables for the
study are:
Dependent Variable: Test Mark (measured from 0 to 100)
Independent Variables: Revision time (measured in hours) Intelligence (measured
using IQ score)
• Nominal variables are variables that have two or more categories, but which do not
have an intrinsic order. For example, a real estate agent could classify their types of
property into distinct categories such as houses, condos, co-ops or bungalows. So
87
LYCEUM OF ALABANG
"type of property" is a nominal variable with 4 categories called houses, condos, co-ops
and bungalows. Of note, the different categories of a nominal variable can also be
referred to as groups or levels of the nominal variable. Another example of a nominal
variable would be classifying where people live in the USA by state. In this case there
will be many more levels of the nominal variable (50 in fact).
• Dichotomous variables are nominal variables which have only two categories or levels.
For example, if we were looking at gender, we would most probably categorize
somebody as either "male" or "female". This is an example of a dichotomous variable
(and also a nominal variable). Another example might be if we asked a person if they
owned a mobile phone. Here, we may categorise mobile phone ownership as either
"Yes" or "No". In the real estate agent example, if type of property had been classified
as either residential or commercial then "type of property" would be a dichotomous
variable.
• Ordinal variables are variables that have two or more categories just like nominal
variables only the categories can also be ordered or ranked. So if you asked someone
if they liked the policies of the Democratic Party and they could answer either "Not very
much", "They are OK" or "Yes, a lot" then you have an ordinal variable. Why? Because
you have 3 categories, namely "Not very much", "They are OK" and "Yes, a lot" and
you can rank them from the most positive (Yes, a lot), to the middle response (They
are OK), to the least positive (Not very much). However, whilst we can rank the levels,
we cannot place a "value" to them; we cannot say that "They are OK" is twice as
positive as "Not very much" for example.
Continuous variables are also known as quantitative variables. Continuous variables can
be further categorized as either interval or ratio variables.
• Interval variables are variables for which their central characteristic is that they can be
measured along a continuum and they have a numerical value (for example,
temperature measured in degrees Celsius or Fahrenheit). So the difference between
20C and 30C is the same as 30C to 40C. However, temperature measured in degrees
Celsius or Fahrenheit is NOT a ratio variable.
• Ratio variables are interval variables, but with the added condition that 0 (zero) of the
measurement indicates that there is none of that variable. So, temperature measured
in degrees Celsius or Fahrenheit is not a ratio variable because 0C does not mean
there is no temperature. However, temperature measured in Kelvin is a ratio variable
as 0 Kelvin (often called absolute zero) indicates that there is no temperature
whatsoever. Other examples of ratio variables include height, mass, distance and
many more. The name "ratio" reflects the fact that you can use the ratio of
measurements. So, for example, a distance of ten meters is twice the distance of 5
meters.
88
LYCEUM OF ALABANG
variable is the exam mark (measured from 0 to 100), and the independent variables are
revision time (measured in hours) and intelligence (measured using IQ score). Here, it would be
possible to use an experimental design and manipulate the revision time of the students. The
tutor could divide the students into two groups, each made up of 50 students. In "group one",
the tutor could ask the students not to do any revision. Alternately, "group two" could be asked
to do 20 hours of revision in the two weeks prior to the test. The tutor could then compare the
marks that the students achieved.
Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms
of statistical analysis used to determine if there is a relationship between two sets of values. It
usually involves the variables X and Y.
The results from bivariate analysis can be stored in a two-column data table.
Example:
You might want to find out the relationship between the age of the students and their
academic achievement. The age would be your independent variable, X and the academic
achievement would be your dependent variable, Y.
89
LYCEUM OF ALABANG
Scatter Plot
It is a type of plot or mathematical diagram using Cartesian coordinate to display values
for typically two variables for a set of data. If the points are color-coded, one additional variable
can be displayed. The data is displayed as a collection of points, each having the value of one
variable determining the position on the horizontal axis and the value of the other variable
determining the position on the vertical axis.
A scatter plot can be used either when one continuous variable that is under the control
of the experimenter and the other depends on it or when both continuous variables are
independent. If a parameter exists that is systematically incremented and/or decremented by
the other, it is called the control parameter or independent variable and is customarily plotted
along the horizontal axis. The measured or dependent variable is customarily plotted along the
vertical axis.
A scatter plot can suggest various kinds of correlations between variables with a
certain confidence interval.
Example:
Plotting Weight vs. Height. Weight would be on y axis and height would be on the x
axis. Correlations may be positive (rising), negative (falling), or null (uncorrelated).
Pattern of dots
Positive correlation if the pattern of dots slopes from lower left to upper right.
90
LYCEUM OF ALABANG
Negative correlation if the pattern of dots slopes from upper left to lower right.
A line of best fit (or "trend" line) is a straight line that best represents the data on a
scatter plot. This line may pass through some of the points, none of the points, or all of the
points. A line of best fit can be drawn in order to study the relationship between the variables.
An equation for the correlation between the variables can be determined by established best-fit
procedures. For a linear correlation, the best-fit procedure is known as linear regression and is
guaranteed to generate a correct solution in a finite time. No universal best-fit procedure is
guaranteed to generate a correct solution for arbitrary relationships. A scatter plot is also very
useful when we wish to see how two comparable data sets agree with each other. In this case,
an identity line, i.e., a y =x line, or an 1:1 line, is often drawn as a reference. The more the
two data sets agree, the more the scatters tend to concentrate in the vicinity of the identity line;
if the two data sets are numerically identical, the scatters fall on the identity line exactly.
91
LYCEUM OF ALABANG
Correlation between sets of data is a measure of how well are they related. The most
common measure of correlation in stats is the Pearson Correlation. The full name is the
Pearson Product Moment Correlation or PPMC. It shows the linear relationship between
two sets of data. In simple terms, it answers the question; Can I draw a line graph to
represent the data? Two letters are used to represent the Pearson correlation: Greek letter
rho (ρ) for a population and the letter “r”. It tells you whether there is a relationship between
the variables. To compute the value of Pearson Correlation we have the formula:
Formula 1:
Formula 2:
r= Σ(x–x)(y–y)
Σ ( x – x )2 Σ ( y – y )2
The results will be between -1 and 1. You will rarely see 0, -1 or 1 as a result. You’ll get
a number somewhere in between those values. The closer the value of r gets to zero, the
greater the variation the data points are around the line of best fit. To interpret the obtained
results the table below may use.
92
LYCEUM OF ALABANG
Example 1:
Problem :
Researchers want to know if there is a significant relationship between the ages of the person
to their glucose level. They use six (6) persons as their samples and obtained the data below.
Solution :
Step 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2.
Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98
2 25 59
3 36 83
4 45 70
5 50 90
6 61 85
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would be 40 × 98 =
3920.
Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920
2 25 59 1475
3 36 83 2988
4 45 70 3150
5 50 90 4500
6 61 85 5185
93
LYCEUM OF ALABANG
Step 3: Take the square of the numbers in the x column, and put the result in the x2 column.
Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920 1600
2 25 59 1475 625
3 36 83 2988 1296
4 45 70 3150 2025
5 50 90 4500 2500
6 61 85 5185 3721
Step 4: Take the square of the numbers in the y column, and put the result in the y2 column.
Glucose
Samples Age (x) Level (y) xy x2 y2
Step 5: Add up all of the numbers in the columns and put the result at the bottom column. The
Greek letter sigma (Σ) is a short way of saying “sum of” or “summation of”.
Glucose
Samples Age (x) Level (y) xy x2 y2
94
LYCEUM OF ALABANG
The result is 0.5108, which means the variables have a High positive correlation.
Example 2:
Problem :
Calculate the Pearson correlation coefficient of the obtained scores by 5 students in
algebra and trigonometry as given below:
Algebra 15 16 12 10 8
Trigonometry 18 11 10 20 17
Solution:
Complete the table by following steps 1 to 5.
x Y xy x2 y2
15 18 270 225 324
16 11 176 256 121
12 10 120 144 100
10 20 200 100 400
8 17 136 64 289
∑x = 61 ∑y = 76 ∑xy = 902 ∑x = 789
2
∑y = 1234
2
95
LYCEUM OF ALABANG
The result is – 0.4241, which means the variables have a Medium negative correlation
96
LYCEUM OF ALABANG
Activity:
x 5 6 4 2
y 7 3 9 8
2. The scores of 6 pupils in two subjects : Physics and Chemistry are given below..
Calculate the coefficient of correlation.
Chemistry 45 53 67 40 35 50
Physics 68 76 70 64 54 66
97
LYCEUM OF ALABANG
Simple regression is used to examine the relationship between one dependent and one
independent variable. After performing an analysis, the regression statistics can be used to
predict the dependent variable when independent variable is known. Regression goes beyond
correlation by adding prediction capabilities.
People use regression on an intuitive level everyday, such as :
• in business, a well-dressed man is thought to be financially successful;
• a mother knows that more sugar in her children’s diet results in higher energy levels; and
• the ease of waking up in the morning often depends on how late you went to bed the
night before.
The regression line ( known as the least squares line ) is a plot of the expected value of
the dependent variable for all values of the independent variable. Technically, it is the line that
minimizes the squared residuals. The regression line is the one that best fits the data on a
scatter plot.
Using the regression equation , the dependent variable maybe predicted from the
independent variable. The slope of the regression line ( b ) is defined as the rise divided by
the run. The y-intercept ( a ) is the point on the y-axis where the regression line would intercept
the y-axis. The slope and y-intercept are incorporated into the regression equation. The
intercept is usually called the constant , and the slope is referred to as the coefficient. Since the
regression model is usually not a perfect predictor, there is also an error term in the equation.
In the regression equation, y is always the dependent variable and x is always the
independent variable. Here are three equivalent ways to mathematically describe a linear
regression model :
y = intercept + ( slope ± x ) + error
y = a + bx + e
The significance of the slope of the regression line is determined from the t-statistic. It
is the probability that the observed correlation coefficient occurred by chance if the true
correlation is zero. Some researchers prefer to report the F-ratio instead of the t-statistic. The F-
ratio is equal to the t-statistic squared.
The t-statistic for the significance of the slope is essentially a test to determine if the
regression model ( equation ) is usable. If the slope is significantly different than zero, then we
can use the regression model to predict the dependent variable for any value of the
independent variable.
m = Δy
Δx
m = y2 – y1
x2 – x1
98
LYCEUM OF ALABANG
4. General form: Ax + By + C = 0
The slope indicates the steepness of a line and the intercept indicates the location
where it intersects an axis. The slope and the intercept define the linear relationship between
two variables, and can be used to estimate an average rate of change. The greater the
magnitude of the slope, the steeper the line and the greater the rate of change.
By examining the equation of a line, you quickly can discern its slope and y-intercept
(where the line crosses the y-axis).
99
LYCEUM OF ALABANG
y = - 3x + 3
4
100
LYCEUM OF ALABANG
Example :
Problem:
A company determines that job performance for employees in a production department
can be predicted using the regression model y = 130 + 4.3x, where x is the hours of in-house
training they received (from 0 to 20) and y is their score on a job skills test. The value of the y-
intercept (130) indicates the average job skill score for an employee with no training. The value
of the slope (4.3) indicates that for each hour of training, the job skill score increases, on
average, by 4.3 points.
Activity:
Can we predict the number of total calories based upon the total fat grams?
101
LYCEUM OF ALABANG
Solution:
1. Prepare a scatter plot of the data on graph paper.
2. Using a strand of spaghetti, position the spaghetti so that the plotted points are as close
to the strand as possible.
3. Find two points that you think will be on the "best-fit" line.
4. We are choosing the points (9, 260) and (30, 530). ( You may choose different . )
5. Calculate the slope of the line through your two points (rounded to three decimal places).
m = y2 – y1
x2 – x1
7. This equation can now be used to predict information that was not plotted in the scatter
plot.
Question a: Predict the total calories based upon 22 grams of fat.
y = 12.857 (x – 9) + 260
y = 12.857 (22 – 9) + 60
y = 12.857 (13) + 260
y = 427.141calories
102
LYCEUM OF ALABANG
y = 12.857(x – 9) + 260
y = 12.857 (18 – 9) +260
y = 12.857 (9) +260
y = 115.713 + 260
y = 375.713 calories
y = 12.857(x – 9) + 260
y = 12.857 (26 – 9) +260
y = 12.857 (17) +260
y = 218.569 + 260
y = 478.569 calories
103
LYCEUM OF ALABANG
CHAPTER TEST
Problem solving.
1. To interpret the relationship between years of education and salary potential, 5 persons
were surveyed. The results obtained on their number of years of higher education (college
degree and higher)and their monthly salaries are shown below. Compute the Pearson’s
Product Moment Coefficient Correlation and interpret the relationship between the variables.
104
LYCEUM OF ALABANG
2. A financial analyst believes that the interest rate on bonds is inversely related to the interest
rate of loans. Hence, bonds perform when the lending rate are down and vice versa. The
results of the observation are shown in the table below. Find the slope and y-intercept on the
data and predict the interest rate bond (%) when the Interest rate loan is
a. 7
b. 11
c. 12
Interest Rate on Loan (%) Interest Rate on Bond (%)
10 6
5 9
8 7
6 8
8 6
105
LYCEUM OF ALABANG
Tables
106
LYCEUM OF ALABANG
107
LYCEUM OF ALABANG
108
LYCEUM OF ALABANG
109
LYCEUM OF ALABANG
110
LYCEUM OF ALABANG
Chapter 1 : Exercise
111
LYCEUM OF ALABANG
Solution :
112
LYCEUM OF ALABANG
Chapter 2 : Exercise
______3. x = 45 ; x = 42 ; s = 5
______4. x = 250 ; x = 255 ; s = 2
______5. x = 28 ; x = 24 ; s = 2.5
IV. Draw the graph of the following probability.
1. Pr ( - 1.8 ≤ z ≤ 2.7 )
2. Pr ( 1.2 ≤ z ≤ 2.8 )
3. Pr ( z ≤ 1.5 )
113
LYCEUM OF ALABANG
V. Word problems.
1. Scores on a history test have an average of 80 with a standard deviation of 6. If there were
50 students who took the test on this subject, how many students got a score of at least 75 ?
2. The weight of chocolate bars from a particular chocolate factory has a mean of 8 ounces with
standard deviation of .1 ounce. What is the z-score corresponding to a weight of 8.17
ounces?
3. Books in the library are found to have an average length of 350 pages with a standard
deviation of 100 pages. If there are 10,000 books in the library , how many books have a
corresponding length of at least 80 pages?
5. The mean growth of the thickness of trees in a forest is found to be .5 cm/year with a
standard deviation of .1cm/year. What is the z-score corresponding to 1 cm/year?
114
LYCEUM OF ALABANG
Chapter 3 : Exercise
2. n=6;r=3
3. n=8;r=2
4. n=9;r=3
5. n = 10; r = 4
115
LYCEUM OF ALABANG
Solution:
116
LYCEUM OF ALABANG
Chapter 4 : Exercise
III. Find the value of the unknown with the following given:
4. α = _____________ if CI = 0.95
5. CI =_____________ if 2α = 0.01
117
LYCEUM OF ALABANG
2. Lyceum of Alabang P.E. department wants to calculate the proportion of students who have
attended a women’s basketball game at the college. They use student email addresses,
randomly choose 220 students, and email them. Of the 145 who responded, 22 had attended
a women’s basketball game.
a. What is the sample proportion of students who have attended a women’s basketball
game?
b. What is the sample proportion of students who have not attended a women’s basketball
game ?
c. Can a normal distribution be used to model the sampling distribution for the sample
proportion ? Explain.
118
LYCEUM OF ALABANG
Chapter 5 : Exercise
II. Determine if z-test or t-test is appropriate for the following given. Write Z or T in the
blank before each number. If neither of the two test is applicable, write X.
Tear Here
________1. s = 2.5 ; n = 50
________2. n = 15 ;
________3. s2 = unknown ; n = 25
________4. s = 16 ; n = 20
________5. s = 36 ; n = 30
III. Compute for the degrees of freedom based on the following given.
1. df = ___________ if n1 = 16 and n2 = 20
2. df = ___________ if n = 28
3. df = ___________ if n1 = 24 and n2 = 26
119
LYCEUM OF ALABANG
V. Problem solving.
1. Average heart rate for Americans is 72 beats/minute. A group of 25 individuals
participated in an aerobics fitness program to lower their heart rate. After six months the
group was evaluated to identify is the program had significantly slowed their heart. The mean
heart rate for the group was 69 beats/minute with a standard deviation of 6.5. Was the
aerobics program effective in lowering heart rate?
Answer the following:
a. Ho: _______________________________
b. Ha: _______________________________
c. α = ________________________________
d Test statistic ________________________
e. Tailed test __________________________
f. degrees of freedom ____________________
g. critical value _________________________
h. computed t-value _____________________
i. Graph
j. Conclusion: ______________________________________________________________
2. The amount of a certain trace element in blood is known to vary with a standard deviation
of 14.1 ppm (parts per million) for male blood donors and 9.5 ppm for female donors.
Random samples of 75 male and 50 female donors yield concentration means of 28 and 33
ppm, respectively. What is the likelihood that the population means of concentrations of the
element are the same for men and women?
Answer the following:
a. Ho: _______________________________
b. Ha: _______________________________
c. α = ________________________________
d Test statistic ________________________
e. Tailed test __________________________
f. degrees of freedom ____________________
g. critical value _________________________
h. computed t-value _____________________
i. Graph
j. Conclusion: ______________________________________________________________
120
LYCEUM OF ALABANG
Chapter 6 : Exercise
II. Find the slope and y-intercept with the following given.
Tear Here
121
LYCEUM OF ALABANG
1.. Find the Pearson’s correlation coefficient ( r ) using the following data (α = 0.02 ;
two-tailed test ) and state the correlation.
Samples x y xy x2 y2
1 2 6
2 7 16
3 5 11
Σ
122
LYCEUM OF ALABANG
2. You have to examine the relationship between the age and price for used cars sold in the
last year by a car dealership company.
Note : Use points ( 5, 4500 ) and ( 7, 4200 ) as basis for the computation of the slope.
Find :
a. Predict the price when the car age is 6 years.
b. Predict the price when the car age is 9 years.
c. Predict the price when the car age is 15 years.
Tear Here
123