Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Standard Normal Distribution

Normal distribution with mean   0 and standard deviation   1 is called a standard normal distribution. A random
variable with a standard normal distribution is typically denoted by the symbol z. The formula for the probability distribution
of z is given by;
1. Use the Standard Normal Table to find the areas corresponding to the z values. If
necessary, use the symmetry of the normal distribution to find areas
corresponding to negative z values and the fact that the total area on each side of
the mean equals .5 to convert the areas to the probabilities of the event you have
shaded.
Normal Curve Area
List of Population and Corresponding Sample Statistics
Sampling Distribution
A sampling distribution is the probability distribution of a sample statistic that occurs when n-size
samples are repeatedly taken from a population. It describes a range of possible outcomes for a
statistic, such as the mean or mode of some variable, of a population.

Population
Sampling Distribution
If the sample statistic is the sample mean, than the distribution is the sampling distribution of
the sample mean.

Estimating the mean useful life of automobiles, the mean monthly sales for all iPhone dealers
in a large city, and the mean breaking strength of new plastic are practical problems with
something in common. In each case, we are interested in making an inference about the mean
of some population.

But the question is which sample statistic should we use to estimate the population mean.
Sample mean or sample median ? Which of these does represent better the population?
The Central Limit Theorem

Consider a random sample of n observations selected from a population (any


probability distribution) with mean  and standard deviation  .

Then, when n is sufficiently large, the sampling distribution of x will be


approximately a normal distribution with mean  x   and standard deviation  x   / n
This important theorem is called Central Limit Theorem in statistic.

The larger the sample size, the better will be the normal approximation to the sampling
distribution of x .
Sample size depends on the shape of the
distribution of the sampled population.

The greater the skewness of the sampled


population distribution, the larger the sample size
must be before the normal distribution is an
adequate approximation for the sampling
distribution of x . For most sampled populations,
sample sizes of n  30 will suffice for the
normal approximation to be reasonable.
Characteristics of Estimators
1. Unbiasedness
• If the sampling distribution of a sample statistic has a mean equal to the population parameter
the statistic is intended to estimate, the statistic is said to be an unbiased estimate of the
parameter.
• If the mean of the sampling distribution is not equal to the parameter, the statistic is said to be a
biased estimate of the parameter.

a) Unbiased sample statistics b) Biased sample statistics


According to the central limit theorem, the expected value of possible sample means is
equal to the population mean. Thus, sample means are unbiased estimates of population
averages. As it is known, for any estimator to be unbiased, it must be equal to the value it
predicts.
N Cn N Cn

 x x  x ,
i 1 2 ,  xn 
E  xi   x  i 1
 i 1

Cn N!
N n
n ! N  n  !

( N  1)! n ! N  n  !
E  xi    1  2   N 
(n  1)! N  n ! nN !
N

 i
E  xi   i 1

N
2. Consistency
Sample means, like other estimators, do not equal the population mean they will estimate
because they involve a sampling error.
For an estimator to be consistent, the probability that the difference between the values it
predicts greater than 0 or a value such as  must be very small.

As the number of sample units increases, the difference between the sample mean and the
population mean  x    decreases due to the law of large numbers. So the sample
mean becomes consistent.

P  x       0, n  ,  0
3. Efficiency - Minimum Variance

The standard deviation of a sampling


distribution measures another important
property of statistics: the spread of these
estimates generated by repeated sampling.
Suppose two statistics, A and B, are both
unbiased estimators of the population parameter.
Since the means of the two sampling
distributions are the same, we turn to their
standard deviations in order to decide which will
provide estimates that fall closer to the unknown
population parameter we are estimating.
Naturally, we will choose the sample statistic
that has the smaller standard deviation.
The standard deviation  x   / n of the sampling distribution of a statistic is also called
the standard error of the statistic. On the other hand this formula is valid when the
population is infinite.
If the population is finite, it is multiplied by the finite population correction factor.

 N n
x 
n N 1

Finite multiplier is used when the sampling ratio is n / N  0, 05 . Because in the case of
very small samples, the numerator and denominator of the finite multiplier factor converge.
Since the populations are generally large, the denominator of the finite population
correction factor is approximately N and is expressed as;
n
1
N
When populations are infinitely large, the value in the square root approaches 1.
4. Sufficiency
An sufficient estimator is one that is calculated with the values of all the units in the
sample.
Values such as median, mode, quartile can be used as estimators, but they cannot carry
the qualification feature. Because they are not statistics based on the values of all the
units in the sample, like the arithmetic mean.
In this respect, the sample arithmetic mean is an sufficient estimator of the population
mean.

In sum, to make an inference about a population parameter, we use the sample statistic
with a sampling distribution that is unbiased, efficient (has a small standard deviation
-usually smaller than the standard deviation of other unbiased sample statistics),
consistent and sufficient.
Estimation of Population Mean
The population mean can be estimated in two ways:

1. Point Estimation,

1. Interval Estimation
Point Estimation
The simplest type of statistic used to make inferences about a population parameter is a
point estimator.
• A point estimator of a population parameter is a rule or formula that tells us how to use
the sample data to calculate a single number that can be used as an estimate of the
population parameter. For example, the sample mean is a point estimator of the
population mean. Similarly, the sample variance is a point estimator of the population
variance.
x 
s2   2

• Often, many different point estimators can be found to estimate the same parameter. Each
will have a sampling distribution that provides information about the point estimator. By
examining the sampling distribution, we can determine how large the difference between
an estimate and the true value of the parameter (called the error of estimation) is likely
to be. We can also tell whether an estimator is more likely to overestimate or to
underestimate a parameter.
Interval Estimation
An estimator is expected to be unbiased, consistent, efficient, and accurate as well as
sufficient. The precision of an estimator is measured by its standard error. That is, the
sampling error showing the mean difference from the estimated population parameter
should be small. When this error is large, even if it has other properties, the estimator
cannot give precise information about the true value of the population parameter. By adding
the sampling error that reveals this feature of the estimator to the estimation, the lower and
upper extreme values of the population parameters are determined. This interval is called
Confidence Interval.
For sample means;
P  x    k x   1  

 is the probability that the true population mean is outside the forecast range, and k is the
t or Z value of probability 1   based on the number of sample units. From this inequality;

k x  x     k x
x  k x    x  k x
Confidence Interval for a Population Mean

Example
Q) In a factory bulbs are produced with a standard deviation of 20 hours and an average durability of 550
hours. It is known that the durability of bulbs normal distributed. Quality is kept under control by randomly
selecting 100 units with a probability of 95%. In a selected sample, the average is calculated as 540 hours.

A) Find the confidence interval of the average durability of the bulbs with a probability of 95%. Are the
sample means up to the standard?
B) Find the interval for the sample means to be up to the standard.
Confidence Interval for a Population Proportion

In some cases it is necessary to estimate the proportion or number of units in the population that have a
particular characteristic. In this case, the populations have two-groups consisting of those with and without
a certain feature, or they are transformed into this shape for the purpose of the research.
Researchers may sometimes want to treat multigroup populations as two-group. For example, while the
distribution of a class is multigroup according to the grades taken from any course (between 0-100), it can
be converted into two groups as successful and unsuccessful students from this course, when desired.
If A is the number of people with a certain feature in a population of N units, than the proportion of those
who have this feature will be;
A a
P For sample: p
N n

and the proportion of those who do not have this feature will be;

NA na
Q  1 P  For sample: q  1 p 
N n
Example

A food-products company conducted a market study by randomly


sampling and interviewing 1,000 consumers to determine which
brand of breakfast cereal they prefer. Suppose 313 consumers were
found to prefer the company’s brand. How would you estimate the
true fraction of all consumers who prefer the company’s cereal
brand?
Determining the Sample Size for  and p̂
Sample Size for Confidence Interval for  Sample Size for Confidence Interval for 

In order to estimate  with a sampling error SE and In order to estimate a binomial probability p̂ with sampling
with 100(1   )% confidence, the required sample size is error SE and with 100(1   )% confidence, the required
found as follows: sample size is foundby solving the following equation:
  
Z /2    SE
 n pq
Z /2  SE
n
The Solution for n is giving by the equation:
The Solution for n can be written as follows:
 ( Z ) 
2

n    /2 
 SE 
( Z /2 ) 2 ( pq)
n
Note: The value of  is usually unknown. It can be ( SE ) 2
estimated by the standard deviation, s, from a prior
sample. Alternatively, we may approximate the range R Note: Because the value of the product pq is unknown, it
of observations in the population, and (conservatively) can be estimated by using the sample fraction of successes,
estimate   R / 4 . In any case, you should round the from a prior sample. In any case, you should round the
value of n obtained upward to ensure that the sample value of n obtained upward to ensure that the sample size
size will be sufficient to achieve the specified reliability. will be sufficient to achieve the specified reliability.
Example

In a region where 10000 families live, a firm will conduct a sampling


study to investigate whether there has been a significant change in its
market share, which has been 20% in recent years. What should the
sample size be to estimate the market share with a probability of 99%
with a margin of 0.05?
Confidence Interval for a Population Variance
Intuitively, it seems reasonable to use the sample variance, s , to estimate  2 . However, unlike with sample
2
2
means and proportions, the sampling distribution of s does not follow a normal (z) distribution or a Student’s t-
2
distribution. Rather, when certain assumptions are satisfied, the sampling distribution of s possesses
approximately a chi-square  distribution. The chi-square probability distribution, like the t-distribution, is
2

characterized by a quantity called the degrees of freedom (df) associated with the distribution. Several chi-square
distributions with different df values are shown in the figure below. You can see that unlike z- and t-distributions,
the chi-square distribution is not symmetric about 0.
Critical Values of 
2
Critical Values of 
2
Example
The number of supermarkets in Myanmar’s most populated cities is increasing and market competition is also high.
Kaggle has published a study regarding the growth of supermarkets. A three-month dataset has been collected based
on the historical sales of a supermarket company located at Mandalay. The product lines under consideration in the
study are electronic accessories, fashion accessories, food and beverages, health and beauty, home and lifestyle, and
sports and travel. To analyze the average unit price of all the electronic accessories, a random sample of eight
accessories’ unit prices (in Kyat) paid by cash are listed in the following table.

a) Identify the target parameter for this study.


b) Compute a point estimate of the target parameter.
c) What is the problem with using the normal (z) statistic to find a confidence interval for the target parameter?
d) Find a 95% confidence interval for the target parameter.
e) Give a practical interpretation of the interval, part d.
f) What conditions must be satisfied for the interval, part d, to be valid?
g) How many electronic accessories’ unit prices would need to be sampled in order to reduce the width of the
confidence interval to 5 Kyat?
h) The analysist wants to know the variation of the unit prices of the electronic accessories at Mandalay. Provide the
researchers with an estimate of the target parameter using a 99% confidence interval.
Tests of Hypotheses

• We define two hypotheses:


• (1) The null hypothesis, denoted H0, is that which represents the status quo to the party
performing the sampling experiment—the hypothesis that will be accepted unless the data
provide convincing evidence that it is false. This usually represents the “status quo” or some
claim about the population parameter that the researcher wants to test.

H 0 :   .... H 0 :   .... H 0 :   ....

• (2) The alternative (research) hypothesis, denoted H a or H1 is that which will be accepted
only if the data provide convincing evidence of its truth. This usually represents the values of a
population parameter for which the researcher wants to gather evidence to support.

H a or H1 :   .... H1 :   .... H1 :   ....


Suppose building specifications in a certain city require that the average breaking strength of residential sewer
pipe be more than 2,400 pounds per foot of length (i.e., per linear foot). Each manufacturer who wants to sell pipe
in this city must demonstrate that its product meets the specification. We want to decide whether the mean
breaking strength of the pipe exceeds 2,400 pounds per linear foot.

H 0 :   2400 (i.e., the manufacturer’s pipe does not meet specifications)


H a :   2400 (i.e., the manufacturer’s pipe meets specifications)
Type I and Type II Errors

• There are two kinds of errors in hypothesis testing. The error of rejecting the null hypothesis
although it is true, is called type I error and its probability is α .
• A different error occurs if a false null hypothesis is accepted. This error is called type II error and
its probability is denoted by β. The size of α (i.e., the significance level) is chosen by the
researcher. The size of β depends on the true mean, which we do not know, and on α. By
decreasing α, the probability β of the type II error increases. The probability (1–β) is the
probability that a false null hypothesis is rejected what we want. This is called the power of a test
and it is an important property of a test. By decreasing α, the power of the test also decreases.
Thus, there is a tradeoff between α and β. As already mentioned, the error probability α should not
be too small; otherwise the test is losing its power to reject H0 if it is false. Both α and β can only
be reduced by increasing the sample size N.
Type II error β (dependent on α and μ)

• The value of α cannot be calculated or statistically justified, it must be determined by the


researcher. For this, the researcher should take into account the consequences (risks and
opportunities) of alternative decisions. If the costs of a type I error are high, α should be small.
Alternatively, if the costs of a type II error are high, α should be larger, and thus β smaller. This
increases the power of the test.
One-tailed (or one-sided) Statistical Test Two-tailed (or two-sided) Statistical Test

The rejection region for a two-tailed test differs from that for
a one-tailed test. When we are trying to detect departure
from the null hypothesis in either direction, we must
establish a rejection region in both tails of the sampling
distribution of the test statistic.
Significance Levels: p-Values
Steps for Calculating the p-Value for a Test of Hypothesis
1. Determine the value of the test statistic z corresponding to the result of the sampling experiment.
2. a. If the test is one-tailed, the p-value is equal to the tail area beyond z in the same direction as the alternative
hypothesis. Thus, if the alternative hypothesis is of the form >, the p-value is the area to the right of, or above, the
observed z-value. Conversely, if the alternative is of the form <, the p-value is the area to the left of, or below, the
observed z-value.

b. If the test is two-tailed, the p-value is equal to twice the tail area beyond the observed z-value in the direction of the
sign of z—that is, if z is positive, the p-value is twice the area to the right of, or above, the observed z-value. Conversely,
if z is negative, the p-value is twice the area to the left of, or below, the observed z-value.
Test of Hypothesis About a Population Mean: Normal (z) Statistic
• When testing a hypothesis about a population mean m, the test statistic we use will depend on whether the sample
size n is large (say, n > 30) or small, and whether or not we know the value of the population standard deviation, s.
Because the sample size is large, the Central Limit Theorem guarantees that the sampling distribution of x is
approximately normal. Consequently, the test statistic for a test based on large samples will be based on the normal z-
statistic. Although the z-statistic requires that we know the true population standard deviation s, we rarely if ever
know s. when n is large, the sample standard deviation s provides a good approximation to s, and the z-statistic can be
approximated as follows:

where m0 represents the value of m specified in the null hypothesis.


Test of Hypothesis About a Population Mean: Student’s t-Statistic
When we are faced with making inferences about a population mean using the information in a small sample, two
problems emerge:
1. The normality of the sampling distribution for x does not follow from the Central Limit Theorem when the
sample size is small. We must assume that the distribution of measurements from which the sample was selected
is approximately normally distributed in order to ensure the approximate normality of the sampling distribution
of x.
2. If the population standard deviation s is unknown, as is usually the case, then we cannot assume that s will
provide a good approximation for s when the sample size is small. Instead, we must use the t-distribution rather
than the standard normal z-distribution to make inferences about the population mean m.

Therefore, as the test statistic of a small-sample test of a population mean, we use the t-statistic:

where m0 is the null hypothesized value of the population mean m.


To find the rejection region, we must specify the value of a, the probability that the test will lead to rejection of the
null hypothesis when it is true, and then consult the t-table.
The technique for conducting a small-sample test of hypothesis about a population mean is summarized in the
following box.
Using the p-value
• The test procedure may be simplified by using a p-value approach instead of the critical value
approach. The p-value (probability value) for our empirical t-statistic is the probability to observe a t-
value more distant from the null hypothesis than our temp normal if H0 is true:

• The p-value is also referred to as the empirical significance level. In SPSS, the p-value is called
“significance” or “sig”. It tells us the exact significance level of a test statistic, while the classical test
only gives us a “black and white” picture for a given α. A large p-value supports the null hypothesis,
but a small p-value indicates that the probability of the test statistic is low if H0 is true. So probably
H0is not true and we should reject it.
• We can also interpret the p-value as a measure of plausibility. If p is small, the plausibility of H0 is
small and it should be rejected. And if p is large, the plausibility of H0 is large.
• By using the p-value, the test procedure is simplified considerably. It is not necessary to start the test
by specifying an error probability (significance level) α. Furthermore, we do not need a critical value
and thus no statistical table. (Before the development of computers, these tables were necessary
because the computing effort for critical values as well as for p-values was prohibitive.)
Critical Values of t
P. 1) Cigarette advertisements are required by federal law to carry the following statement: “Warning: The surgeon
general has determined that cigarette smoking is dangerous to your health.” However, this warning is often located in
inconspicuous corners of the advertisements and printed in small type. Suppose the Federal Trade Commission (FTC)
claims that 80% of cigarette consumers fail to see the warning. A marketer for a large tobacco firm wants to gather
evidence to show that the FTC’s claim is too high, i.e., that fewer than 80% of cigarette consumers fail to see the
warning. Specify the null and alternative hypotheses for a test of the FTC’s claim.

P. 2) A metal lathe is checked periodically by quality-control inspectors to determine whether it is producing


machine bearings with a mean diameter of 0.5 inch. If the mean diameter of the bearings is larger or smaller than 0.5
inch, then the process is out of control and must be adjusted. Formulate the null and alternative hypotheses for a test
to determine whether the bearing production process is out of control.
Example

In a factory where detergents are produced in boxes of 500 gr., the standard deviation of
boxes is known as 10 gr. The average weight of 50 randomly selected boxes is calculated
as 505 gr. Could it be said that the production is in compliance with the standards at the
5% significance level? (It takes a very short time to reset the machine.)
Example

Steel with an average resistance of at least 2500 kg is used in a machine industry. In


order to make the product acceptance sampling of a new order, 64 bars were
randomly selected and it was determined that the average resistance was 2350 kg
and the standard deviation was 400 kg. Is this order acceptable at the 1%
significance level?
Problem

The reputations (and hence sales) of many businesses can be severely damaged by shipments of manufactured
items that contain a large percentage of defectives. For example, a manufacturer of alkaline batteries may
want to be reasonably certain that fewer than 5% of its batteries are defective. Suppose 300 batteries are
randomly selected from a very large shipment; each is tested, and 10 defective batteries are found.
• Does this provide sufficient evidence for the manufacturer to conclude that the fraction defective in the
entire shipment is less than 0.05? Use   0.01
Problem

• With individual lines at its various windows, a post office finds that the standard deviation for
normally distributed waiting times for customers is 7.2 minutes. The post office experiments with a
single, main waiting line and finds that for a random sample of 25 customers the waiting times for
customers have a standard deviation of 4.5 minutes. At the 1 % significance level, determine if the
single line changed the variation among the waiting times for customers.
Problem
Below are the monthly incomes (in 100 dollar) and monthly expenditures of 40 individuals randomly selected
from a specific region.

18 29 20 61 22 55 54 21
Income
43 29 18 37 21 32 32 20
46 35 27 49 23 39 54 46
49 31 30 29 23 49 62 44
56 41 13 49 49 44 61 63

6 12 2 15 2 10 17 0
18 14 5 2 8 2 3 7
Expenditures
2 5 5 12 5 4 6 5
2 10 10 6 14 14 5 7
6 7 12 12 9 17 15 9

You might also like