Statistics
Statistics
Normal distribution with mean 0 and standard deviation 1 is called a standard normal distribution. A random
variable with a standard normal distribution is typically denoted by the symbol z. The formula for the probability distribution
of z is given by;
1. Use the Standard Normal Table to find the areas corresponding to the z values. If
necessary, use the symmetry of the normal distribution to find areas
corresponding to negative z values and the fact that the total area on each side of
the mean equals .5 to convert the areas to the probabilities of the event you have
shaded.
Normal Curve Area
List of Population and Corresponding Sample Statistics
Sampling Distribution
A sampling distribution is the probability distribution of a sample statistic that occurs when n-size
samples are repeatedly taken from a population. It describes a range of possible outcomes for a
statistic, such as the mean or mode of some variable, of a population.
Population
Sampling Distribution
If the sample statistic is the sample mean, than the distribution is the sampling distribution of
the sample mean.
Estimating the mean useful life of automobiles, the mean monthly sales for all iPhone dealers
in a large city, and the mean breaking strength of new plastic are practical problems with
something in common. In each case, we are interested in making an inference about the mean
of some population.
But the question is which sample statistic should we use to estimate the population mean.
Sample mean or sample median ? Which of these does represent better the population?
The Central Limit Theorem
The larger the sample size, the better will be the normal approximation to the sampling
distribution of x .
Sample size depends on the shape of the
distribution of the sampled population.
x x x ,
i 1 2 , xn
E xi x i 1
i 1
Cn N!
N n
n ! N n !
( N 1)! n ! N n !
E xi 1 2 N
(n 1)! N n ! nN !
N
i
E xi i 1
N
2. Consistency
Sample means, like other estimators, do not equal the population mean they will estimate
because they involve a sampling error.
For an estimator to be consistent, the probability that the difference between the values it
predicts greater than 0 or a value such as must be very small.
As the number of sample units increases, the difference between the sample mean and the
population mean x decreases due to the law of large numbers. So the sample
mean becomes consistent.
P x 0, n , 0
3. Efficiency - Minimum Variance
N n
x
n N 1
Finite multiplier is used when the sampling ratio is n / N 0, 05 . Because in the case of
very small samples, the numerator and denominator of the finite multiplier factor converge.
Since the populations are generally large, the denominator of the finite population
correction factor is approximately N and is expressed as;
n
1
N
When populations are infinitely large, the value in the square root approaches 1.
4. Sufficiency
An sufficient estimator is one that is calculated with the values of all the units in the
sample.
Values such as median, mode, quartile can be used as estimators, but they cannot carry
the qualification feature. Because they are not statistics based on the values of all the
units in the sample, like the arithmetic mean.
In this respect, the sample arithmetic mean is an sufficient estimator of the population
mean.
In sum, to make an inference about a population parameter, we use the sample statistic
with a sampling distribution that is unbiased, efficient (has a small standard deviation
-usually smaller than the standard deviation of other unbiased sample statistics),
consistent and sufficient.
Estimation of Population Mean
The population mean can be estimated in two ways:
1. Point Estimation,
1. Interval Estimation
Point Estimation
The simplest type of statistic used to make inferences about a population parameter is a
point estimator.
• A point estimator of a population parameter is a rule or formula that tells us how to use
the sample data to calculate a single number that can be used as an estimate of the
population parameter. For example, the sample mean is a point estimator of the
population mean. Similarly, the sample variance is a point estimator of the population
variance.
x
s2 2
• Often, many different point estimators can be found to estimate the same parameter. Each
will have a sampling distribution that provides information about the point estimator. By
examining the sampling distribution, we can determine how large the difference between
an estimate and the true value of the parameter (called the error of estimation) is likely
to be. We can also tell whether an estimator is more likely to overestimate or to
underestimate a parameter.
Interval Estimation
An estimator is expected to be unbiased, consistent, efficient, and accurate as well as
sufficient. The precision of an estimator is measured by its standard error. That is, the
sampling error showing the mean difference from the estimated population parameter
should be small. When this error is large, even if it has other properties, the estimator
cannot give precise information about the true value of the population parameter. By adding
the sampling error that reveals this feature of the estimator to the estimation, the lower and
upper extreme values of the population parameters are determined. This interval is called
Confidence Interval.
For sample means;
P x k x 1
is the probability that the true population mean is outside the forecast range, and k is the
t or Z value of probability 1 based on the number of sample units. From this inequality;
k x x k x
x k x x k x
Confidence Interval for a Population Mean
Example
Q) In a factory bulbs are produced with a standard deviation of 20 hours and an average durability of 550
hours. It is known that the durability of bulbs normal distributed. Quality is kept under control by randomly
selecting 100 units with a probability of 95%. In a selected sample, the average is calculated as 540 hours.
A) Find the confidence interval of the average durability of the bulbs with a probability of 95%. Are the
sample means up to the standard?
B) Find the interval for the sample means to be up to the standard.
Confidence Interval for a Population Proportion
In some cases it is necessary to estimate the proportion or number of units in the population that have a
particular characteristic. In this case, the populations have two-groups consisting of those with and without
a certain feature, or they are transformed into this shape for the purpose of the research.
Researchers may sometimes want to treat multigroup populations as two-group. For example, while the
distribution of a class is multigroup according to the grades taken from any course (between 0-100), it can
be converted into two groups as successful and unsuccessful students from this course, when desired.
If A is the number of people with a certain feature in a population of N units, than the proportion of those
who have this feature will be;
A a
P For sample: p
N n
and the proportion of those who do not have this feature will be;
NA na
Q 1 P For sample: q 1 p
N n
Example
In order to estimate with a sampling error SE and In order to estimate a binomial probability p̂ with sampling
with 100(1 )% confidence, the required sample size is error SE and with 100(1 )% confidence, the required
found as follows: sample size is foundby solving the following equation:
Z /2 SE
n pq
Z /2 SE
n
The Solution for n is giving by the equation:
The Solution for n can be written as follows:
( Z )
2
n /2
SE
( Z /2 ) 2 ( pq)
n
Note: The value of is usually unknown. It can be ( SE ) 2
estimated by the standard deviation, s, from a prior
sample. Alternatively, we may approximate the range R Note: Because the value of the product pq is unknown, it
of observations in the population, and (conservatively) can be estimated by using the sample fraction of successes,
estimate R / 4 . In any case, you should round the from a prior sample. In any case, you should round the
value of n obtained upward to ensure that the sample value of n obtained upward to ensure that the sample size
size will be sufficient to achieve the specified reliability. will be sufficient to achieve the specified reliability.
Example
characterized by a quantity called the degrees of freedom (df) associated with the distribution. Several chi-square
distributions with different df values are shown in the figure below. You can see that unlike z- and t-distributions,
the chi-square distribution is not symmetric about 0.
Critical Values of
2
Critical Values of
2
Example
The number of supermarkets in Myanmar’s most populated cities is increasing and market competition is also high.
Kaggle has published a study regarding the growth of supermarkets. A three-month dataset has been collected based
on the historical sales of a supermarket company located at Mandalay. The product lines under consideration in the
study are electronic accessories, fashion accessories, food and beverages, health and beauty, home and lifestyle, and
sports and travel. To analyze the average unit price of all the electronic accessories, a random sample of eight
accessories’ unit prices (in Kyat) paid by cash are listed in the following table.
• (2) The alternative (research) hypothesis, denoted H a or H1 is that which will be accepted
only if the data provide convincing evidence of its truth. This usually represents the values of a
population parameter for which the researcher wants to gather evidence to support.
• There are two kinds of errors in hypothesis testing. The error of rejecting the null hypothesis
although it is true, is called type I error and its probability is α .
• A different error occurs if a false null hypothesis is accepted. This error is called type II error and
its probability is denoted by β. The size of α (i.e., the significance level) is chosen by the
researcher. The size of β depends on the true mean, which we do not know, and on α. By
decreasing α, the probability β of the type II error increases. The probability (1–β) is the
probability that a false null hypothesis is rejected what we want. This is called the power of a test
and it is an important property of a test. By decreasing α, the power of the test also decreases.
Thus, there is a tradeoff between α and β. As already mentioned, the error probability α should not
be too small; otherwise the test is losing its power to reject H0 if it is false. Both α and β can only
be reduced by increasing the sample size N.
Type II error β (dependent on α and μ)
The rejection region for a two-tailed test differs from that for
a one-tailed test. When we are trying to detect departure
from the null hypothesis in either direction, we must
establish a rejection region in both tails of the sampling
distribution of the test statistic.
Significance Levels: p-Values
Steps for Calculating the p-Value for a Test of Hypothesis
1. Determine the value of the test statistic z corresponding to the result of the sampling experiment.
2. a. If the test is one-tailed, the p-value is equal to the tail area beyond z in the same direction as the alternative
hypothesis. Thus, if the alternative hypothesis is of the form >, the p-value is the area to the right of, or above, the
observed z-value. Conversely, if the alternative is of the form <, the p-value is the area to the left of, or below, the
observed z-value.
b. If the test is two-tailed, the p-value is equal to twice the tail area beyond the observed z-value in the direction of the
sign of z—that is, if z is positive, the p-value is twice the area to the right of, or above, the observed z-value. Conversely,
if z is negative, the p-value is twice the area to the left of, or below, the observed z-value.
Test of Hypothesis About a Population Mean: Normal (z) Statistic
• When testing a hypothesis about a population mean m, the test statistic we use will depend on whether the sample
size n is large (say, n > 30) or small, and whether or not we know the value of the population standard deviation, s.
Because the sample size is large, the Central Limit Theorem guarantees that the sampling distribution of x is
approximately normal. Consequently, the test statistic for a test based on large samples will be based on the normal z-
statistic. Although the z-statistic requires that we know the true population standard deviation s, we rarely if ever
know s. when n is large, the sample standard deviation s provides a good approximation to s, and the z-statistic can be
approximated as follows:
Therefore, as the test statistic of a small-sample test of a population mean, we use the t-statistic:
• The p-value is also referred to as the empirical significance level. In SPSS, the p-value is called
“significance” or “sig”. It tells us the exact significance level of a test statistic, while the classical test
only gives us a “black and white” picture for a given α. A large p-value supports the null hypothesis,
but a small p-value indicates that the probability of the test statistic is low if H0 is true. So probably
H0is not true and we should reject it.
• We can also interpret the p-value as a measure of plausibility. If p is small, the plausibility of H0 is
small and it should be rejected. And if p is large, the plausibility of H0 is large.
• By using the p-value, the test procedure is simplified considerably. It is not necessary to start the test
by specifying an error probability (significance level) α. Furthermore, we do not need a critical value
and thus no statistical table. (Before the development of computers, these tables were necessary
because the computing effort for critical values as well as for p-values was prohibitive.)
Critical Values of t
P. 1) Cigarette advertisements are required by federal law to carry the following statement: “Warning: The surgeon
general has determined that cigarette smoking is dangerous to your health.” However, this warning is often located in
inconspicuous corners of the advertisements and printed in small type. Suppose the Federal Trade Commission (FTC)
claims that 80% of cigarette consumers fail to see the warning. A marketer for a large tobacco firm wants to gather
evidence to show that the FTC’s claim is too high, i.e., that fewer than 80% of cigarette consumers fail to see the
warning. Specify the null and alternative hypotheses for a test of the FTC’s claim.
In a factory where detergents are produced in boxes of 500 gr., the standard deviation of
boxes is known as 10 gr. The average weight of 50 randomly selected boxes is calculated
as 505 gr. Could it be said that the production is in compliance with the standards at the
5% significance level? (It takes a very short time to reset the machine.)
Example
The reputations (and hence sales) of many businesses can be severely damaged by shipments of manufactured
items that contain a large percentage of defectives. For example, a manufacturer of alkaline batteries may
want to be reasonably certain that fewer than 5% of its batteries are defective. Suppose 300 batteries are
randomly selected from a very large shipment; each is tested, and 10 defective batteries are found.
• Does this provide sufficient evidence for the manufacturer to conclude that the fraction defective in the
entire shipment is less than 0.05? Use 0.01
Problem
• With individual lines at its various windows, a post office finds that the standard deviation for
normally distributed waiting times for customers is 7.2 minutes. The post office experiments with a
single, main waiting line and finds that for a random sample of 25 customers the waiting times for
customers have a standard deviation of 4.5 minutes. At the 1 % significance level, determine if the
single line changed the variation among the waiting times for customers.
Problem
Below are the monthly incomes (in 100 dollar) and monthly expenditures of 40 individuals randomly selected
from a specific region.
18 29 20 61 22 55 54 21
Income
43 29 18 37 21 32 32 20
46 35 27 49 23 39 54 46
49 31 30 29 23 49 62 44
56 41 13 49 49 44 61 63
6 12 2 15 2 10 17 0
18 14 5 2 8 2 3 7
Expenditures
2 5 5 12 5 4 6 5
2 10 10 6 14 14 5 7
6 7 12 12 9 17 15 9