Statistics MyNotes
Statistics MyNotes
Statistics MyNotes
A simple random sample is when the sample is chosen from the population
randomly so that every possible sample of n elements has an equal chance of
being selected.
Data is collected by various methods. Sometimes our data set consists of the
entire population in other cases data may constitute a sample from some
population.
The Pth percentile of a group of numbers is the value below which lie P percent of
the numbers in the group. The position of the Pth percentile is given by
(n+1)P/100 where n is the number of data points.
1st Quartile -> It is the 25th percentile. Similarly 2nd and 3rd quartile.
Median is the point below which lie half the data. It is the 50th percentile.
Three common measures of central tendency are Mean, Median and Mode.
Mode if the data set is the value that occurs most frequently.
Mean summarizes all the information in the data. It is the single point that can be
viewed as the point where all the mass –the weight- of the observations is
concentrated.
The Median, on the other hand, is an observation in the center of the data set. ½
the data lies above this observation and the other ½ below it.
The mode tells us our data set’s most frequently occurring value. There can be
several modes.
MEASURE OF VARIABILITY
The 2 data sets have the same central tendency but different
variability/dispersion.
The range of a set of observations is the difference between the largest and the
smallest observation.
Two other more commonly used measures of dispersion are variance and
standard deviation. These 2 are more useful as compared to the range and
interquartile range because they use the information contained in all the
observations in the data set or the population.
The STANDARD DEVIATION of a sample is the square root of the sample variance.
Why would we use the S.D. when we already have its square, the variance?
The S.D. is a more meaningful measure. The variance is the average squared
deviation from the mean. It is squared because if we just compute the deviations
from the mean and then average them we will get Zero. Therefore, when seeking
a measure of the variance we square the observation which removes the negative
signs and thus the measure is not equal to Zero. But the variance obtained is still a
squared quantity. By taking the root we un-square the units and get a quantity
denoted in the original units of the problem. Statisticians work with variance
because its mathematical properties simplify computations whereas people
applying statistics use S.D. since it is more easily interpreted.
When data are grouped into classes, we may also plot a frequency distribution of
the data. Such a frequency plot is called a histogram.
The Relative Frequency of a class is the count of data points in the class divided
by total number of points.
The larger the kurtosis the more peaked will be the distribution. Kurtosis is
calculated and reported either as an absolute or relative value. Absolute kurtosis
is always a positive number. The absolute kurtosis of the Normal Distribution is 3.
This value 3 is taken as the datum to calculate the relative kurtosis.
Relative Kurtosis= Absolute kurtosis−3
Relative kurtosis can be negative and we will always work with relative kurtosis.
A negative kurtosis implies a flatter distribution than the normal distribution, and
is called platykurtic. A positive kurtosis means the vice-versa and is called
Leptokurtic.
The mean is the measure of centrality of a set of observations whereas the S.D. is
the measure of their spread. Two rules establish a relation between these two.
1) CHEBYSHEV’S THEOREM:
*At least ¾ of the observations will lie within 2 S.D. of the mean
*At least 8/9th of the observations in a set will lie within 3 S.D. of the mean.
In general the rule states that at least (1- 1/k^2) observations will lie within
k S.D. of the man
PROBABILITY
Probability is the Quantitative measure of Uncertainty – A number that conveys
the strength of our belief in the occurrence of an uncertain event.
Universal Set is a set containing everything in the given context and Is denoted by
S.
The sample space is the universal set S pertinent to the given experiment which is
the set of all possible outcomes of an experiment.
P ( A )=1−P ( A )
'
P ( A ∪ B ) =P ( A )+ P ( B )−P ( A ∩B )
Mutually Exclusive events
When the sets corresponding to 2 events are disjoint (i.e. have no intersection),
the 2 events are called mutually exclusive.
P ( A ∪ B ) =P ( A )+ P ( B )
Conditional Probability
Assuming P(B)≠0
Independence of Events
Two events are said to be independent of each other if and only if the following
three conditions are met:-
P(A|B)=P(A)
P(B|A)=P(B)
P(A∩ B ¿=P(A)P(B)
The 3rd equation tells us that when A and B are independent of each other then
we can obtain the probability of the joint occurrence of A and B simply by
multiplying both the probabilities together. This rule is thus called the Product
rule of independent events.
Product Rules for Independent Events
The rules for the union and the intersection of 2 independent events extend
nicely to sequences of more than 2 events and are quite useful in random
sampling.
Union Rule: P(A1∩ A 2 ∩ A 3 ….. P ( An ) ¿=1−P( A 1' ) P( A 2 ' )P( A 3 ') … … . P (An ' )
Intersection Rule:
P ( A 1 ∩ A 2 ∩ A 3 … … .. An )=P ( A 1 ) . P ( A 2 ) . P ( A 3 ) … … .. P ( An )
Note:- When 2 events are mutually exclusive, they are not independent. In fact
they are dependent events in the sense that if one happens the other cannot
happen. The probability of the intersection of 2 mutually exclusive events is 0
whereas the probability of the intersection of two independent events is not 0, it
is equal to the product of the probabilities of separate events.
Baye’s Theorem
P( A ∩ B)
P ( A|B )=
P (B )
P ( A ∩ B )=P ( A|B ) . P ( B )
P ( A )=P ( A ∩B )+ P ( A ∩ B' )
From the law of total probability using conditional probability, we get a new
equation:-
P ( A )=P ( A|B ) . P ( B ) + P ( A|B ' ) P ( B' ) … … … …(1)
Now P(B|A)=P(A|B)P(B)/P(A)
RANDOM VARIABLE
A random variable is an uncertain quantity whose value depends on chance.
A random variable has a probability law – a rule that assigns probabilities to the
different values of the random variable. The probability law, the probability
assignment, is called the probability distribution of the random variable.
BBBB X=0
GBBB
BGBB X=1
BBGB
BBBG
GGBB
GBGB
GBBG X=2
BGGB
BGBG
BBGG
BGGG
GBGG
GGBG
GGGB X=3
GGGG X=4
This mean of the probability distribution of the random variable is called the
Expected Value of the R.V. We use both µ and E(X) as the symbol for expected
value.
E ( X ) =μ=ΣxP ( x ) for all x
X P(X) xP(X)
0 .1 0
1 .2 .2
2 .3 .6
3 .2 .6
4 .1 .4
5 .1 .5
1.00 2.3Mean, E(X)
E ( H ( X ) ) =Σ H ( X ) P ( X ) for all x
2
σ =V ( X )=E ( X )−[ E ( X ) ]
2 2
0.5
S . D .=( V ( X ) )
If the outcome of a trial can only be either success or failure, then the trial is a
Bernoulli trial. The number of successes X in one Bernoulli trial, which can be 1 or
0 , is a BERNOULLI RANDOM VARIABLE.
The trials must be Bernoulli trials in that the outcomes can only be either
success or failure.
The outcomes of the trials must be independent.
The probability of success in each trial must be a CONSTANT.
If X ~ B(n,p), then
P ( X=x )=
( x ! ( n−x
n!
)! )
x
P . ( 1− p )
n −x
If we need 2 pins and the probability of making a pin successfully is 0.6 then we
would stop as soon as we get 2 good pins. It may be produced in the 1 st 2
attempts or 3,4,5 attempts. The number of trials could be 2, 3, 4, 5…..etc. (In
contrast in binomial distribution the number of trials is fixed)
If X~NB(s,p), then
( x−1 ) !
P ( X=x )= . p s . ( 1− p ) x−s
( s−1 ) ! ( x−s ) !
Where, x=s,s+1,s+2…..
E(X)=S/p V(X)=s(1-p)/p^2
If X~G(p), then
E(X)=1/p V(X)=(1-p)/p^2
As long as the expected value µ=np is neither too large nor too small i.e. between
0.01 and 50, the binomial formula for P(X=x) can be approximated as,
μx
−μ
P ( X=x )=e . Where x=0, 1, 2, 3…..
x!
E(X)=np
V(X)=np(1-p) but since p is very small 1-p~1 hence, V(X)= np
If we count the number of times a rare event will occur during a fixed interval,
then the number would follow a Poisson distribution.
A continuous random variable is a random variable that can take on any value in
an interval of numbers.
For a normal distribution with mean µ and S.D. σ the probability density
function f(x) is given by the formula:
[
− .{ }]
2
1 x−μ
1 2 σ
f ( x )= .e
σ √2π
For−∞< x <+∞
X N ( μ , σ2)
P(carrot) Estimates P
Types of Sampling
There are 2 methods of selecting samples from populations. Nonrandom or
Judgment Sampling , and random or probability sampling. In judgment
sampling, personal knowledge and opinion is used to identify the items
from the population whereas in random sampling all the items in the
population have a chance of being chosen.
RANDOM SAMPLING
NOTE: With both stratified and cluster sampling, the population is divided
into well-defined groups. We use stratified sampling when each group has a
small variation within itself but there is a wide variation between these
groups. We use cluster sampling when these is considerable variation
within groups but the groups are essentially similar to each other.
NON RANDOM SAMPLING
SAMPLING DISTRIBUTIONS
A probability distribution of all the possible means of the samples is a
distribution of the sample means. Statisticians call this a Sampling
distribution of the mean.
µ
The Population Distribution:
If somehow we were able to take all the possible samples of a given size from this
population distribution, they would be graphically represented by the below 4
samples.
Now if we were able to take the means from all these sample distributions and
produce a distribution of these sample means, it would look like this:
Above is the sampling distribution of the mean which is the distribution of all the
sample means and has:
Mean of the sampling distribution of the means called ‘mu sub x bar’
If we increase our sample size from 5 to 20, it would not change the standard
deviation of the items in the original population. But with samples of size 20 we
would have increased the effect of averaging in each sample and would expect
even less dispersion among the sample means.
Mu sub x bar( μ x) = µ
The sampling distribution has a S.D.(standard error) equal to the population S.D.
divided by the square root of the sample size:
σ
Sigma sub x bar σ x =
√n
EXAMPLE: A bank calculates that its individual savings accounts are normally
distributed with a mean of 2000$ and a S.D. of 600$. If the bank takes a random
sample of 100 accounts, what is the probability that the sample mean will lie
between $1900 and $2200?
Solution:
600
σ x=
√100
=600/10
=60
x−μ
Standardizing of the sample mean: z= σ
x
From table we get the combined area as 0.7492 which would be the probability
that the sample mean will lie between 1900 and 2200$.
The mean of the sampling distribution of the mean will be equal to the
population mean regardless of the sample size, even if the population is not
normal.
As the sample size increases, the sampling distribution of the mean will
approach normality, regardless of the shape of the population distribution.
CLT assures us that the sampling distribution of the mean approaches
normal as the sample size increases. Statisticians use the normal
distribution as an approximation to the sampling distribution whenever the
sample size is at least 30.
The Significance of the CLT is that it permits us to use sample statistics to
make inferences about the population parameters without knowing
anything about the shape of the frequency distribution of that population
other than what we can get from the sample.
AN OPERATIONAL CONSIDERATION IN SAMPLING: THE RELATIONSHIP
BETWEEN SAMPLE SIZE AND ERROR
√ N −n
Finite population multiplier:
√ N −1
Note: It is the absolute size of the sample that determines sampling precision, not
the fraction of the population sampled.
ESTIMATION
This chapter introduces methods that enable us to estimate with reasonable
accuracy the population proportion (the proportion of the population that
possesses a given characteristic) and the population mean.
Types of Estimate:
POINT ESTIMATES
The sample mean x bar is the best estimator of the population mean µ. It is
unbiased, consistent, the most efficient estimator, and, as long as the sample is
sufficiently large, its sampling distribution can be approximated by the normal
distribution. If we know x bar, we can make statements about any estimate we
make from sampling information.
The most frequently used estimator of the population standard deviation sigma is
the sample standard deviation ‘s’(refer Levin and Rubin book to see manual
method of calculation. Pg333).
To find the range we need to find the standard error of the mean by applying CLT
and then the range can be calculated.
Our sample of 200 users has a mean battery life of 36 months. The director wants
to know about the range within which the battery life will lie. S.D.=10.
=10/14.14=.707 month
Hence we are 68.3% sure that it will lie within the above range
CONFIDENCE LEVEL:-68.3%
High confidence levels produce large confidence intervals which are not precise
and give fuzzy estimates.
Note: Suppose we calculate from one sample the following confidence interval
and level.
“We are 95% sure that the mean battery life of the population lies within 30 to 42
months”
“This statement does not mean that the chance is .95 that the mean life will fall
within the interval establishes in one sample. Instead, it means that if we select
many random samples of the same size and calculate a confidence interval for
each of these samples, then in about 95.5% of the cases, the population mean
will lie within that interval.”
Use of the T-distribution is required whenever the sample size is less than 30 and
the population S.D. is not known. Furthermore in using the T-distribution we
assume that the population is normal or approximately normal.
Example: If the Hypothesized population mean is 0.04inch and the population S.D.
is .004 inch, what are the chances of getting a sample mean (0.0408) that differs
from 0.04 inch by 0.0008 inch?
σ
Sigma sub x bar =
√n
=0.004 in./√ 100 = 0.0004 inch.
Now we find Z value to discover how far is the mean of our sample
Use the table to find out that 4.5 % is the total chance that our sample mean
would be differing from the population mean by 2 standard errors. With this low a
chance we cannot conclude that a population with a true mean of 0.4 would be
likely to produce a sample like this.
The minimum standard for an acceptable probability, 4.5 %, is also the risk we
take of rejecting a hypothesis that could have been true; there can be no risk free
trade off.
TESTING HYPOTHESIS
Whenever we reject the null hypothesis, the conclusion we do accept is called the
alternative hypothesis and is symbolized by H 1or “H sub-one”.
H 0 : μ=200 ¿
H 1 : μ ≠ 200
H 1 : μ>200
H 1 : μ< 200
The purpose of hypothesis testing is not to question the computed value of the
sample statistic but to make a judgment about the difference between the sample
statistic and a hypothesized population parameter.
Whenever we say that we accept the null hypothesis, we actually mean that
there is not sufficient statistical evidence to reject it. Use of the term ‘accept’,
instead of ‘do not reject’, has become a standard. It means simply that when
sample data do not cause us to reject a null hypothesis, we behave as if the
hypothesis is true.
Note: The higher the significance level, the higher the probability of rejecting a
null hypothesis.
The probability of making one type of error can be reduced only if we are willing
to increase the probability of making the other type of error.
A two-tailed test of hypothesis will reject the null hypothesis if the sample is
significantly higher than or lower than the hypothesized population mean. Thus in
a two-tailed test there are two rejection regions.
However there are situations in which a two-tailed test is not appropriate and we
must use a one-tailed test.
STEP ACTION
1 Decide whether this is a two-tailed test or a one tailed test. State your
hypothesis. Select a level of significance appropriate for this decision
2 Decide which distribution (t or z) is appropriate and find the critical
value(s) for the chosen level of significance.
3 Calculate the standard error of the sample statistic. Use the standard
error to convert the observed value of the sample statistic to the
standardized value.
4 Sketch the distribution and mark the position of the standardized
sample value and the critical value(s) for the test
5 Compare the value of the standardized sample statistic with the
critical value(s) for this test and interpret the result.
If the sample size is 30 or less and sigma is not known, we should use the t-
distribution with n-1 degrees of freedom.
TESTING HYPOTHESIS: TWO SAMPLE TESTS
Do female employees earn less than male employees? Did one group of
experimental animals react differently from that at another?
In each of these examples, decision makers are concerned with the parameters of
two populations. In these situations, they are not as interested in the actual value
of the parameters as they are in the relation between the values of the 2
parameters i.e. how these parameters differ.
Because we now wish to study 2 populations, not just one, the sampling
distribution of interest is the sampling distribution of the difference between
sample means.
Suppose population1 and population2 has standard deviations σ 1∧σ 2 and means
μ1∧μ2 . Now we take two samples from population 1 and 2 and then we subtract
the sample mean x 1−x 2 Difference between sample means.
( μ1- μ2). The standard deviation of the distribution of the difference between the
sample means is called the standard error of the difference between the 2 means
and is calculated by using the below formula:
2 2
σ1 σ2
σ (x − x )=√ ( + )
1 2
n1 n2
NOTE: SEE EXAMPLES FROM LEVIN AND RUBIN BOOK (Pg No:445)
TESTING DIFFERENCES BETWEEN MEANS WITH DEPENDENT SAMPLES
Ex: A health spa has a program which guarantees more than 17 kg of weight loss.
The spa provides records of 10 participants with before and after weight. So
conceptually what we have is not 2 samples of before and after weights, rather
one sample of weight losses.
CHI-SQUARE AND ANOVA
Suppose we have proportions from 5 populations instead of only two. In this case,
the methods for comparing proportions described earlier do not apply. Chi-
square tests enable us to test whether more than 2 population proportions can
be considered equal. Actually it allows us to do a lot more than just test for the
equality of several proportions. If we classify a proportion into several categories
with respect to two attributes, we can use a chi-square test to determine whether
the 2 attributes are independent of each other.
The ANOVA enables us to test whether more than 2 population means van
be considered equal.
Many times, managers need to know whether the differences they observe
among several sample proportions are significant or only due to chance.
CONTINGENCY TABLES
Using these symbols we can calculate the null and alternate hypothesis as
follows:
H 0 : p N = pS = pC = pW ← Null hypothesis
H 1 : p N ≠ p S ≠ p C ≠ pW ← Alternate Hypothesis ¿
¿
If the null hypothesis is true then we can combine the data from the 4
samples and then estimate the proportion of the total workforce that
prefers the present method:
Combined proportion = (68+75+57+79)/(100+120+90+110)
= 279/420=0.6643
2 2
Chi−Square → χ =Σ (f o−f e ) / f e
Fo - Observed Frequency
Fe – Expected Frequency
Now when we calculate our value for chi-square (Levin and Rubin book,
page no. 536) we get the value 2.764. If this value was as large as 20, it
would indicate a substantial difference between our observed and
expected values. A chi-square of 0, on the other hand, indicates that the
observed frequencies exactly match the expected frequencies. The value of
chi-square can never be negative because the differences between the
expected and observed values are always squared.
RT × CT
f e=
n
RT – Row total for the row containing that cell
CT – Column total for the column containing that cell
N – Total number of observations
f e – Expected frequency in a given cell
ANALYSIS OF VARIANCE(ANOVA)
STATEMENT OF HYPOTHESIS:
In this case, our reason for using ANOVA is to decide whether these 3
samples were drawn from populations having the same means. Because we
are testing the effectiveness of the three training methods, we must
determine whether the 3 samples, represented by the sample means could
have been drawn from population having the same mean.
In order to use ANOVA we must assume that each of the samples is drawn
from a normal population and that each of these population has the same
variance, σ 2. However if the sample sizes are large enough we do not need
the assumption of normality.
ANOVA is based on a comparison of two different estimates of the
variance, σ 2, of overall population. In this case, we can calculate one of
these estimates by examining the variance among the three sample means,
which are 17, 21 and 19. The other estimate of the population variance is
determined by the variation within the 3 samples themselves. Then we
compare these 2 estimates of the population variance, because both the
estimates are of variance, they should be approximately equal in value
when the null hypothesis is true. If the null hypothesis is not true, these 2
estimates will differ considerably.
Determine one estimate of the population variance from the variance
among the sample means.
Determine a 2nd estimate of the variance from the variance within the
samples
Compare these 2 estimates. If they are approx. equal, accept the null
hypothesis.
2 Σ ( x−x )2
sx =
k−1
Now we know the variance among the sample mean, s x2. So we substitute it in
place of σ x 2. Now the above equation becomes:
Σ n ( x−x )2
σ ̂ 2=s x 2 ×n=
k−1
Σ n j ( x j−x )2
σ^ b=
2
k−1
Where,
x=grand mean
K= number of samples.
Where,
n j=¿ the jth sample
K=number of samples
nT =Σ n j =total sample ¿ ¿
Between−Column variance σ^ b
2
F= = 2
Within−column variance σ^w
When populations are not the same, the between-column variance tends to be
larger than the within-column variance, and the F value tends to be large. This
leads to the rejection of null hypothesis.
THE F DISTRIBUTION
Like other statistics we studies, if the null hypothesis is true, then the F statistic
has a particular sampling distribution. The F distribution is identified by a pair of
DOF unlike the T and chi-square distribution which have just 1 DOF. The first
number is the DOF in the numerator of the F ratio and the 2nd is the DOF in the
denominator.
DOF of numerator=3-1=2
From F table we find the value of 3.81. Because the calculated value is within the
acceptance region we would accept the null hypothesis.
TYPES OF RELATIONSHIP
Equation of line: Y = a + bX
Y: Dependent variable
A: y intercept
B: slope
X: Independent variable.
Y 2−Y 1
Slope of a straight line b= X −X
2 1
Now that we have seen how to determine the equation of a straight line, let see
how we can calculate the equation of a line that is drawn through the middle of a
set of points in a scatter diagram. To a statistician, the line will have a “good fit” if
it minimizes the error between the estimated points on the line and the actual
observed points that were used to draw it.
We want to find a way to penalize large absolute errors so that we can avoid
them. We can accomplish this if we square the individual errors before we add
them. Squaring each term accomplishes 2 goals:
Σ XY −n X Y
b= 2 2
Σ X −n X
a=Y −b X
Now ,
Σ XY −n X Y
b=
Σ X 2−n X 2
78−( 4 ) ( 3 ) ( 6 )
¿
44−( 4 ) ( 3 )2
6
¿
8
¿ 0.75← X intercept
a=Y −b X
¿ 6−( 0.75 ) ( 3 )
¿ 3.75←Y intercept
Y^ =3.75+ 0.75 ( 4 )
¿ 6.75
Thus the manager might expect to spend about 675 $ annually in repairs of a 4
year old truck.
Y cap: estimated values from the estimating equation that correspond to each
value of Y
X Y ^ (3.75+0.75X)
Y Y −Y^ (Y −Y^ )2
5 7 7.5 -0.5 0.25
3 7 6.0 1.0 1.00
3 6 6.0 0.0 .00
1 4 4.5 -0.5 0.25
TOTAL-> 1.50
S=
e
√ Σ ( Y −Y^ ) 2
√n−2
¿
√ 1.5
4−2
√
2
Σ Y −a Σ Y −b Σ XY
Se =
n−2
Assuming that the observed points are normally distributed around the regression
line, we can expect to find 68% of the points within +/- 1 Se , 95.5% within +/- 2 Se
and 99.7% within +/- 3 Se .
CORRELATION ANALYSIS
One point that must be emphasized strongly is that r 2measure only the
strength of a linear relationship between 2 variables.
2
a Σ Y +b Σ XY −n Y
r 2=
Σ Y 2−nY 2