Statistics
Statistics
Statistics
The arithmetic mean of Virat Kohli’s batting scores also called his
Batting
Average is;
Sum of runs scored/Number of innings = 661/10
The arithmetic mean of his scores in the last 10 innings is 66.1.
Harmonic
Mean
A Harmonic Progression is a sequence if the reciprocals of its terms are in
Arithmetic Progression, and harmonic mean (or shortly written as HM) can be
calculated by dividing the number of terms by reciprocals of its terms.
In particular cases, especially those involving rates and ratios, the harmonic
mean gives the most correct value of the mean. For example, if a vehicle
travels a specified distance at speed x (eg 60 km / h) and then travels again
at the speed y (e.g.40 km / h), the average speed value is the harmonic mean
x, y (Ie, 48 km / h).
Geometric
Mean
▶ The Geometric Mean (GM) is the average value or mean which
signifies the central tendency of the set of numbers by finding the
product of their values.
▶ Basically, we multiply the numbers altogether and take out the nth
root of the multiplied numbers, where n is the total number of
values.
▶ For example: for a given set of two numbers such as 3 and 1, the
geometric mean is equal to √(3+1) = √4 = 2.
Use of Geometric
Mean
▶ For example, suppose you have an investment which earns 10% the first
year, 50% the second year, and 30% the third year. What is its average
rate of return?
▶ It is not the arithmetic mean, because what these numbers mean is that
on the first year your investment was multiplied (not added to) by 1.10, on
the second year it was multiplied by 1.60, and the third year it was
multiplied by 1.20. The relevant quantity is the geometric mean of these
three numbers.
▶ The question about finding the average rate of return can be rephrased
as: "by what constant factor would your investment need to be multiplied
by each year in order to achieve the same effect as multiplying by 1.10
one year, 1.60 the next, and 1.20 the third?"
▶ If you calculate this geometric mean
▶ You get approximately 1.283, so the average rate of return is about 28%
(not 30% which is what the arithmetic mean of 10%, 60%, and 20% would
give you).
Media
n Median is the middle value of the dataset in which
▶
the dataset is arranged in the ascending order or in
descending order.
▶ When the dataset contains an even number of
values, then the median value of the dataset can
be found by taking the mean of the middle two
values.
▶ If you have skewed distribution, the best measure of
finding the central tendency is the median.
▶ The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the
mean for highly skewed distributions, e.g. family
income. For example mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270. The median of these four
observations is (30+40)/2 =35. Here 3 observations
out of 4 lie between 20-40. So, the mean 270 really
fails to give a realistic picture of the major part of
the data. It is influenced by extreme value 990.
Mode
▶ Range: It is simply the difference between the maximum value and the minimum
value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
▶ Variance: Deduct the mean from each data in the set then squaring each of
them and adding each square and finally dividing them by the total no of values
in the data set is the variance. Variance (σ2)=∑(X−μ)2/N
▶ Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
▶ Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
▶ Mean and Mean Deviation: The average of numbers is known as the mean and
the arithmetic mean of the absolute deviations of the observations from a
measure of central tendency is known as the mean deviation (also called mean
absolute deviation).
Range
▶ It is the simplest method of measurement of dispersion.
▶ It is defined as the difference between the largest and the
smallest item in a given distribution.
▶ Range = Largest item (L) – Smallest item (S)
Interquartile Range
▶ It is defined as the difference between the Upper Quartile
and Lower Quartile of a given distribution.
▶ Interquartile Range = Upper Quartile (Q 3)–Lower
Quartile(Q 1)
Variance
▶ Variance is a measure of how data points differ from the mean.
▶ A variance is a measure of how far a set of data (numbers) are spread
out from their mean (average) value.
▶ The more the value of variance, the data is more scattered from its mean
and if the value of variance is low or minimum, then it is less scattered
from mean. Therefore, it is called a measure of spread of data from
mean.
▶ the formula for variance is
Var (X) = E[(X –μ) 2]
▶ the variance is the square of standard deviation, i.e.,
Variance = (Standard deviation)2= σ2
Variance
• Relative measures of dispersion are especially valuable when you need to compare the
variability of different datasets where absolute values could mislead due to differing
scales.
▶ Coeffiecent of Range: It is simply the difference between the maximum value and the
minimum value by addition of Maximum and minimum
▶ Coeffiecent of Variance: This expresses the standard deviation as a percentage of
the mean. It is useful for comparing the degree of variability between datasets
with different units or vastly different means.
▶ Coeffiecent of Standard Deviation: This expresses the standard deviation as a non
percentage of the mean.
▶ Coeffiecent of Quartiles and Coeffiecent of Quartile Deviation: This measure is based
on the interquartile range (IQR), which is the difference between the third
quartile (Q3) and the first quartile (Q1). It is useful when the data distribution is
skewed, as it focuses on the middle 50% of the data.
Coefficient of
variance
The coefficient of variance (CV) is a relative measure of variability that
▶
indicates the size of a standard deviation in relation to its mean.
▶ It is a standardized, unitless measure that
allows you to compare variability between disparate
groups and characteristics.
▶ It is also known as the relative standard deviation (RSD).
▶ The coefficient of variation facilitates meaningful comparisons
in
scenarios where absolute measures cannot.
Quartile
Deviation
The Quartile Deviation (QD) is the product of half of the difference
▶
between the upper and lower quartiles.
▶ Mathematically we can define as: Quartile Deviation = (Q3 – Q1) /
2
▶ Quartile Deviation defines the absolute measure of dispersion.
Whereas the relative measure corresponding to QD, is known as
the coefficient of QD, which is obtained by applying the certain
set of the formula: Coefficient of Quartile Deviation = (Q3 – Q1)
/ (Q3 + Q1)
▶ A Coefficient of QD is used to study & compare the degree of
variation in different situations.
Skewness
▶ Skewness is a measure of the degree of asymmetry of a distribution.
▶ If the left tail (tail at small end of the distribution) is more
pronounced than the right tail (tail at the large end of the
distribution), the function is said to have negative skewness.
▶ If the reverse is true, it has positive skewness. If the two are equal, it
has zero skewness.
Kurtosis
▶ Kurtosis is a measure of whether the data are heavy-tailed or light-
tailed relative to a normal distribution.
▶ That is, data sets with high kurtosis tend to have heavy tails, or
outliers. Data sets with low kurtosis tend to have light tails, or lack of
outliers.
▶ Significant skewness and kurtosis clearly indicate that data are not
normal.
Types of Distributions
Normal Distribution
▶ In probability theory and statistics, the Normal Distribution, also
called the Gaussian Distribution, is the most significant
continuous probability distribution.
▶ A large number of random variables are either nearly or
exactly represented by the normal distribution, in every
physical science and economics.
▶ In a normal distribution, the mean, mean and mode are equal.
(i.e., Mean = Median= Mode). The normally distributed curve
should be symmetric at the centre.
Normal
Distribution
Hypothesis Testing
44
Introduction
The purpose of hypothesis testing is to determine whether there is
enough statistical evidence in favor of a certain belief about a
parameter.
An hypothesis is a preliminary or tentative explanation or postulate
by the researcher of what the researcher considers the outcome of an
investigation will be. It is an informed/educated guess.
It indicates the expectations of the researcher regarding certain
variables. It is the most specific way in which an answer to a
problem can be stated.
What is hypothesis 45
1. Descriptive Hypotheses:
These are propositions that describe the characteristics
( such as size, form or distribution) of a variable. The variable
may be an object, person, organization etc. ,
e.g., The rate of unemployment among arts graduates is
higher than that of commerce graduates. The
educational system is not oriented to human resource
needs of a country.
2. Relational Hypotheses.
These are propositions which describe the relationship
between tow variables.
3. Causal Hypotheses
It state that the existence of, or a change in, one variable Causes or
leads to an effect on another variable.
The first variable is called the independent variable, and the latter
the dependent variable.
When dealing with causal relationships between variables the
researcher must consider the direction in which such relationship
flow
e.g: which is cause and which is effect
4. Working Hypotheses
While planning the study of a problem, hypotheses are formed.
Initially they may not be very specific. In such cases, they are
referred to as ‘ working hypotheses’ which are subject to
modification as the investigation proceeds.
Cont…
52
5. Null Hypotheses
This hypotheses are formulated for testing statistical
significance, since, this form is a convenient approach to
statistical analysis. As the test would nullify the null hypotheses.
6. Statistical Hypotheses
These are statements about a statistical population.
These are derived from a sample. These are
quantitative in nature in that they are numerically
measurable
eg: Group A is older than B’
Cont…
7. Common Sense Hypotheses 53
8. Complex Hypotheses
These aim at testing the existence of logically
derived relationships between empirical uniformities.
e.g., In the early stage human ecology described empirical
uniformities in the distribution of land values, industrial
concentrations, types of business and other phenomena.
9. Analytical Hypotheses:
It concerned with the relationship of analytic variables.
These hypotheses occur at the highest level of
abstraction.
These specify relationship between changes in one
property and changes in another.
Eg., The study of human fertility might show empirical
regularities by wealth, education, region, and religion.
Characteristics of a Good 54
Hypotheses
Conceptual Clarity
Specificity
Testability
Availability of Techniques
Theoretical relevance
Consistency
Objectivity
Simplicity
Sources of Hypotheses
55
Theory
Observation
Analogies
Intuition and personal experience
Findings of studies
State of Knowledge
Culture
Continuity of Research
Steps for Hypothesis Testing
Formulate H0 and H1
H 0: p = 0.4 0
H1: p ¹ 0.40
Step 2: Select an Appropriate
Test
The test statistic measures how close the
sample has come to the null hypothesis.
The test statistic often follows a well-known
distribution (eg, normal, t, or chi-square).
In our example, the z statistic, which follows
the standard normal distribution, would be
appropriate.
p-p
z= s
p
Where σp is standard
deviation
Step 3: Choose Level of
Significance
Type I Error
Occurs if the null hypothesis is rejected when it is in
fact true.
The probability of type I error ( α ) is also called the
level of significance.
Type II Error
Occurs if the null hypothesis is not rejected when it is in
fact false.
The probability of type II error is denoted by β .
Unlike α, which is specified by the researcher, the
magnitude of β depends on the actual value of the
population parameter (proportion).
Shaded Area
= 0.9699
Unshaded Area
= 0.0301
0 zCAL = 1.88
Step 4: Collect Data and
Calculate Test Statistic
The required data are collected and the
value of the test statistic computed.
In our example, 30 people were surveyed
and 17 shopped on the internet. The
value of the sample proportion is
p = 17/30 = 0.567.
The value ofsp is:
sp=0.089
Step 4: Collect Data and
Calculate Test Statistic
The test statistic z can be calculated as follows:
pˆ - p
zCAL=
s p
= 0.567-0.40
0.089
= 1.88
Step 5: Determine Probability Value/
Critical Value
Using standard normal tables (Table 2 of the Statistical
Appendix), the area to the right of zCAL is .0301 (zCAL =1.88)
The shaded area between 0 and 1.88 is 0.4699. Therefore,
the area to the right of 1.88 is 0.5 - 0.4699 = 0.0301.
Thus, the p-value is .0301
t = (X - m)/s X
The t-statistic, is t distributed with n - 1 df.
Tests of Tests of
Association Differences
H :
0 1 2
H :
1 1 2
Number Standard
of Cases Mean Deviation
15.507 0.000
t Test
Equal Variances Assumed Equal Variances Not Assumed