STATISTICS
STATISTICS
STATISTICS
TO
STATISTICS
What is Statistics?
▪ the study of the collection, analysis,
interpretation, presentation, and
organization of data.
▪ mathematical discipline to collect,
summarize data
▪ a branch of applied mathematics
According to Merriam-Webster dictionary
➢ statistics is defined as “classified facts representing
the conditions of a people in a state – especially the
facts that can be stated in numbers or any other
tabular or classified arrangement”.
▪ Descriptive Statistics
▪ Inferential Statistics
❖ Collection of data
❖ Organization of data
❖ Presentation of the data
❖ Data summarization
❖ Statistical analysis
❖ Inference of data
Collection of data
- the process of measuring, gathering, assembling the raw data up on
which the statistical investigation is to be based.
Organization of data
- summarization of data in some meaningful way, e.g table form.
Presentation of the data
- the process of re-organization, classification, compilation, and
summarization of data to present it in a meaningful
form.
Analysis of data
- the process of extracting relevant information from the
summarized data, mainly through the use of elementary
mathematical operation.
Inference of data
- the interpretation and further observation of the various statistical
measures through the analysis of the data by implementing those
methods by which conclusions are formed and inferences made.
Distance
- the property of distance is concerned with the
relationship of differences between objects. If a measurement
system possesses the property of distance it means that the unit
of measurement means the same thing throughout the scale
of numbers. That is, an inch is an inch, no matters were it falls -
immediately ahead or a mile downs the road.
Fixed Zero
A measurement system possesses a rational zero (fixed
zero) if an object that has none of the attribute in question is
assigned the number zero by the system of rules.
The object does not need to really exist in the "real world",
as it is somewhat difficult to visualize a "man with no height".
The requirement for a rational zero is this: if objects with
none of the attribute did exist would they be given the value zero.
Defining O0 as the object with none of the attribute in
question, the definition of a rational zero becomes:
Examples:
• Gender (Male or Female.)
• Marital status(married, single, widow, divorce)
• Nationality
• Blood type
• Zip code
• Hair color
Ordinal Scales
- are measurement systems that possess the property of order, but
not the property of distance. The property of fixed zero is not important if the
property of distance is not satisfied.
▪ Level of measurement which classifies data into categories that can be
ranked. Differences between the ranks do not exist.
▪ Arithmetic operations are not applicable but relational operations are
applicable.
▪ Ordering is the sole property of ordinal scale.
Examples:
• Rank ( 1st place, 2nd place, 3rd place, . . . Etc.)
• Agreement level ( always, oftentimes, sometimes, seldom and
never)
• Educational level ( primary, secondary & tertiary)
• Income ( low, medium & high)
Interval Scales
- are measurement systems that possess the properties of Order and
distance, but not the property of fixed zero.
• Level of measurement which classifies data that can be ranked and
differences are meaningful. However, there is no meaningful zero, so
ratios are meaningless.
• All arithmetic operations except division and multiplication are
applicable.
• Relational operations are also possible.
Examples:
• IQ
• Temperature (ºC or F)
• Time (minutes or hours)
• Age (years)
• Physical measures (height, weight, BP . . . Etc.)
• Income
Ratio Scales
- are measurement systems that possess all three properties: order,
distance, and fixed zero. The added power of a fixed zero allows ratios of
numbers to be meaningfully interpreted.
▪ Level of measurement which classifies data that can be ranked,
differences are meaningful, and there is a true zero. True ratios exist
between the different units of measure.
▪ All arithmetic and relational operations are applicable.
Examples:
• Temperature (0ºC doesn't mean that there is no value, 0ºC is the
freezing point)
Methods
Data Collection
Presentation
There are two sources of data
1. Primary Data
• Data measured or collect by the investigator or the user
directly from the source.
Two activities involved: planning and measuring.
➢ Planning:
• Identify source and elements of the data.
• Decide whether to consider sample or census.
• If sampling is preferred, decide on sample size, selection
method,… etc.
• Decide measurement procedure.
• Set up the necessary organizational structure.
➢ Measuring: there are different options.
• Focus Group
• Telephone Interview
• Mail Questionnaires
• Door-to-Door Survey
• Mall Intercept
• New Product Registration
• Personal Interview and
• Experiments are some of the sources for collecting the
primary data
2. Secondary Data
▪ Data gathered or compiled from published and unpublished
sources or files.
▪ When our source is secondary data check that:
• The type and objective of the situations.
• The purpose for which the data are collected and
compatible with the present problem.
• The nature and classification of data is appropriate to our
problem.
• There are no biases and misreporting in the published data.
Note: Data which are primary for one may be secondary for the
other.
Sampling Techniques
In Statistics, there are different sampling techniques available to get relevant
results from the population. The two different types of sampling methods are::
• Probability Sampling
• Non-probability Sampling
Probability Sampling Method
Systematic Sampling
Clustered Sampling
Convenience Sampling
Snowball Sampling
Snowball sampling is also known as a chain-referral sampling technique. In this method,
the samples have traits that are difficult to find. So, each identified member of a
population is asked to find the other sampling units. Those sampling units also belong to
the same targeted population.
Probability sampling vs Non-probability Sampling Methods
The below table shows a few differences between probability sampling
methods and non-probability sampling methods.
Representation of Data
• Bar Graph
• Pie Chart
• Line Graph
• Pictograph
• Histogram
• Frequency Distribution
Bar Graph
represents grouped data with rectangular bars
with lengths proportional to the values that they
represent. The bars can be plotted vertically or
horizontally.
Pie Chart
A type of graph in which a circle is divided into Sectors. Each of
these sectors represents a proportion of the whole.
Line Chart
The line chart is represented by a series of data points
connected with a straight line.
The series of data points are called ‘markers.’
Pictograph
A pictorial symbol for a word or phrase, i.e. showing data with the
help of pictures. Such as Apple, Banana & Cherry can have
different numbers, and it is just a representation of data.
Histogram
A diagram is consisting of rectangles. Whose area is
proportional to the frequency of a variable and whose width
is equal to the class interval.
Frequency Distribution
The frequency of a data value is often represented by “f.” A
frequency table is constructed by arranging collected data
values in ascending order of magnitude with their
corresponding frequencies.
There are three basic types of frequency distributions
• Categorical frequency distribution
• Ungrouped frequency distribution
• Grouped frequency distribution
MSDWD
SSMMM
WDSMM
WDDSS
SWWDD
Solution:
Since the data are categorical, discrete classes can be used. There are four
types of marital status M, S, D, and W. These types will be used as class for the
distribution. We follow procedure to construct the frequency distribution.
Step 3: Count the tally and place the result in column (3).
• First find the smallest and largest raw score in the collected
data.
• Arrange the data in order of magnitude and count the
frequency.
• To facilitate counting one may include a column of tallies.
Example:
80 76 90 85 80
70 60 62 70 85
65 60 63 74 75
76 70 70 80 85
Definitions:
k = 1+ 𝟑. 𝟑𝟐𝒍𝒐𝒈 n
K = 𝟑. 𝟑𝟐𝒍𝒐𝒈 (𝟐𝟎)
K = 5.32 = 6 (rounding up).
Step 4: Find the class width;
𝒘=𝑹 𝑲
𝒘 = 𝟑𝟑
𝟔
𝒘 = 𝟑𝟑
𝟔
= 5.5
= 6 (rounding up)
Step 6: Find the upper class limit; e.g. the first upper class=12-U=12-1=11 11,
17, 23, 29, 35, 41 are the upper class limits.
So combining step 5 and step 6, one can construct the following classes.
Class limits
6 – 11
12 – 17
18 – 23
24 – 29
30 – 35
36 – 41
The symbol
σ𝑛𝑖=1 𝒙𝒊 is a mathematical shorthand for 𝒙𝟏 + 𝒙𝟐 + 𝒙𝟑 + … + 𝒙𝒏
𝒙𝒊 = 𝒙𝟏 + 𝒙𝟐 + 𝒙𝟑 + 𝒙𝟒 + 𝒙𝟓
𝑖=1
=5+7+7+6+8
= 33
PROPERTIES OF SUMMATION
Example:
Considering the following data
determine
X Y
5 6
7 7
7 8
6 7
8 8
Measures of central tendency help you find the middle, or the
average, of a dataset.
The 3 most common measures of central tendency are the mode,
median, and mean.
• Mean
• Median
• Mode
In statistics, the notation of a sample mean and a population mean and their
formulas are different. But the procedures for calculating the population and
sample means are the same.
Outlier effect on the mean
• Outliers can significantly increase or decrease the mean when they are
included in the calculation. Since all values are used to calculate the
mean, it can be affected by extreme outliers.
• An outlier is a value that differs significantly from the others in a dataset.
Mean of Grouped Data Formula
• The mean formula is defined as the sum of the observations divided by the
total number of observations.
• There are two different formulas for calculating the mean for ungrouped
data and the mean for grouped data.
• Let us look at the formula to calculate the mean of grouped data. The
σ𝒇
formula is: x̄ = 𝒊
𝑵
Where,
𝒙𝟏 𝒇𝟏 +,𝒇𝟐 𝒙𝟐 + 𝒇𝟑 𝒙𝟑 + . . . 𝒇𝒏 𝒙𝒏
x̄ = 𝒇𝟏 ,𝒇𝟐 ,𝒇𝟑 , . . . 𝒇𝒏
σ 𝒙𝒊 𝒇𝒊
x̄ = σ𝒇
𝒊
x̄ = 30.86
Assumed Mean Method
• A technique used to calculate the arithmetic mean for grouped data. In
this method.
• An assumed mean (a value within the range of the data) is chosen, and
the deviations of the data points from this assumed mean are
determined.
• By using these deviations, the arithmetic mean is then computed,
providing an estimate of the central tendency of the grouped data.
σ(𝒇𝒊 𝒅𝒊 )
AM = a + ( 𝒇 )
𝒊
Where,
a = assumed mean
fi = frequency of ith class
di = xi – a = deviation of ith class
Σfi = n = Total number of observations
xi = class mark = (upper class limit + lower class limit)/2
Example:
The following table gives information about the marks obtained by
110 students in an examination.
Class 1 - 10 11 - 20 21 - 30 31 - 40 40 - 50
Frequency 12 28 32 25 13
σ(𝒇𝒊 𝒅𝒊 )
=a+( 𝒇𝒊
)
−𝟏𝟎
= 25.5 + 𝟏𝟏𝟎
= 25.5 + ( -0.091)
= 25.409
Mean Deviation
• is defined as a statistical measure that is used to calculate the
average deviation from the mean value of the given data set.
• The mean deviation of the data values can be easily calculated using
the below procedure.
Step 1: Find the mean value for the given data values
Step 2: Now, subtract the mean value from each of the data values given
(Note: Ignore the minus symbol)
Step 3: Now, find the mean of those values obtained in step 2
σ |X – µ|
Mean Deviation = 𝑵
Where,
Σ represents the addition of values
X represents each value in the data set
µ represents the mean of the data set
N represents the number of data values
Example: Determine the mean deviation for the data values
5, 3,7, 8, 4, 9.
Median
- the median of a dataset is the value that’s exactly in the middle when it is
ordered from low to high.
Mode
• is the most frequently occurring value in the dataset. It’s possible to have no
mode, one mode, or more than one mode.
• To find the mode, sort your dataset numerically or categorically and select the
response that occurs most frequently.
Quantiles
• When a distribution is arranged in order of magnitude of items, the median is the
value of the middle term.
• Their measures that depend up on their positions in distribution quartiles, deciles,
and percentiles are collectively called quantiles.
Quartiles
• are measures that divide the frequency distribution in to four equal
parts.
• The value of the variables corresponding to these divisions are
denoted 𝑸𝟏 , 𝑸𝟐 , and 𝑸𝟑 often called the first, the second and the
third quartile respectively.
• 𝑄1 is a value which has 25% items which are less than or equal to it.
• Similarly 𝑄2 has 50%items with value less than or equal to it and 𝑄3
has 75% items whose values are less than or equal to it.
𝑖𝑁
• To find 𝑄𝑖 (i = 1, 2, 3) we count of the classes beginning from the
4
lowest class.
• For grouped data: we have the following formula:
𝒊𝑵
𝟒 − 𝒇𝒄
𝑸 𝒊 = 𝑳𝑸 𝒊 + ( )•w
𝒇𝑸
Where:
𝑳𝑸𝒊 = lower class boundary of the quartile class
𝒇𝒄 = cumulative frequency less than type preceding the quartile class
𝒇𝑸 = frequency of the quartile class
𝑵 = total number of observations
W = class size/width
Remark:
The quartile class (class containing
Qi ) is the class with the smallest
cumulative frequency (less than
type) greater than or equal to 𝒊𝑵𝟒
Deciles
• are measures that divide the frequency distribution in to ten equal parts.
• The values of the variables corresponding to these divisions are denoted
𝑫𝟏 , 𝑫𝟐 ,.. 𝑫𝟗 often called the first, the second,…, the ninth decile
respectively.
• To find 𝑫𝒊 (i = 1, 2,..9) we count 𝟏𝟎
𝒊𝑵
of the classes beginning from the
lowest class.
• For grouped data: we have the following formula:
𝒊𝑵
𝟏𝟎 − 𝒇𝒄
𝑫𝒊 = 𝑳𝑫𝒊 + ( )•w
𝒇𝑸
Where:
𝑳𝑫𝒊 = lower class boundary of the decile class
𝒇𝒄 = cumulative frequency less than type preceding the decile class
𝒇𝑸 = frequency of the decile class
𝑵 = total number of observations
W = class size/width
Percentiles
• are measures that divide the frequency distribution in to hundred equal
parts.
• The values of the variables corresponding to these divisions are denoted
𝑷𝟏 , 𝑷𝟐 ,.. 𝑷𝟗𝟗 often called the first, the second,…, the ninety-ninth
percentile respectively.
• To find 𝑷𝒊 (i= 1, 2,..99) we count 𝟏𝟎𝟎
𝒊𝑵
of the classes beginning from the
lowest class.
• For grouped data: we have the following formula:
𝒊𝑵
𝟏𝟎 − 𝒇𝒄
𝑷𝒊 = 𝑳𝑷𝒊 + ( )•w
𝒇𝑸
Where:
𝑳𝑷𝒊 = lower class boundary of the percentile class
𝒇𝒄 = cumulative frequency less than type preceding the percentile class
𝒇𝑸 = frequency of the percentile class
𝑵 = total number of observations
W = class size/width