Topic 2- Descriptive_statistics
Topic 2- Descriptive_statistics
• Pictorially, we may use any of the graphs or charts of our choice depending on the nature
or type of data. Bar charts, pie charts, dot plots, histograms, etc.
• Summarily, we may use measures of; central tendency, variation, position and shape to
give insights into our population or sample.
• With descriptive statistics, the outcome or results are usually limited to the provided
sample or population data and are not used to generalize for the entire population or
other populations.
2
Outline
• Raw data
• Summary measures
➢Measures of central tendency
➢Measures of variation
➢Measures of position
➢Measures of shape
3
• Table 1: Ages of 15 respondents in a survey
29 27 35 24 40
Raw Data 33
28
29
32
45
25
47
23
30
38
4
Organizing and Graphing Data
In this section we learn how to organize and display data using
tables and graphs. We will learn how to prepare frequency
distribution tables for qualitative and quantitative data; how to
construct bar graphs, pie charts, histograms, and polygons for
such data.
5
• Table 3: Frequency distribution of marital status of respondents
Status Tally Frequency Relative Percentage
frequency
Single(S) //// 5 0.33 33.33
Married(M) ////// 7 0.47 46.67
Frequency distribution Divorced(D) / 1 0.07 6.67
Widowed(W) // 2 0.13 13.33
A frequency distribution Total 15 1.0 100.0
exhibits how the frequencies
are distributed over various Table 4: Frequency distribution of ages of respondents
categories.
Ages Tally Frequency Relative Percentage
frequency
23 - 31 /////// 8 0.53 53.33
32 - 40 //// 5 0.33 33.33
41 - 49 // 2 0.13 13.33
Total 15 1.0 100.0
6
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 (𝑓𝑖 )
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦(𝑖) =
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑖𝑒𝑠(𝑛)
7
Graphical Presentation of Data
Data can be graphically or pictorially displayed using any of the numerous charts and graphs such as;
line graphs, bar graphs, pie charts, histograms, and polygons for such data; and how to prepare stem-
and-leaf displays.
8
FREQUENCY DISTRIBUTION OF MARITAL
8
STATUS OF RESPONDENTS
7
7
Graphical presentation 6
of qualitative data 5
5
Frequency
4
For qualitative data the two most
commonly used graphical 3
0
Single Married Divorced Widowed
Status
13.33
Pie Chart
6.67 33.33
12
Summary Measures
In this section we learn how to compute and use numerical summary
measures, usually referred to as “typical values”, such as the ones
that identify the center and spread of a distribution to identify many
important features of a distribution. The summary measure considered
include the measures of; location (central tendency), spread(variation),
position and shape.
13
Measures of central tendency
These are measures that describe the center of a distribution. Thus, they give
the center of a histogram or a frequency distribution curve. This section
discusses the three measures of central tendency; mean, median, and mode for
both grouped and ungrouped data. The other types of means either than the
arithmetic mean will be mentioned. We will also figure out the situations for
which each measure is most appropriate to use.
14
Mean
• Also known as the arithmetic mean, it is the most
frequently used measure of central tendency.
• It is also sometimes referred to as the ‘average’ or
‘expectation’.
• It is generally obtained as
sum of all values
𝑚𝑒𝑎𝑛 =
number of values
15
Mean
• When the mean of a variable, x, is computed for sample data of size n, it is called sample mean and
denoted by 𝐱ത.
• when the mean of a variable, x, is computed for a population it is called population mean and denoted by
𝛍.
σ𝑥
• For ungrouped data, 𝑥ҧ =
𝑛
σ 𝑓𝑥
• For grouped data, 𝑥ҧ = σ𝑓
• Note: For any data, the sum of all values is equal to the product of the sample size and mean; that
is,
σ 𝑥 = 𝑛𝑥.ҧ
16
17
Combined mean
• This is used to obtain the mean of two or more data sets. Once the means and sample
sizes of the two (or more) data sets are know, the combined mean of the two or more
data sets can be computed as follows
• where 𝑛1 , 𝑛2 , … , 𝑛𝑘 are the sample sizes of the data sets and 𝑥ҧ1 , 𝑥ҧ2 ,…, 𝑥ҧ𝑘 are
the corresponding means of the data sets.
18
Applications
• Suppose a sample of 10 statistics books gave a mean price of ₵140 and a sample of 8 mathematics books gave a
mean price of ₵ 160. Find the combined mean
• Twenty business majors and 18 economics majors go bowling. Each student bowls one game. The scorekeeper
announces that the mean score for the 18 economics majors is 144 and the mean score for the entire group of 38
students is 150. Find the mean score for the 20 business majors.
• . Suppose the average amount of money spent on shopping by 10 persons during a given week is ₵ 105.50. Find the
total amount of money spent on shopping by these 10 persons.
• The mean 2009 income for five families was ₵ 99,520. What was the total 2009 income of these five families?
• The mean age of six persons is 46 years. The ages of five of these six persons are 57, 39, 44, 51, and 37 years,
respectively. Find the age of the sixth person.
19
The sat scores of 12 students who sat for the exam are as follows:
1113 2009 1374 1137 2110 1086 1166 1039 1673 2300 1139 5490
a. Calculate the mean and median for these data.
b. Identify the outlier in this data set. Drop the outlier and recalculate the mean and median. Which of these two summary
measures changes by a larger amount when you drop the outlier?
c. Which is the better summary measure for these data, the mean or the median? Explain.
The yearly salaries of all employees who work for a company have a mean of $62,350 and a standard deviation
of $6820. The years of experience for the same employees have a mean of 15 years and a standard
deviation of 2 years. Is the relative variation in the salaries larger or smaller than that in years of experience
for these employees?
The following table gives information on the amounts (in dollars) of electric bills for August 2022 for a sample of 50
families.
Amount of Electric Bill (in Ghana cedis) Number of Families
0 to less than 40 5
40 to less than 80 16
80 to less than 120 11
120 to less than 160 10
160 to less than 200 8
Find the mean, variance, and standard deviation.
The following data give the weights (in pounds) lost by 15 teacher, who are members of a health club at the end of
2 months after joining the club.
5 10 8 7 25 12 5 14 11 10 21 9 8 11 18
a. Compute the values of the three quartiles and the interquartile range.
b. Calculate the (approximate) value of the 82nd percentile.
20
Median
• It is the value of the middle term of a ranked dataset.
21
Median
• For grouped data,
𝑤 𝑛
𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑙𝑚 + −𝐹
𝑓𝑚 2
Where
• 𝑙𝑚 is the lower class boundary of the median class
• 𝑤 is the class width of the median class
• 𝑓𝑚 is the frequency of the median class
• 𝐹 is the cumulative frequency of the pre-median class
• 𝑛 is the sample size
22
Mode
• It is the most frequent occurring number or observation in the dataset.
23
Measures of dispersion
• These are measures that give the spread of a
distribution.
1 1 σ𝑥 2
𝑆2 = σ 𝑥 − 𝑥ҧ 2
= σ 𝑥2 −
𝑛−1 𝑛−1 𝑛
and
2 1 2 1 σ𝑥 2
𝜎 = σ 𝑥−𝜇 = σ 𝑥2 − .
𝑁 𝑁 𝑁
• The standard deviation is denoted by 𝑆 𝑎𝑛𝑑 𝜎 for sample and population data
respectively are obtained by taking the positive squared root of the variance.
25
Measures of position
26
Quartiles
• Quartiles are three summary measures that divide a ranked data set into
four equal parts.
• The first quartile is the value of the middle term among the observations
that are less than the median.
• The third quartile is the value of the middle term among the observations
that are greater than the median.
27
Percentiles
• These are summary measures that divide a ranked data set into 100 equal parts.
• The 𝑘 𝑡ℎ percentile, 𝑃𝑘 , can be defined as a value in a data set such that about k% of the
measurements are smaller than the value of 𝑃𝑘 and about (100 - k)% of the
measurements are greater than the value of 𝑃𝑘 .
𝑘𝑛 𝑡ℎ
• 𝑃𝑘 = 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 term in a ranked data set
100
• The percentile rank of a value, 𝑥𝑖 , gives the percentage of values in the data set that are
less than 𝑥𝑖 .
Number of values less than 𝑥𝑖
• 𝑃𝑅 𝑥𝑖 = × 100
Total number of values in the data set
28
Measures of shape
• A distribution is symmetric(normally distributed) if the right side of the distribution is similar to the left side of the
distribution. The median= mean= mode and the coefficient of skewness is 0.
• If the coefficient of skewness is greater than 0, then it is right-skewed and the right tail is longer than the left tail. If the
coefficient of skewness is less than 0, then it is left-skewed and the left tail is longer than the right tail.
• skewness can be measured using the coefficient of skewness(moments),
Bowley’s coefficient of skewness(quantiles) and Karl Pearson’s measure of
skewness among others.
30
31
coefficient of skewness
• There are three forms of kurtosis; mesokurtic(normal=3), leptokurtic (thicker tails>3) and
platykurtic(thinner tails<3).
33
kurtosis
34
35
Types of Variables and their associated Descriptive statistics
Measures of Measures of Measures of
Variable type examples Possible plots
location variability position