Chapter 2 Descriptive Statistics
Chapter 2 Descriptive Statistics
Chapter 2 Descriptive Statistics
CHAPTER DISCUSSIONS
e.g.
1
Contingency table/Two-way table
e.g.
Distribution of students according to sex and
programme at College XYZ in 2004
Sex
Programme Male Female Total
Business Studies 270 180 450
Banking Studies 125 125 250
Art and Design 10 90 100
Accountancy 80 120 200
Total 485 515 1,000
Source : College XYZ
Bar charts
2
ii) Multiple bar chart
3
Pie charts
e.g. Use the data in example 5 and draw a pie chart for courses attended in 2001.
Answer
Histogram
A bar graph of a frequency distribution. It consists of a set of vertical bars which are located
continuously side by side.
Stem-and-leaf
4
The stem-and-leaf display is a valuable tool for organizing a set of data and understanding
how the values distribute and cluster over the range of observations in the data set. It shows
the values of the original observations. The display separates data entries into stems
(leading digits on the left) and leaves (trailing digits on the right). A display that organizes
data to show its shape and distribution
12, 15, 16, 20, 25, 25, 26, 27, 30, 31, 33, 38, 42, 42, 43, 45, 46, 49, 50, 50.
Answer
Unit = 10
1 2 represents 12
1 256
2 05567
3 0138
4 223569
5 00
Answer
Unit = 100
4 9 represents 490
4 9
5 579
6 24
7 1
8 3
Measure of central tendency is a number (or a character) that is used to represent data set.
It is also known as the average. There are 3 measures: the arithmetic mean, the median,
and the mode.
The arithmetic mean, or simply the mean, is the sum of all the observations divided by the
number of observations. It is also known as the simple average. The data to compute mean
must be at least interval level of measurement.
5
Ungrouped data
x̄=
∑x
n
Answer
∑
X
X= n
110+112+ 98+100+115 +95+100 730
= 7 = 7 = 104.3
~
Median ( X )
The median is a value located in the center of a distribution. As such, 50% of the
observations are below the median and 50% are above the median. The data must be at
least ordinal level of measurement.
Answer
Array X : 95, 98, 100, 100, 110, 112, 115
n+1 7+1
~
Location of X = 2 = 2 = 4
~
∴ X = 100
Answer
n+1 8+1
~
Location of X = 2 = 2 = 4.5
90+100
~ =95
∴X = 2
The Mode ( X^ )
The mode is the value (or character) of the observation that appears most frequently. It is
the most popular value (or character). It can be computed for all levels of measurement of
data: nominal, ordinal, interval, and ratio.
Answer
~ ^
Since X > X > X , the distribution is said to be positively skewed. This means
that most students have weights less than 60 kg.
MEASURES OF VARIATION
Variations of data in each set can be seen clearly. The variations can be measured by:
(i) Range
(ii) Standard deviation and Variance
(iii) Coefficient of Variation
Range (R)
X
R = largest
−X smallest
e.g.
For the above data sets, compute range for each of them.
Answer
Data set A : R = 100 – 100 = 0
Data set B : R = 105 – 90 = 15
Data set C : R = 140 – 70 = 70
2
Standard deviation (s) and Variance (s )
Standard deviation is the square root of variance, as such variance is the square of standard
deviation, i.e.
√1
S = n−1
[ ( ∑ X )2
∑X − n
2
]
e.g.
X : 17, 17, 18, 19, 19, 18, 20, 23, 19
Calculate the standard deviation.
Answer
√ [ ∑ X − ∑n ] √ [ ]
2
1 ( X )2
2 1 (170 )
3238−
S = n−1 = 9−1 9 = 1.83
Whenever two samples have the same units of measure, the variance and standard
deviation for each can be compared directly. For example, suppose an automobile dealer
wanted to compare the standard deviation of miles driven for the cars she received as trade-
ins on new cars. She found that for a specific year, the standard deviation for Buicks was
422 miles and the standard deviation for Cadillacs was 350 miles. She could say that the
variation in mileage was greater in the Buicks. But what if a manager wanted to compare the
standard deviations of two different variables, such as the number of sales per salesperson
over a 3-month period and the commissions made by these salespeople?
A statistic that allows you to compare standard deviations when the units are different, as in
this example, is called the coefficient of variation. Coefficient of variation is used to compare
dispersion between data sets.
8
S
CV = X̄ x 100%
Interpretation of CV:-
(i) Data set with larger CV means the distribution is more dispersed compared to data set
with smaller CV.
(ii) Data set with smaller CV means the distribution is more consistent (or more uniform)
compared to data set with larger CV.
e.g. Compare the consistency of number of hours spent watching television among the
3 locations below.
Residential
678
X̄ = 20 = 33.9
√ [ ]
2
1 (678 )
23656−
S = 20−1 20 = 5.95
5. 95
CV = 33 . 9 x 100% = 17.6%
Industrial
201
X̄ = 8 = 25.13
√ [ ]
2
1 (201 )
6677−
S = 8−1 8 = 15.25
15 .25
CV = 25 .13 x 100% = 60.7%
Town
228
X̄ = 12 = 19
9
√ [ ]
2
1 (228)
5296−
S = 12−1 12 = 9.36
9 .36
CV = 19 x 100% = 49.3%
The distribution of time spent on watching television in Residential area is the most
consistent. The least consistent is the Industrial area.
MEASURES OF SKEWNESS
mean−mod e
Coefficient of skewness 1 = s tan dard deviation
3 (mean−median)
Coefficient of skewness 2 = s tan dard deviation
e.g.
The following information were calculated for the distribution of money received by 50
children.
X̄ = 115.60; ~ X^ = 114.55;
X = 115.33; S = 14.49
Determine the shape of the distribution by computing the coefficient of skewness.
Answer
X̄− X^
Coefficient of skewness 1 = S
115. 60−114 . 55
= 14 . 49 = 0.07 = 0
The distribution is said to be symmetrical.
p.s. The computation can also be done using the second coefficient of skewness.
~
3 ( X̄ − X )
Coefficient of skewness 2 = S
3 ( 115. 60 - 115. 33 )
= 14 . 49 = 0.06 = 0
MEASURES OF POSITION
10
Quartiles
Quartiles divide an array into 4 equal parts. Thus there are 3 quartiles: the first quartile ( Q 1 ),
the second quartile (Q 2 ), and the third quartile (Q 3 ).
Q1 Q2 Q3
e.g.
Find the first, second and third quartiles for the following data.
X: 95, 98, 100, 100, 110, 112, 115
Answer
n+1 7+1
Location of Q 1 = 4 = 4 = 2nd
∴ Q 1 = 98
n+1 7+1
Location of Q 2 = 2 = 2 = 4th
∴ Q 2 = 100
3( n+1 ) 3(7 +1 )
Location of Q 3 = 4 = 4 = 6th
∴ Q 3 = 112
Box-and-Whisker Plot
The Box-and-Whisker Plot is used to display some information on measures of position and
to determine the shape of the distribution. These plots involve five specific values:
A boxplot is a graph of a data set obtained by drawing a horizontal line from the minimum
data value to Q1, drawing a horizontal line from Q3 to the maximum data value, and drawing
a box whose vertical sides pass through Q1 and Q3 with a vertical line inside the box
passing through the median or Q2.
1. Find the five-number summary for the data values, that is, the maximum and minimum
data values, Q1 and Q3, and the median.
2. Draw a horizontal axis with a scale such that it includes the maximum and minimum data
values.
3. Draw a box whose vertical sides go through Q1 and Q3, and draw a vertical line though
the median.
4. Draw a line from the minimum data value to the left side of the box and a line from the
maximum data value to the right side of the box.
11
Information obtained from a Box Plot.
1. a. If the median is near the center of the box, the distribution is approximately symmetric.
b. If the median falls to the left of the center of the box, the distribution is positively
skewed.
c. If the median falls to the right of the center, the distribution is negatively skewed.
2. a. If the lines are about the same length, the distribution is approximately symmetric.
b. If the right line is larger than the left line, the distribution is positively skewed.
c. If the left line is larger than the right line, the distribution is negatively skewed.
12