Chapter 1 and Introduction Notes
Chapter 1 and Introduction Notes
Choose the
Arrange the
stem and
data in
ascending
leaf units Plot
(make a
order
key)
Example: Consider the following data set
3, 3, 11, 12, 16, 16, 17, 21, 22, 23, 25, 36, 37, 42, 61
Step 2: Determine the what the stem and leaf will be.
Usually the leaf contains the last digit in the number and the stem contains all
the other digits.
Notice that the largest value is made up of “tens” and “ones”
61 = 6 tens and 1 one
So make the stem “tens” and the leaves “ones”
3, 3, 11, 12, 16, 16, 17, 21, 22, 23, 25, 36, 37, 42, 61
What ? Why ?
A table in which the • Summarizes the
data is grouped into values in the data set.
classes and the • Shows patterns in the
number of values data.
(frequencies) which
fall in each class is
recorded.
Total: 36 � 𝑓𝑓𝑖𝑖 = 𝑛𝑛
Construction of Step 1: Step 2:
a frequency Find the Determine
range
distribution the number
of classes
Step 7:
Record the
frequencies
Step 3:
Determine
class width
Step 6: Step 5:
Determine Step 4:
Calculate lower
the upper Choose the
values for
values lower value for
remaining classes
the first class
E.g.
The class
width/length is
6−1 = 5
This frequency
distribution has
6 classes
• The next few slides show the steps of how to obtain how many classes a
frequency distribution should have for a specific data set, and what the class
width should be (e.g. how 6 and 5 in the above example was obtained).
• Sometimes there is a natural way to group data into a frequency distribution,
however, the first 3 steps of constructing a frequency distribution ensures an
appropriate frequency distribution is obtained for the data set.
Example:
The table below gives the weights of 50 parts made in a factory.
n = 50 (sample size)
7
88
Example:
Data points can Data points
equal both the cannot equal
upper and lower the upper
values in the class value.
class. There is no
There is a ‘gap’ ‘gap’ between
between upper classes.
limit of one class
and lower limit of A value of 77.5 will
the next. fall into this class
Obtaining class boundaries from
class limits…
Cumulative Frequencies: The number of values in the sample that are less
than or equal to the upper class boundary/limit of that class.
Relative Frequency Percentage: The percentage of frequencies that fall into a class.
Example 2
We can now
observe the Boundaries frequency
general pattern in
the data. 37.5 - 41.5 4
41.5 - 45.5 10
45.5 - 49.5 8
49.5 - 53.5 15
53.5 - 57.5 9
57.5 - 61.5 3
61.5 - 65.5 1
Frequency Polygon Class Midpoints – plotted on x-axis
Frequencies – plotted on y-axis
Don’t want 16
Midpoints frequency
the graph 14
to “float” 39.5 4
12
43.5 10
Frequency
10
47.5 8
8 51.5 15
Anchor the
graph to 6 55.5 9
the x-axis 4
59.5 3
63.5 1
2
0
(Class width 𝑙𝑙 = 4)
35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
Min midpoint
39.5 – 4 = 35.5 Class Midpoints Max midpoint 63.5 + 4 = 67.5
• Class Boundaries – plotted on x-axis
Ogive • Cumulative Frequencies – plotted on y-axis
cumulative frequency
Often done in
percentage form 60
50
Boundaries Cumulative freq.
40
Cum. frequency
37.5 41.5
37.5 - 41.5 44 30
61.5 - 65.5 50
Shapes of Distributions
One of the main purposes of drawing a histogram or frequency polygon is to determine the
clustering pattern of the values in the data set, or how the values in the data set are
distributed.
Symmetric Bell-shape:
This shape occurs when the
majority of the values fall in
the central portion of the scale
with fewer values further
away from the centre on
either end.
Examples would be heights
and weights.
Positively skewed shape:
This shape occurs when the
majority of the data (the peak)
occurs at the lower end of the
scale, with fewer values
occurring at the upper end.
An example of this shape would
be the time it takes to serve a
customer at a supermarket. For
most customers the service time
is quite short. The longer the
service time, the less the
number of customers.
Negatively skewed shape:
This shape occurs when the
majority of the data (the peak)
now occur at the upper end of
the scale with fewer at the
lower end.
For example a statistics test
where the majority of the
students did well and achieved a
high score, with few students
doing badly and achieving a low
score.
Quantiles
“Rank”
𝑃𝑃50 𝑃𝑃80
80% of the values
50% of the values
smaller than, or 𝑃𝑃75 smaller than, or
75% of the values equal to, it
equal to, it.
smaller than or
equal to it.
80 was
40 data points the max
= scores for 40 mark
students
10 scores in
each quartile
Remember to find the position of the quantile using
the y-axis. Read off the x-axis for the actual value of 25% of the students
the quantile. had a score of more
than 52 marks
16 + 14 + ⋯ + 14 + 13
𝑥𝑥̅ = = 18
12
• As mentioned in the previous section, the median is a
quantile that divides a data set into two.
Median • It is the middle value of the data set if 𝑛𝑛 is odd and the
average of the two middle values in the data set if 𝑛𝑛 is even
when all the values are arranged in ascending order.
𝑛𝑛 + 1 13
Step 2: Find the position of the median, Me. 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = = = 6.5
2 2
Step 3: Count the number of points until you get to the position
13, 14, 14, 15, 16, 16, 17, 17, 17, 17, 18, 42
1 2 3 4 5 6 7 8
14, 14, 15, 16, 16, 17, 17, 17, 17, 18, 42
Mode Most common value. For the above example, the mode is 17
Mean Vs Median
13, 14, 14, 15, 16, 16, 17, 17, 17, 17, 18, 42
𝑘𝑘 � 𝑓𝑓𝑖𝑖 = 𝑛𝑛
∑𝑖𝑖=1 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 𝑖𝑖=1
𝑥𝑥̅ = (the sample size)
𝑛𝑛
𝑘𝑘
Can be represented as 1
this where 𝑥𝑥𝑖𝑖 ≈ 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑥𝑥̅ = � 𝑥𝑥𝑖𝑖 𝑓𝑓𝑖𝑖 (on formula sheet)
𝑛𝑛
𝑖𝑖=1
This means there are 3 values between 15 and 20.
Example Since we don’t know what they are, we can estimate
them using the class midpoint of 17.5.
How
spread
out the
data is
Standard deviation
Range
Shows the extent to
The distance
which a typical value
between the max.
in the data set differs
and min. values
from the mean
Standard
deviation
“s”
Shows the variability
First find the
in relation to the
“Sample Variance”
mean
(Know how to use STAT mode on
the calculator to obtain this)
𝑛𝑛
1
𝑠𝑠 2 = � 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
𝑛𝑛 − 1
𝑖𝑖=1
(on formula sheet) 𝒔𝒔 = 𝒔𝒔𝟐𝟐
1
= � 𝑥𝑥𝑖𝑖2 − 𝑛𝑛𝑥𝑥̅ 2
𝑛𝑛 − 1
Example 20 40 25 35
∑ 𝑥𝑥𝑖𝑖 20 + 40 + 25 + 35
First need the sample mean: 𝑥𝑥̅ = = = 30
𝑛𝑛 4
∑ 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
𝑠𝑠 2 =
𝑛𝑛 − 1
20 − 30 2 + 40 − 30 2 + 25 − 30 2 + 35 − 30 2
= = 83.3
3
So: 𝑠𝑠 = 𝑠𝑠 2 = 9.1
What happens if we are
given grouped data?
1
𝑠𝑠 2 = � 𝑓𝑓𝑖𝑖 𝑥𝑥𝑖𝑖2 − 𝑛𝑛𝑥𝑥̅ 2 � 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 = 160
𝑛𝑛 − 1
1 2
= 1787.5 − 22 7.27 2
21 � 𝑓𝑓𝑖𝑖 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) = 1787.5
= 29.7 …
States that the minimum proportion of values in the data set that lie
within a distance 𝑑𝑑 = 𝑘𝑘 × 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 of the mean is given by
1
1− 2
𝑘𝑘
Chebychev’s theorem:
1
i.e: A proportion of at least 1 − lie within/around 𝑘𝑘 standard deviations of the
𝑘𝑘 2
mean (before and after the mean) for any shaped distribution of the data.
The distribution of the
values in the data set
have no particular shape
(e.g: not symmetric bell-
shaped).
Use a diagram:
Step 1: First calculate d, the distance between the mean and the values:
𝑑𝑑 = 5.5 − 2.5 = 3 and 𝑑𝑑 = 5.5 − 8.5 = 3
𝑑𝑑 3
𝑘𝑘 = =
𝑠𝑠 2
Solution: At least 56% of the phone calls will last between 2.5 and 8.5 minutes.
Example 2:
Using the same information from the previous example, what proportion of
phone calls will be less than 2.5 minutes or longer 8.5 minutes?
Recall: Chebychev’s theorem states nothing about the specific shape of the
distribution of the data in order for the theorem to be used.
If it is known that the data set of interest has a symmetric bell-shaped distribution,
results better than that of Chebychev’s theorem can be obtained.
For data that is symmetric bell-shaped, the Empirical rule is as follows:
i. Approximately 68% of data values are within 1 standard deviation of the mean.
ii. Approximately 95% of data values are within 2 standard deviations of the mean.
iii. Approximately 99.7% of data values are within 3 standard deviations of the
mean.
Example:
A financial services institution does a study of the credit card debt of 2500 people
who have this kind of debt. The average amount was R2450 with a standard
deviation of R234. They also found that the data was bell-shaped.
𝑥𝑥̅
𝑠𝑠
This should start you
thinking “Empirical rule”.
a) Approx. what percentage of people have debt between R2216 and R2684?
b) Approx. what percent of people have debt between R1749 and R2450?
𝑥𝑥̅ = 2450
On the red part of the line
𝑠𝑠 = 234
Empirical rule:
±68 % of the data
lies within 1 std. dev.
of the mean
Step 2: Divide distance by 𝑠𝑠 to determine how many stand. deviations fit into the interval
b) Approx. what percent of people have debt between R1748 and R2450?
Distance = 702 = 3 𝑠𝑠
702 702 Empirical rule can be used
∴ = =3 as an “in between” step.
𝑠𝑠 234
Using the Empirical Rule:
approximately 99.7% of the values fall
on the red line (3 standard deviations
around the mean)
Go 3 standard deviations
in this direction too.
Use symmetry…
99.7
= 49.85
2
1748 2450
2450 2918
Distance = 468 = 2 𝑠𝑠
This means we can use the
468 468 Empirical rule
∴ = =2
𝑠𝑠 234
𝑥𝑥̅ = 2450 Using the Empirical rule:
𝑠𝑠 = 234 approximately 95% are
between 1982 and 2918
(2 standard deviations
around the mean)
Remaining 5% “outside”.
1982 2918 Due to 5
2450
(2450 – 2 x 234) symmetry = 2.5%
2
on each side
Go this direction 2s
2s too
𝑠𝑠
What does it do? 𝐶𝐶𝐶𝐶 = × 100%
𝑥𝑥̅
(on formula sheet)
Allows the comparison of the variability of 2 data
sets, even when expressed in different units.
Example Samples of 15 adults and 15 3-year olds
6 6
CV = × 100 = 3.5 ≈ 4% CV = × 100 = 12%
171 50
100 10
CV = × 100 = 7.7% CV = × 100 = 4.3%
1300 230
No linear relationship, 𝑟𝑟 ≈ 0
Calculating the correlation coefficient:
For a sample of 𝑛𝑛 pairs of coordinates 𝑥𝑥1 , 𝑦𝑦1 , 𝑥𝑥2 , 𝑦𝑦2 , … , (𝑥𝑥𝑛𝑛 , 𝑦𝑦𝑛𝑛 ), the coefficient
of correlation is calculated as follows:
𝑛𝑛 ∑ 𝑥𝑥𝑥𝑥 − ∑ 𝑥𝑥 ∑ 𝑦𝑦
𝑟𝑟 =
𝑛𝑛 ∑ 𝑥𝑥 2 − ∑ 𝑥𝑥 2 × 𝑛𝑛 ∑ 𝑦𝑦 2 − ∑ 𝑦𝑦 2
𝑆𝑆𝑥𝑥𝑥𝑥
OR 𝑟𝑟 =
𝑆𝑆𝑥𝑥𝑥𝑥 𝑆𝑆𝑦𝑦𝑦𝑦 (on the formula sheet)
where
2 2
(∑𝑥𝑥)(∑𝑦𝑦) 2
∑𝑥𝑥 2
∑𝑦𝑦
𝑆𝑆𝑥𝑥𝑥𝑥 = � 𝑥𝑥𝑥𝑥 − 𝑆𝑆𝑥𝑥𝑥𝑥 = � 𝑥𝑥 − 𝑆𝑆𝑦𝑦𝑦𝑦 = � 𝑦𝑦 −
𝑛𝑛 𝑛𝑛 𝑛𝑛
Or this can be calculated using STAT mode on the calculator (see ‘2variable stat mode’
notes for either a Casio or Sharp calculator in the Calculator Notes folder on Moodle).
Example:
The number of copies sold of a new book (measured in thousands of units) is
dependent on the advertising budget the publisher commits in a pre-
publication campaign (measured in thousands of Rands).
𝑦𝑦 12.5 18.6 25.3 24.8 35.7 45.4 44.4 45.8 65.3 75.7 72.3 79.2
Scatter plot of the bivariate data set on the previous slide:
50
40
number of copies of the
30 book sold can be fairly
20 well explained by a
10 straight line
0
0 5 10 15 20 25 30 35
advertising budget
𝑥𝑥 𝑦𝑦 𝑥𝑥𝑥𝑥 𝑥𝑥2 𝑦𝑦2
8 12.5 100 64 156.25 𝑛𝑛 = 12
9.5 18.6 176.7 90.25 345.96
7.2 25.3 182.16 51.84 640.09 � 𝑥𝑥 = 178.8
6.5 24.8 161.2 42.25 615.04
10 35.7 357 100 1274.49
� 𝑦𝑦 = 545
12 45.4 544.8 144 2061.16
11.5 44.4 510.6 132.25 1971.36
14.8 45.8 677.84 219.04 2097.64 � 𝑥𝑥𝑥𝑥 = 10032.89
17.3 65.3 1129.69 299.29 4264.09
27 75.7 2043.9 729 5730.49
� 𝑥𝑥 2 = 3396.92
30 72.3 2169 900 5227.29
25 79.2 1980 625 6272.64
� 𝑦𝑦 2 = 30656.5
Sum: 178.8 545 10032.89 3396.92 30656.5
𝑛𝑛 = 12 � 𝑥𝑥 = 178.8 � 𝑦𝑦 = 545 � 𝑥𝑥𝑥𝑥 = 10032.89 � 𝑥𝑥 2 = 3396.92 � 𝑦𝑦 2 = 30656.5
∑𝑥𝑥 2 178.8 2
𝑆𝑆𝑥𝑥𝑥𝑥 2
= � 𝑥𝑥 − = 3396.92 − = 732.8
𝑛𝑛 12
2 2
2
∑𝑦𝑦 545
𝑆𝑆𝑦𝑦𝑦𝑦 = � 𝑦𝑦 − = 30656.5 − = 5904.4167
𝑛𝑛 12
𝑆𝑆𝑥𝑥𝑥𝑥
∴ 𝑟𝑟 = = 0.9194
𝑆𝑆𝑥𝑥𝑥𝑥 𝑆𝑆𝑦𝑦𝑦𝑦
Therefore, there is a strong, positive linear relationship between the advertising budget and
the number of copies sold.
Other useful graphical methods
Each time a
Plot a dot for value in the
Draw a each value in
Find the min suitable data set is
and max the data set at repeated, stack
section of the their positions
values number line their dots above
on the scale one another
Example
The data set below gives the weights (kg) for 14 adult dogs treated yesterday
at a vet.
9, 8, 7, 4, 9, 8, 6, 9, 7, 9, 10, 8, 10, 5
min max
3 4 5 6 7 8 9 10 11
Graphical representation of some important
Box and Whisker Plot values of a data set (the five number summary)
• Mark each of the values of the five number summary on a suitable section of
the number line.
• Draw a box extending from Q1 to Q3.
A Box-and-Whisker plot can also be used to assess the skewness of the distribution
of the values in the data set.
For positively skewed data, most of the values are at the lower end of the scale.
The right side of the ‘box’ is larger than the left side
For negatively skewed data, most of the values are at the upper end of the scale.
The median is closer to Q3
The left side of the ‘box’ is larger than the right side
For symmetric, bell-shaped data, most of the values are clustered around the centre.
Example
In this way the chart Types of
visually depicts which complaints
situations/categories received by a
occur more frequently company
40 + 35