Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Chapter 1 and Introduction Notes

The document provides an introduction to statistics, focusing on methods for obtaining, organizing, summarizing, presenting, analyzing, and interpreting data. It discusses descriptive and inferential statistics, sampling methods, and the construction of frequency distributions, including class limits and boundaries. Additionally, it covers graphical representations of data, such as histograms and frequency polygons, to visualize patterns and distributions.

Uploaded by

Khulekani Cele
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 1 and Introduction Notes

The document provides an introduction to statistics, focusing on methods for obtaining, organizing, summarizing, presenting, analyzing, and interpreting data. It discusses descriptive and inferential statistics, sampling methods, and the construction of frequency distributions, including class limits and boundaries. Additionally, it covers graphical representations of data, such as histograms and frequency polygons, to visualize patterns and distributions.

Uploaded by

Khulekani Cele
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Statistics - Introduction

A collection of methods for


• Obtaining,
Descriptive Statistics • Organizing,
(Chapter 1) • Summarizing,
• Presenting,
• Analysing and
Inferential Statistics
• Interpreting the data, and then drawing
(Chapter 5) conclusions from it.
e.g. All STAT370 Population – all items
with a characteristic of Problems
students this • Expensive
semester interest (size N)
• Not practical
• Time consuming

Generalize the Census – a


findings from the study of all
sample to the e.g. 100 STAT370
items in the students this semester
population pop.

Study only the Take a Sample – a subset of


items in the sample the population (size n)
sample
A Variable
A characteristic that can assume different values

DISCRETE Variable Example:


The number of
Assumes a finite, or countable, number
STAT370 students in
of possible values
a lecture.
CONTINUOUS Variable
Assumes an infinite number of possible Example:
values (in an interval between two The height of the
values) STAT370 students.

Data Set: A set of measurements, counts, or


descriptions of the characteristic of the members
being studied.
More examples

The number of car accidents in


discrete
an intersection

The weight of fish in a tank


continuous

The time it takes to complete a


marathon continuous

The flavours of milkshakes at


discrete
McDonalds
Chapter 1
Summarizing Data

The first part of the definition of


statistics, obtaining data, refers to Assume all data sets
sampling methods which will be are from samples.
discussed in chapter 5. Aim – get info about a
population.

Chapter 1 deals with organizing,


summarizing and presenting the
data obtained.
• Used for summarizing numerical data
1.1 Stem and leaf plot • Shows patterns/shape of data sets

Choose the
Arrange the
stem and
data in
ascending
leaf units Plot 
(make a
order
key)
Example: Consider the following data set
3, 3, 11, 12, 16, 16, 17, 21, 22, 23, 25, 36, 37, 42, 61

Step 1: Order data in ascending order (done)

Step 2: Determine the what the stem and leaf will be.
Usually the leaf contains the last digit in the number and the stem contains all
the other digits.
Notice that the largest value is made up of “tens” and “ones”
61 = 6 tens and 1 one
So make the stem “tens” and the leaves “ones”
3, 3, 11, 12, 16, 16, 17, 21, 22, 23, 25, 36, 37, 42, 61

Step 3: Draw Stem Leaf Min = 3 = 03


Max = 61
0 33
1 1 26 67 ∴ stem will contain all the
There are no values in the ‘tens’ values from 0 to 6. No
fifties, therefore it is left blank 2 1 2 3 5
values must be left out!
(but 50 is still included in the 3 6 7
stem) 4 2 Now we can see the general
5 pattern of how the values are
6 1 distributed.

Can be any arbitrary


Key: 36 = 3|6
number in the key
Section 1.2
Frequency Related
and
Distributions Graphs

What ? Why ?
A table in which the • Summarizes the
data is grouped into values in the data set.
classes and the • Shows patterns in the
number of values data.
(frequencies) which
fall in each class is
recorded.
Total: 36 � 𝑓𝑓𝑖𝑖 = 𝑛𝑛
Construction of Step 1: Step 2:
a frequency Find the Determine
range
distribution the number
of classes
Step 7:
Record the
frequencies
Step 3:
Determine
class width
Step 6: Step 5:
Determine Step 4:
Calculate lower
the upper Choose the
values for
values lower value for
remaining classes
the first class
E.g.

The class
width/length is
6−1 = 5
This frequency
distribution has
6 classes

• The next few slides show the steps of how to obtain how many classes a
frequency distribution should have for a specific data set, and what the class
width should be (e.g. how 6 and 5 in the above example was obtained).
• Sometimes there is a natural way to group data into a frequency distribution,
however, the first 3 steps of constructing a frequency distribution ensures an
appropriate frequency distribution is obtained for the data set.
Example:
The table below gives the weights of 50 parts made in a factory.
n = 50 (sample size)
7
88

Step 1: range (R) = max – min = 88 – 7 = 81

Step 2: Determine the number of classes (denoted by k)


Use Sturge’s rule to find k: 1 + 1.44 ln(n) (on formula sheet)
k = rounded up value of {1 + 1.44 ln(n) }
= 1 + 1.44 ln(50) Use 7
= 6.63…. classes
= 7 Always round up!!

E.g: If k = 5.21, then round k up to 6 (we cannot


have 5 and a bit classes, only 6 whole classes!

Step 3: Determine the class width/length denoted by 𝑙𝑙 :


𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
We must have 𝑙𝑙 = Rounded up again! (Not on formula sheet)
𝑘𝑘
81
∴ 𝑙𝑙 = = 11.57 ≈ 12 Class width = 12
7
There are 5 values in
Step 4: 7 is the smallest value so let the lower value the data set between
of 1st class be 6 6 and 17 (inclusive).

Step 5: Add the class width 𝑙𝑙 = 12 Class Limits Frequency


to find each lower value of 6 − 17 5
the remaining classes. +12
18 − 29 9
Step 6: The upper class values will 30 − 41 14
be one unit less than the
42 − 53 7
lower class limit of the
following class. 54 − 65 7
66 − 77 5
Step 7: Determine the frequencies for
each class. 78 − 89 3
Selecting Classes

Some classes Sturge’s rule


form naturally
Example: Convenient
Class marks when no
0–9 natural
10 – 19 classes are
20 – 29 present.

90 - 100
There are some basic rules
Rules Classes should not be
too large or too small

Might not see


a pattern
Each data point
must be in (only)
one class
Where will
E.g. 18 go?
The classes
must not lead And 29.3?
to ambiguity
Class Limits vs Class Boundaries

Example:
Data points can Data points
equal both the cannot equal
upper and lower the upper
values in the class value.
class. There is no
There is a ‘gap’ ‘gap’ between
between upper classes.
limit of one class
and lower limit of A value of 77.5 will
the next. fall into this class
Obtaining class boundaries from
class limits…

To determine the class boundaries of a class, Class Limits Class Boundaries


we add the average of the ‘gap’ between the 6 − 17 5.5 − 17.5
class limits to the upper class limits and
subtract it from the lower class limits: 18 − 29 17.5 − 29.5
• ‘gap’ = 18 – 17 = 1 30 − 41 29.5 − 41.5
• averaging this value: 1 ÷ 2 = ½ = 0.5 41.5 − 53.5
42 − 53
Therefore, we add 0.5 to the upper class limits
54 − 65 53.5 − 65.5
and subtract it from all the lower class limits to
obtain class boundaries. 66 − 77 65.5 − 77.5
The classes are now continuous with no gap
78 − 89 77.5 − 89.5
between them.
From a frequency distribution, a number of things can be calculated:

 Class Midpoints: The middle value of a class interval.

 Cumulative Frequencies: The number of values in the sample that are less
than or equal to the upper class boundary/limit of that class.

 Relative Frequency: The frequency of a class relative to the sample size.

 Relative Frequency Percentage: The percentage of frequencies that fall into a class.

Using the above, the frequency distribution can be represented graphically in a


number of different ways…
Frequency Distribution Example

Example 2

Consider the following data of low


temperatures (in degrees
Fahrenheit to the nearest degree)
for 50 days. The highest
temperature is 64 and the lowest
temperature is 39.
𝑛𝑛 = 50
Step 1: range = max – min = 64 – 39 = 25

Step 2: Determine the number of classes k


k = rounded up value of {1 + 1.44 ln(n) }
= 1 + 1.44 ln(50)
= 6.63….
= 7 Always round up!!

Step 3: Determine the class width denoted by 𝑙𝑙:


𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 25
𝑙𝑙 = ∴ 𝑙𝑙 = = 3.57 ≈ 4
𝑘𝑘 7

Step 4: 39 is the smallest value so let the lower value


of 1st class be 38
Step 5: Add 𝑙𝑙 = 4 to each lower value of the remaining classes
Class Limits
38 -
38 + 4 = 42
42 -
46 -
50 -
54 - Class Limits
58 - 38 – 41 42 – 1 = 41
62 - 42 – 45
46 – 49
Step 6: Determine the upper class limits: 50 – 53
54 – 57 If there was a
58 – 61 next class, it
62 – 65 would start at 66,
therefore this one
Step 7: Determine the frequencies for each class ends at 65
37.5+41.5 4
= 39.5 4 + 10 = 14
2 50

Class limits frequency Boundaries Midpoints Relative freq. Cumulative freq.

38 - 41 4 37.5 - 41.5 39.5 0.08 4


42 - 45 10 41.5 - 45.5 43.5 0.20 14
46 - 49 8 45.5 - 49.5 47.5 0.16 22
50 - 53 15 49.5 - 53.5 51.5 0.30 37
54 - 57 9 53.5 - 57.5 55.5 0.18 46
58 - 61 3 57.5 - 61.5 59.5 0.06 49
62 - 65 1 61.5 - 65.5 63.5 0.02 50
Class Boundaries – plotted on x-axis
Histogram Frequencies – plotted on y-axis

We can now
observe the Boundaries frequency
general pattern in
the data. 37.5 - 41.5 4
41.5 - 45.5 10
45.5 - 49.5 8
49.5 - 53.5 15
53.5 - 57.5 9
57.5 - 61.5 3
61.5 - 65.5 1
Frequency Polygon Class Midpoints – plotted on x-axis
Frequencies – plotted on y-axis
Don’t want 16
Midpoints frequency
the graph 14
to “float” 39.5 4
12
43.5 10
Frequency

10
47.5 8
8 51.5 15
Anchor the
graph to 6 55.5 9
the x-axis 4
59.5 3
63.5 1
2

0
(Class width 𝑙𝑙 = 4)
35.5 39.5 43.5 47.5 51.5 55.5 59.5 63.5 67.5
Min midpoint
39.5 – 4 = 35.5 Class Midpoints Max midpoint 63.5 + 4 = 67.5
• Class Boundaries – plotted on x-axis
Ogive • Cumulative Frequencies – plotted on y-axis
cumulative frequency
Often done in
percentage form 60

50
Boundaries Cumulative freq.
40

Cum. frequency
37.5 41.5
37.5 - 41.5 44 30

41.5 - 45.5 14 37.5 will be on the x-


20 axis as nothing is less
45.5 - 49.5 22 than or equal to it 41.5 will correspond to
10
a cumulative freq. of 4
49.5 - 53.5 37
4
53.5 - 57.5 46 0
0 10 20 30 40 50 60 70

57.5 - 61.5 49 class boundary

61.5 - 65.5 50
Shapes of Distributions
One of the main purposes of drawing a histogram or frequency polygon is to determine the
clustering pattern of the values in the data set, or how the values in the data set are
distributed.

Symmetric Bell-shape:
This shape occurs when the
majority of the values fall in
the central portion of the scale
with fewer values further
away from the centre on
either end.
Examples would be heights
and weights.
Positively skewed shape:
This shape occurs when the
majority of the data (the peak)
occurs at the lower end of the
scale, with fewer values
occurring at the upper end.
An example of this shape would
be the time it takes to serve a
customer at a supermarket. For
most customers the service time
is quite short. The longer the
service time, the less the
number of customers.
Negatively skewed shape:
This shape occurs when the
majority of the data (the peak)
now occur at the upper end of
the scale with fewer at the
lower end.
For example a statistics test
where the majority of the
students did well and achieved a
high score, with few students
doing badly and achieving a low
score.
Quantiles

Suppose If it was an easy


you get test this could be
65% in a a bad mark Knowing
test. the mark
Is this a isn’t really
good If it was an hard useful.
mark? test this could be
a good mark
BUT
Knowing how
many marks were
smaller might be “ 𝑃𝑃𝑖𝑖 ”
useful. The 𝑖𝑖 𝑡𝑡𝑡 percentile

“Rank”

What percent 𝑖𝑖%


of the marks of the other values
were smaller are smaller than, or
than yours? equal to, that value
Quantiles…
Percentiles, 𝑃𝑃1 , 𝑃𝑃2 , … , 𝑃𝑃99 , divide the data set into 100 equal parts.
Deciles, 𝐷𝐷1 , 𝐷𝐷2 , … , 𝐷𝐷9 , divide the data up into 10 equal parts.
Quartiles, 𝑄𝑄1 , 𝑄𝑄2 , 𝑄𝑄3 , divide the data up into 4 equal parts.
Median, 𝑀𝑀𝑒𝑒 , divides the data up into two equal parts.

Note: 𝑀𝑀𝑒𝑒 = 𝑄𝑄2 = 𝑃𝑃50 ,


𝑄𝑄1 = 𝑃𝑃25 ,
𝑄𝑄3 = 𝑃𝑃75 ,
𝐷𝐷1 = 𝑃𝑃10
More examples

𝑃𝑃50 𝑃𝑃80
80% of the values
50% of the values
smaller than, or 𝑃𝑃75 smaller than, or
75% of the values equal to, it
equal to, it.
smaller than or
equal to it.

median 8th decile


3rd quartile
The ogive can be used to find specific quantiles. The
Percentiles from
position of the quantile is found on the y-axis and the
Ogives corresponding x-value will give you the quantile.

Ogive shows the results for a test out of 80 marks


What is the
sample size?

80 was
40 data points the max
= scores for 40 mark
students

10 scores in
each quartile
Remember to find the position of the quantile using
the y-axis. Read off the x-axis for the actual value of 25% of the students
the quantile. had a score of more
than 52 marks

Half the class (20


students) got a score
of at most 44 marks
Deciles split the data
into 10 parts. Here,
each part will be 4
A pass is 50% students (40÷10).
i.e. 40 marks out The first decile is
of 80. approx. 29 marks
Only 14 students
failed
Sec 1.3 – Summarizing Statistics

A STATISTIC is a measure of description from a sample.


It is calculated based on sample values 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 and
describes some quantifiable characteristic of a population.

Examples: 1) The average weight of 30 elephants selected from


the national parks in RSA.

2) The proportion of 60 UKZN students that have bank


accounts.
Measures of Location
A measure of location or central tendency is a value that shows the location on
the scale where a data set is centrally located (where most values are clustered).

Mean Average value

A value “Middle data point”.


“Centre”
around Median 50% of the values in
can mean the data set are
which
many smaller or equal to it.
data is
things
centred
Mode Most common value
Mean Notation: mean = 𝑥𝑥̅
Pronounced “X-bar”
Average value 𝑛𝑛
1 ∑ 𝑥𝑥𝑖𝑖 (on formula sheet)
𝑥𝑥̅ = � 𝑥𝑥𝑖𝑖 =
𝑛𝑛 𝑛𝑛
𝑖𝑖=1

16, 14, 15, 17, 17, 17,


Example 16, 18, 42, 17, 14, 13
𝑛𝑛 = 12

16 + 14 + ⋯ + 14 + 13
𝑥𝑥̅ = = 18
12
• As mentioned in the previous section, the median is a
quantile that divides a data set into two.
Median • It is the middle value of the data set if 𝑛𝑛 is odd and the
average of the two middle values in the data set if 𝑛𝑛 is even
when all the values are arranged in ascending order.

16, 14, 15, 17, 17, 17,


Example 𝑛𝑛 = 12
16, 18, 42, 17, 14, 13
Step 1: Put the data in ascending order:
13, 14, 14, 15, 16, 16, 17, 17, 17, 17, 18, 42

𝑛𝑛 + 1 13
Step 2: Find the position of the median, Me. 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = = = 6.5
2 2
Step 3: Count the number of points until you get to the position

13, 14, 14, 15, 16, 16, 17, 17, 17, 17, 18, 42

1 2 3 4 5 6 7 8

The 6.5th position is here


When 𝑛𝑛 is even the
median will be between
16+17 two of the data points.
Median: = 16.5
2
What happens if the 𝑛𝑛 becomes odd…
number 13 is removed?
𝑛𝑛+1 11+1
Position: = =6
2 2

14, 14, 15, 16, 16, 17, 17, 17, 17, 18, 42

When 𝑛𝑛 is odd the


1 2 3 4 5 6 7 8 median will be
exactly one of the
data points
6th position. So Me = 17

Mode Most common value. For the above example, the mode is 17
Mean Vs Median

Good when… Good when…

• Data is symmetric • Data is not symmetric


• Has outliers
• No outliers

No extreme values that are “too


big” or “too small” when compared
to the rest of the values
Comparing the mean and median:

13, 14, 14, 15, 16, 16, 17, 17, 17, 17, 18, 42

Mean = 18 Median = 16.5

The values are all very similar.


• But notice the outlier: 42.
• There is only one value greater
than the mean of 18. It doesn’t
seem very central.
• The outlier is pulling the mean
up.
• When there are outliers, the
median is more suitable.
Recall: In a frequency distribution we do
not have the exact values of the data.
Mean for grouped data
Therefore, we use the class midpoints
(𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) ) to estimate the values.
𝑘𝑘 = number of
classes
Midpoints Frequencies of each class
𝑘𝑘

𝑘𝑘 � 𝑓𝑓𝑖𝑖 = 𝑛𝑛
∑𝑖𝑖=1 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 𝑖𝑖=1
𝑥𝑥̅ = (the sample size)
𝑛𝑛
𝑘𝑘
Can be represented as 1
this where 𝑥𝑥𝑖𝑖 ≈ 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑥𝑥̅ = � 𝑥𝑥𝑖𝑖 𝑓𝑓𝑖𝑖 (on formula sheet)
𝑛𝑛
𝑖𝑖=1
This means there are 3 values between 15 and 20.
Example Since we don’t know what they are, we can estimate
them using the class midpoint of 17.5.

𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) ∑𝑘𝑘𝑖𝑖=1 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖


x 𝑥𝑥̅ =
𝑛𝑛
x
x 17.5 3 + ⋯ + (41.5)(10)
=
x 53
x
= 31.877
𝑛𝑛 = � 𝑓𝑓𝑖𝑖 = 53
Measures of variability

How
spread
out the
data is

Standard deviation
Range
Shows the extent to
The distance
which a typical value
between the max.
in the data set differs
and min. values
from the mean
Standard
deviation
“s”
Shows the variability
First find the
in relation to the
“Sample Variance”
mean
(Know how to use STAT mode on
the calculator to obtain this)
𝑛𝑛
1
𝑠𝑠 2 = � 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
𝑛𝑛 − 1
𝑖𝑖=1
(on formula sheet) 𝒔𝒔 = 𝒔𝒔𝟐𝟐
1
= � 𝑥𝑥𝑖𝑖2 − 𝑛𝑛𝑥𝑥̅ 2
𝑛𝑛 − 1
Example 20 40 25 35

∑ 𝑥𝑥𝑖𝑖 20 + 40 + 25 + 35
First need the sample mean: 𝑥𝑥̅ = = = 30
𝑛𝑛 4

∑ 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2
𝑠𝑠 2 =
𝑛𝑛 − 1

20 − 30 2 + 40 − 30 2 + 25 − 30 2 + 35 − 30 2
= = 83.3
3

So: 𝑠𝑠 = 𝑠𝑠 2 = 9.1
What happens if we are
given grouped data?

Approximate the values


with the midpoints 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖)

Sample variance for a frequency distribution:


𝑛𝑛
1 2 Notation: 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) ≈ 𝑥𝑥𝑖𝑖
𝑠𝑠 2 = � 𝑓𝑓𝑖𝑖 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) − 𝑥𝑥̅
𝑛𝑛 − 1
𝑖𝑖=1
(on formula sheet)
1
= � 𝑓𝑓𝑖𝑖 𝑥𝑥𝑖𝑖2 − 𝑛𝑛𝑥𝑥̅ 2 (Know how to use STAT mode on
𝑛𝑛 − 1 the calculator to obtain this)
Example

Boundaries fi 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) fi 𝑥𝑥2 𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) fi


0–5 10 2.5 25 62.5
5 – 10 6 7.5 45 337.5
10 – 15 3 12.5 38 Type text here
468.75
15 – 20 3 17.5 53 918.75

𝑛𝑛 = � 𝑓𝑓𝑖𝑖 = 22 � 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 = 160 2


� 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 = 1787.5
𝑘𝑘
1 160
𝑥𝑥̅ = � 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 = = 7.27
𝑛𝑛 22
𝑖𝑖=1 𝑛𝑛 = � 𝑓𝑓𝑖𝑖 = 22

1
𝑠𝑠 2 = � 𝑓𝑓𝑖𝑖 𝑥𝑥𝑖𝑖2 − 𝑛𝑛𝑥𝑥̅ 2 � 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) 𝑓𝑓𝑖𝑖 = 160
𝑛𝑛 − 1

1 2
= 1787.5 − 22 7.27 2
21 � 𝑓𝑓𝑖𝑖 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚(𝑖𝑖) = 1787.5

= 29.7 …

Confirm this using STAT


So: 𝑠𝑠 = 𝑠𝑠 2 = 5.45 mode on the calculator
Uses of a standard deviation:

1) A measure of spread of a data set.


2) It can be used to determine the minimum proportion of values in a data set
that lies within a certain distance of the mean
Using Chebychev’s theorem

States that the minimum proportion of values in the data set that lie
within a distance 𝑑𝑑 = 𝑘𝑘 × 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 of the mean is given by
1
1− 2
𝑘𝑘
Chebychev’s theorem:
1
i.e: A proportion of at least 1 − lie within/around 𝑘𝑘 standard deviations of the
𝑘𝑘 2
mean (before and after the mean) for any shaped distribution of the data.
The distribution of the
values in the data set
have no particular shape
(e.g: not symmetric bell-
shaped).

Consider the next slide


on how this theorem
can be applied…
A distance of 𝑑𝑑 = 𝑘𝑘 × 𝑠𝑠 to the left and to According to Chebychev’s
the right of the mean
Theorem, at least a proportion of
1
1− of the values
𝑘𝑘 2

in the data set will lie in this


shaded region

(in the interval between the


values 𝑥𝑥̅ − 𝑘𝑘 × 𝑠𝑠 and 𝑥𝑥̅ + 𝑘𝑘 × 𝑠𝑠)

𝑥𝑥̅ − 𝑘𝑘 × 𝑠𝑠 𝑥𝑥̅ 𝑥𝑥̅ + 𝑘𝑘 × 𝑠𝑠

−𝑘𝑘 × 𝑠𝑠 +𝑘𝑘 × 𝑠𝑠 (𝑠𝑠 = standard deviation)


Example 1: 𝑥𝑥̅ = 5.5 𝑠𝑠 = 2
Consider randomly selected phone calls that have a mean time of 5.5 minutes
with a standard deviation of 2 minutes. What proportion of phone calls will last
between 2.5 minutes and 8.5 minutes?
Nothing has been mentioned about the shape of the distribution of the length
of the phone calls. Therefore, Chebychev’s theorem must be applied.

Use a diagram:

2.5 𝑥𝑥̅ = 5.5 8.5


𝑠𝑠 = 2

2.5 𝑥𝑥̅ = 5.5 8.5

Step 1: First calculate d, the distance between the mean and the values:
𝑑𝑑 = 5.5 − 2.5 = 3 and 𝑑𝑑 = 5.5 − 8.5 = 3

Therefore, there is a distance of 3 on either side of the mean.


Step 2: Determine how many standard deviations (𝑠𝑠) go into this distance, this
will help us calculate 𝑘𝑘: 𝑠𝑠 = 2, 𝑑𝑑 = 3

𝑑𝑑 3
𝑘𝑘 = =
𝑠𝑠 2

Therefore, there are 1.5 standard deviations


∴ 𝑘𝑘 = 1.5
around the mean in the interval from 2.5 to 8.5

Step 3: Use Chebychev’s Theorem to calculate the minimum proportion of values


in the interval:
1 1
(on formula sheet) 1− 2 =1− 2
= 0.56
𝑘𝑘 1.5

Solution: At least 56% of the phone calls will last between 2.5 and 8.5 minutes.
Example 2:
Using the same information from the previous example, what proportion of
phone calls will be less than 2.5 minutes or longer 8.5 minutes?

at least 56% The diagram shows how 100% of the


lengths of phone calls are distributed.

Since at least 56% are between 2.5 and


8.5 minutes…

… then at most 44% will lie outside of


the interval. i.e. less than 2.5 and
greater than 8.5

2.5 𝑥𝑥̅ = 5.5 8.5


Uses of a standard deviation…

3) Empirical Rule Learn this rule!!

Recall: Chebychev’s theorem states nothing about the specific shape of the
distribution of the data in order for the theorem to be used.

If it is known that the data set of interest has a symmetric bell-shaped distribution,
results better than that of Chebychev’s theorem can be obtained.
For data that is symmetric bell-shaped, the Empirical rule is as follows:
i. Approximately 68% of data values are within 1 standard deviation of the mean.
ii. Approximately 95% of data values are within 2 standard deviations of the mean.
iii. Approximately 99.7% of data values are within 3 standard deviations of the
mean.
Example:
A financial services institution does a study of the credit card debt of 2500 people
who have this kind of debt. The average amount was R2450 with a standard
deviation of R234. They also found that the data was bell-shaped.

𝑥𝑥̅
𝑠𝑠
This should start you
thinking “Empirical rule”.

a) Approx. what percentage of people have debt between R2216 and R2684?

b) Approx. what percent of people have debt between R1749 and R2450?

c) Approx. what percent of people have debt greater than R2918?


a) Approx. what percent of people have debt between R2216 and R2684?

𝑥𝑥̅ = 2450
On the red part of the line
𝑠𝑠 = 234
Empirical rule:
±68 % of the data
lies within 1 std. dev.
of the mean

2216 2450 2684

±68 % of the people


Distance = 234 = 1 s Distance = 234 = 1 s have debt between
Step 1: Calculate the distance around the mean R2216 and R2684

Step 2: Divide distance by 𝑠𝑠 to determine how many stand. deviations fit into the interval
b) Approx. what percent of people have debt between R1748 and R2450?

𝑥𝑥̅ = 2450 On the red part of the line


𝑠𝑠 = 234

The data is symmetric.


and
the distance is 3 std
deviations around the
mean…
1748 𝑥𝑥̅ =2450

Distance = 702 = 3 𝑠𝑠
702 702 Empirical rule can be used
∴ = =3 as an “in between” step.
𝑠𝑠 234
Using the Empirical Rule:
approximately 99.7% of the values fall
on the red line (3 standard deviations
around the mean)

1748 2450 3152


But we only want
half !
3 s =702 3 s =702

Go 3 standard deviations
in this direction too.
Use symmetry…
99.7
= 49.85
2

1748 2450

Solution: Approximately 49.85% of the people


have debt between R1748 and R2450
c) Approx. what percent of people have debt greater than R2918?

On the red part of the number line

2450 2918

Distance = 468 = 2 𝑠𝑠
This means we can use the
468 468 Empirical rule
∴ = =2
𝑠𝑠 234
𝑥𝑥̅ = 2450 Using the Empirical rule:
𝑠𝑠 = 234 approximately 95% are
between 1982 and 2918
(2 standard deviations
around the mean)

Remaining 5% “outside”.
1982 2918 Due to 5
2450
(2450 – 2 x 234) symmetry = 2.5%
2
on each side
Go this direction 2s
2s too

Approximately 2.5% have


debt greater than R2918
2918
Relative Dispersion Coefficient of Variation

What is the coefficient of


variation?
How is it
calculated?
A calculation that expresses the standard deviation
of the data set as a percentage of the mean.

𝑠𝑠
What does it do? 𝐶𝐶𝐶𝐶 = × 100%
𝑥𝑥̅
(on formula sheet)
Allows the comparison of the variability of 2 data
sets, even when expressed in different units.
Example Samples of 15 adults and 15 3-year olds

Height (cm) Height (cm)


The standard deviations are
𝑥𝑥̅ = 171 the same. Does that mean 𝑥𝑥̅ = 50
the data is equally variable? 𝑠𝑠 = 6
𝑠𝑠 = 6

6 6
CV = × 100 = 3.5 ≈ 4% CV = × 100 = 12%
171 50

The height data for children is approx. 3 times


more variable than the adult height data.
Example A sample of 15 BMW’s

The data for the weight


Weight (kg) and speed of the cars are Speed (km/hr)
expressed in different
𝑥𝑥̅ = 1300 units, therefore the 𝑥𝑥̅ = 230
standard deviations cannot 𝑠𝑠 = 10
𝑠𝑠 = 100
be directly compared.

100 10
CV = × 100 = 7.7% CV = × 100 = 4.3%
1300 230

The weight data is roughly twice as variable as the speed data.


Correlation
 Often two variables are measured simultaneously and relationships between
these variables are then explored.
 Data sets involving two variables are known as bivariate data sets.
 We have an independent variable (𝑥𝑥) and dependent variable (𝑦𝑦).
 𝑥𝑥 is also known as an explanatory variable, and 𝑦𝑦 a response variable.
 The best way to begin exploring relationships between two variables is to use a
scatter plot.
 The strength of the linear relationship between the two variables can be
determined using a correlation coefficient, denoted by 𝑟𝑟.
Properties of the correlation coefficient 𝒓𝒓:
• It is based on the closeness of the plotted points on a scatter diagram to a line
fitted through them.
−1 ≤ 𝑟𝑟 ≤ 1
• If the plotted points are closely clustered around the fitted line, 𝑟𝑟 will lie close
to either 1 or –1.
• The further the plotted points are away from the fitted line, the closer the
value of 𝑟𝑟 will be to 0.

• The sign of 𝑟𝑟 (either positive or negative) determines the type of linear


relationship between the two variables:
 When 𝑟𝑟 > 0, there is a positive linear relationship.
 When 𝑟𝑟 < 0, there is a negative linear relationship.
Strong, positive linear relationship, 𝑟𝑟 ≈ 1

Strong, negative linear relationship, 𝑟𝑟 ≈ −1

No linear relationship, 𝑟𝑟 ≈ 0
Calculating the correlation coefficient:

For a sample of 𝑛𝑛 pairs of coordinates 𝑥𝑥1 , 𝑦𝑦1 , 𝑥𝑥2 , 𝑦𝑦2 , … , (𝑥𝑥𝑛𝑛 , 𝑦𝑦𝑛𝑛 ), the coefficient
of correlation is calculated as follows:
𝑛𝑛 ∑ 𝑥𝑥𝑥𝑥 − ∑ 𝑥𝑥 ∑ 𝑦𝑦
𝑟𝑟 =
𝑛𝑛 ∑ 𝑥𝑥 2 − ∑ 𝑥𝑥 2 × 𝑛𝑛 ∑ 𝑦𝑦 2 − ∑ 𝑦𝑦 2

𝑆𝑆𝑥𝑥𝑥𝑥
OR 𝑟𝑟 =
𝑆𝑆𝑥𝑥𝑥𝑥 𝑆𝑆𝑦𝑦𝑦𝑦 (on the formula sheet)

where
2 2
(∑𝑥𝑥)(∑𝑦𝑦) 2
∑𝑥𝑥 2
∑𝑦𝑦
𝑆𝑆𝑥𝑥𝑥𝑥 = � 𝑥𝑥𝑥𝑥 − 𝑆𝑆𝑥𝑥𝑥𝑥 = � 𝑥𝑥 − 𝑆𝑆𝑦𝑦𝑦𝑦 = � 𝑦𝑦 −
𝑛𝑛 𝑛𝑛 𝑛𝑛

Or this can be calculated using STAT mode on the calculator (see ‘2variable stat mode’
notes for either a Casio or Sharp calculator in the Calculator Notes folder on Moodle).
Example:
The number of copies sold of a new book (measured in thousands of units) is
dependent on the advertising budget the publisher commits in a pre-
publication campaign (measured in thousands of Rands).

𝑦𝑦 = number of copies sold (dependent variable)


𝑥𝑥 = the advertising budget (independent variable)

The results of 12 recently published books are shown below:

𝑥𝑥 8 9.5 7.2 6.5 10 12 11.5 14.8 17.3 27 30 25

𝑦𝑦 12.5 18.6 25.3 24.8 35.7 45.4 44.4 45.8 65.3 75.7 72.3 79.2
Scatter plot of the bivariate data set on the previous slide:

Adverting budget and copies sold

90 From the scatter plot, it


80 can be seen that the
70
relationship between the
60
advertising budget and
copies sold

50
40
number of copies of the
30 book sold can be fairly
20 well explained by a
10 straight line
0
0 5 10 15 20 25 30 35
advertising budget
𝑥𝑥 𝑦𝑦 𝑥𝑥𝑥𝑥 𝑥𝑥2 𝑦𝑦2
8 12.5 100 64 156.25 𝑛𝑛 = 12
9.5 18.6 176.7 90.25 345.96
7.2 25.3 182.16 51.84 640.09 � 𝑥𝑥 = 178.8
6.5 24.8 161.2 42.25 615.04
10 35.7 357 100 1274.49
� 𝑦𝑦 = 545
12 45.4 544.8 144 2061.16
11.5 44.4 510.6 132.25 1971.36
14.8 45.8 677.84 219.04 2097.64 � 𝑥𝑥𝑥𝑥 = 10032.89
17.3 65.3 1129.69 299.29 4264.09
27 75.7 2043.9 729 5730.49
� 𝑥𝑥 2 = 3396.92
30 72.3 2169 900 5227.29
25 79.2 1980 625 6272.64
� 𝑦𝑦 2 = 30656.5
Sum: 178.8 545 10032.89 3396.92 30656.5
𝑛𝑛 = 12 � 𝑥𝑥 = 178.8 � 𝑦𝑦 = 545 � 𝑥𝑥𝑥𝑥 = 10032.89 � 𝑥𝑥 2 = 3396.92 � 𝑦𝑦 2 = 30656.5

∑𝑥𝑥 ∑𝑦𝑦 178.8 545


𝑆𝑆𝑥𝑥𝑥𝑥 = � 𝑥𝑥𝑥𝑥 − = 10032.89 − = 1912.39
𝑛𝑛 12

∑𝑥𝑥 2 178.8 2
𝑆𝑆𝑥𝑥𝑥𝑥 2
= � 𝑥𝑥 − = 3396.92 − = 732.8
𝑛𝑛 12
2 2
2
∑𝑦𝑦 545
𝑆𝑆𝑦𝑦𝑦𝑦 = � 𝑦𝑦 − = 30656.5 − = 5904.4167
𝑛𝑛 12

𝑆𝑆𝑥𝑥𝑥𝑥
∴ 𝑟𝑟 = = 0.9194
𝑆𝑆𝑥𝑥𝑥𝑥 𝑆𝑆𝑦𝑦𝑦𝑦

Therefore, there is a strong, positive linear relationship between the advertising budget and
the number of copies sold.
Other useful graphical methods

• Graphs numerical data


Dot Plot • Shows patterns in data sets
• Good for small data sets

Each time a
Plot a dot for value in the
Draw a each value in
Find the min suitable data set is
and max the data set at repeated, stack
section of the their positions
values number line their dots above
on the scale one another
Example
The data set below gives the weights (kg) for 14 adult dogs treated yesterday
at a vet.
9, 8, 7, 4, 9, 8, 6, 9, 7, 9, 10, 8, 10, 5

min max

We can now get an idea of the shape


This data is
of the distribution of the values…
negatively skewed.

3 4 5 6 7 8 9 10 11
Graphical representation of some important
Box and Whisker Plot values of a data set (the five number summary)

• The median – a measure of central location.


• Q1 and Q3 are shown so the interquartile range and the quartile deviation –
both measures of variation – can be found.
• The minimum and maximum values are also shown so the range – a measure of
variation – can be calculated.
• Outliers can be included. Outliers are values that are unusually small or
unusually large relative to the other values in the data set.
Min = 162 𝑄𝑄1 = 170.5 𝑄𝑄2 = 179.5
Drawing a box-and-whisker plot:
𝑄𝑄3 = 184 Max = 190

• Mark each of the values of the five number summary on a suitable section of
the number line.
• Draw a box extending from Q1 to Q3.

• Draw the “whiskers” by drawing a horizontal line connecting Q1 to the


minimum value and Q3 to the maximum value.
• Positions of the outliers (if any) are marked with a star or some other bold
mark (not for the purpose of this course).

160 170 180 190


Shapes of distributions for box-and-whisker plots:

A Box-and-Whisker plot can also be used to assess the skewness of the distribution
of the values in the data set.

For positively skewed data, most of the values are at the lower end of the scale.

The median is closer to Q1

Skewed to the right


or
positively skewed

The right side of the ‘box’ is larger than the left side
For negatively skewed data, most of the values are at the upper end of the scale.
The median is closer to Q3

Skewed to the left


or
negatively skewed

The left side of the ‘box’ is larger than the right side

For symmetric, bell-shaped data, most of the values are clustered around the centre.

The median is approximately halfway between Q1 and Q3


Pareto Chart Combines a bar graph and a line graph

Shows the cumulative


Drawn in descending size totals
order

Example
In this way the chart Types of
visually depicts which complaints
situations/categories received by a
occur more frequently company
40 + 35

You might also like