Data Analysis
Data Analysis
Advanced Level
MATHEMATICS
Data Analysis
Data Analysis
Prepared by
Paul Grinder, Okanagan University College
with
Pat Corbett-Labatt, North Island College
Bob Darling, Malaspina University-College
Peter Robbins, Kwantlen University College
Ada Sarsiat, Northwest Community College
for the
Province of British Columbia
Ministry of Advanced Education, Training and Technology
and the
Centre for Curriculum, Transfer and Technology
© 2000-2020 Province of British Columbia, Ministry of
Advanced Education, Skills & Training
Download for free from the B.C. Open Textbook Collection: https://open.bccampus.ca
i
Learning outcomes
The word statistics is derived from the Latin word status which means “state”. Governments
were the first to use statistics. They used statistics to collect and interpret data about their
countries. Today, statistics are used in almost every major field of study.
3. Study the terminology in the Glossary to become familiar with the definitions.
ii
Glossary
Bar graph
A graph that uses side by side bars of different lengths to represent ranked data.
Confidence interval
The interval in which a statistic will likely fall, a certain percent of the time, after repeated
experimentation.
Data
The information collected for statistical analysis.
Deviation
The difference between one data value and the mean.
Frequency
The number of times that a particular value occurs in a set of data.
Frequency graph
Sometimes called a broken line graph. A graph with a horizontal axis representing data
values and a vertical axis representing frequency values.
Frequency histogram
Also known as a bar graph.
Measures of position
Statistics that describe how one data value compares to another. Percentiles, quartiles and z
scores are measures of position.
Measures of variation
Statistics that describe how the data is spread out or dispersed. The range, deviation and
standard deviation are measures of variation.
Mean
The average. The mean is obtained by finding the sum of the data values and dividing by the
number of data values.
Median
The middle value, or the average of the two middle values, of a set of ranked data.
Mode
The data value that occurs most frequently.
iii
Normal curve
Also called a bell curve. Data that is distributed symmetrically about the mean so that most
of the data is close to the mean.
Normal distribution
A distribution that takes the shape of a normal curve when graphed. Approximately 68% of
the data values will fall within one standard deviation of the mean, 95.5% will fall within two
standard deviations of the mean and 99.7% of the data will fall within three standard
deviations of the mean.
Percentile
One of the 100 values that divide a set of ranked data into 100 equal intervals. The 48th
percentile is a value that is higher than 48% of all the data values.
Population
A large group from which samples are taken for statistical analysis.
Quartile
One of four values that divide a set of ranked data into four equal intervals. The first quartile
is equal to the 25th percentile.
Random
A value is random if it has an equal chance of occurring as any other value from the same set.
Random sample
A sample that has the same probability of being chosen as any other sample of the same size.
Range
The difference between the largest data value and the smallest data value.
Ranked data
Data that is listed from highest to lowest or lowest to highest.
Sample
A small set of data chosen from a larger set of data.
Sampling error
The amount of error associated with a calculated value as determined by the size of the
sample.
Standard deviation
The square root of the average squared deviation of a set of data.
Statistic
A value calculated from a set of data. The mean and z scores are statistics.
Statistics
A branch of mathematics that collects, organizes and analyzes data.
iv
Stem and leaf plot
A table of data values where the last digits of data values (leaves) are strung out behind their
first digits (or stem values).
Survey
Information derived from a sampling of a certain population.
Tally
A method of counting data using “tic” marks.
Yes population
A 40% yes population is one that has responded yes to a particular question 40% of the time.
z score
Also known as a standard score. The value obtained by dividing the deviation by the standard
deviation.
v
vi
Unit 1: The uses and abuses of statistics
The word statistics has two meanings. A statistic is a numerical measurement describing
some characteristic of a set of data. For example, a statistic like 290 pounds could be used to
describe the average or mean weight of a football team. Statistics is also a collection of
methods for planning experiments, collecting data, analyzing the data and drawing
conclusions.
Statistics can be used to misrepresent a situation. Suppose a small store employs 6 people
who earn an average, or mean, wage of $8.50 per hour as calculated below,
$8 + $8 + $8 + $8 + $9 + $10
= $8.50
6
Now suppose the store owner, who earns $40 per hour, includes his wages in the calculation,
$8 + $8 + $8 + $8 + $9 + $10 + $40
= $13
7
If the store owner reports that the average wage earned at the store is $13, he or she is
misrepresenting the situation since the store owner is the only person making $13 per hour or
more.
1
Another source of deceptive statistics results from the faulty collection of data. Companies
that conduct public opinion polls have to be extremely careful that they survey a large
enough sample of the population and also an unbiased segment of the population. For
example, suppose a poll was conducted in BC to determine whether a luxury tax should be
imposed on buyers of new pick-up trucks. The citizens of Prince George might respond quite
differently to the poll than the residents of Victoria. The poll could be quite biased if it was
only conducted in Victoria, or only conducted in Prince George.
Statistical graphs can be presented in a deceptive manner. Consider the two bar graphs
below depicting the same data.
35
25 25
20
Hours
Hours watching 15
watching 20 TV
TV 10
15 0
Men Women Men Women
Without a close inspection of the vertical scale, the first bar graph creates the impression that
men watch twice as much TV as women do. In the second graph, the vertical scale starts at 0,
and the length of the bars are proportional to the actual hour of TV watching.
The above examples illustrate only a few of the abuses of statistics. To avoid the “lies and
damned lies”, every step of the statistical process must be scrupulously carried out; from the
collection of the data, to the calculation of a statistic, to the presentation of conclusions.
2
Exercise 1
Women
Men
75 80
b. A questionnaire asking family members to list the number of books they read in
the last year is mailed to 1000 homes in the city of Vancouver.
c. To determine how many college students are smokers, Butler asks the first 20
students he sees standing outside the main entrance to the college, “Are you a
smoker?”
3
Activity 1: Watching TV
Ask every student in the room to write down, on a small piece of paper, an estimate of the
number of minutes they spent watching TV yesterday. Collect the data (pieces of paper) in
some sort of container.
b. Do you think that this one piece of data is a good representation of the actual (yet
to be calculated) average?
2. Replace the first piece of paper and draw two pieces of data. Find the mean of these
two.
3. Replace the two pieces of data and now draw four pieces of paper. What is the
average time spent watching TV based on just these four pieces of data?
4. Replace the four pieces of paper and draw one half (or one half plus one) of the data.
Find the average time for one half the data.
a. How do the previous calculations of the mean, using smaller samples of the total
data, compare to the actual mean?
b. Some of the students may have recorded 0 minutes for the time they spent
watching TV yesterday. How did these zeros affect the mean time?
c. Now calculate the mean for only those students who actually watched some TV
yesterday.
4
Unit 2: Introduction: Mean, median, mode, range
and graphs
Statistics is the science of collecting, classifying, presenting and interpreting numerical data.
The data are numbers or measurements collected by a statistician. For example, the data
below are scores obtained by 12 students on a math quiz out of 40 marks.
32, 39, 32, 27, 30, 34, 32, 35, 40, 36, 32, 36
In order to statistically describe the above data, we might ask the following questions.
4. What is the difference between the highest and lowest score, or what is the range?
6. How can the data be represented graphically, with a line graph, bar graph or stem
and leaf plot?
The mean, median, mode, and range are four statistics which can be used to describe a set of
data. The mean, median, and mode are called measures of central tendency because they tell
us where the data is centered. The range is a measure of variation because it tells us how
much the data is spread out.
The mean is the most important measure of central tendency. It is calculated as follows:
The mean is the sum of all the data values divided by the number of data values, or
Σx
x=
n
where x = mean, x is a data value, and n is the number of data values
The symbol “ Σ ” is the Greek letter “sigma” and means “the sum of all”. Here “
Σ x” means the sum of all x (or data) values.
Example 1
5
Find the mean score of the following 12 math test scores;
32, 39, 32, 27, 30, 34, 32, 35, 40, 36, 32 and 36.
Solution
Σx 32 + 39 + 32 + 27 + 30 + 34 + 32 + 35 + 40 + 36 + 32 + 36
x= =
n 12
405
= = 33.75
12
The median is the middle value when the data is arranged from highest to
lowest. If there are two middle values, then the median is the mean of these two
values.
Example 2
Find the median score of the above 12 math scores.
Solution
40 35 32
39 34 middle 32
36 32 values 30
36 32 27
Note that because there are an even number (twelve) of data values, we have two middle
values. The mean of these two values is
34 + 32 66
= = 33
2 2
6
The mode is the data value which occurs most often (or with the greatest
frequency). The mode may not exist.
Example 3
Determine the mode of the above twelve math scores.
Solution
40 35 32
39 34 32
36 32 30
36 32 27
Notice that the score of 32 occurs most often. Hence the modal score is 32.
The range is the difference between the largest data value and the smallest data
value.
Example 4
Determine the range of the 12 math scores above.
Solution
40 35 32
39 34 32
36 32 30
36 32 27
The highest value is 40. The lowest is 27. The range is,
40 – 27 = 13.
Also known as frequency distributions, the line graph plots data values
(horizontal axis) against the frequency of those data values (vertical axis).
7
Example 5
Plot the frequency distribution of the 12 math scores.
Solution
Label each axis. Plot points and connect points with a straight line.
4
Frequency
0
25 27 29 31 33 35 37 39
Math Scores
Bar graphs or, frequency histograms, use bars to represent frequencies for
certain data intervals.
Example 6
Prepare a frequency histogram for the 12 math scores.
27, 30, 32, 32, 32, 32, 34, 35, 36, 39, and 40.
Solution
The range of scores is 13. If we want 5 bars, each data interval should be 3 scores wide.
Interval Frequency
38 – 40 38, 40 (2 scores)
35 – 37 35, 36, 36 (3 scores)
32 – 34 32, 32, 32, 32, 34 (5 scores)
29 – 31 30 (1 score)
26 – 28 27 (1 score)
8
Frequency of math scores
6
4
Frequency
3
26 28 29 31 32 34 35 37 38 40
Math scores
A stem and leaf plot is similar to a bar graph, except the bars are replaced by
digits. These leaf digits are the last digits of data having the same first digit(s),
called the stem.
Example 7
Construct a stem and leaf plot for the following data (minutes taken to run 15 km).
49, 52, 53, 53, 57, 58, 60, 63, 64, 66, 66, 66, 69, 70, 70, 72, 75, 77, 77, 79, 83, 88, 89,
94, 106
Solution
The stems are the first digits of the data numbers (4, 5, 6, 7, 8, 9 and 10). The leaf digits are
strung out beside their stems as displayed below.
9
Exercise 2
a. Determine the mean, median, mode, and range of the above data.
range = _________
b. Which statistic (mean, median, mode, or range) best describes the annual salary
most of the workers receive?
_______________________________
c. Which statistic best describes the gap that exists between annual salaries?
_______________________________
2. a. Find the daily mean temperature and daily range for the temperatures below.
°C °C DAILY DAILY
DAY HIGH LOW MEAN RANGE
1 6 0
2 3 2
3 12 4
4 13 10
5 15 7
6 13 10
7 12 8
10
d. Determine the mean range over this 7 day period.
_______________
3. Below are the statistics for a final exam given to two different math classes.
(The exam was worth 100 marks.)
b. Which class probably had the student with the highest mark?
_______________
c. Which class probably had the student with the lowest mark?
_______________
4. The following data represents the average monthly temperature for Vancouver over a
one year period.
Month J F M A M J J A S O N D
Average
Temperature (°C) 4 7 10 12 15 20 22 19 14 8 3 0
11
Average monthly temperatures for Vancouver over a one-year period Average monthly temperatures for Vancouver over a one-year period
24 24
22 22
20 20
18 18
16 16
Temperature
Temperature
(degrees C)
(degrees C)
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
J F M A M J J A S O N D J F M A M J J A S O N D
Month Month
5. The following data represents final grades for a computer course. Construct a stem and
leaf plot for the data.
40%, 42%, 42%, 50%, 54%, 56%, 58%, 66%, 66%, 68%, 69%, 70%, 73%, 80%,
84%, 85%, 88%, 89%, 93%
6. The following stem and leaf plot represents the time in seconds taken to type 60 words
by a class of business students.
4 4 6 8
5 0 1 6 8 8 9
6 0 0 2 3 3 5 7 7 7
7 1 3 3 4 4
8 4
9 5
x = median =
12
7. Jody wants to receive a grade of 80% for the laboratory part of her Chemistry course.
So far, she has a 78% average on her last 5 labs. What grade does she need on her
sixth lab to earn an 80% average?
8. Neil has 182 out of 260 marks thus far in his math course. If the final exam is worth
100 marks, what mark does Neil need on the final exam to earn a final grade of 75%
for the course? How do you feel about Neil’s chances, and why?
13
Activity 2: Shoe size
Ask at least 15 male students and 15 female students to write their shoe size on a piece of
paper. Include M for male and F for female as well, on the slip of paper.
MALE FEMALE
MALE FEMALE
mean
median
mode
range
14
a. Why would it be important to know the range of shoe sizes for males and
females?
b. Why would it be important to know the modal shoe size for males and females?
c. Why would knowing the mean and median shoe sizes be important?
9 9
8 8
7 7
6 6
5 5
Frequency
4 4
3 3
2 2
1 1
0 0
15
Unit 3: Measures of position: quartiles and
percentiles
The mean, median and mode are measures of central tendency. These statistics tell us where
the data is centered. Quartiles and percentiles are measures of position. Quartiles and
percentiles can be used to compare one particular data value to all the rest of the data. These
statistics enable us to answer questions such as, “Is a certain value unusually high, unusually
low or just average?”
Quartiles divide ranked data into four equal parts. Ranked data is data arranged
from highest to lowest, or lowest to highest. The quartile values are denoted as
Q1, Q2 and Q3. Q1 separates the bottom 25% of the data from the top 75%. Q2 is
the same as the median and separates the top 50% from the bottom 50% of the
data. Q3 separates the top 25% of the data from the bottom 75%.
Example 1
Find Q1, Q2 and Q3 for the following set of data (resting heart rates of 30 college males).
55 94 80 68 78 61
60 55 88 60 70 70
70 60 86 42 65 74
72 68 80 100 58 84
81 72 71 85 57 96
Solution
42 60 68 71 80 86
55 60 68 72 80 88
55 60 70 72 81 94
57 61 70 74 84 96
58 65 70 78 85 100
Find Q2 or the median, first. Since there are 30 values, there are two middle values, 70 and
71.
70 + 71
Q2 = = 70.5
2
16
To find Q1, find the middle value of the bottom 50% of the data. The bottom half of the data
has 15 values and ranges from 42 to 70. Counting to the 8th value,
Q1 = 60
Q3 can be found in a similar fashion. Q3 is the middle value of the upper 50% of the data or
the 8th value from 71 to 100.
Q3 = 81
Finding Q1, Q2 and Q3 in the above example was simply a matter of ranking and counting.
But there was an even number (n = 30) of data values. When there is an odd number of data
values, a different method of calculating Q1 and Q3 will have to be used.
The three quartiles Q1, Q2 and Q3 divide the ranked data into 4 equal parts. There are 99
percentiles P1, P2, P3, …, P99 that divide the ranked data into 100 equal parts. For example,
P80 is called the “80th percentile” and P80 is a value that is higher than 80% of the rest of the
data values. P10 is a value that is higher than 10% of the data values.
Example 2
Below are the number of chin ups completed in one minute by 70 male college students. The
data is ranked lowest to highest.
0 3 7 8 10 13 20
0 3 7 9 10 13 20
1 4 7 9 10 13 22
1 4 7 9 10 14 22
2 6 7 9 11 14 23
2 6 7 9 11 15 25
2 6 7 9 11 15 28
3 6 8 10 11 15 30
3 6 8 10 12 18 30
3 7 8 10 13 20 33
Find the percentiles associated with the data values 0, 7 and 25.
17
Solution
0
a. For data value 0, k = ×100% = 0%
70
So, 0 is the 0th percentile or P0 = 0. This makes sense because 0 is not higher than any
other value.
b. There are 19 data values that are less than the value 7.
19
For data value 7, k = × 100% = 27% (rounded) .
70
65
For data value 25, k = × 100% = 93% (rounded) .
70
So, P93 = 25. The person who did 25 chin ups in one minute did better than 93% of the
other college men.
The reverse procedure, finding what data value corresponds to a given percentile, is rather
involved.
To find the data value associated with a certain percentile, Pk, follow the steps
below:
k
Step 2 Calculate C = n where k is the percentile in question and n is the
100
number of data values.
Step 4 If C is not a whole number, round C up to the next larger whole number
and Pk = the Cth (rounded up) data value, counting from the lowest
value.
18
Example 3
The following stem and leaf plot depicts the number of words typed in 2 minutes by 125
office administration students.
7 3 7 8 9 9
8 0 1 2 3 5 5 5 8 8 9
9 0 0 0 2 3 3 7 8 8 8 9 9
10 0 0 0 1 1 1 1 1 3 3 7 8 8 8 8 8
11 0 1 1 2 2 2 2 3 4 4 4 5 5 7 7 7 7 8 8 9 9 9
12 0 0 1 1 1 1 1 5 5 5 6 6 6 6 6 7 7 8 9
13 0 0 0 0 0 2 3 3 3 6 6 6 6 7 7 8 8 9
14 0 0 1 1 1 1 3 3 4 4 6 7 8 8 8
15 0 0 3 8 9 9
16 1 7
Solution
k
a. To find P40, calculate C = n where k = 40 and n = 125,
100
40
C= 125 = 50
100
where the C = 50th data value (from the stem and leaf plot) is 112 and the C + 1 = 51st data
value is 113.
112 + 113
P40 = = 112.5
2
k
Calculate C = n where K = 25 and n = 125
100
19
25
C= 125 = 31.25
100
Round C = 31.25 to 32 and P25 is the 32nd data value from the lowest value.
P25 = 98
95
C= 125 = 118.75
100
Round C up to 119.
20
Exercise 3
1. The following stem and leaf plot depicts the number of words typed in 2 minutes by
125 office administration students.
7 3 7 8 9 9
8 0 1 2 3 5 5 5 8 8 9
9 0 0 0 2 3 3 7 8 8 8 9 9
10 0 0 0 1 1 1 1 1 3 3 7 8 8 8 8 8
11 0 1 1 2 2 2 2 3 4 4 4 5 5 7 7 7 7 8 8 9 9 9
12 0 0 1 1 1 1 1 5 5 5 6 6 6 6 6 7 7 8 9
13 0 0 0 0 0 2 3 3 3 6 6 6 6 7 7 8 8 9
14 0 0 1 1 1 1 3 3 4 4 6 7 8 8 8
15 0 0 3 8 9 9
16 1 7
b. Jill can type 140 words in 2 minutes. What percentile score is this? Jill can type
faster than what percent of the other 124 students?
c. A student needs to type 90 words in 2 minutes in order to pass the test. What
percentile is associated with this value? How many students failed the test?
21
d. The instructor decided that those students who achieved less than the 20th
percentile would have to retake the test. What data value is represented by P20?
2. Below are the number of chin ups completed in one minute by 70 male college
students. The data is ranked lowest to highest.
0 3 7 8 10 13 20
0 3 7 9 10 13 20
1 4 7 9 10 13 22
1 4 7 9 10 14 22
2 6 7 9 11 14 23
2 6 7 9 11 15 25
2 6 7 9 11 15 28
3 6 8 10 11 15 30
3 6 8 10 12 18 30
3 7 8 10 13 20 33
22
c. In a. you found that P75 = 13 but when the process was reversed, in b. you found
that 13 = P70. Explain this discrepancy.
23
Activity 3: Mutual funds
Work in groups of 3 or 4.
1. On the following page, thirteen Canadian “Asia ex-Japan” mutual funds are listed
with their current value and percentage gains over 1 day, 1 week, 30 days and 1 year.
On page 66 in the spaces provided, rank each fund from best to worst based on their
percentage gain for the given time interval. For example, for the “1 day %” ranking
column, Fund B would rank 1st and Fund H would rank last or 13th.
2. a) Which, if any, funds always ranked in the top half of the group (above Q2) in all
four categories? (These would be the best funds with low ranks.)
b) Are there any fund(s) that ranked above Q3 in all categories? Which one(s)?
3. a) Find the sum of the four rankings for each fund in the last column.
4. a) Is there any fund that has a consistently high ranking? Which one?
5. a) Rank what your group thinks are the three best performing “Asia ex-Japan”
mutual funds.
24
Activity 3: Asia ex-Japan Canadian mutual funds
As of December 23, 1999
Fund Fund name Price $ 1 day $ 1 day Rank 1 week Rank 30 day Rank YTD Rank Sum
Letter Chg % % % % of
Code ranks
A AGF Asian Growth 13.880 .620 4.68 10.86 16.35 56.31
Class
Source: http://globefund.com
25
Unit 4: The standard deviation
Notice that the mean and median for both sets are exactly the same. Their measures of
central tendency are the same, but the range is very different for both sets. The range for Set
A is 40 and for Set B it is 20. The range is a very simple measure of variation. Set A and Set
B are dispersed, or spread out, quite differently.
One way of measuring the variation or dispersion of specific data values is to calculate the
deviation, x – x , where x is a data value and x is the mean. For example, in Set A above, the
amount that 70 deviates from the mean is 20.
A very precise way of measuring the variation of an entire set of data is to calculate the
standard deviation. The standard deviation is a statistic that indicates a kind of average
deviation from the mean of the data. The larger the standard deviation number, the more
spread out the data is. The Greek letter sigma, σ, is the symbol for standard deviation.
σ=
Σ x−x ( )2
where σ is the standard deviation, x is a data value, x is the mean of the data,
and n is the number of data values.
1
The formula given above is for finding the standard deviation of a given population. The formula
σ n −1 =
(
Σ x−x )2
26
Example 1
Find the mean and standard deviation of the 12 science quiz scores below.
Solution
Σx 5 + 6 + 8 + 8 + 9 + 10 + 10 + 11 + 11 + 12 + 15 + 15
x= = = 10
n 12
x x- x (x − x )2
5 5-10= -5 (-5)2 = 25
6 -4 16
8 -2 4
8 -2 4
9 -1 1
10 0 0
10 0 0
11 1 1
11 1 1
12 2 4
15 5 25
15 5 25
(
=Σ x−x )2
106
σ=
∑ x−x( )2
=
106
= 8.83 ≈ 2.97
n 12
The reason we square the deviations before finding the average is to avoid adding positive
and negative values. Notice above that the sum of the deviations, x − x is zero.
27
The paper and pencil method of calculating the standard deviation can be extremely lengthy,
especially when n, the population size, is large.
Example 2
Calculate the mean and standard deviation of the data in Example 1 using a calculator with a
statistics mode. (As many calculators function differently, please bring your calculator
manual to class.)
Solution
Put your calculator in the statistics mode (if necessary), enter the data and find x and σ.
2. Enter the data (Find your data button. FRQ can be used to input repeated data
values.):
5 DATA (or Σ+ )
6 DATA
8 DATA
8 DATA
9 DATA
10 DATA
10 DATA
11 DATA
11 DATA
12 DATA
15 DATA
15 DATA
3. Find x and σ (you may have to press SHIFT or 2nd or some other key).
To find x , press:
To find σ, press:
28
Now complete Exercise 4 and check your answers.
29
Exercise 4
1. The following data represents quiz scores on a test out of 10 by 21 math students.
0, 0, 1, 2, 4, 5, 5, 5, 6, 6, 6, 7, 8, 8, 8, 8, 9, 9, 9, 10, 10
3
Frequency
2
0
0 1 2 3 4 5 6 7 8 9 10
Scores
h. How many scores are there between 2.9 and 9.1, or, how many scores lie
within one standard deviation of the mean?
_____________________
30
2. a. In Example 1 of this section, the mean of the 12 science quiz scores was 10 and
the standard deviation was 2.97. What percent of the scores lie within one
standard deviation of the mean? The scores were 5, 6, 8, 8, 9, 10, 10, 11, 11, 12,
15, and 15.
___________________
b. What percent of the scores lie within two standard deviations of the mean?
___________________
3. Suppose the standard deviation of a set of numbers is 0. What does this tell you about
the data?
___________________
4. Two classes, each with 100 students, wrote an examination with a possible maximum
score of 100. In the first class the mean score was 75 and the standard deviation was
5. In the second class, the mean score was 70 and the standard deviation was 15.
Which of the two classes do you think had more scores of 85 or better? Why?
__________________
5. The following data represents the weights (in kg) of a small class of students:
78, 42, 72, 88, 86, 97, 91, 79, 82, 86, 91, and 74
6. It is found that the time taken by a bank teller to serve 7 people is 3, 3, 4, 5, 6, 6, and
7 minutes.
31
Activity 4: Pop quiz
1. Your instructor will ask you to write the quiz in Appendix B (pages 67) along with
the rest of the class. You will only have 5 minutes to complete the test. Do not look at
it until your instructor says “Go!”
2. After all the tests are marked out of 20, record the marks below, ranked from highest
to lowest. (See page 75 for the answers.)
mode =
x =
σ=
Q1 =
Q2 =
Q3 =
32
Unit 5: The normal distribution
Data can be distributed in quite a variety of ways. Consider the frequency histograms below,
Some examples of the above distributions follow. Imagine that a group of math students were
given a math test out of 40. The difficulty of the test can have quite an influence on the shape
of a frequency distribution.
10
0 20 40
Test Scores
33
Case 2 If the test was quite difficult and most of the students
received a mark of less than 50%, this distribution
would be considered skewed to the right. 20
Frequency
10
0 20 40
Test Scores
Case 3 If the test was given to two different math classes, one
of which had not been taught half the material, this
distribution would be bimodal. 20
Frequency
10
0 20 40
Test Scores
10
0 20 40
Test Scores
10
0 20 40
Test Scores
34
When a population is measured for some attribute or ability, the most frequently occurring
distribution is the normal distribution. When a product is tested for some characteristic the
result is most often a normal distribution 2. For example, if a sample population of men (or
women) is tested for physical strength (or blood pressure or intelligence or shoe size), most
of the people will be close to average strength with a small minority either much stronger
than or much weaker than the majority.
In the previous assignment you were often asked, “what percent of the data lie within one
standard deviation of the mean?” Knowing how a population bunches around its mean value
can be quite useful.
Chebyshev’s Theorem states that the proportion of any distribution that lies
2
1
within k standard deviations of the mean is at least 1 − , where k > 1. This
k
applies to any distribution of data.
2
1 1 3
For example, when k = 2, Chebyshev’s Theorem states that, 1 − = 1 − = or more of
2 4 4
the data will lie within 2 standard deviations of the mean.
When data is distributed normally, (see top left histogram) Chebyshev’s proportion can be
“improved” on dramatically.
The vast majority of statistical analysis is done on normally distributed data; from biology to
psychology to economics to medicine to sports.
2
Normal distribution is not appropriate for all kinds of distribution. For example, small sample sizes or biased
populations would not necessarily be normally distributed and could be statistically analyzed by different
methods.
35
The histogram below is a representation of an “ideal” normal population, where the mean is 0
and the standard deviation is 1.
99.7%
95%
68%
34% 34%
13.5% 13.5%
2.5% 2.5%
-3 -2 -1 0 1 2 3
36
Exercise 5
1. Sixty college students were asked for the total number of children in their family. The
data collected follows:
1 6 3 5 5 3 4 1 2 7 3 2
3 4 5 3 1 3 2 1 4 4 2 2
3 9 4 3 3 5 3 5 7 3 1 1
3 5 2 6 4 3 3 3 3 3 2 3
4 3 5 7 3 2 1 2 3 2 4 3
24
21
18
15
Frequency
12
9
6
3
1 2 3 4 5 6 7 8 9
Number of children
37
_______________
and _______________
h. What percent of the data lies within two standard deviations of the mean?
_______________
j. What percent of the data lies within three standard deviations of the mean?
_______________
k. Compare your answers for f., h., and j. to the results predicted by the Empirical
Rule. Does the result suggest an approximately normal distribution?
_______________
2. The following table tallies the number of hour of TV watched in one day by 75 high
school students. Only those who watched some TV yesterday were included in the
tally.
38
Hours Tally Frequency
0.5
1.0
1.5
2.0
2.5
3.0
3.5
d. What percent of the data lies within one standard deviation of the mean?
e. What percent of the data lies within two standard deviations of the mean?
39
26
24
22
20
18
16
Frequency 16
14
12
10
6 7
2
1
60 69 70 79 80 89 90 99 100 109 110 119 120 129 130 139
IQ
e. What percent of the IQ’s lie within one standard deviation of the mean?
__________________
__________________
__________________
40
h. Does IQ seem to be normally distributed?
__________________
__________________
41
Activity 5: Brothers and sisters
Duplicate the survey conducted in Question 1 of Exercise 5. Ask 40 people how many
siblings (brothers and sisters) they have. Record the answers. Add 1 to each number so that
the person asked is included.
24
21
18
15
Frequency
12
9
6
3
1 2 3 4 5 6 7 8 9
Number of children
h. What percent of the data lies within two standard deviations of the mean?
42
_______________
j. What percent of the data lies within three standard deviations of the mean?
_______________
k. Compare your answers for f., h., and j. to the results predicted by the Empirical
Rule. Does the result suggest an approximately normal distribution?
_______________
43
Unit 6: The normal curve
-3 -2 -1 0 1 2 3
The area under the curve represents 100% (or 1.00) of the data (or population). By the
empirical rule, the area under the curve and within one standard deviation of the mean is
68.26%, within two standard deviations is 95.44%, and within three standard deviations is
99.74%, as shown below.
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
The area under the The area under the The area under the
curve is 0.6826 curve is 0.9544 curve is 0.9974
− x2
1
y= e 2
where e ≈ 2.718
2π
By using z scores, the area of any region under the curve can be determined.
44
The z score or standard score represents the number of standard deviations a
data value is from the mean value. The formula for z is,
x−x
z=
σ
Example 1
The mean IQ is 100 and the standard deviation is 15. If Frank has an IQ of 127, find his z
score.
Solution
127 − 100
z= = 1.8
15
Example 2
Frank has an IQ of 127, or a z score of 1.8. What percent of the population have IQ scores
less than (or equal to) 127 and what percent have IQ scores higher than 127?
Solution
0.5000.
-3 -2 -1 0 1 2 3
1.8
45
The percent of the population that have IQ scores less than (or equal to) 127 is,
A z score of 1.8 can be considered as equivalent to a percentile of 96, since it is higher than
96.41% of the population. In other words, an IQ of 127 has a 96th percentile ranking.
Areas under the normal curve can also be associated with probabilities. In the above example
we could say that the probability that some person would have an IQ score less than 127 is
96 24
0.9641 out of 1.0000 or about or 96% = .
100 25
1
The probability that someone would have an IQ higher than 127 is about 4% or .
25
Or, in a group of 25 people, chosen randomly, probably only one would have an IQ score of
more than 127.
Example 3
The waiting-in-line time at a certain grocery store is normally distributed with a mean
of 3.5 minutes and a standard deviation of 1.4 minutes.
a. What percent of the customers wait in line less than one minute?
c. What is the probability that a customer would have to wait in line for more than 7 minutes?
Solution
0.0367
x−x 1 − 3.5
z= = = − 1.79
σ 1.4
-3 -2 -1 0 1 2 3
-1.79
46
From the table (Appendix A, see page 66), a z score of –1.79 yields the same area as 1.79.
The area between 0 and –1.79 is 0.4633. The area to the left of –1.79 represents the
proportion of the population that waits in line less than a minute,
5 − 3.5
z= = 1.07
1.4 0.3577
This means that 14.23% of the customers have to wait in line for more than 5
minutes.
7 − 3.5
z= = 2.5
1.4
This means that 0.62%, or less than 1%, of the customers would have to stand in line for
more than 7 minutes. The probability that someone would have to stand in line for more than
31
7 minutes is 62 in 10000 or .
5000
Example 4
A certain tire company tested their new Treadmasters and found that the tires’ tread life
averaged 60000 km with a standard deviation of 7000 km. The company wants to sell
the Treadmaster with a guarantee that they will last a certain number of kilometres.
The company is willing to give a money back guarantee on 10% of its worst tires. At
how many kilometres will 10% of the tires be worn out?
47
Solution
We need to find the z score that marks off an area under the curve of 10% or 40% from
0 to z. In the table, the closest value to 0.4000 is 0.3997, and this corresponds to a z
score of 1.28.
40%
10%
-3 -2 -1 0 1 2 3
Z = -1.28
To determine the kilometre value that is associated with a z score of –1.28, solve the z score
formula for x.
x−x
z=
σ
x − 60000
-1.28 =
7000
x = 51040 km
The company should guarantee tires that wear out before 51040 kilometres.
48
Exercise 6
Use Appendix A (see page 66) for the questions that follow.
1. Find the area under the normal curve between the following z scores.
-3 -2 -1 0 1 2 3
2. The average resting heartrate for a normally distributed population of men was found
to be 62 beats per minute with a standard deviation of 11 beats per minutes.
a. What percent of men have resting heartrates under 70 beats per minute?
_______________
b. What percent of men have resting heartrates over 70 beats per minute?
_______________
c. What percent of men have resting heartrates between 40 and 80 beats per minute?
_______________
3. In a group of normally distributed women, the average height is 5 feet 4 inches (64
inches) with a standard deviation of 2.8 inches.
49
a. What percent of the women are between 5 feet and 6 feet ?
_______________
_______________
c. What is the probability that a woman would be shorter than 5 feet tall?
_______________
a. What percent of the students spent more than 40 hours per week studying?
_______________
_______________
5. Larry’s lightbulb factory manufactures bulbs with an average life of 1000 hours and a
standard deviation of 100 hours. To sell more light bulbs Larry wishes to give a
50
guarantee, but he is only willing to replace 5% of the lightbulbs sold. For how many
hours should the lightbulbs be guaranteed?
_____________
6. Workers in a certain factory are given a bonus every time they assemble more than
300 toy cars in one eight hour day. The number of toy cars assembled each day by a
worker is normally distributed with a mean of 270 cars and a standard deviation of 16
cars. What percent of the workers receive a bonus each day?
__________________________
7. A radar unit measures the speed of passing cars on a highway. The speeds of the cars
are normally distributed with a mean speed of 104 km/h.
a. Find the standard deviation of the speeds if 3% of the cars are travelling faster than
115 km/h.
_________________________
b. Using the standard deviation found above, what percent of the cars are travelling at
less than 90 km/h?
_________________________
51
____________________________
d. If there is a no tolerance rule in effect, and the posted speed is 100 km/h how many
cars would be considered to be speeding?
___________________________
52
Activity 6: Rolling dice
With a partner, roll a pair of dice 150 times. Your partner should tally each roll of the dice,
while you keep count of the number of rolls. Complete the tally sheet below and draw a
histogram for this data.
a.
Dice sum Tally Frequency
2
3
4
5
6
7
8
9
10
11
12
b.
50
45
40
35
30
Frequency
25
20
15
10
0
2 3 4 5 6 7 8 9 10 11 12
Dice sum
53
e. Find the interval x − σ to x + σ ______________ to _______________
f. What percent of the rolls lie within one standard deviation of the mean?
__________________
g. What are x − 2σ and x + 2σ and what percent of the data lie within 2 standard
deviations of the mean?
_________________________________________________________________
h. What are x − 3σ and x + 3σ and what percent of the data lies within 3 standard
deviations from the mean?
__________________________________________________________________
__________________________________________________________________
Test this by rolling the dice ten times. How many times out of ten did a roll of 10, 11, or 12
occur?
Repeat: Roll ten more times and count how many times a 10, 11, or 12 was rolled.
54
Unit 7: Analysing survey data
Hardly a day goes by without the media reporting the results of some survey. Surveys are
conducted to determine what people like or dislike, what their opinions are on various issues
and what factors affect their lives.
Governments and businesses often use surveys in order to make decisions and to monitor the
effectiveness of previous decisions.
We will restrict our analysis of survey data to YES-NO population surveys only. In a YES-
NO survey, every member of the population answers a question with a YES or a NO. For
example, “Do you smoke?” is a YES or NO type question. “How many cigarettes do you
smoke every day?” is not a YES or NO question. If we were to ask every member of a
population the YES or NO question we would be taking a census of the population. If 20% of
the population answered YES to the question, this would be called a 20% yes population.
It is often very expensive and very time consuming to take a population census. By using a
smaller sample of the population, we can estimate the percentage of YES answers in the
population.
For example, in a recent survey of 1000 Canadians, 55% responded YES to the question, “Is
having a happy life the thing that matters most to you?” when compared to other things like
health and freedom. Even though the survey only represents a small portion of the total
population of Canada, its margin of error is calculated to be plus or minus 3.2% 19 times out
of 20.
In other words, if that survey was repeated 20 times, using a different 1000 Canadians each
time, then 19 times out of 20 times, the number of YES responses to the above question
would be between 51.8% (55% - 3.2%) and 58.2% ( 55% + 3.2%).
If the sample size is n, then the sampling error of the percentage of YES
answers in the population is approximately,
100
n
100
When n > 100, then the accuracy of is quite good.
n
Since 19 out of 20 is equivalent to 95%, we can be “confident” that the percentage of YES
answers will be within the sampling error interval 95% of the time.
55
Example 1
The business association in Grissville surveyed 384 people and asked each if they had eaten
dinner in a local restaurant at least once in the last week. 223 people responded YES to the
survey. Find the 95% confidence interval (and sampling error) for this sample.
Solution
223
The proportion of YES answers in the population is = 0.581 or 58%.
384
100
The sampling error, is about 5.1%.
384
The 95% confidence interval is 58% - 5.1% and 58% + 5.1% or about 53% to 63%.
56
Exercise 7
1. Read the following newspaper clipping.
_____________________________________________________________________
b. Suppose this poll was accurate to plus or minus 3 percentage points 19 times out of
20.
i. What is the least support Mr. Martin could have 19 times out of 20?
_________________________
ii. What is the most support Ms. Wilson could have 19 times out of 20?
_________________________
iii. Considering the above and the percent of undecided voters, does Mr. Martin
have a majority of the potential vote?
________________________
2. A survey was conducted to determine whether bicycle riders should have to pay for a
licence to ride on the city streets. 183 people were asked and 57 said yes.
3. An opinion poll reported that support for the Liberals stood at 58% with a sampling
error of 2.5% 19 times out of 20. Use the sampling error formula to determine the
sample size.
57
4. A certain poll found that 833 people out of 1240 thought that capital punishment
should be reinstated for first degree murder. What is the 95% confidence interval for
this sample?
5. If the poll in Question 4 was conducted in 1998, are the results still valid today?
6. Count the first one hundred letters in this sentence and then count how many times
the letter “e” occurred and count every “e” as a YES response.
b. What is the 95% confidence interval for the occurrence of the letter “e” in the
English language?
c. Repeat the above process with a different set of words and record the percent of
“e’s” found in the passage. Does this percent fall in the confidence interval range
found in part b. above?
7. See if your calculator has random number function (RAN# button). It should produce
three digit decimal numbers randomly. In other words, every number has an equal
chance of showing up on your screen. Assume that every time a 3, 6 or 9 appears as
the last digit of the random number, it is the same as receiving a YES response. Then,
58
theoretically a YES response should occur 30% of the time, since 3, 6 and 9 are three
out of ten possible last digits in each random.
a. Find the sampling error for this 30% yes population if the sample size is 100
random numbers.
b. Generate 100 random numbers and tally the number of times a 3, 6 or 9 occurred as
the last digit. What percent of the time did a 3, 6 or 9 occur?
d. Check with the other students. Did their samples produce percentages within the 20
to 40 percent interval 19 times out of 20 times?
59
Activity 7: Smoking
When conducting a survey it is important to ask simple unambiguous questions. It is also
important to select a sample that is representative of the population being surveyed. The
following exercise should demonstrate the importance of sample size.
Ask various students the question, “Do you smoke?” (If there is any confusion about what
you are asking, you could say, “Have you smoked a cigarette in the last 48 hours?”) Record
the number of YES responses and then calculate the percentage of YES responses.
16
25
30
35
a. As the sample size increased, did the variation in percentages increase or decrease?
______________________________
60
b. According to Statistics Canada, 1996-1997, about 20% of the BC population are
smokers. Are your results close to 20%?
______________________________
c. You sampled 35 college students. Are these students representative of the total
college population? What problems might there be with your sample in terms of it
being representative of the whole population?
__________________________________________________________________
__________________________________________________________________
61
Unit 8: A statistics project
Now it is your turn. You or your group will select a topic of interest, collect data regarding
the topic, and statistically analyse the results. Choose a topic and then let your instructor
know what it is before presenting your results in Exercise 8.
A few possible topics are listed below. However, feel free to identify one of your own.
2. Number of keys (or credit cards) carried by a person while at the college.
9. Age difference (in months) between oldest and youngest siblings (or between
spouses).
10. Waiting time in line (grocery store or bank) during busy times.
62
Exercise 8
2. Before you collect the data, try to predict the mean and range of the data.
3. You must collect 30 or more data values. List your data below.
4. Present your data graphically below. Use either a line graph, bar graph or stem and
leaf plot.
63
5. Describe the shape of your graph in 4. Is it normal, J-shaped, skewed or bimodal?
x =
median =
mode =
range =
σ=
Q1 =
Q3 =
8. Which data value is higher than 90% of the rest of the data?
64
10. How do the actual mean and range compare with your predicted mean and range?
11. Was the sample you chose biased in any way? Were you biased in the way you
collected the data?
12. If you were to do this project over again, how could you improve it?
65
Appendix A
The entries in this table represent the area under the normal curve
from 0 to z. Areas for negative values of z are obtained by
symmetry.
x−x
z=
σ 0 Z
66
Appendix B
Activity 4: Quiz
Speed counts. Work as quickly as possible and do not use a calculator.
1. Add 1 + 3 + 5 + 7 + 9 + 11 + 13 + 15 + 17 + 19
3. Divide 8 by 0.
5. Write down the 5th, 25th and middle two letters of the alphabet.
67
7. Name four countries in Africa.
9. Use the letters given to form three words across and three words down. (One letter
per square.)
A B E E I R
T N
68
Answers
Exercise 1
1. The lengths of the bars suggest that men only have a life expectancy of one half or
50% that of women. This is a result of starting the years’ scale at 75 years rather than
78
at zero years. Actually, men have a life expectancy of or 96% that of women.
81
2. a. Only people with very strong opinions would bother to phone in.
b. Most people do not respond to mail surveys, so the sample would be quite small.
People who do not read books would most likely not respond. The survey might
not be representative of all the geographical, cultural and economic sectors of
Vancouver.
c. Butler is probably at the very place where the smokers are hanging out.
Exercise 2
1. a. x = $25 866.67, median = $25 000, mode = $25 000, range = $50 000
2. a.
daily
mean range
1 3 6 b. 8.2º C
2 2.5 1
3 8 8 c. 15º C
4 11.5 3
5 11 8 d. 4.7º C
6 11.5 3
7 10 4
3. a. A b. A c. A d. B
69
4. Average monthly temperatures for Vancouver over a one-year period (frequency graph)
24
22
20
18
16
14
o
Temperature C
12
10
0
J F M A M J J A S O N D
Month
24
22
20
18
16
14
o
Temperature C
12
10
0
J F M A M J J A S O N D
Month
70
5. Computer course final grades.
4 0 2 2
5 0 4 6 8
6 6 6 8 9
7 0 3
8 0 4 5 8 9
9 3
6. a. 25 students
b. x = 63.5 seconds
median = 63 seconds
Exercise 3
1. a. P50 = 119 and P80 = 138
b. Jill scored at the 82nd percentile since 140 = P82. Jill can type faster than 82% of the
other students.
c. 90 = P12 or 90 is the 12th percentile. 12% or 15 out of 125 failed the test.
d. P20 = 98.5
2. a. Q3 or P75 = 13
b. 13 = P70
c. The discrepancy is a result of the relatively small (n = 70) sample size. As the sample
size increases, these discrepancies become smaller.
Exercise 4
1. a. 10 b. 8 c. 6 d. 6 e. 3.1
f.
5
3
Frequency
0
0 1 2 3 4 5 6 7 8 9 10
Scores
71
g. x - σ = 6 - 3.1 = 2.9
x + σ = 6 + 3.1 = 9.1
h. There are 15 scores between 2.9 and 9.1 or, 71% of all the scores lie within one
standard deviation of the mean.
8
2. a. x + σ = 12.97 and x − σ = 7.03 . or 66.7% lie within one standard deviation of the
12
mean.
b. x + 2σ = 15.94 and x − 2σ = 4.06 All of the scores, or 100%, lie within two standard
deviations of the mean.
Exercise 5
1. a. 9-1 = 8
b.
24
21 22
18
15
Frequency
12
9
10
6 8
7 7
2 3
3 1
0 1 2 3 4 5 6 7 8 9
Children in family
c. x = 3.4 d. σ = 1.7
47
e. 1.7 and 5.1 f. = 78.3%
60
72
56
g. 0.0 and 6.8 h. = 93.3% i. -1.7 and 8.4
60
59
j. = 98.3% k. yes
60
2. a.
Hours Frequency
0.5 3 b. x = 2.07
1.0 11 c. σ = 0.78
1.5 12 d. 58.7%
2.0 17 e. 96.0%
2.5 15 f. 100%
3.0 13
3.5 4
3. a.
26
24 26
25
22
20
18
16
Frequency 14 16 7
12
10
8
6 7
4 6
1 2
2
Exercise 6
1. a. 0.4207 b. 0.2257 c. 0.5926 d. 0.3124 e. 0.0668
73
x − x 115 − 104
7. a. σ = = = 5.85 b. 0.84% c. 0.31%
z 1.88
d. 75.17%
Exercise 7
1. a. There are “undecided” voters.
iii. Yes. Even at Martin’s worst and Wilson’s best, Wilson cannot overtake Martin.
57 100
2. a. = 31.1% b. = 7.4% c. between 23.7% and 38.5%
183 183
100
3. 2.5 = or n = 1600
n
833 100
4. = 67.2% and = 2.8%. 67.2% + 2.8% = 70% and 67.2% - 2.8% = 64.4%.
1240 1240
Rounding, the 95% confidence interval is 64% to 70%.
5. No. The results of polls are usually only valid for the day they are taken.
100
7. a. = 10% c. 20% to 40% b. and d. Check with your instructor.
100
74
Appendix B - Quiz
1. 100
2. all the months have 28 days
3. meaningless – 8 can’t be divided by 0
4. 6.23
5. E, Y, M and N
6. 6.5952
7. Northern Africa Eastern Africa Middle Africa
Algeria Burundi Angola
Egypt Comoros Cameroon
Libya Djibouti Central African Republic
Morocco Ethiopia Chad
Sudan Kenya Congo
Tunisia Madagascar Equatorial Guinea
Western Africa Malawi Gabon
Benin Mauritius Zaire
Burkina Faso Mozambique Southern Africa
Cape Verde Reunion Botswana
Cote d’Ivoire Rwanda Lesotho
Gambia Somalia Namibia
Ghana Tanzania South Africa
Guinea Uganda Swaziland
Guinea-Bissau Zambia
Liberia Zimbabwe
Mali
Mauritania
Niger
Nigeria
Senegal
Sierra Leone
Togo
8. 2.5
9.
T I N
A R E
B E T
75