Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
30 views

Data Analysis

Uploaded by

Luuly Phamnguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Data Analysis

Uploaded by

Luuly Phamnguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Adult Basic Education

Advanced Level

MATHEMATICS

Data Analysis

Ministry of Advanced Education,


Training and Technology
Adult Basic Education
Advanced Level Mathematics

Data Analysis

Prepared by
Paul Grinder, Okanagan University College
with
Pat Corbett-Labatt, North Island College
Bob Darling, Malaspina University-College
Peter Robbins, Kwantlen University College
Ada Sarsiat, Northwest Community College
for the
Province of British Columbia
Ministry of Advanced Education, Training and Technology
and the
Centre for Curriculum, Transfer and Technology
© 2000-2020 Province of British Columbia, Ministry of
Advanced Education, Skills & Training

Republished by BCcampus with permission.


Victoria, B.C.

Data Analysis by Paul Grinder is released under a


Creative Commons Attribution 4.0 International Licence,
except where otherwise noted.

The CC licence permits you to retain, reuse, copy,


redistribute, and revise this book—in whole or in part—for
free providing the authors are attributed as follows:

Data Analysis by Paul Grinder is under a CC BY 4.0 Licence.

If you redistribute all or part of this book, it is


recommended the following statement be added to the
copyright page so readers can access the original book
at no cost:

Download for free from the B.C. Open Textbook Collection: https://open.bccampus.ca

This textbook can be referenced. In APA citation style, it


would appear as follows:

Grinder, P. (2020). Data analysis. BCcampus.

Visit BCcampus Open Education to learn about open


education in British Columbia.
Contents
Learning outcomes .................................................................... ii
Glossary ................................................................................... iii
Unit 1: The uses and abuses of statistics...................................1
Unit 2: Introduction: Mean, median, mode, range and
graphs ...............................................................................5
Unit 3: Measures of position: quartiles and percentiles .........16
Unit 4: The standard deviation ................................................26
Unit 5: The normal distribution ..............................................33
Unit 6: The normal curve ........................................................44
Unit 7: Analysing survey data.................................................55
Unit 8: A statistics project.......................................................62
Appendix A ..............................................................................66
Appendix B ..............................................................................67
Answers....................................................................................69

i
Learning outcomes

The word statistics is derived from the Latin word status which means “state”. Governments
were the first to use statistics. They used statistics to collect and interpret data about their
countries. Today, statistics are used in almost every major field of study.

Upon completion of this Module, you should be able to:

• explain the uses and misuses of statistics


• demonstrate an understanding of mean, median, mode, range, quartiles, percentiles,
standard deviation, the normal curve, z scores, sampling error and confidence intervals
• graphically present data in the form of frequency tables, line graphs, bar graphs and stem
and leaf plots
• design and conduct a statistics project, analyze the data and communicate your
observations about the data

Procedure for independent study


1. Read each of the units in order and complete all of the exercises. If you need
assistance, contact your instructor.

2. Complete the Activity Exercises wherever possible.

3. Study the terminology in the Glossary to become familiar with the definitions.

4. If recommended by your instructor, complete additional problem sets.

5. Complete the Project for this Module.

ii
Glossary
Bar graph
A graph that uses side by side bars of different lengths to represent ranked data.

Confidence interval
The interval in which a statistic will likely fall, a certain percent of the time, after repeated
experimentation.

Data
The information collected for statistical analysis.

Deviation
The difference between one data value and the mean.

Frequency
The number of times that a particular value occurs in a set of data.

Frequency graph
Sometimes called a broken line graph. A graph with a horizontal axis representing data
values and a vertical axis representing frequency values.

Frequency histogram
Also known as a bar graph.

Measures of central tendency


Statistics that describe where the data is centred. The mean, median and mode are measures
of central tendency.

Measures of position
Statistics that describe how one data value compares to another. Percentiles, quartiles and z
scores are measures of position.

Measures of variation
Statistics that describe how the data is spread out or dispersed. The range, deviation and
standard deviation are measures of variation.

Mean
The average. The mean is obtained by finding the sum of the data values and dividing by the
number of data values.

Median
The middle value, or the average of the two middle values, of a set of ranked data.

Mode
The data value that occurs most frequently.

iii
Normal curve
Also called a bell curve. Data that is distributed symmetrically about the mean so that most
of the data is close to the mean.

Normal distribution
A distribution that takes the shape of a normal curve when graphed. Approximately 68% of
the data values will fall within one standard deviation of the mean, 95.5% will fall within two
standard deviations of the mean and 99.7% of the data will fall within three standard
deviations of the mean.

Percentile
One of the 100 values that divide a set of ranked data into 100 equal intervals. The 48th
percentile is a value that is higher than 48% of all the data values.

Population
A large group from which samples are taken for statistical analysis.

Quartile
One of four values that divide a set of ranked data into four equal intervals. The first quartile
is equal to the 25th percentile.

Random
A value is random if it has an equal chance of occurring as any other value from the same set.

Random sample
A sample that has the same probability of being chosen as any other sample of the same size.

Range
The difference between the largest data value and the smallest data value.

Ranked data
Data that is listed from highest to lowest or lowest to highest.

Sample
A small set of data chosen from a larger set of data.

Sampling error
The amount of error associated with a calculated value as determined by the size of the
sample.

Standard deviation
The square root of the average squared deviation of a set of data.

Statistic
A value calculated from a set of data. The mean and z scores are statistics.

Statistics
A branch of mathematics that collects, organizes and analyzes data.

iv
Stem and leaf plot
A table of data values where the last digits of data values (leaves) are strung out behind their
first digits (or stem values).

Survey
Information derived from a sampling of a certain population.

Tally
A method of counting data using “tic” marks.

Yes population
A 40% yes population is one that has responded yes to a particular question 40% of the time.

z score
Also known as a standard score. The value obtained by dividing the deviation by the standard
deviation.

v
vi
Unit 1: The uses and abuses of statistics

The word statistics has two meanings. A statistic is a numerical measurement describing
some characteristic of a set of data. For example, a statistic like 290 pounds could be used to
describe the average or mean weight of a football team. Statistics is also a collection of
methods for planning experiments, collecting data, analyzing the data and drawing
conclusions.

The uses of statistics


It is hard to read a magazine or newspaper without coming across some statistical survey or
analysis. Sportscasts, TV documentaries and newscasts also have their share of statistics. The
uses of statistics include applications in business, sports, medicine, agriculture, psychology,
sociology, education and political science. Governments use statistics to monitor everything
from life style preferences to crime rates. New drugs are statistically analyzed to determine
their effectiveness on patients. The statistical technique of random selection is employed to
guarantee that a small sample of a larger population group is actually an unbiased
representation of the whole population. Statistics, such as plus-minus records, can even be
used to determine whether a certain hockey player should be given more or less ice time.

The abuses of statistics


Just as statistics can be used to provide a solid quantitative analysis of a set of data, statistics
can be misused to distort data. The abuse of statistics is what Benjamin Disraeli (nineteenth
century British prime minister) was referring to when he made the famous comment, “There
are three kinds of lies – lies, damned lies and statistics.”

Statistics can be used to misrepresent a situation. Suppose a small store employs 6 people
who earn an average, or mean, wage of $8.50 per hour as calculated below,

$8 + $8 + $8 + $8 + $9 + $10
= $8.50
6

Now suppose the store owner, who earns $40 per hour, includes his wages in the calculation,

$8 + $8 + $8 + $8 + $9 + $10 + $40
= $13
7

If the store owner reports that the average wage earned at the store is $13, he or she is
misrepresenting the situation since the store owner is the only person making $13 per hour or
more.

1
Another source of deceptive statistics results from the faulty collection of data. Companies
that conduct public opinion polls have to be extremely careful that they survey a large
enough sample of the population and also an unbiased segment of the population. For
example, suppose a poll was conducted in BC to determine whether a luxury tax should be
imposed on buyers of new pick-up trucks. The citizens of Prince George might respond quite
differently to the poll than the residents of Victoria. The poll could be quite biased if it was
only conducted in Victoria, or only conducted in Prince George.

Statistical graphs can be presented in a deceptive manner. Consider the two bar graphs
below depicting the same data.

Hours spent per week watching TV

35

25 25

20
Hours
Hours watching 15
watching 20 TV
TV 10

15 0
Men Women Men Women

Without a close inspection of the vertical scale, the first bar graph creates the impression that
men watch twice as much TV as women do. In the second graph, the vertical scale starts at 0,
and the length of the bars are proportional to the actual hour of TV watching.

The above examples illustrate only a few of the abuses of statistics. To avoid the “lies and
damned lies”, every step of the statistical process must be scrupulously carried out; from the
collection of the data, to the calculation of a statistic, to the presentation of conclusions.

Now complete Exercise 1 and check your answers.

2
Exercise 1

1. Why is the following bar graph misleading?


Life expectancy from birth

Women

Men

75 80

2. What factor or factors might cause the following surveys to be biased?

a. TV news watchers are asked to phone in their opinion on whether marijuana


smoking should be legalized.

b. A questionnaire asking family members to list the number of books they read in
the last year is mailed to 1000 homes in the city of Vancouver.

c. To determine how many college students are smokers, Butler asks the first 20
students he sees standing outside the main entrance to the college, “Are you a
smoker?”

Answers are on page 69.

3
Activity 1: Watching TV

Ask every student in the room to write down, on a small piece of paper, an estimate of the
number of minutes they spent watching TV yesterday. Collect the data (pieces of paper) in
some sort of container.

1. a. Draw one piece of paper and record the number.

b. Do you think that this one piece of data is a good representation of the actual (yet
to be calculated) average?

2. Replace the first piece of paper and draw two pieces of data. Find the mean of these
two.

3. Replace the two pieces of data and now draw four pieces of paper. What is the
average time spent watching TV based on just these four pieces of data?

4. Replace the four pieces of paper and draw one half (or one half plus one) of the data.
Find the average time for one half the data.

5. Now find the mean using all the data.

a. How do the previous calculations of the mean, using smaller samples of the total
data, compare to the actual mean?

b. Some of the students may have recorded 0 minutes for the time they spent
watching TV yesterday. How did these zeros affect the mean time?

c. Now calculate the mean for only those students who actually watched some TV
yesterday.

4
Unit 2: Introduction: Mean, median, mode, range
and graphs

Statistics is the science of collecting, classifying, presenting and interpreting numerical data.
The data are numbers or measurements collected by a statistician. For example, the data
below are scores obtained by 12 students on a math quiz out of 40 marks.

32, 39, 32, 27, 30, 34, 32, 35, 40, 36, 32, 36

In order to statistically describe the above data, we might ask the following questions.

1. What is the average, or mean, score?

2. What is the middle, or median, score?

3. What score occurs most often, or what is the mode?

4. What is the difference between the highest and lowest score, or what is the range?

6. How can the data be represented graphically, with a line graph, bar graph or stem
and leaf plot?

The mean, median, mode, and range are four statistics which can be used to describe a set of
data. The mean, median, and mode are called measures of central tendency because they tell
us where the data is centered. The range is a measure of variation because it tells us how
much the data is spread out.

The mean is the most important measure of central tendency. It is calculated as follows:

The mean is the sum of all the data values divided by the number of data values, or
Σx
x=
n
where x = mean, x is a data value, and n is the number of data values

The symbol “ Σ ” is the Greek letter “sigma” and means “the sum of all”. Here “
Σ x” means the sum of all x (or data) values.

Example 1

5
Find the mean score of the following 12 math test scores;

32, 39, 32, 27, 30, 34, 32, 35, 40, 36, 32 and 36.

Solution

Using the formula, (note that n = 12),

Σx 32 + 39 + 32 + 27 + 30 + 34 + 32 + 35 + 40 + 36 + 32 + 36
x= =
n 12

405
= = 33.75
12

The mean score is 33.75.

The median is the middle value when the data is arranged from highest to
lowest. If there are two middle values, then the median is the mean of these two
values.

Example 2
Find the median score of the above 12 math scores.

Solution

Arrange the data from the highest to lowest.

40 35 32
39 34 middle 32
36 32 values 30
36 32 27

Note that because there are an even number (twelve) of data values, we have two middle
values. The mean of these two values is

34 + 32 66
= = 33
2 2

Hence the median math score is 33.

6
The mode is the data value which occurs most often (or with the greatest
frequency). The mode may not exist.

Example 3
Determine the mode of the above twelve math scores.

Solution

Again, arrange the data from highest to lowest.

40 35 32
39 34 32
36 32 30
36 32 27

Notice that the score of 32 occurs most often. Hence the modal score is 32.

The range is the difference between the largest data value and the smallest data
value.

Example 4
Determine the range of the 12 math scores above.

Solution

The data is,

40 35 32
39 34 32
36 32 30
36 32 27

The highest value is 40. The lowest is 27. The range is,

40 – 27 = 13.

Also known as frequency distributions, the line graph plots data values
(horizontal axis) against the frequency of those data values (vertical axis).

7
Example 5
Plot the frequency distribution of the 12 math scores.

Solution

Label each axis. Plot points and connect points with a straight line.

Frequency distribution of math scores

4
Frequency

0
25 27 29 31 33 35 37 39
Math Scores

Bar graphs or, frequency histograms, use bars to represent frequencies for
certain data intervals.

Example 6
Prepare a frequency histogram for the 12 math scores.

27, 30, 32, 32, 32, 32, 34, 35, 36, 39, and 40.

Solution

The range of scores is 13. If we want 5 bars, each data interval should be 3 scores wide.

Interval Frequency
38 – 40 38, 40 (2 scores)
35 – 37 35, 36, 36 (3 scores)
32 – 34 32, 32, 32, 32, 34 (5 scores)
29 – 31 30 (1 score)
26 – 28 27 (1 score)

8
Frequency of math scores
6

4
Frequency
3

26 28 29 31 32 34 35 37 38 40
Math scores

A stem and leaf plot is similar to a bar graph, except the bars are replaced by
digits. These leaf digits are the last digits of data having the same first digit(s),
called the stem.

Example 7
Construct a stem and leaf plot for the following data (minutes taken to run 15 km).

49, 52, 53, 53, 57, 58, 60, 63, 64, 66, 66, 66, 69, 70, 70, 72, 75, 77, 77, 79, 83, 88, 89,
94, 106

Solution

The stems are the first digits of the data numbers (4, 5, 6, 7, 8, 9 and 10). The leaf digits are
strung out beside their stems as displayed below.

Time (min.) taken to run 15 km


4 9
5 2 3 3 7 8
6 0 3 4 6 6 6 9
7 0 0 2 5 7 7 9
8 3 8 9
9 4
10 6

Now complete Exercise 2 and check your answers.

9
Exercise 2

1. A small company employs 15 workers. Their annual salaries are as follows:

$15 000 $ 25 000 $ 25 000


17 000 25 000 25 000
17 000 25 000 25 000
22 000 25 000 30 000
22 000 25 000 65 000

a. Determine the mean, median, mode, and range of the above data.

x = _________ median = __________ mode = __________

range = _________

b. Which statistic (mean, median, mode, or range) best describes the annual salary
most of the workers receive?
_______________________________

c. Which statistic best describes the gap that exists between annual salaries?

_______________________________

2. a. Find the daily mean temperature and daily range for the temperatures below.

°C °C DAILY DAILY
DAY HIGH LOW MEAN RANGE
1 6 0
2 3 2
3 12 4
4 13 10
5 15 7
6 13 10
7 12 8

b. Determine the mean of the daily means.


_______________

c. Determine the total range of temperature over this 7 day period.


_______________

10
d. Determine the mean range over this 7 day period.
_______________

3. Below are the statistics for a final exam given to two different math classes.
(The exam was worth 100 marks.)

CLASS A (n = 25) CLASS B (n = 25)


x = 80 x = 72
range = 40 range = 10

a. Which class (overall) seemed to do better on the exam?


_______________

b. Which class probably had the student with the highest mark?
_______________

c. Which class probably had the student with the lowest mark?
_______________

d. Which class had students with similar abilities? Explain.


_______________

4. The following data represents the average monthly temperature for Vancouver over a
one year period.

Month J F M A M J J A S O N D
Average
Temperature (°C) 4 7 10 12 15 20 22 19 14 8 3 0

Draw a frequency distribution graph and a histogram on the following page.

11
Average monthly temperatures for Vancouver over a one-year period Average monthly temperatures for Vancouver over a one-year period
24 24

22 22

20 20

18 18

16 16
Temperature

Temperature
(degrees C)

(degrees C)
14 14

12 12

10 10

8 8

6 6

4 4

2 2

0 0
J F M A M J J A S O N D J F M A M J J A S O N D
Month Month

5. The following data represents final grades for a computer course. Construct a stem and
leaf plot for the data.

40%, 42%, 42%, 50%, 54%, 56%, 58%, 66%, 66%, 68%, 69%, 70%, 73%, 80%,
84%, 85%, 88%, 89%, 93%

6. The following stem and leaf plot represents the time in seconds taken to type 60 words
by a class of business students.

4 4 6 8
5 0 1 6 8 8 9
6 0 0 2 3 3 5 7 7 7
7 1 3 3 4 4
8 4
9 5

a. How many students were tested?

b. Find the mean and median for this data.

x = median =

12
7. Jody wants to receive a grade of 80% for the laboratory part of her Chemistry course.
So far, she has a 78% average on her last 5 labs. What grade does she need on her
sixth lab to earn an 80% average?

8. Neil has 182 out of 260 marks thus far in his math course. If the final exam is worth
100 marks, what mark does Neil need on the final exam to earn a final grade of 75%
for the course? How do you feel about Neil’s chances, and why?

Answers are on pages 6969.

13
Activity 2: Shoe size

Ask at least 15 male students and 15 female students to write their shoe size on a piece of
paper. Include M for male and F for female as well, on the slip of paper.

1. Below, arrange the shoe sizes in order from smallest to largest.

MALE FEMALE

2. Determine the following.

MALE FEMALE
mean
median
mode
range

3. Imagine you are a shoe manufacturer.

14
a. Why would it be important to know the range of shoe sizes for males and
females?

b. Why would it be important to know the modal shoe size for males and females?

c. Why would knowing the mean and median shoe sizes be important?

4. Organize the data into bar graphs.

Men’s shoe sizes Women’s shoe sizes


10 10

9 9

8 8

7 7

6 6

5 5
Frequency
4 4

3 3

2 2

1 1

0 0

Shoe size Shoe size

15
Unit 3: Measures of position: quartiles and
percentiles

The mean, median and mode are measures of central tendency. These statistics tell us where
the data is centered. Quartiles and percentiles are measures of position. Quartiles and
percentiles can be used to compare one particular data value to all the rest of the data. These
statistics enable us to answer questions such as, “Is a certain value unusually high, unusually
low or just average?”

Quartiles divide ranked data into four equal parts. Ranked data is data arranged
from highest to lowest, or lowest to highest. The quartile values are denoted as
Q1, Q2 and Q3. Q1 separates the bottom 25% of the data from the top 75%. Q2 is
the same as the median and separates the top 50% from the bottom 50% of the
data. Q3 separates the top 25% of the data from the bottom 75%.

Example 1
Find Q1, Q2 and Q3 for the following set of data (resting heart rates of 30 college males).

55 94 80 68 78 61
60 55 88 60 70 70
70 60 86 42 65 74
72 68 80 100 58 84
81 72 71 85 57 96

Solution

Rank the data from lowest to highest.

42 60 68 71 80 86
55 60 68 72 80 88
55 60 70 72 81 94
57 61 70 74 84 96
58 65 70 78 85 100

Find Q2 or the median, first. Since there are 30 values, there are two middle values, 70 and
71.

70 + 71
Q2 = = 70.5
2

16
To find Q1, find the middle value of the bottom 50% of the data. The bottom half of the data
has 15 values and ranges from 42 to 70. Counting to the 8th value,

Q1 = 60

Q3 can be found in a similar fashion. Q3 is the middle value of the upper 50% of the data or
the 8th value from 71 to 100.

Q3 = 81

Finding Q1, Q2 and Q3 in the above example was simply a matter of ranking and counting.
But there was an even number (n = 30) of data values. When there is an odd number of data
values, a different method of calculating Q1 and Q3 will have to be used.

The three quartiles Q1, Q2 and Q3 divide the ranked data into 4 equal parts. There are 99
percentiles P1, P2, P3, …, P99 that divide the ranked data into 100 equal parts. For example,
P80 is called the “80th percentile” and P80 is a value that is higher than 80% of the rest of the
data values. P10 is a value that is higher than 10% of the data values.

Each data value, x, corresponds to a particular percentile, Pk, where k is given


by the expression,

number of data values less than x


k= × 100%
n

where n is the total number of data values.

Example 2
Below are the number of chin ups completed in one minute by 70 male college students. The
data is ranked lowest to highest.

0 3 7 8 10 13 20
0 3 7 9 10 13 20
1 4 7 9 10 13 22
1 4 7 9 10 14 22
2 6 7 9 11 14 23
2 6 7 9 11 15 25
2 6 7 9 11 15 28
3 6 8 10 11 15 30
3 6 8 10 12 18 30
3 7 8 10 13 20 33

Find the percentiles associated with the data values 0, 7 and 25.

17
Solution

0
a. For data value 0, k = ×100% = 0%
70

So, 0 is the 0th percentile or P0 = 0. This makes sense because 0 is not higher than any
other value.

b. There are 19 data values that are less than the value 7.

19
For data value 7, k = × 100% = 27% (rounded) .
70

So, 7 is the 27th percentile or P27 = 7.

c. There are 65 data values less than 25.

65
For data value 25, k = × 100% = 93% (rounded) .
70

So, P93 = 25. The person who did 25 chin ups in one minute did better than 93% of the
other college men.

The reverse procedure, finding what data value corresponds to a given percentile, is rather
involved.

To find the data value associated with a certain percentile, Pk, follow the steps
below:

Step 1 Rank the data from lowest to highest.

 k 
Step 2 Calculate C =  n where k is the percentile in question and n is the
 100 
number of data values.

C th data value + (C + 1) data value


th
Step 3 If C is a whole number, then Pk = .
2

Step 4 If C is not a whole number, round C up to the next larger whole number
and Pk = the Cth (rounded up) data value, counting from the lowest
value.

18
Example 3
The following stem and leaf plot depicts the number of words typed in 2 minutes by 125
office administration students.

7 3 7 8 9 9
8 0 1 2 3 5 5 5 8 8 9
9 0 0 0 2 3 3 7 8 8 8 9 9
10 0 0 0 1 1 1 1 1 3 3 7 8 8 8 8 8
11 0 1 1 2 2 2 2 3 4 4 4 5 5 7 7 7 7 8 8 9 9 9
12 0 0 1 1 1 1 1 5 5 5 6 6 6 6 6 7 7 8 9
13 0 0 0 0 0 2 3 3 3 6 6 6 6 7 7 8 8 9
14 0 0 1 1 1 1 3 3 4 4 6 7 8 8 8
15 0 0 3 8 9 9
16 1 7

Find P40, P25 and P95.

Solution

The data is already ranked.

 k 
a. To find P40, calculate C =  n where k = 40 and n = 125,
 100 

 40 
C= 125 = 50
 100 

Since C = 50 is a whole number, calculate

C th data value + (C + 1) data value


th
P40 =
2

where the C = 50th data value (from the stem and leaf plot) is 112 and the C + 1 = 51st data
value is 113.

112 + 113
P40 = = 112.5
2

b. Finding P25 is the same as finding Q1, the first quartile.

 k 
Calculate C =  n where K = 25 and n = 125
 100 

19
 25 
C= 125 = 31.25
 100 

The C = 31.25 is not a whole number.

Round C = 31.25 to 32 and P25 is the 32nd data value from the lowest value.

P25 = 98

c. To find P95, calculate

 95 
C= 125 = 118.75
 100 

Round C up to 119.

The 119th score is 150.

So, P95 = 150.

Now complete Exercise 3 and check your answers.

20
Exercise 3

1. The following stem and leaf plot depicts the number of words typed in 2 minutes by
125 office administration students.

7 3 7 8 9 9
8 0 1 2 3 5 5 5 8 8 9
9 0 0 0 2 3 3 7 8 8 8 9 9
10 0 0 0 1 1 1 1 1 3 3 7 8 8 8 8 8
11 0 1 1 2 2 2 2 3 4 4 4 5 5 7 7 7 7 8 8 9 9 9
12 0 0 1 1 1 1 1 5 5 5 6 6 6 6 6 7 7 8 9
13 0 0 0 0 0 2 3 3 3 6 6 6 6 7 7 8 8 9
14 0 0 1 1 1 1 3 3 4 4 6 7 8 8 8
15 0 0 3 8 9 9
16 1 7

a. Find P50 (the median) and P80.

b. Jill can type 140 words in 2 minutes. What percentile score is this? Jill can type
faster than what percent of the other 124 students?

c. A student needs to type 90 words in 2 minutes in order to pass the test. What
percentile is associated with this value? How many students failed the test?

21
d. The instructor decided that those students who achieved less than the 20th
percentile would have to retake the test. What data value is represented by P20?

2. Below are the number of chin ups completed in one minute by 70 male college
students. The data is ranked lowest to highest.

0 3 7 8 10 13 20
0 3 7 9 10 13 20
1 4 7 9 10 13 22
1 4 7 9 10 14 22
2 6 7 9 11 14 23
2 6 7 9 11 15 25
2 6 7 9 11 15 28
3 6 8 10 11 15 30
3 6 8 10 12 18 30
3 7 8 10 13 20 33

a. Find Q3 (or P75) for the above data.

b. Now find the percentile associated with 13 chin ups.

22
c. In a. you found that P75 = 13 but when the process was reversed, in b. you found
that 13 = P70. Explain this discrepancy.

Answers are on page 69.

23
Activity 3: Mutual funds

Work in groups of 3 or 4.

1. On the following page, thirteen Canadian “Asia ex-Japan” mutual funds are listed
with their current value and percentage gains over 1 day, 1 week, 30 days and 1 year.
On page 66 in the spaces provided, rank each fund from best to worst based on their
percentage gain for the given time interval. For example, for the “1 day %” ranking
column, Fund B would rank 1st and Fund H would rank last or 13th.

2. a) Which, if any, funds always ranked in the top half of the group (above Q2) in all
four categories? (These would be the best funds with low ranks.)

b) Are there any fund(s) that ranked above Q3 in all categories? Which one(s)?

3. a) Find the sum of the four rankings for each fund in the last column.

b) What does a “low” sum indicate?

c) What does a “high” sum indicate?

4. a) Is there any fund that has a consistently high ranking? Which one?

b) Which fund has the worst performance based on overall ranking?

5. a) Rank what your group thinks are the three best performing “Asia ex-Japan”
mutual funds.

b) Also rank the three worst performing funds.

24
Activity 3: Asia ex-Japan Canadian mutual funds
As of December 23, 1999
Fund Fund name Price $ 1 day $ 1 day Rank 1 week Rank 30 day Rank YTD Rank Sum
Letter Chg % % % % of
Code ranks
A AGF Asian Growth 13.880 .620 4.68 10.86 16.35 56.31
Class

B AGF Asian Growth 9.420 .440 4.90 11.22 15.72 62.98


Class (US$)

C Clarington Asia 16.548 .214 1.31 4.88 16.72 67.37


Pacific

D Fidelity Far East 39.630 .340 .87 4.92 11.20 53.72

E Fidelity Far East 26.970 .360 1.35 5.39 10.99 60.15


(US$)

F First Canadian Far 10.412 .021 .20 2.21 8.52 33.86


East

G Green Line Asian 12.450 .500 4.18 12.77 19.37 82.28


Growth

H Investors Pacific 8.970 .050 -.55 1.24 4.30 25.88


International

I National Bank Far 10.540 .040 .38 2.73 6.90 27.76


East Equity

J Navigator Asia 10.044 .003 .03 -.21 -.39 19.49


Pacific

K Royal Asian 12.463 .088 .71 4.51 5.80 58.24


Growth

L Universal Far East 4.735 .062 1.34 6.45 10.84 50.60

M Universal Far East 3.214 .049 1.54 6.81 10.18 56.43


(US$)

Source: http://globefund.com

25
Unit 4: The standard deviation

Consider the data in set A and B below.

Set A: 30, 50, 70 Set B: 40, 50, 60

Notice that the mean and median for both sets are exactly the same. Their measures of
central tendency are the same, but the range is very different for both sets. The range for Set
A is 40 and for Set B it is 20. The range is a very simple measure of variation. Set A and Set
B are dispersed, or spread out, quite differently.

One way of measuring the variation or dispersion of specific data values is to calculate the
deviation, x – x , where x is a data value and x is the mean. For example, in Set A above, the
amount that 70 deviates from the mean is 20.

A very precise way of measuring the variation of an entire set of data is to calculate the
standard deviation. The standard deviation is a statistic that indicates a kind of average
deviation from the mean of the data. The larger the standard deviation number, the more
spread out the data is. The Greek letter sigma, σ, is the symbol for standard deviation.

The standard deviation 1 formula is as follows,

σ=
Σ x−x ( )2

where σ is the standard deviation, x is a data value, x is the mean of the data,
and n is the number of data values.

1
The formula given above is for finding the standard deviation of a given population. The formula

σ n −1 =
(
Σ x−x )2

is used to find the standard deviation of smaller sample of a given population.


n −1
Throughout this module, we will use the formula for σ (not σ n -1 ).

26
Example 1
Find the mean and standard deviation of the 12 science quiz scores below.

5, 6, 8, 8, 9, 10, 10, 11, 11, 12, 15 and 15

Solution

Find the mean, or x .

Σx 5 + 6 + 8 + 8 + 9 + 10 + 10 + 11 + 11 + 12 + 15 + 15
x= = = 10
n 12

The following table can be used to determine the standard deviation.

x x- x (x − x )2

5 5-10= -5 (-5)2 = 25
6 -4 16
8 -2 4
8 -2 4
9 -1 1
10 0 0
10 0 0
11 1 1
11 1 1
12 2 4
15 5 25
15 5 25
(
=Σ x−x )2

106

The standard deviation, or σ, is

σ=
∑ x−x( )2

=
106
= 8.83 ≈ 2.97
n 12

The reason we square the deviations before finding the average is to avoid adding positive
and negative values. Notice above that the sum of the deviations, x − x is zero.

27
The paper and pencil method of calculating the standard deviation can be extremely lengthy,
especially when n, the population size, is large.

Example 2
Calculate the mean and standard deviation of the data in Example 1 using a calculator with a
statistics mode. (As many calculators function differently, please bring your calculator
manual to class.)

Solution

Put your calculator in the statistics mode (if necessary), enter the data and find x and σ.

1. To operate in statistics mode, press: .

2. Enter the data (Find your data button. FRQ can be used to input repeated data
values.):

5 DATA (or Σ+ )
6 DATA
8 DATA
8 DATA
9 DATA
10 DATA
10 DATA
11 DATA
11 DATA
12 DATA
15 DATA
15 DATA

3. Find x and σ (you may have to press SHIFT or 2nd or some other key).

To find x , press:
To find σ, press:

4. To return to calculation mode, press:

28
Now complete Exercise 4 and check your answers.

29
Exercise 4
1. The following data represents quiz scores on a test out of 10 by 21 math students.

0, 0, 1, 2, 4, 5, 5, 5, 6, 6, 6, 7, 8, 8, 8, 8, 9, 9, 9, 10, 10

a. What is the range? ________

b. What is the mode? ________

c. What is the median? ________

d. What is the mean? ________

e. Calculate the standard deviation. ________

f. Plot a frequency distribution graph for the data below.

3
Frequency
2

0
0 1 2 3 4 5 6 7 8 9 10
Scores

g. Calculate x - σ = ________ and x + σ = __________.

h. How many scores are there between 2.9 and 9.1, or, how many scores lie
within one standard deviation of the mean?

_____________________

i. What percent of the scores is this?

30
2. a. In Example 1 of this section, the mean of the 12 science quiz scores was 10 and
the standard deviation was 2.97. What percent of the scores lie within one
standard deviation of the mean? The scores were 5, 6, 8, 8, 9, 10, 10, 11, 11, 12,
15, and 15.
___________________

b. What percent of the scores lie within two standard deviations of the mean?

___________________

3. Suppose the standard deviation of a set of numbers is 0. What does this tell you about
the data?
___________________

4. Two classes, each with 100 students, wrote an examination with a possible maximum
score of 100. In the first class the mean score was 75 and the standard deviation was
5. In the second class, the mean score was 70 and the standard deviation was 15.
Which of the two classes do you think had more scores of 85 or better? Why?

__________________

5. The following data represents the weights (in kg) of a small class of students:

78, 42, 72, 88, 86, 97, 91, 79, 82, 86, 91, and 74

a. Calculate the range. __________________

b. Calculate the mean weight. __________________

c. Calculate the standard deviation __________________

d. What percentage of the weights fall within


one standard deviation of the mean weight? __________________

6. It is found that the time taken by a bank teller to serve 7 people is 3, 3, 4, 5, 6, 6, and
7 minutes.

a. Find the mean time. __________________

b. Find the standard deviation. __________________

Answers are on pages 69.

31
Activity 4: Pop quiz

1. Your instructor will ask you to write the quiz in Appendix B (pages 67) along with
the rest of the class. You will only have 5 minutes to complete the test. Do not look at
it until your instructor says “Go!”

2. After all the tests are marked out of 20, record the marks below, ranked from highest
to lowest. (See page 75 for the answers.)

3. Calculate the following.

mode =
x =
σ=
Q1 =
Q2 =
Q3 =

4. What percent of the scores lie within

a. one standard deviation of the mean?


b. two standard deviations of the mean?
c. three standard deviations of the mean?

32
Unit 5: The normal distribution

Data can be distributed in quite a variety of ways. Consider the frequency histograms below,

Normal or triangular Uniform or rectangular Skewed to right


(distribution is symmetrical) (distribution is symmetrical)

Skewed to left J-shaped Bimodal

• Symmetrical: Both sides of this distribution are identical.


• Uniform (rectangular): Every value appears with equal frequency.
• Skewed: One tail is stretched out longer than the other. The direction of skewness is
on the side of the longer tail.
• J-shaped: There is no tail on the side of the class with the highest frequency.
• Bimodal: The two most populous classes are separated by one or more classes. This
situation often implies that two populations are being sampled.

Some examples of the above distributions follow. Imagine that a group of math students were
given a math test out of 40. The difficulty of the test can have quite an influence on the shape
of a frequency distribution.

Case 1 If the test was so easy that most of the students


received a perfect mark, this distribution would be
considered J-shaped.
20
Frequency

10

0 20 40
Test Scores

33
Case 2 If the test was quite difficult and most of the students
received a mark of less than 50%, this distribution
would be considered skewed to the right. 20

Frequency
10

0 20 40
Test Scores

Case 3 If the test was given to two different math classes, one
of which had not been taught half the material, this
distribution would be bimodal. 20

Frequency
10

0 20 40
Test Scores

Case 4 If the test was “fair”, at least from the instructors


point of view, this distribution would be normal.
20
Frequency

10

0 20 40
Test Scores

Case 5 What kind of test would produce a uniform


distribution?
20
Frequency

10

0 20 40
Test Scores

34
When a population is measured for some attribute or ability, the most frequently occurring
distribution is the normal distribution. When a product is tested for some characteristic the
result is most often a normal distribution 2. For example, if a sample population of men (or
women) is tested for physical strength (or blood pressure or intelligence or shoe size), most
of the people will be close to average strength with a small minority either much stronger
than or much weaker than the majority.

In the previous assignment you were often asked, “what percent of the data lie within one
standard deviation of the mean?” Knowing how a population bunches around its mean value
can be quite useful.

Pafnuty Chebyshev (1821-1894) was a Russian mathematician who worked on probability,


theory of prime numbers, and problems in mechanics.

Chebyshev’s Theorem states that the proportion of any distribution that lies
2
1
within k standard deviations of the mean is at least 1 −   , where k > 1. This
k
applies to any distribution of data.
2
1 1 3
For example, when k = 2, Chebyshev’s Theorem states that, 1 −   = 1 − = or more of
2 4 4
the data will lie within 2 standard deviations of the mean.

Chebyshev’s Theorem applies to any set of data.

When data is distributed normally, (see top left histogram) Chebyshev’s proportion can be
“improved” on dramatically.

When data is distributed normally, then approximately 68% of the data is


within one standard deviation of the mean, 95% of the data is within 2 standard
deviations of the mean and 99.7% is within 3 standard deviations of the mean.
This is the Empirical Rule.

The vast majority of statistical analysis is done on normally distributed data; from biology to
psychology to economics to medicine to sports.

2
Normal distribution is not appropriate for all kinds of distribution. For example, small sample sizes or biased
populations would not necessarily be normally distributed and could be statistically analyzed by different
methods.

35
The histogram below is a representation of an “ideal” normal population, where the mean is 0
and the standard deviation is 1.
99.7%

95%

68%

34% 34%

13.5% 13.5%
2.5% 2.5%

-3 -2 -1 0 1 2 3

Now complete Exercise 5 and check your answers.

36
Exercise 5
1. Sixty college students were asked for the total number of children in their family. The
data collected follows:

1 6 3 5 5 3 4 1 2 7 3 2
3 4 5 3 1 3 2 1 4 4 2 2
3 9 4 3 3 5 3 5 7 3 1 1
3 5 2 6 4 3 3 3 3 3 2 3
4 3 5 7 3 2 1 2 3 2 4 3

a. What is the range for this data?


__________________________________

b. Draw a histogram for the data below.

24
21
18
15
Frequency
12
9
6
3

1 2 3 4 5 6 7 8 9
Number of children

c. Find the mean for this data. x = ___________

d. Find the standard deviation. σ = ____________

e. Find the values, x − σ and x + σ . __________ and ___________.

f. What percent of the data lies between x − σ and x + σ ?

37
_______________

g. What are the values, x − 2σ and x + 2σ ?

and _______________

h. What percent of the data lies within two standard deviations of the mean?

_______________

i. What are the x − 3σ and x + 3σ values?

____________ and _____________

j. What percent of the data lies within three standard deviations of the mean?

_______________

k. Compare your answers for f., h., and j. to the results predicted by the Empirical
Rule. Does the result suggest an approximately normal distribution?

_______________

2. The following table tallies the number of hour of TV watched in one day by 75 high
school students. Only those who watched some TV yesterday were included in the
tally.

38
Hours Tally Frequency
0.5
1.0
1.5
2.0
2.5
3.0
3.5

a. Convert the tally counts to frequency numbers above.

b. Calculate the mean. x =

c. Calculate the standard deviation. σ =

d. What percent of the data lies within one standard deviation of the mean?

e. What percent of the data lies within two standard deviations of the mean?

f. What percent lies within three standard deviations of the mean?

3. The following is a collection of IQ scores: IQ stands for “intelligence quotient”.

66 81 88 93 97 100 102 106 112 119


71 83 89 93 98 100 102 107 112 119
71 83 89 95 98 100 102 107 113 121
72 84 89 95 98 100 102 107 113 122
73 85 90 96 98 100 103 108 114 123
74 85 91 96 98 100 103 110 114 126
76 85 92 96 99 100 103 110 115 126
77 86 92 97 99 101 104 111 117 127
80 86 92 97 99 101 105 111 118 130
81 88 92 97 99 101 106 112 118 136

a. Complete the histogram for the data below.

39
26

24

22

20

18

16
Frequency 16
14

12

10

6 7

2
1
60 69 70 79 80 89 90 99 100 109 110 119 120 129 130 139

IQ

b. Find the range. __________________

c. Find the mean.


x = __________________

d. Find the standard deviation.


σ = __________________

e. What percent of the IQ’s lie within one standard deviation of the mean?

__________________

f. What percent lie within 2 standard deviations of the mean?

__________________

g. What percent lie within 3 standard deviations of the mean?

__________________

40
h. Does IQ seem to be normally distributed?

__________________

i. What IQ score would be three standard deviations above the mean?

__________________

Answers are on pages 69.

41
Activity 5: Brothers and sisters
Duplicate the survey conducted in Question 1 of Exercise 5. Ask 40 people how many
siblings (brothers and sisters) they have. Record the answers. Add 1 to each number so that
the person asked is included.

a. What is the range for this data? __________________________________

b. Draw a histogram for the data below.

24
21
18
15
Frequency
12
9
6
3

1 2 3 4 5 6 7 8 9
Number of children

c. Find the mean for this data. x = ___________

d. Find the standard deviation. σ = ____________

e. Find the values, x − σ and x + σ . __________ and ___________.

f. What percent of the data lies between x − σ and x + σ ? _______________

g. What are the values, x − 2σ and x + 2σ ? _______________ and

h. What percent of the data lies within two standard deviations of the mean?

42
_______________

i. What are the x − 3σ and x + 3σ values? ____________ and _____________

j. What percent of the data lies within three standard deviations of the mean?

_______________

k. Compare your answers for f., h., and j. to the results predicted by the Empirical
Rule. Does the result suggest an approximately normal distribution?

_______________

43
Unit 6: The normal curve

The normal curve is an idealized representation of a normally distributed population. The


normal curve, also called a bell-shaped curve, is drawn below, where the mean score is 0 and
the standard deviation is 1.

-3 -2 -1 0 1 2 3

The area under the curve represents 100% (or 1.00) of the data (or population). By the
empirical rule, the area under the curve and within one standard deviation of the mean is
68.26%, within two standard deviations is 95.44%, and within three standard deviations is
99.74%, as shown below.

68.26% 95.44% 99.74%

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

The area under the The area under the The area under the
curve is 0.6826 curve is 0.9544 curve is 0.9974

The normal curve is the graph of the exponential equation;

− x2
1
y= e 2
where e ≈ 2.718

By using z scores, the area of any region under the curve can be determined.

44
The z score or standard score represents the number of standard deviations a
data value is from the mean value. The formula for z is,

x−x
z=
σ

Where x is a data value, x is the mean, and σ is the standard deviation.

Example 1
The mean IQ is 100 and the standard deviation is 15. If Frank has an IQ of 127, find his z
score.

Solution

Here, x = 100, σ = 15, and x = 127.

127 − 100
z= = 1.8
15

A z score is similar to a percentile in that it is a measure of position. As a rule, z scores above


2.0 (or below –2.0) are considered “unusual” values. In a normal population such scores
would occur less than 5% of the time. Z scores between –2.0 and 2.0 are considered
“ordinary” values.

Example 2
Frank has an IQ of 127, or a z score of 1.8. What percent of the population have IQ scores
less than (or equal to) 127 and what percent have IQ scores higher than 127?

Solution

Refer to Appendix A (see page 66).


Locate 1.8 under the z column and read across to the value under the 0.00 column.

A z score of 1.8 relates to the area under the 0.4641


curve from 0 to 1.80. The area is 0.4641.
The area under the curve from –∞ to 0 is 0.5000

0.5000.

-3 -2 -1 0 1 2 3
1.8

45
The percent of the population that have IQ scores less than (or equal to) 127 is,

0.5000 + 0.4641 = 0.9641 or 96.41%.

A z score of 1.8 can be considered as equivalent to a percentile of 96, since it is higher than
96.41% of the population. In other words, an IQ of 127 has a 96th percentile ranking.

The percent of the population with IQ scores above 127 is,

1.0000 – 0.9641 = 0.0359 or 3.59%

Areas under the normal curve can also be associated with probabilities. In the above example
we could say that the probability that some person would have an IQ score less than 127 is
96 24
0.9641 out of 1.0000 or about or 96% = .
100 25
1
The probability that someone would have an IQ higher than 127 is about 4% or .
25

Or, in a group of 25 people, chosen randomly, probably only one would have an IQ score of
more than 127.

Example 3
The waiting-in-line time at a certain grocery store is normally distributed with a mean
of 3.5 minutes and a standard deviation of 1.4 minutes.

a. What percent of the customers wait in line less than one minute?

b. What percent of the customers wait in line more than 5 minutes?

c. What is the probability that a customer would have to wait in line for more than 7 minutes?

Solution

a. Convert 1 minute to a z score. 0.4633

0.0367

x−x 1 − 3.5
z= = = − 1.79
σ 1.4

-3 -2 -1 0 1 2 3
-1.79

46
From the table (Appendix A, see page 66), a z score of –1.79 yields the same area as 1.79.
The area between 0 and –1.79 is 0.4633. The area to the left of –1.79 represents the
proportion of the population that waits in line less than a minute,

0.5000 – 0.4633 = 0.0367 or 3.67%

b. Convert 5 minutes to a z score.

5 − 3.5
z= = 1.07
1.4 0.3577

The area under the curve between 0.1423


0.0 and 1.07 is 0.3577 and the area
beyond 1.07 is

0.5000 – 0.3577 = 0.1423


-3 -2 -1 0 1 2 3
1.07

This means that 14.23% of the customers have to wait in line for more than 5
minutes.

c. Convert 7 minutes to a z score.


0.4938

7 − 3.5
z= = 2.5
1.4

The area under the curve between 0 0.0062

and 2.5 is 0.4938. The area beyond


2.5 is 0.5000 – 0.4938 = 0.0062.
-3 -2 -1 0 1 2 3
2.5

This means that 0.62%, or less than 1%, of the customers would have to stand in line for
more than 7 minutes. The probability that someone would have to stand in line for more than
31
7 minutes is 62 in 10000 or .
5000
Example 4
A certain tire company tested their new Treadmasters and found that the tires’ tread life
averaged 60000 km with a standard deviation of 7000 km. The company wants to sell
the Treadmaster with a guarantee that they will last a certain number of kilometres.
The company is willing to give a money back guarantee on 10% of its worst tires. At
how many kilometres will 10% of the tires be worn out?

47
Solution

We need to find the z score that marks off an area under the curve of 10% or 40% from
0 to z. In the table, the closest value to 0.4000 is 0.3997, and this corresponds to a z
score of 1.28.

40%

10%

-3 -2 -1 0 1 2 3
Z = -1.28

To determine the kilometre value that is associated with a z score of –1.28, solve the z score
formula for x.

x−x
z=
σ

x − 60000
-1.28 =
7000

x = 60000 – 1.28 (7000)

x = 51040 km

The company should guarantee tires that wear out before 51040 kilometres.

Now complete Exercise 6 and check your answers.

48
Exercise 6
Use Appendix A (see page 66) for the questions that follow.

1. Find the area under the normal curve between the following z scores.

-3 -2 -1 0 1 2 3

a. z = 0 and z = 1.41 _____________________

b. z = -0.6 and z = 0 _____________________

c. z = -1.23 and z = 0.53 _____________________

d. z = 0.46 and z = 2.31 _____________________

e. z > 1.5 _____________________

2. The average resting heartrate for a normally distributed population of men was found
to be 62 beats per minute with a standard deviation of 11 beats per minutes.

a. What percent of men have resting heartrates under 70 beats per minute?

_______________

b. What percent of men have resting heartrates over 70 beats per minute?

_______________

c. What percent of men have resting heartrates between 40 and 80 beats per minute?

_______________
3. In a group of normally distributed women, the average height is 5 feet 4 inches (64
inches) with a standard deviation of 2.8 inches.

49
a. What percent of the women are between 5 feet and 6 feet ?

_______________

b. What is the probability that a woman would be taller than 6 feet ?

_______________

c. What is the probability that a woman would be shorter than 5 feet tall?

_______________

4. A survey of college students enrolled in technology programs indicated that they


spent an average of 29 hours a week outside of class time studying for their courses.
The data was normally distributed with a standard deviation of 9 hours per week.

a. What percent of the students spent more than 40 hours per week studying?

_______________

b. What percent spent fewer than 10 hours per week studying?

_______________

c. What percent spent between 20 and 50 hours per week studying?

5. Larry’s lightbulb factory manufactures bulbs with an average life of 1000 hours and a
standard deviation of 100 hours. To sell more light bulbs Larry wishes to give a

50
guarantee, but he is only willing to replace 5% of the lightbulbs sold. For how many
hours should the lightbulbs be guaranteed?

_____________

6. Workers in a certain factory are given a bonus every time they assemble more than
300 toy cars in one eight hour day. The number of toy cars assembled each day by a
worker is normally distributed with a mean of 270 cars and a standard deviation of 16
cars. What percent of the workers receive a bonus each day?

__________________________

7. A radar unit measures the speed of passing cars on a highway. The speeds of the cars
are normally distributed with a mean speed of 104 km/h.

a. Find the standard deviation of the speeds if 3% of the cars are travelling faster than
115 km/h.

_________________________

b. Using the standard deviation found above, what percent of the cars are travelling at
less than 90 km/h?

_________________________

c. What percent are travelling faster than 120 km/h?

51
____________________________

d. If there is a no tolerance rule in effect, and the posted speed is 100 km/h how many
cars would be considered to be speeding?

___________________________

Answers are on page 69.

52
Activity 6: Rolling dice
With a partner, roll a pair of dice 150 times. Your partner should tally each roll of the dice,
while you keep count of the number of rolls. Complete the tally sheet below and draw a
histogram for this data.

a.
Dice sum Tally Frequency
2
3
4
5
6
7
8
9
10
11
12

b.
50

45

40

35

30
Frequency

25

20

15

10

0
2 3 4 5 6 7 8 9 10 11 12
Dice sum

c. Find the mean. x = _____________


d. Find the standard deviation σ = __________________

53
e. Find the interval x − σ to x + σ ______________ to _______________

f. What percent of the rolls lie within one standard deviation of the mean?

__________________

g. What are x − 2σ and x + 2σ and what percent of the data lie within 2 standard
deviations of the mean?

_________________________________________________________________

h. What are x − 3σ and x + 3σ and what percent of the data lies within 3 standard
deviations from the mean?

__________________________________________________________________

__________________________________________________________________

i. Does the data appear to be normally distributed?

j. What z score is associated with a roll of 9?

k. What percent of the rolls would be greater than 9?

Test this by rolling the dice ten times. How many times out of ten did a roll of 10, 11, or 12
occur?

What percent is this?

Repeat: Roll ten more times and count how many times a 10, 11, or 12 was rolled.

54
Unit 7: Analysing survey data
Hardly a day goes by without the media reporting the results of some survey. Surveys are
conducted to determine what people like or dislike, what their opinions are on various issues
and what factors affect their lives.

Governments and businesses often use surveys in order to make decisions and to monitor the
effectiveness of previous decisions.

We will restrict our analysis of survey data to YES-NO population surveys only. In a YES-
NO survey, every member of the population answers a question with a YES or a NO. For
example, “Do you smoke?” is a YES or NO type question. “How many cigarettes do you
smoke every day?” is not a YES or NO question. If we were to ask every member of a
population the YES or NO question we would be taking a census of the population. If 20% of
the population answered YES to the question, this would be called a 20% yes population.

It is often very expensive and very time consuming to take a population census. By using a
smaller sample of the population, we can estimate the percentage of YES answers in the
population.

For example, in a recent survey of 1000 Canadians, 55% responded YES to the question, “Is
having a happy life the thing that matters most to you?” when compared to other things like
health and freedom. Even though the survey only represents a small portion of the total
population of Canada, its margin of error is calculated to be plus or minus 3.2% 19 times out
of 20.

In other words, if that survey was repeated 20 times, using a different 1000 Canadians each
time, then 19 times out of 20 times, the number of YES responses to the above question
would be between 51.8% (55% - 3.2%) and 58.2% ( 55% + 3.2%).

Sampling error for a 95% confidence interval.

If the sample size is n, then the sampling error of the percentage of YES
answers in the population is approximately,

100
n

100
When n > 100, then the accuracy of is quite good.
n

Since 19 out of 20 is equivalent to 95%, we can be “confident” that the percentage of YES
answers will be within the sampling error interval 95% of the time.

55
Example 1
The business association in Grissville surveyed 384 people and asked each if they had eaten
dinner in a local restaurant at least once in the last week. 223 people responded YES to the
survey. Find the 95% confidence interval (and sampling error) for this sample.

Solution

223
The proportion of YES answers in the population is = 0.581 or 58%.
384

100
The sampling error, is about 5.1%.
384

The 95% confidence interval is 58% - 5.1% and 58% + 5.1% or about 53% to 63%.

Now complete Exercise 7 and check your answers.

56
Exercise 7
1. Read the following newspaper clipping.

The latest poll, conducted for


Victoria radio station CBC shows Mr.
Martin with 39.9 percent support, Ms.
Wilson with 33.9 percent, and Mr.
Yeung with 8.5 percent.

a. Why do the three percentages not add to 100%?

_____________________________________________________________________

b. Suppose this poll was accurate to plus or minus 3 percentage points 19 times out of
20.

i. What is the least support Mr. Martin could have 19 times out of 20?

_________________________

ii. What is the most support Ms. Wilson could have 19 times out of 20?

_________________________

iii. Considering the above and the percent of undecided voters, does Mr. Martin
have a majority of the potential vote?

________________________

2. A survey was conducted to determine whether bicycle riders should have to pay for a
licence to ride on the city streets. 183 people were asked and 57 said yes.

a. What percent of the people responded “yes” to the survey?

b. What is the sampling error for this survey?

c. What is the 95% confidence interval for this survey?

3. An opinion poll reported that support for the Liberals stood at 58% with a sampling
error of 2.5% 19 times out of 20. Use the sampling error formula to determine the
sample size.

57
4. A certain poll found that 833 people out of 1240 thought that capital punishment
should be reinstated for first degree murder. What is the 95% confidence interval for
this sample?

5. If the poll in Question 4 was conducted in 1998, are the results still valid today?

6. Count the first one hundred letters in this sentence and then count how many times
the letter “e” occurred and count every “e” as a YES response.

a. What percent of the 100 letters were “e’s”?

b. What is the 95% confidence interval for the occurrence of the letter “e” in the
English language?

c. Repeat the above process with a different set of words and record the percent of
“e’s” found in the passage. Does this percent fall in the confidence interval range
found in part b. above?

7. See if your calculator has random number function (RAN# button). It should produce
three digit decimal numbers randomly. In other words, every number has an equal
chance of showing up on your screen. Assume that every time a 3, 6 or 9 appears as
the last digit of the random number, it is the same as receiving a YES response. Then,

58
theoretically a YES response should occur 30% of the time, since 3, 6 and 9 are three
out of ten possible last digits in each random.

a. Find the sampling error for this 30% yes population if the sample size is 100
random numbers.

b. Generate 100 random numbers and tally the number of times a 3, 6 or 9 occurred as
the last digit. What percent of the time did a 3, 6 or 9 occur?

c. What is the 95% confidence interval for this sample?

d. Check with the other students. Did their samples produce percentages within the 20
to 40 percent interval 19 times out of 20 times?

Answers are on page 69.

59
Activity 7: Smoking
When conducting a survey it is important to ask simple unambiguous questions. It is also
important to select a sample that is representative of the population being surveyed. The
following exercise should demonstrate the importance of sample size.

Ask various students the question, “Do you smoke?” (If there is any confusion about what
you are asking, you could say, “Have you smoked a cigarette in the last 48 hours?”) Record
the number of YES responses and then calculate the percentage of YES responses.

Number of Number of Percentage of


students asked YES responses YES responses

16

25

30

35

a. As the sample size increased, did the variation in percentages increase or decrease?

______________________________

60
b. According to Statistics Canada, 1996-1997, about 20% of the BC population are
smokers. Are your results close to 20%?

______________________________

c. You sampled 35 college students. Are these students representative of the total
college population? What problems might there be with your sample in terms of it
being representative of the whole population?

__________________________________________________________________

__________________________________________________________________

61
Unit 8: A statistics project
Now it is your turn. You or your group will select a topic of interest, collect data regarding
the topic, and statistically analyse the results. Choose a topic and then let your instructor
know what it is before presenting your results in Exercise 8.

A few possible topics are listed below. However, feel free to identify one of your own.

1. Number of cups of coffee (cans of pop) consumed by a student in one day.

2. Number of keys (or credit cards) carried by a person while at the college.

3. Number of cars passing a certain intersection every minute.

4. Number of cigarettes smoked per day by smokers.

5. Minutes spent studying last night.

6. Initial copyright dates on books in the library.

7. Resting heart rates of males (or females).

8. Total minutes per week spent on exercise by college students.

9. Age difference (in months) between oldest and youngest siblings (or between
spouses).

10. Waiting time in line (grocery store or bank) during busy times.

Now complete Exercise 8 and check your answers.

62
Exercise 8

1. What topic have you chosen?

2. Before you collect the data, try to predict the mean and range of the data.

3. You must collect 30 or more data values. List your data below.

4. Present your data graphically below. Use either a line graph, bar graph or stem and
leaf plot.

63
5. Describe the shape of your graph in 4. Is it normal, J-shaped, skewed or bimodal?

6. Calculate the following statistics for your data.

x =

median =

mode =

range =

σ=

Q1 =

Q3 =

7. What measure of central tendency best describes this set of data?

8. Which data value is higher than 90% of the rest of the data?

9. What percent of the data lies within,

a. one standard deviation of the mean?

b. two standard deviations of the mean?

c. three standard deviations of the mean?

64
10. How do the actual mean and range compare with your predicted mean and range?

11. Was the sample you chose biased in any way? Were you biased in the way you
collected the data?

12. If you were to do this project over again, how could you improve it?

13. Write a concluding statement about the data.

65
Appendix A
The entries in this table represent the area under the normal curve
from 0 to z. Areas for negative values of z are obtained by
symmetry.
x−x
z=
σ 0 Z

Second decimal place in z


z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.454
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993
3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995
3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997
3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.5 0.4998
4.0 0.49997
4.5 0.499997
5.0 0.4999997

66
Appendix B
Activity 4: Quiz
Speed counts. Work as quickly as possible and do not use a calculator.

1. Add 1 + 3 + 5 + 7 + 9 + 11 + 13 + 15 + 17 + 19

2. How many months have 28 days?

3. Divide 8 by 0.

4. Subtract 23.79 from 30.02.

5. Write down the 5th, 25th and middle two letters of the alphabet.

6. Multiply 6.87 by 0.96.

67
7. Name four countries in Africa.

8. Divide 1.3 by 0.52.

9. Use the letters given to form three words across and three words down. (One letter
per square.)

A B E E I R

T N

Answers are on page 75.

68
Answers

Exercise 1
1. The lengths of the bars suggest that men only have a life expectancy of one half or
50% that of women. This is a result of starting the years’ scale at 75 years rather than
78
at zero years. Actually, men have a life expectancy of or 96% that of women.
81

2. a. Only people with very strong opinions would bother to phone in.

b. Most people do not respond to mail surveys, so the sample would be quite small.
People who do not read books would most likely not respond. The survey might
not be representative of all the geographical, cultural and economic sectors of
Vancouver.

c. Butler is probably at the very place where the smokers are hanging out.

Exercise 2
1. a. x = $25 866.67, median = $25 000, mode = $25 000, range = $50 000

b. the mode c. the range

2. a.

daily
mean range
1 3 6 b. 8.2º C
2 2.5 1
3 8 8 c. 15º C
4 11.5 3
5 11 8 d. 4.7º C
6 11.5 3
7 10 4

3. a. A b. A c. A d. B

69
4. Average monthly temperatures for Vancouver over a one-year period (frequency graph)
24

22

20

18

16

14
o
Temperature C
12

10

0
J F M A M J J A S O N D
Month

Average monthly temperatures for Vancouver over a one-year period (histogram)

24

22

20

18

16

14
o
Temperature C
12

10

0
J F M A M J J A S O N D
Month

70
5. Computer course final grades.

4 0 2 2
5 0 4 6 8
6 6 6 8 9
7 0 3
8 0 4 5 8 9
9 3

6. a. 25 students
b. x = 63.5 seconds
median = 63 seconds

78% + 78% + 78% + 78% + 78% + x


7. = 80% , x = 90%
6
182 + x
8. = 0.75 , x = 88 Neil’s chances are poor.
260 + 100
He has only averaged 70% on his tests thus far.

Exercise 3
1. a. P50 = 119 and P80 = 138
b. Jill scored at the 82nd percentile since 140 = P82. Jill can type faster than 82% of the
other students.
c. 90 = P12 or 90 is the 12th percentile. 12% or 15 out of 125 failed the test.
d. P20 = 98.5

2. a. Q3 or P75 = 13
b. 13 = P70
c. The discrepancy is a result of the relatively small (n = 70) sample size. As the sample
size increases, these discrepancies become smaller.

Exercise 4
1. a. 10 b. 8 c. 6 d. 6 e. 3.1

f.
5

3
Frequency

0
0 1 2 3 4 5 6 7 8 9 10
Scores

71
g. x - σ = 6 - 3.1 = 2.9
x + σ = 6 + 3.1 = 9.1
h. There are 15 scores between 2.9 and 9.1 or, 71% of all the scores lie within one
standard deviation of the mean.

8
2. a. x + σ = 12.97 and x − σ = 7.03 . or 66.7% lie within one standard deviation of the
12
mean.
b. x + 2σ = 15.94 and x − 2σ = 4.06 All of the scores, or 100%, lie within two standard
deviations of the mean.

3. All the data values are equal.

4. In the first class most of the scores are between


x ± σ or 75 ± 5 = 70 to 80.
In the second class, x ± σ = 70 ± 15 = 55 to 85.
It is more likely that the second class would have more scores of 85 or better.

5. a. 55 kg b. 80.5 kg c. 13.6 kg d. 83.3%

6. a. 4.9 minutes b. 1.46 minutes

Exercise 5
1. a. 9-1 = 8
b.
24

21 22
18

15
Frequency
12

9
10
6 8
7 7
2 3
3 1

0 1 2 3 4 5 6 7 8 9
Children in family

c. x = 3.4 d. σ = 1.7

47
e. 1.7 and 5.1 f. = 78.3%
60

72
56
g. 0.0 and 6.8 h. = 93.3% i. -1.7 and 8.4
60
59
j. = 98.3% k. yes
60

2. a.
Hours Frequency
0.5 3 b. x = 2.07
1.0 11 c. σ = 0.78
1.5 12 d. 58.7%
2.0 17 e. 96.0%
2.5 15 f. 100%
3.0 13
3.5 4

3. a.

26
24 26
25
22
20
18
16
Frequency 14 16 7

12
10
8
6 7
4 6
1 2
2

60 70 80 90 100 110 120 130


IQ

b. 70 c. 99.5 d. 14.2 e. 67%

f. 95% g. 100% h. yes i. 142.1

Exercise 6
1. a. 0.4207 b. 0.2257 c. 0.5926 d. 0.3124 e. 0.0668

2. a. 76.73% b. 23.27% c. 92.67%

3. a. 92.15% b. 0.21% or about 1 in 500 c. 7.64% or about 3 in 40

4. a. 11.12% b. 1.74% c. 83.14%

5. x = z σ + x = -1.65 (100) + 1000 = 835 hours

6. About 3% of the workers.

73
x − x 115 − 104
7. a. σ = = = 5.85 b. 0.84% c. 0.31%
z 1.88

d. 75.17%

Exercise 7
1. a. There are “undecided” voters.

b. i. 39.9% - 3% = 36.9% ii. 33.9% + 3% = 36.9%

iii. Yes. Even at Martin’s worst and Wilson’s best, Wilson cannot overtake Martin.
57 100
2. a. = 31.1% b. = 7.4% c. between 23.7% and 38.5%
183 183

100
3. 2.5 = or n = 1600
n

833 100
4. = 67.2% and = 2.8%. 67.2% + 2.8% = 70% and 67.2% - 2.8% = 64.4%.
1240 1240
Rounding, the 95% confidence interval is 64% to 70%.

5. No. The results of polls are usually only valid for the day they are taken.

6. a. There are 18 “e’s” or 18%.


b. 8% to 28%.
c. Check with your instructor.

100
7. a. = 10% c. 20% to 40% b. and d. Check with your instructor.
100

74
Appendix B - Quiz
1. 100
2. all the months have 28 days
3. meaningless – 8 can’t be divided by 0
4. 6.23
5. E, Y, M and N
6. 6.5952
7. Northern Africa Eastern Africa Middle Africa
Algeria Burundi Angola
Egypt Comoros Cameroon
Libya Djibouti Central African Republic
Morocco Ethiopia Chad
Sudan Kenya Congo
Tunisia Madagascar Equatorial Guinea
Western Africa Malawi Gabon
Benin Mauritius Zaire
Burkina Faso Mozambique Southern Africa
Cape Verde Reunion Botswana
Cote d’Ivoire Rwanda Lesotho
Gambia Somalia Namibia
Ghana Tanzania South Africa
Guinea Uganda Swaziland
Guinea-Bissau Zambia
Liberia Zimbabwe
Mali
Mauritania
Niger
Nigeria
Senegal
Sierra Leone
Togo
8. 2.5
9.

T I N

A R E

B E T

75

You might also like