Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

C Q, D Q, CN C: Name: - Period: - Date

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Name: ________________________________ Period: _________ Date: _________________

Describing Data Practice


1. Classify the following variables as categorical (C) or quantitative (Q). If the data is quantitative, also state
whether it is discrete (D) or continuous (CN)
C______ Zip code Q, D___ Test scores
Q, CN__ Average miles per gallon of a car C______ Favorite ice cream flavor

2. The following partially complete two-way table shows the favorite food of a random sample of 20 Dallas
Cowboy football fans. Example calculation: Male eggplant: 4/15= 27%
Eggplant Quinoa Kale Total
Male 4 (27%) 8 (53%) 3 (20%) 15
Female 10 (67%) 1 (7%) 4 (27%) 15
Total 14 9 7 30

a) Fill in the missing conditional values in the table above.


b) Create a segmented bar graph on the right to display the relationship
between gender and food preference for these football fans.
c) In the space below, write a few sentences comparing the distribution
of food preference across gender.
A higher percentage of males (53%) prefer quinoa than
females (7%). A higher percentage of females (67%) prefer eggplant than males (27%).
Males and females have about the same percentage who prefer kale.

The following partially complete two-way table shows the favorite food
of a random sample of 20 Dallas Stars hockey fans.
Eggplant Quinoa Kale Total
Male 5 (36%) 4 (29%) 5 (36%) 14
Female 6 (38%) 5 (31%) 5 (31%) 16
Total 11 9 10 30

d) Fill in the missing marginal values in the table above.


e) Create a segmented bar graph on the right showing the breakdown of food preference by gender for
these hockey fans.
f) Is the association between gender and food preference stronger or weaker among football or hockey
fans? Justify your answer.
There is stronger association for football fans. Knowing the value of one variable for football
fans helps predict the value of the other variable. Among football fans, a higher percentage
of males (53%) prefer quinoa than females (7%). Also, a higher percentage of females (67%)
prefer eggplant than males (27%).

There is only a weak association between gender and favorite food among the hockey fans.
For hockey fans, the conditional distributions (and segment heights) for favorite food are
about the same for each gender and about the same within each gender.

1
3. The following table is a set of starting salaries (in $1000s) of a sample of 80 recent college graduates.
a) Complete the column cumulative frequency. Then draw a histogram.
Class Frequency Cumulative
interval frequency
30 < x ≤ 40 12 12
40 < x ≤ 50 32 32+12 =44
50 < x ≤ 60 16 60
60 < x ≤ 70 8 68
70 < x ≤ 80 4 72
80 < x ≤ 90 4 76
90 < x ≤ 100 4 80
b) Is the mean of this data greater or less than the median? Explain.
Mean > median. Because the distribution is skewed right, the average will be “pulled”
towards the larger values; the mean is not resistant. The median is resistant.
c) Is the median starting salary more than $50,000? Justify your answer.
No. The median starting salary is the average of the 40th and 41st values in the ordered list of
values. From the cumulative frequency table, there are 44 values at or below 50. Therefore,
the median must be $50K or less.
mean
d) One possible statistic to measures skewness is the ratio of the mean to the median, i.e., .
median
What value(s) of this statistic might indicate that the distribution of values is symmetric? Explain.
If the mean ≈ median, then the statistic is close to 1 and the distribution is approx.
symmetric.
e) What values of that statistic might indicate that the distribution is skewed to the right? Explain.
When the data is skewed to the right, the median < mean. So, assuming all the data values
are greater than zero, the statistic will be greater than 1 when the data is skewed right.
4. A consumer group surveyed the prices for a certain item in five different stores and reported the average
price as $15. We visited four of the five stores and found the prices to be $10, $15, $15, and $25.
Assuming that the consumer group is correct, what is the price of the item at the store that we did not
visit?
(a) $10 (b) $15 (c) $20 (d)$25
Mean = (10 + 15 + 15 + 25 + X)/5 = 15. Solving, X = $10.
5. The mean of five numbers is 36. If one of those numbers is 28, then what is the mean of the other four
numbers?
Mean = (a + b + c + d + 28)/5 = 36. Algebraic manipulation yields
a + b + c + d = 180 – 28 = 152 à (a + b + c + d)/4 = 38
6. A distribution of 6 scores has a median of 21. If the highest score increases by 3 points, what will be the
value of the median?
(A) 21 (B) 21.5 (C) 24 (D) 27 (E) cannot be determined with given information
(A) is correct
7. A distribution of 6 scores has a mean of 20. If the highest score increases by 3 points, what will be the
value of the mean?
(A) 20 (B) 20.5 (C) 23 (D) 27 (E) cannot be determined with given information
(B) is correct (Original data: mean = 120/6 = 20; new data: mean = 123/6 = 20.5)
2
8. The circumference of 14 trees in the California Redwood Forest selected at random were measured. The
results to the nearest inch are given below.
80 62 38 81 77 55 77 55 94 10 70 85 40 53
Rearranging to put the numbers in order,
10 38 40 53 55 55 62 70 77 77 80 81 85 94
a) Calculate the following measures of center for this data (and show your work):
Mean Median (calculate by hand)
x = Σ xi/n = 62.6 inches (62+70)/2 = 66 inches
b) Calculate the following measures of spread for this data (and show your work):
Standard deviation Interquartile Range
Sample variance = s2 = (Σ(xi - x )2) /(n-1) = 514.5 inches2 IQR = Q3 – Q1 = 80 – 53 = 27 inches
Sample standard deviation = s = √514.5 = 22.7 inches
c) Interpret the standard deviation in context.
The circumferences of the trees typically vary by 22.7 inches from the mean of 62.6 inches.
d) What is the five-number summary for this data? Identify each number.
Min = 10 in. Q1 = 53 in. Median = 66 in. Q3 = 80 in. Max = 94 inches
e) Determine whether there are any outliers in this data. Show your work. IQR = 80 – 53 = 27
Q1 – 1.5·IQR = 53 – 1.5·(27) = 12.5 Q3 + 1.5·IQR = 80 + 1.5·(27) = 120.5
10 < 12.5. ⸫ 10 is a low outlier. 120.5 > 94. ⸫ No high outlier.
f) What do the statistics that you have calculated suggest about the skewness of the distribution? Explain.
Mean (62.6) < Median (66) suggests that the distribution is skewed left.

g) Which summary statistics—the median, mean, standard deviations, interquartile range—should you use
to report the overall circumference of these trees? Explain.
Because the data is skewed to the left and/or there is an outlier, we should use the median
and the interquartile range (“IQR”) as the measures of center and spread, respectively. The
median and IQR are resistant to the influence of outliers; the mean and the standard
deviation are not.

h) Construct a (modified) boxplot of the data using the scale below:

i) Write a few sentences to describe the distribution of circumferences of these trees.


Center: The median tree circumference is 66 inches.
Unusual features: There is one low outlier at 10 inches. (OR There is a gap between 10
inches and the next highest at 38 inches). There are no high outliers.
Shape: The shape of the distribution is skewed left (mean (62.6) < median (66)).
Spread: The interquartile range is Q3–Q1 = 80-53 = 27 inches.
3
j) Using the data on tree circumference above, (i) calculate the following statistics and decide (ii) whether it
could be used to measure center or spread and (iii) whether it is resistant to the influence of outliers.
Q1 + Q3 = 53+80 = 66.5 in. Max – Min = 94-10 = 42 in. Max + Min = 94+10 = 52 in.
2 2 2 2 2 2
Center, resistant Spread, not resistant Center, not resistant
k) One of the circumferences was 10 inches. If the circumference of this tree was 30 inches, what effect
would the increase have had on the following statistics? Justify your answer.
The mean: If the 10-inch circumference was 30 inches, the mean would Tarzan Jane
increase by 20 divided by 14 or 1.4 inches to 64 inches. 22 10L
The median: Here, the median would not change. The current median is 7 10H 7789
2 20L 00234
between 62 and 70, and both 10 and 30 are less than those values. 9875 20H 589
9. Here is the data on the number of goals scored per season by two hockey players. 1 30L 0
65 30H 5
Tarzan: 12, 12, 17, 22, 25, 27, 28, 29, 31, 35, 36, 42, 47 2 40L
Jane: 17, 17, 18, 19, 20, 20, 22, 23, 24, 25, 28, 29, 30, 35 7 40H

a) Display the data in a back-to-back stemplot. Key: 2|20L|0 means 22


b) Write a few sentences comparing the distributions. goals for Tarzan and 20
Center: Tarzan tended to score more goals/season than Jane. I will use goals for Jane.
median and IQR to compare the centers and spreads, because Jane’s data
is skewed. The median for Tarzan (28 goals/season) is greater than that for Jane (22.5).
Unusual features: Neither distribution appears to have any outliers or significant gaps.
(Student must show work to show no outliers in fact.)
Spread: Tarzan’s data shows more variability. The IQR for Tarzan (35-22 = 13 goals/season)
is greater than that for Jane (27.25 – 19.25 = 8).
Shape: Tarzan’s distribution is approx. symmetric, while Jane’s distribution is skewed to the
right (median (22.5) < mean = (23.4)).

c) Based on the stemplot, give one reason you might give to choose Tarzan to be on your hockey team?
Tarzan is likely to score more goals than Jane. Tarzan has a higher first quartile, median,
mean, third quartile and maximum.
d) In any given year, which hockey player can you count on to consistently score goals close to his or her
season average? Explain.
Jane. Jane’s results vary less about her average; she has a lower standard deviation and IQR.

10. A distribution of 6 scores has a mean of 20 and a standard deviation of 3. Suppose we added a score of 15
to the distribution. What effect would including this observation have on the mean and standard
deviation? Explain.
The mean would decrease because 15 is less than the current mean.
The standard deviation would increase because 15 is more than the current “typical”
distance (3) from the mean to the observations in distribution.

4
Warmup Questions

1. Given the set of data { 8 6 5 5 4 2 }, demonstrate that adding a constant (e.g., 5) to every score
increases the mean and median by that amount.
Original data rearranged {2 4 5 5 6 8} New data after adding 5 to each score: {7 9 10 10 11 13}
Mean = ∑x /n = 30/6 = 5 Mean = ∑x /n = 60/6 = 10
Median = 5 Median = 10

2. Same facts as stated in the previous problem. What will happen to the first quartile and third quartile?
(a) Be unchanged (c) increase by 5 (correct)
(b) Be multiplied by 5 (d) increase by √5

3. Same facts as stated in the previous problem. What will happen to the interquartile range?
(a) Be unchanged (correct) (c) increase by 5
(b) Be multiplied by 5 (d) increase by √5

4. Given the same set of data { 8 6 5 5 4 2 }, show that multiplying each score by a constant multiplies the
mean and median by that constant.
Original data rearranged {2 4 5 5 6 8} New data after multiplying each score by 5: {10 20 25 25 30 40}
Mean = ∑x /n = 30/6 = 5 Mean = ∑x /n = 150/6 = 25
Median = 5 Median = 25

5. The figure on the right is the density curve of a distribution. Which of the
following statements is true?
(A) The mean will be higher than the median because the distribution is skewed left.
(B) The median will be higher than the mean because the distribution is skewed left.
(C) The mean will be higher than the median because the distribution is skewed right.
(D) The median will be higher than the mean because the distribution is skewed right.
(E) The mean and median will be approximately the same because the distribution is symmetrical.

6. The scores on a statistics exam are strongly skewed to the left and have a very wide range. Which of the
following would be the best method for describing the distribution? Explain your answer choice.
(a) Median and interquartile range
(b) Mean and standard deviation

5
Notes on old AP exam FRQS
2000 3 I Graphing and comparing frequency distributions Flexibility

Display the data graphically so that blank and blank can be compared.

Based on an examination of your graphical display, write a few sentences comparing the blank with the blank.

2001 1 I Outliers Rain

Are there any outliers in this data? Justify your answer.


Student/F
2002 Form B 5 I making graphs and comparing distributions course tim

Use the same scale to draw boxplots of for the blank and blank.

Write a few sentences comparing the variability of the two distributions.

You have been asked to report on this for a school newspaper. Write a few sentences describing the student and faculty
performance in this competition for the newspaper.

2004 1 I Boxplots, outliers, properties of boxplots, shape/center Gasoline a

On the grid below, draw parallel boxplots (showing outliers if any) of the differences of the two additives.

Which additive, A or B, would you recommend if the goal is to increase gas mileage in the highest proportion of cars?
Explain your choice.

…. If your goal is to have the highest mean increase in gas? Explain your choice.

2005 Form B 1 I Shape, center, spread of a distribution Test score

Based on the stemplot, describe the shape of distribution.

Which summary statistics, the mean or the median, should the instructor use to report that overall exam performance
was high? Explain.

The midrange is defined as …. Compute this value using the data on the previous page.

Is the midrange considered a measure of center or a measure of spread? Explain.

2006 1 I Comparing distributions, variability, center Catapults


Comment on any similarities and any distributions in the two distributions.

If a parent wants to maximize the probability of having a ping-pong ball land within a band, which of the two catapaults
would be better to use than the other. Justify your choice.

2007 Form B 1 I Stemplots, describing distributions Economic


Display these data in stemplot.

Use your stemplot to describe the main features of this score distribution.

Why would it be misleading to report only a measure of center for this score distribution?
6
2010 Form B 1 I Comparing boxplots, constructing a stemplot, stemplot vs. boxplot Polluted ri

(Parallel box plots show) Compare the distributions of the concentration of aldrin among the three rivers.

(Give data) Construct a stemplot that displays the concentrations of aldrin for River X.

Describe a characteristic of the distribution of aldrin concentrations in River X that can be seen in the stemplot but cannot
be seen in the boxplot.

2011 Form B 1 I Estimating median from hist., comparing hist., mean vs. median Pupil-teac

Describe how you would use the histograms to estimate the median P-T ratio for each group (west and east) of states.
Then use this procedure to estimate the median of the west group and the median of the east group.

(b) Write a few sentences comparing the distributions of P-T ratios for states in the two groups (west and east) during the
2001–2002 school year.

(c) Using your answers in parts (a) and (b), explain how you think the mean P-T ratio during the 2001–2002 school year
will compare for the two groups (west and east).

2015 1 I Comparing boxplots, making decisions using boxplots Accountan


Two boxplots shown. Write a few sentences comparing the distributions of the yearly salaries at the two corporations.

(b) Suppose both corporations offered you a job for $36,000 a year as an entry-level accountant.

(i) Based on the boxplots, give one reason why you might choose to accept the job at corporation A.

(ii) Based on the boxplots, give one reason why you might choose to accept the job at corporation B.

2016 1 I Describing a distribution; effect of changing a value on mean, median. Robin's tip
Histogram shown.

Write a few sentences to describe the distribution of tip amounts for the day shown.

One of the tip amounts was $8. If the $8 tip had been $18, what effect would the increase have had on the following
statistics? Justify your answer.

The mean:

The median:

2017 4 I Comparing boxplots, using boxplots to classify Pottery ch


Given a set of parallel boxplots, describe how the percents found in the pieces of pottery are similar and how they differ
among the three cities.

Consider a piece of pottery known to have originated at one of the three sites, but the actual site is not known.

Suppose an analysis of the clay reveals that the sum of the percents of the three chemicals X, Y, and Z is 20.5%. Based on
the boxplots, which site, --I, II or III—is the most likely site where the piece of pottery originated? Justify your choice.

Suppose only one chemical could be analyzed in the piece of pottery. Which chemical—X, Y or Z—would be the most
useful in identifying the site where the piece of pottery originated? Justify your choice.

7
Medians from histograms; mean from combined sample; mean +/- SD on
2018 5 I histogram Teachers
Teaching year is recorded as an integer, with first-year teachers recorded as 1, second-year teachers recorded as 2, and so
on. Both sets of data have a mean teaching year of 8.2, with data recorded from 200 teachers at High School A and 22
teachers at High School B. On the histograms, each interval represents possible integer values from the left endpoint up
to but not including the right endpoint.

(a) The median teaching year for one high school is 6, and the median teaching year for the other high school is 7.
Identify which high school has each median and justify your answer.
(b) An additional 18 teachers were not included with the data recorded from the 200 teachers at High School A. The
mean teaching year of the 18 teachers is 2.5. What is the mean teaching year for all 218 teachers at High School
A?
(c) The standard deviation of the teaching for the 221 teachers at High School B is 7.2. If one teacher is selected at
random from High School B, what is the probability that the teaching year for the selected teacher will be within 1
standard deviation of the mean of 8.2? Justify your answer.

You might also like