DS Assignment 2
DS Assignment 2
DS Assignment 2
Grade Breakdown:
Part 1: Written Answer
Question 1 12
Question 2 13
Question 3 10
Question 4 10
Total Points = 45
Part 2: Python
Question 5 8
Question 6 11
Question 7 16
Total Points = 35
Total Points: 80
Part 1 – Written Answer (round answers to 2 decimal places)
The height of adult males in a certain population is approximately Normal, with a mean of 1.75
metres and a standard deviation of 0.1 metres. For each part below, sketch a Normal curve and
shade in the area representing the proportion being calculated.
a. [4 Points] What proportion of have heights are greater than 1.75 metres?
c. [4 Points] How tall must a male be to be in the top 2.5 % of adult males?
The table below gives the self-reported ages of 10 first-born children (Child’s age) along with
the ages of their mothers (Mother’s age), both in years.
a. [6 Points] Draw, by hand, a scatterplot for this dataset. Comment on the direction,
form, and strength of this relationship.
b. [5 Points] Find the correlation between the mother’s and child’s age. Show all your
work.
c. [2 Points] Do the value and the sign (positive or negative) of the correlation in part (b)
make sense based on the scatterplot from part (a)? Briefly explain.
Question 3 [10 Points]
A study investigates the relationship between students' daily study hours and final exam
scores. The students in the study have an average daily study time (x) of approximately 6
hours, with a standard deviation of about 1 hour. The average final exam score (y) is around 75
points, with a standard deviation of about 10 points. Suppose the correlation between daily
study hours and final exam scores is r=0.6.
a. [2 Points] Calculate the slope and intercept of the regression line of final exam scores on
daily study hours. Explain the meaning of the slope in the context of this study.
b. [1 Point] Using the calculated regression line equation, predict the final exam score for a
student who studies 6 hours daily.
c. [4 Points] Draw, by hand, the regression line representing the relationship between daily
study hours and final exam scores for study times ranging from 2 to 6 hours. On the graph,
mark the predicted exam score for a student who studies 6 hours daily.
e. [1 Point] Calculate the percentage of the final exam score variation explained by the
straight-line relationship with daily study hours.
f. [2 Points] Briefly discuss whether you expect the prediction made in part c to be accurate
and explain your reasoning.
A study investigates the relationship between the number of hours of sleep a person gets per
night and their job performance score. Based on the observations in the table below, assume
that you have already calculated the regression line to be: 𝑦=5.17𝑥+42.06.
b. [4 Points] Draw a residual plot. What does this plot tell us about the regression
model?
c. [2 Points] What is an influential observation? Do you think this dataset contains any?
Part 2 – Python
Question 5 [8 points]
The length of the fin in a population of a specific type of fish is approximately normal, with
mean 50 cm and standard deviation 29 cm.
a. [2 points] What proportion of fish have fin length less than 2.9 cm?
b. [2 points] What proportion of fish have fin length greater than 150 cm?
c. [2 points] What proportion of fish have fin length between 79 cm and 89 cm?
d. [2 points] What value of fin length gives a 15% proportion of fish above it?
Simulation study plays an important role in statistics and data science. It provides a
fundamental tool to study the properties of statistical estimators and models under various
situations, because in simulation study, we know the “ground truth” of the data, which we
seldom have in real-world data analysis. In this question, we are going to perform a basic
simulation study related to the standard normal distribution.
a. [2 points] Compute the proportion of values smaller than two standard deviation above
the mean for a standard normal distribution.
b. [5 points] Write a function called `stats_normal` with argument `n` to perform the
following task: (Hint: review what we have learned in Lab 2)
1. Generate a sample from standard normal with a sample size equal to n.
2. Compute the mean of the sample.
3. Compute the standard deviation of the sample.
4. Compute the proportion of values smaller than two sample standard deviation
above the sample mean in the sample.
5. Return a dictionary (Recall Lab 1) containing the three numerical quantities
above.
c. [2 points] Set the random seed as 5. Run the function `stats_normal’ with n = 100, 1000,
10000. (Hint: you can simply write three lines of codes. It is optional to run a loop.)
d. [2 points] What kind of pattern can you see from the results in part c?
The dataset (waterlev.csv) records water level measurements from January 1, 2000, to August
27, 2019. It includes two main fields: the date of the measurement and the mean water gauge
height in feet. Each row represents a measurement taken on a specific date.
a. [3 points] Create a histogram of the mean water gauge heights with the density curve
estimate on the same plot. Comment on the shape of the distribution.
b. [2 points] Calculate the mean (𝑥̅ ) and standard deviation (s) of the mean water gauge
heights.
c. [4 points] Using the results of part b, follow the steps below:
1. Find the number of data points with mean water gauge heights between 𝑥̅ − 𝑠
and 𝑥̅ + 𝑠.
2. Divide this number by the total number of data points in the dataset to obtain a
proportion.
3. By comparing this proportion to the proportion between 𝜇−𝜎 and 𝜇+𝜎 given by
the normal density curve, which one is larger?
4. Based on your results, does there appear to be a departure from the normal
distribution? (Hint: for the tail part of the distribution, is it heavier/lighter than
normal?)
d. [7] Follow the steps below:
1. Make a boxplot of the mean water gauge heights.
2. Generate a sample from a normal distribution with mean and standard deviation
equal to the ones computed in part b. The sample size should be equal to the
total number of points in the dataset. Set the random seed as 2023. (Hint: use
rnorm)
3. Make a boxplot of this normal sample.
4. Set the random seed as 2022. Redo steps 2 and 3.
5. By comparing these three boxplots, do they support your findings in part c
(regarding whether there is a departure from the normal distribution)? Why?
(Hint: pay attention to the outlier part. The reason we redraw another boxplot
for the normal sample is to reduce the effects of randomness, so that you can
see a general outlier pattern of a normal distribution.)