Bacs HW2
Bacs HW2
Bacs HW2
112078516
2024-02-27
Question 1) Let’s have a look at how the mean and median behave.
1
Distribution 1
0.04
0.03
Density
0.02
0.01
0.00
0 20 40 60
(a) Create and visualize a new “Distribution 2”: a combined dataset (n=800) that is negatively
skewed (tail stretches to the left). Change the mean and standard deviation of d1, d2, and d3
to achieve this new distribution. Compute the mean and median, and draw lines showing the
mean (solid) and median (dashed).
2
Distribution 2
0.04
0.03
Density
0.02
0.01
0.00
0 10 20 30 40 50 60
(b) Create a “Distribution 3”: a single dataset that is normally distributed (bell-shaped,
symmetric) – you do not need to combine datasets, just use the rnorm() function to create a
single large dataset (n=800). Show your code, compute the mean and median, and draw lines
showing the mean (solid) and median (dashed).
3
Distribution 3
0.08
0.06
Density
0.04
0.02
0.00
20 30 40
(c) In general, which measure of central tendency (mean or median) do you think will be more
sensitive (will change more) to outliers being added to your data?
## 75%
## 10.73876
1.5 IQR is approximately equal to 10. Therfore, a distribution with a value » 10 with standard deviation =
5 is a reasonable outlier added to the original distribution.
# added outlier to d1
outlier <- rnorm(n=10, mean=100, sd=5)
d1_with_out <- c(d1, outlier)
# calculate the mean and median when outliers not being added to data
origin_mean <- mean(d1)
origin_median <- median(d1)
4
# calculate the mean and median when outliers being added to data
mean_with_outlier <- mean(d1_with_out)
median_with_outlier <- median(d1_with_out)
## [1] 0.87166
median_dif
## [1] 0.05931095
First, after adding smaller outliers while their mean equals to 100 (»10), we can tell that the difference
between means is bigger than the difference between the medians.
# added outlier to d1
outlier <- rnorm(n=10, mean=1000, sd=5)
d1_with_out <- c(d1, outlier)
# calculate the mean and median when outliers not being added to data
origin_mean <- mean(d1)
origin_median <- median(d1)
# calculate the mean and median when outliers being added to data
mean_with_outlier <- mean(d1_with_out)
median_with_outlier <- median(d1_with_out)
## [1] 11.96454
median_dif
## [1] 0.05931095
After adding bigger outliers while their mean equals to 1000 (»10), we can tell that the difference between
means is way more bigger than the difference between the medians.
The consequence proves that the mean value will be more sensitive (will change more) to
outliers than the median. Moreover, the larger the outliers, the greater the impact they will
have on the mean value from the data.
5
Question 2) Let’s try to get some more insight about what standard deviations
are.
(a) Create a random dataset (call it rdata) that is normally distributed with: n=2000, mean=0,
sd=1. Draw a density plot and put a solid vertical line on the mean, and dashed vertical lines
at the 1st, 2nd, and 3rd standard deviations to the left and right of the mean. You should
have a total of 7 vertical lines (one solid, six dashed).
rdata
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
6
(b) Using the quantile() function, which data points correspond to the 1st, 2nd, and 3rd
quartiles (i.e., 25th, 50th, 75th percentiles) of rdata? How many standard deviations away
from the mean (divide by standard-deviation; keep positive or negative sign) are those points
corresponding to the 1st, 2nd, and 3rd quartiles?
quantile(rdata, c(0.25,0.5,0.75))
(c) Now create a new random dataset that is normally distributed with: n=2000, mean=35,
sd=3.5. In this distribution, how many standard deviations away from the mean (use positive
or negative) are those points corresponding to the 1st and 3rd quartiles? Compare your answer
to (b)
(rdata2Q1-35)/3.5
## [1] -0.6723922
(rdata2Q3-35)/3.5
## [1] 0.6206003
Compared to the answer in (b), there is approximately no change. Our objective is to determine the distance
of Q1 and Q3 from the center of the distribution. Calculating “how many standard deviations away from
the mean” involves normalization.
Since both (b) and (c) originate from a normal distribution, the distances of Q1 and Q3 from the center
should be approximately the same.
(d) Finally, recall the dataset d123 shown in the description of question 1. In that distribution,
how many standard deviations away from the mean (use positive or negative) are those data
points corresponding to the 1st and 3rd quartiles? Compare your answer to (b)
7
# 1st quantile
d123Q1 <- unname(quantile(d123, c(0.25,0.5,0.75)))[1]
# 2nd quantile
d123Q2 <- unname(quantile(d123, c(0.25,0.5,0.75)))[2]
# 3rd quantile
d123Q3 <- unname(quantile(d123, c(0.25,0.5,0.75)))[3]
(d123Q1-mean(d123))/sd(d123)
## [1] -0.7331399
(d123Q3-mean(d123))/sd(d123)
## [1] 0.643898
Q1 is approximately 0.74 standard deviations below the mean, and Q3 is approximately 0.63 standard
deviations above the mean.
Compared to the result in (b), the result has changed. This is because in d123, the distribution exhibits
a positive skew (skew to the right). As a result, Q1 is further from the mean compared to the standard
deviations away from the mean for Q3.
𝑘 = 𝑐𝑒𝑖𝑙𝑖𝑛𝑔((𝑚𝑎𝑥(𝑑) − 𝑚𝑖𝑛(𝑑))/ℎ)
and bin width from number of bins:
ℎ = (𝑚𝑎𝑥(𝑑) − 𝑚𝑖𝑛(𝑑))/𝑘
Now, read this discussion on the Q&A forum called “Cross Validated” about choosing the number of bins
(a) From the question on the forum, which formula does Rob Hyndman’s answer (1st answer)
suggest to use for bin widths/number? Also, what does the Wikipedia article say is the benefit
of that formula?
The Rob Hyndman suggest to use Freedman–Diaconis rule, the bin width formula is
3.49𝜎̂
ℎ= √3
𝑛
8
The Wikipedia article said the benefit of Freedman-Diaconis rule is that it is less sensitive than the standard
deviation to outliers in data.
Freedman-Diaconis rule is designed to roughly minimize the integral of the squared difference between the
histogram and the density of the theoretical probability distribution.
i. Sturges’ formula
𝑘 = [log2 (𝑛)] + 1
k1 = log2(800) +1
h1 = (max(rand_data)-min(rand_data))/k1
data.frame(
k = k1,
h = h1
)
## k h
## 1 10.64386 3.160486
h2 = 3.49*sd(rand_data)/(800^(1/3))
k2 = ceiling((max(rand_data) - min(rand_data))/h2)
data.frame(
k = k2,
h = h2
)
## k h
## 1 18 1.90383
9
h3 = 2*IQR(rand_data)/(800)^(1/3)
k3 = ceiling((max(rand_data) - min(rand_data))/h3)
data.frame(
k = k3,
h = h3
)
## k h
## 1 24 1.433148
(c) Repeat part (b) but let’s extend rand_data dataset with some outliers (creating a new
dataset out_data):
out_data <- c(rand_data, runif(10, min=40, max=60)) From your answers above, in which of the three
methods does the bin width (h) change the least when outliers are added (i.e., which is least sensitive to
outliers), and (briefly) WHY do you think that is?
i. Sturges’ formula
k1_o = log2(800) +1
h1_o = (max(out_data)-min(out_data))/k1_o
data.frame(
k = k1_o,
h = h1_o
)
## k h
## 1 10.64386 5.376344
h2_o = 3.49*sd(out_data)/(800^(1/3))
k2_o = ceiling((max(out_data) - min(rand_data))/h2_o)
data.frame(
k = k2_o,
h = h2_o
)
## k h
## 1 25 2.292732
10
h3_o = 2*IQR(out_data)/(800)^(1/3)
k3_o = ceiling((max(out_data) - min(out_data))/h3_o)
data.frame(
k = k3_o,
h = h3_o
)
## k h
## 1 40 1.463922
Calculate the difference between the original bin width (h) and the bin width
with outliers (h_o)
abs(h1-h1_o)
## [1] 2.215859
abs(h2-h2_o)
## [1] 0.3889017
abs(h3-h3_o)
## [1] 0.03077374
Freedman-Diaconis’ choice method changes the least when outliers are added.
For the Sturges’ formula, the range (max(d) - min(d)) increases with the addition of outliers, resulting in an
increase in �.
For the Scott’s normal reference rule, standard deviation �ˆ increase when outliers are added, resulting in an
increase in �.
For the Freedman-Diaconis’choice, it uses IQR, which are less sensitive to outliers compared with range and
standard deviation methods.
11