Bacs HW2

HW2
112078516
2024-02-27
Question 1) Let’s have a look at how the mean and median behave.
# Three normally distributed data sets

d1 <- rnorm(n=500, mean=15, sd=5)
d2 <- rnorm(n=200, mean=30, sd=5)
d3 <- rnorm(n=100, mean=45, sd=5)
# Combining them into a composite dataset

d123 <- c(d1, d2, d3)
# Let’s plot the density function of d123

plot(density(d123), col="blue", lwd=2,
main = "Distribution 1")
# Add vertical lines showing mean and median

abline(v=mean(d123))
abline(v=median(d123), lty="dashed")
1
Distribution 1
0.04
0.03
Density
0.02
0.01
0.00
0 20 40 60
N = 800 Bandwidth = 2.736
(a) Create and visualize a new “Distribution 2”: a combined dataset (n=800) that is negatively
skewed (tail stretches to the left). Change the mean and standard deviation of d1, d2, and d3
to achieve this new distribution. Compute the mean and median, and draw lines showing the
mean (solid) and median (dashed).
# Three normally distributed data sets

d1 <- rnorm(n=500, mean=45, sd=5)
d2 <- rnorm(n=200, mean=30, sd=5)
d3 <- rnorm(n=100, mean=15, sd=5)

Distribution_2 <- c(d1, d2, d3)
# Let’s plot the density function of Distribution_2

plot(density(Distribution_2), col="blue", lwd=2,

abline(v=mean(Distribution_2))
# tly="dashed" stands for drawing dotted/dashed line
abline(v=median(Distribution_2), lty="dashed")
2
Distribution 2
0.04
0.03
Density
0.02
0.01
0.00
0 10 20 30 40 50 60
It shows up a left-skewed distribution.
(b) Create a “Distribution 3”: a single dataset that is normally distributed (bell-shaped,
symmetric) – you do not need to combine datasets, just use the rnorm() function to create a
single large dataset (n=800). Show your code, compute the mean and median, and draw lines
showing the mean (solid) and median (dashed).
d <- rnorm(n=800, mean=30, sd=5)

Distribution_3 <- c(d)

plot(density(Distribution_3), col="blue", lwd=2,

abline(v=mean(Distribution_3))
abline(v=median(Distribution_3), lty="dashed")
3
Distribution 3
0.08
0.06
Density
0.04
0.02
0.00
20 30 40
(c) In general, which measure of central tendency (mean or median) do you think will be more
sensitive (will change more) to outliers being added to your data?
d1 <- rnorm(n=800, mean=30, sd=5)
# calculate 1.5 IQR

IQR_1.5 <- 1.5*(quantile(d1,0.75) - quantile(d1,0.25))
IQR_1.5
## 75%
## 10.73876
1.5 IQR is approximately equal to 10. Therfore, a distribution with a value » 10 with standard deviation =
5 is a reasonable outlier added to the original distribution.
# added outlier to d1
outlier <- rnorm(n=10, mean=100, sd=5)
d1_with_out <- c(d1, outlier)
# calculate the mean and median when outliers not being added to data
origin_mean <- mean(d1)
origin_median <- median(d1)
4
# calculate the mean and median when outliers being added to data
mean_with_outlier <- mean(d1_with_out)
median_with_outlier <- median(d1_with_out)
# calculate the difference

mean_dif <- abs(origin_mean - mean_with_outlier)
median_dif <- abs(origin_median - median_with_outlier)
mean_dif
## [1] 0.87166
median_dif
## [1] 0.05931095
First, after adding smaller outliers while their mean equals to 100 (»10), we can tell that the difference
between means is bigger than the difference between the medians.
# added outlier to d1
outlier <- rnorm(n=10, mean=1000, sd=5)
d1_with_out <- c(d1, outlier)
# calculate the mean and median when outliers not being added to data
origin_mean <- mean(d1)
origin_median <- median(d1)
# calculate the mean and median when outliers being added to data
mean_with_outlier <- mean(d1_with_out)
median_with_outlier <- median(d1_with_out)
# calculate the difference

mean_dif <- abs(origin_mean - mean_with_outlier)
median_dif <- abs(origin_median - median_with_outlier)
mean_dif
## [1] 11.96454
median_dif
## [1] 0.05931095
After adding bigger outliers while their mean equals to 1000 (»10), we can tell that the difference between
means is way more bigger than the difference between the medians.
The consequence proves that the mean value will be more sensitive (will change more) to
outliers than the median. Moreover, the larger the outliers, the greater the impact they will
have on the mean value from the data.
5
Question 2) Let’s try to get some more insight about what standard deviations
are.
(a) Create a random dataset (call it rdata) that is normally distributed with: n=2000, mean=0,
sd=1. Draw a density plot and put a solid vertical line on the mean, and dashed vertical lines
at the 1st, 2nd, and 3rd standard deviations to the left and right of the mean. You should
have a total of 7 vertical lines (one solid, six dashed).
rdata <- rnorm(n=2000, mean=0, sd=1)

plot(density(rdata), col="blue", lwd=2,
main = "rdata", xlim = c(-4,4))

abline(v=mean(rdata))
abline(v=1, lty="dashed")
abline(v=-1, lty="dashed")
rdata
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
N = 2000 Bandwidth = 0.1933
6
(b) Using the quantile() function, which data points correspond to the 1st, 2nd, and 3rd
quartiles (i.e., 25th, 50th, 75th percentiles) of rdata? How many standard deviations away
from the mean (divide by standard-deviation; keep positive or negative sign) are those points
corresponding to the 1st, 2nd, and 3rd quartiles?
quantile(rdata, c(0.25,0.5,0.75))
## 25% 50% 75%

## -0.67395681 -0.03350167 0.65059354
(c) Now create a new random dataset that is normally distributed with: n=2000, mean=35,
sd=3.5. In this distribution, how many standard deviations away from the mean (use positive
or negative) are those points corresponding to the 1st and 3rd quartiles? Compare your answer
to (b)
rdata2 <- rnorm(n=2000, mean=35, sd=3.5)

# 1st quantile
rdata2Q1 = unname(quantile(rdata2, c(0.25,0.5,0.75)))[1]
# 2nd quantile
# 3rd quantile
c(rdata2Q1, rdata2Q2, rdata2Q3)
## [1] 32.64663 35.00550 37.17210
rdata2Q1 stands for the value of 1st quartile.

rdata2Q2 stands for the value of 2nd quartile.
rdata2Q3 stands for the value of 3rd quartile.
(rdata2Q1-35)/3.5
## [1] -0.6723922
(rdata2Q3-35)/3.5
## [1] 0.6206003
Compared to the answer in (b), there is approximately no change. Our objective is to determine the distance
of Q1 and Q3 from the center of the distribution. Calculating “how many standard deviations away from
the mean” involves normalization.
Since both (b) and (c) originate from a normal distribution, the distances of Q1 and Q3 from the center
should be approximately the same.
(d) Finally, recall the dataset d123 shown in the description of question 1. In that distribution,
how many standard deviations away from the mean (use positive or negative) are those data
points corresponding to the 1st and 3rd quartiles? Compare your answer to (b)
7
# 1st quantile
d123Q1 <- unname(quantile(d123, c(0.25,0.5,0.75)))[1]
# 2nd quantile
# 3rd quantile
c(d123Q1, d123Q2, d123Q3)
## [1] 14.09086 19.05329 30.03047
d123Q1 stands for the value of 1st quartile.

d123Q2 stands for the value of 2nd quartile.
d123Q3 stands for the value of 3rd quartile.
(d123Q1-mean(d123))/sd(d123)
## [1] -0.7331399
(d123Q3-mean(d123))/sd(d123)
## [1] 0.643898
Q1 is approximately 0.74 standard deviations below the mean, and Q3 is approximately 0.63 standard
deviations above the mean.
Compared to the result in (b), the result has changed. This is because in d123, the distribution exhibits
a positive skew (skew to the right). As a result, Q1 is further from the mean compared to the standard
deviations away from the mean for Q3.
Question 3) We mentioned in class that there might be some objective ways

of determining the bin size of histograms. Take a quick look at the Wikipedia
article on Histograms (“Number of bins and width”) to see the different ways
to calculate bin width (h) and number of bins (k).
Note that, for any dataset d, we can calculate number of bins (k) from the bin width (h):
𝑘 = 𝑐𝑒𝑖𝑙𝑖𝑛𝑔((𝑚𝑎𝑥(𝑑) − 𝑚𝑖𝑛(𝑑))/ℎ)
and bin width from number of bins:
ℎ = (𝑚𝑎𝑥(𝑑) − 𝑚𝑖𝑛(𝑑))/𝑘
Now, read this discussion on the Q&A forum called “Cross Validated” about choosing the number of bins
(a) From the question on the forum, which formula does Rob Hyndman’s answer (1st answer)
suggest to use for bin widths/number? Also, what does the Wikipedia article say is the benefit
of that formula?
The Rob Hyndman suggest to use Freedman–Diaconis rule, the bin width formula is
3.49𝜎̂
ℎ= √3
𝑛
8
The Wikipedia article said the benefit of Freedman-Diaconis rule is that it is less sensitive than the standard
deviation to outliers in data.
Freedman-Diaconis rule is designed to roughly minimize the integral of the squared difference between the
histogram and the density of the theoretical probability distribution.
(b) Given a random normal distribution:
rand_data <- rnorm(800, mean=20, sd = 5)

Compute the bin widths (h) and number of bins (k) according to each of the following formula:
i. Sturges’ formula
ii. Scott’s normal reference rule (uses standard deviation)
iii. Freedman-Diaconis’ choice (uses IQR)
Create a random normal distribution:
rand_data <- rnorm(800, mean=20, sd = 5)
𝑘 = [log2 (𝑛)] + 1
k1 = log2(800) +1
h1 = (max(rand_data)-min(rand_data))/k1
data.frame(
k = k1,
h = h1
)
## k h
## 1 10.64386 3.160486

3.49𝜎̂
ℎ= √3
𝑛
h2 = 3.49*sd(rand_data)/(800^(1/3))
k2 = ceiling((max(rand_data) - min(rand_data))/h2)
data.frame(
k = k2,
h = h2
)
## k h
## 1 18 1.90383

IQR(x)
ℎ=2 √
3
𝑛
9
h3 = 2*IQR(rand_data)/(800)^(1/3)
k3 = ceiling((max(rand_data) - min(rand_data))/h3)
data.frame(
k = k3,
h = h3
)
## k h
## 1 24 1.433148
(c) Repeat part (b) but let’s extend rand_data dataset with some outliers (creating a new
dataset out_data):
out_data <- c(rand_data, runif(10, min=40, max=60)) From your answers above, in which of the three
methods does the bin width (h) change the least when outliers are added (i.e., which is least sensitive to
outliers), and (briefly) WHY do you think that is?
Create a random normal distribution with outliers:
out_data <- c(rand_data, runif(10, min=40, max=60))
k1_o = log2(800) +1
h1_o = (max(out_data)-min(out_data))/k1_o
data.frame(
k = k1_o,
h = h1_o
)
## k h
## 1 10.64386 5.376344
h2_o = 3.49*sd(out_data)/(800^(1/3))
k2_o = ceiling((max(out_data) - min(rand_data))/h2_o)
data.frame(
k = k2_o,
h = h2_o
)
## k h
## 1 25 2.292732
10
h3_o = 2*IQR(out_data)/(800)^(1/3)
k3_o = ceiling((max(out_data) - min(out_data))/h3_o)
data.frame(
k = k3_o,
h = h3_o
)
## k h
## 1 40 1.463922
Calculate the difference between the original bin width (h) and the bin width
with outliers (h_o)
abs(h1-h1_o)
## [1] 2.215859
abs(h2-h2_o)
## [1] 0.3889017
abs(h3-h3_o)
## [1] 0.03077374
Freedman-Diaconis’ choice method changes the least when outliers are added.
For the Sturges’ formula, the range (max(d) - min(d)) increases with the addition of outliers, resulting in an
increase in �.
For the Scott’s normal reference rule, standard deviation �ˆ increase when outliers are added, resulting in an
increase in �.
For the Freedman-Diaconis’choice, it uses IQR, which are less sensitive to outliers compared with range and
standard deviation methods.
11

Bacs HW2

Uploaded by

Copyright:

Available Formats

Bacs HW2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bacs HW2

Uploaded by

Copyright:

Available Formats

HW2

# Three normally distributed data sets

# Combining them into a composite dataset

# Let’s plot the density function of d123

# Add vertical lines showing mean and median

N = 800 Bandwidth = 2.736

# Three normally distributed data sets

# Combining them into a composite dataset

# Let’s plot the density function of Distribution_2

# Add vertical lines showing mean and median

N = 800 Bandwidth = 2.74

It shows up a left-skewed distribution.

d <- rnorm(n=800, mean=30, sd=5)

# Combining them into a composite dataset

# Let’s plot the density function of Distribution_2

# Add vertical lines showing mean and median

N = 800 Bandwidth = 1.097

d1 <- rnorm(n=800, mean=30, sd=5)

# calculate 1.5 IQR

# calculate the difference

# calculate the difference

rdata <- rnorm(n=2000, mean=0, sd=1)

# Let’s plot the density function of Distribution_2

# Add vertical lines showing mean and median

N = 2000 Bandwidth = 0.1933

## 25% 50% 75%

rdata2 <- rnorm(n=2000, mean=35, sd=3.5)

c(rdata2Q1, rdata2Q2, rdata2Q3)

## [1] 32.64663 35.00550 37.17210

rdata2Q1 stands for the value of 1st quartile.

c(d123Q1, d123Q2, d123Q3)

## [1] 14.09086 19.05329 30.03047

d123Q1 stands for the value of 1st quartile.

Question 3) We mentioned in class that there might be some objective ways

(b) Given a random normal distribution:

rand_data <- rnorm(800, mean=20, sd = 5)

Create a random normal distribution:

rand_data <- rnorm(800, mean=20, sd = 5)

ii. Scott’s normal reference rule (uses standard deviation)

iii. Freedman-Diaconis’ choice (uses IQR)

Create a random normal distribution with outliers:

out_data <- c(rand_data, runif(10, min=40, max=60))

ii. Scott’s normal reference rule (uses standard deviation)

iii. Freedman-Diaconis’ choice (uses IQR)

You might also like