STA2050 Assignment 2
STA2050 Assignment 2
STA2050 Assignment 2
2023-11-28
library(imputeTS)
library(VIM)
##
## Attaching package: ’VIM’
library(nnet)
library(readxl)
library(csv)
library(dplyr)
##
## Attaching package: ’dplyr’
1
## The following objects are masked from ’package:stats’:
##
## filter, lag
library(tidyverse)
library(finalfit)
library(lubridate)
library(tidyr)
library(ggplot2)
library(psych)
##
## Attaching package: ’psych’
##
## The following objects are masked from ’package:ggplot2’:
##
## %+%, alpha
library(lme4)
2
library(mice)
##
## Attaching package: ’mice’
##
## The following object is masked from ’package:stats’:
##
## filter
##
## The following objects are masked from ’package:base’:
##
## cbind, rbind
library(tibble)
Question 1
#View(Treedata)
str(Treedata)
a) Draw a scatterplot of y vs x
3
Scatterplot of Y (Age) against X (Diameter)
150
125
Tree Age
100
75
6 8 10 12
Tree Diameter
"\n"
## [1] "\n"
b) Estimate the Population Mean age of trees in the stand using ratio estimation and give an approximate
standard error for your estimate
Diameter = Auxilliary = x
Age = Dependent = y
n
X
tx = xi
i=1
n
X
ty = yi
i=1
tx µx
=
ty µy
ty
P opulationM eanAgeµy = µx ∗
tx
StanDev
StandardError = √
nof age
4
#Estimating Population Mean Age
meanx = 10.3
totalx = sum(Diameter_X)
cat("Diameter Sample Total =", totalx, "\n")
totaly = sum(Age_Y)
cat("Age Sample Total =", totaly, "\n")
cat("Therefore, the estimated population mean age of trees is", meany, "while the standard error for sai
## Therefore, the estimated population mean age of trees is 117.6204 while the standard error for said e
P P P
n x∗y− x∗ y
bi = P 2 P 2
n ∗ x − ( x)
y = b0 + b1 x
#Creating a Subset of the Treedata
Regdata = Treedata
Regdata[4] = Regdata[2] * Regdata[3]
Regdata[5] = Regdata[2]ˆ2
Regdata[6] = Regdata[3]ˆ2
colnames(Regdata) = c("Tree_no", 'Diameter_X', "Age_Y", "XY", "Xsqr", "Ysqr")
Regdata = Regdata %>%
mutate_at(vars(1), as.character)
Totals = Regdata %>%
summarize(Tree_no = "Totals",
Diameter_X = sum(Regdata$Diameter_X),
Age_Y = sum(Regdata$Age_Y),
XY = sum(Regdata$XY),
Xsqr = sum(Regdata$Xsqr),
Ysqr = sum(Regdata$Ysqr))
Regdata = bind_rows(Regdata, Totals)
#Regdata = Regdata %>%
# mutate(Regdata$Tree_no[21], "Totals")
str(Regdata)
5
## ’data.frame’: 21 obs. of 6 variables:
## $ Tree_no : chr "1" "2" "3" "4" ...
## $ Diameter_X: num 12 11.4 7.9 10.5 7.9 9 7.3 10.2 11.7 11.3 ...
## $ Age_Y : num 125 119 83 85 99 117 69 133 154 168 ...
## $ XY : num 1500 1357 656 892 782 ...
## $ Xsqr : num 144 130 62.4 110.2 62.4 ...
## $ Ysqr : num 15625 14161 6889 7225 9801 ...
head(Regdata)
##
## Call:
## lm(formula = Age_Y ~ Diameter_X, data = Regdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.892 -11.171 -0.288 9.977 38.971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03031 4.33635 -0.007 0.994
## Diameter_X 11.42115 0.10301 110.874 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 17.98 on 19 degrees of freedom
## Multiple R-squared: 0.9985, Adjusted R-squared: 0.9984
## F-statistic: 1.229e+04 on 1 and 19 DF, p-value: < 2.2e-16
Intercept = coef(Treemodel)[1]
Slope = coef(Treemodel)[2]
cat("The slope of the model is", Slope, "while the intercept is", Intercept, "\n")
6
#Approximating Standard Error for the above Estimation
Residuals = resid(Treemodel)
ResStanDev = sd(Residuals)
SE = ResStanDev * sqrt(1 + 1/length(Age_Y) + ((meanx - mean(Diameter_X))ˆ2 / sum((Diameter_X - mean(Diam
cat("The Regression Estimate of the Population Mean Age is", Estimate, "while the Standard Error for the
## The Regression Estimate of the Population Mean Age is 117.6075 while the Standard Error for the above
## V1 X.Intercept.
## Ratio 10.3 117.6204
## Regression 10.3 117.6075
150
125
Tree Age
100
75
6 8 10 12
Tree Diameter
7
"\n"
## [1] "\n"
From the above graph it is clear to see that both the Ratio Estimation & The Regression Estimation of the
Population Mean Age are very close in their value however the Standard Error for the methods vary greatly
Question 2
The new candy Green Goobules is being test-marketed in an area of upstate NY. The market research firm
decided to sample 6 cities from the 45 in the area & then to sample supermarkets within the cities, wanting
to know the no. of Green Gobules cases sold
#Cluster 1
summary(cluster_1)
sd1 = sd(cluster_1)
var1 = var(cluster_1)
cat("The Standard Deviation of Cluster 1 is", sd1, "while the Variance is", var1, "\n", "\n")
#Cluster 2
summary(cluster_2)
sd2 = sd(cluster_2)
var2 = var(cluster_2)
cat("The Standard Deviation of Cluster 2 is", sd2, "while the Variance is", var2, "\n", "\n")
8
#Cluster 3
summary(cluster_3)
sd3 = sd(cluster_3)
var3 = var(cluster_3)
cat("The Standard Deviation of Cluster 3 is", sd3, "while the Variance is", var3, "\n", "\n")
#Cluster 4
summary(cluster_4)
sd4 = sd(cluster_4)
var4 = var(cluster_4)
cat("The Standard Deviation of Cluster 4 is", sd4, "while the Variance is", var4, "\n", "\n")
#Cluster 5
summary(cluster_5)
sd5 = sd(cluster_5)
var5 = var(cluster_5)
cat("The Standard Deviation of Cluster 5 is", sd5, "while the Variance is", var5, "\n", "\n")
#Cluster 6
summary(cluster_6)
sd6 = sd(cluster_6)
var6 = var(cluster_6)
cat("The Standard Deviation of Cluster 1 is", sd6, "while the Variance is", var6, "\n", "\n")
9
## The Standard Deviation of Cluster 1 is 22.27106 while the Variance is 496
##
b) Estimate the total number of cases sold, and the average number sold per supermarket, along with the
standard errors of your Estimates
## [1] "cluster_1 = I(list(c(146, 180, 251, 152, 72, 181, 171, 361, 73, 186)))\ncluster_2 = I(list(c(99,
## [1] NA
10