Assignment_STAT5002
Assignment_STAT5002
1. We want to test the difference between two drugs A and B for treating high blood pressure, 20 patients
are paired according to age. One of each pair is chosen at random to receive drug A and the other
receives drug B. The resulting drops in blood pressure are set out below:
Pair 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Drug A 5 4 2 6 9 1 1 5 6 3 7 14 8 3 4 10 7 12 6 9
Drug B 2 5 1 2 6 3 0 2 5 2 7 12 5 2 3 6 5 10 4 7
Difference 3 -1 1 4 3 -2 1 3 1 1 0 2 3 1 1 4 2 2 2 2
(a) Introduce appropriate notation to state the null and alternative hypotheses.
(b) What distribution of test statistic should be used here? Explain your answer.
(c) Compute the observed test statistic.
(d) What values of test statistics will argue against the null hypotheses? Is this a one-sided or two-sided
test? Explain your answer.
(e) Applying the 68%-95%-99.7% rule, what is the smallest p-value you can find? Explain your answer.
(f) With a significant level α = 0.05, do we observe sufficient evidence to reject the null hypothesis?
Explain your answer.
(g) Now we want to find out if Drug A is more effective then Drug B (the larger drop in blood pressure
means it is more effective).
(1)What is your new alternative hypotheses?
(2)What is your new p-value? (the smallest you can estimate based on 68%-95%-99.7% rule)
(3)With a significant level α = 0.05, do we observe sufficient evidence to reject the null hypoth-
esis?
Explain your answer.
2. We consider the same data table in Question 1. The company producing the drugs claims that Drug A
is expected to reduce blood pressure by 7 units. Suppose the drop of blood pressure of Drug A follows
a normal distribution with unknown standard deviation, we want to test if the company’s claim is true.
We want to carry out the following steps of a hypothesis test. You can use the following R outputs.
> # Drug A data as a vector
> drug_A = c(5, 4, 2, 6, 9, 1, 1, 5, 6, 3, 7, 14, 8, 3, 4, 10, 7, 12, 6, 9)
> mean(drug_A)
[1] 6.1
> round(sd(drug_A),1)
[1] 3.5
> qchisq(0.95, 4)
[1] 9.487729
> qchisq(0.95, 3)
[1] 7.814728
> qchisq(0.95, 2)
[1] 5.991465
> qchisq(0.95, 1)
[1] 3.841459
4. A marketing company wants to understand the purchasing behaviour of customers for a new product
they have launched. To do this, they conducted a survey to collect data on customer demographics
and whether or not they purchased the product after receiving a promotional offer.
One key variable they are interested in is Age. The company hypothesises that older customers may
be more likely to purchase the product. The binary outcome of the study is whether the customer
purchased the product (Purchased = 1) or did not purchase the product (Purchased = 0).
You are given a dataset that includes the ages of 100 customers and whether they purchased the prod-
uct. The company asks you to perform logistic regression to analyse the relationship between Age and
the likelihood of purchasing the product.
A simulated data set will be used in this question. You should use the code below to simulate the data
in R to complete the following tasks.
> # Simulating a dataset in R
> set.seed(123)
> n <- 100
> # Age centered around 40 years with standard deviation 10
> Age <- rnorm(n, mean = 40, sd = 10)
> # Logistic function with Age as predictor
> Purchased <- rbinom(n, 1, prob = 1 / (1 + exp(-(0.1 * Age - 4))))
> data <- data.frame(Age, Purchased)
>
> # Display first few rows of the dataset
> head(data)
(a) Fit a logistic regression model to the data using Age as the predictor. Write down the logistic
regression equation using the estimated coefficients from your model.
(b) Interpret the estimated coefficient for Age. What does it tell you about how age affects the prob-
ability of purchasing the product? Describe it in terms of the odds ratio as well.
(c) Based on your model, calculate the predicted probability that a 35-year-old person will purchase
the product. What does it tell you?
5. A real estate company is analysing the factors that influence the price of houses in a specific city. They
collected data on several features of 100 houses, including the number of bedrooms, size of the house
(in square feet), lot size, number of bathrooms, and age of the house.
The company is interested in using multiple linear regression to predict house prices based on these
features. However, they are unsure which variables to include in the final model. You are asked to help
them select the best model using forward and backward variable selection.
A simulated data set will be used in this question. You should use the code below to simulate the data
in R to complete the following tasks.
> # Set seed for reproducibility
> set.seed(123)
> # Simulating data for the 100 houses
> n <- 100
> Bedrooms <- sample(2:5, n, replace = TRUE)
> HouseSize <- rnorm(n, mean = 2000, sd = 500)
> LotSize <- rnorm(n, mean = 6000, sd = 2000)
> Bathrooms <- sample(1:4, n, replace = TRUE)
> Age <- sample(1:20, n, replace = TRUE)
>
> # Generate house prices based on a true underlying model
> Price <- 50000 + 30000 * Bedrooms + 120 * HouseSize + 50 * LotSize
> + 25000 * Bathrooms - 500 * Age + rnorm(n, mean = 0, sd = 50000)
>
> # Create a data frame
> house_data <- data.frame(Bedrooms, HouseSize, LotSize, Bathrooms, Age, Price)
(a) Perform forward variable selection using AIC to choose the best model for predicting house price
based on the features. Report the variables that were included in the final model, and if any
variables were removed, specify them. Additionally, provide the regression equation for the final
model, including the estimated coefficients.
(b) Perform backward variable selection using AIC to choose the best model for predicting house price
based on the features. Report the variables that were included in the final model, and if any
variables were removed, specify them. Additionally, provide the regression equation for the final
model, including the estimated coefficients.
(c) Compare the models obtained from forward and backward selection. Are they the same? Explain
any differences and suggest which model you would recommend to the real estate company.