Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Assignment_STAT5002

The document outlines an individual assignment for the STAT5002 Introduction to Statistics course at the University of Sydney, due on November 1, 2024. It consists of five questions covering hypothesis testing, chi-squared tests, logistic regression, and multiple linear regression, with specific tasks and R code provided for each question. Students are instructed to submit their solutions as a single PDF file, ensuring anonymity by including only their student ID.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment_STAT5002

The document outlines an individual assignment for the STAT5002 Introduction to Statistics course at the University of Sydney, due on November 1, 2024. It consists of five questions covering hypothesis testing, chi-squared tests, logistic regression, and multiple linear regression, with specific tasks and R code provided for each question. Students are instructed to submit their solutions as a single PDF file, ensuring anonymity by including only their student ID.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

The University of Sydney

STAT5002 Introduction to Statistics


Copyright: University of Sydney

Semester 2 — 18 Oct, 2024 2024


Lecturers: Tiangang Cui, Mohammad Javad Davoudabadi
This individual assignment is due by 11:59pm Friday 1 Nov 2024, via Canvas. There are five questions
in this assignment, each weigh 10 points. Your solution should be submitted as a single pdf file, include
your written answer and screenshots of your code (and relevant outputs if necessary).
Your submitted file should include your SID. To ensure compliance with our anonymous marking obliga-
tions, please do not under any circumstances include your name in any area of your assignment; only your
SID should be present. Please make sure you review your submissions carefully. What you see is exactly
how the marker will see your assignment. Submissions can be overwritten until the due date.

1. We want to test the difference between two drugs A and B for treating high blood pressure, 20 patients
are paired according to age. One of each pair is chosen at random to receive drug A and the other
receives drug B. The resulting drops in blood pressure are set out below:

Pair 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Drug A 5 4 2 6 9 1 1 5 6 3 7 14 8 3 4 10 7 12 6 9
Drug B 2 5 1 2 6 3 0 2 5 2 7 12 5 2 3 6 5 10 4 7
Difference 3 -1 1 4 3 -2 1 3 1 1 0 2 3 1 1 4 2 2 2 2

The following R outputs will be used.


> # Drug A data as a vector
> drug_A = c(5, 4, 2, 6, 9, 1, 1, 5, 6, 3, 7, 14, 8, 3, 4, 10, 7, 12, 6, 9)
> # Drug B data as a vector
>
> drug_B = c(2, 5, 1, 2, 6, 3, 0, 2, 5, 2, 7, 12, 5, 2, 3, 6, 5, 10, 4, 7)
> diff = drug_A - drug_B
>
> mean(diff)
[1] 1.65
Assuming the difference has a known standard deviation of σ = 1.5 and using the mean of the difference,
we want to carry out the following steps of a hypothesis test.

(a) Introduce appropriate notation to state the null and alternative hypotheses.
(b) What distribution of test statistic should be used here? Explain your answer.
(c) Compute the observed test statistic.
(d) What values of test statistics will argue against the null hypotheses? Is this a one-sided or two-sided
test? Explain your answer.
(e) Applying the 68%-95%-99.7% rule, what is the smallest p-value you can find? Explain your answer.
(f) With a significant level α = 0.05, do we observe sufficient evidence to reject the null hypothesis?
Explain your answer.
(g) Now we want to find out if Drug A is more effective then Drug B (the larger drop in blood pressure
means it is more effective).
(1)What is your new alternative hypotheses?

(2)What is your new p-value? (the smallest you can estimate based on 68%-95%-99.7% rule)

(3)With a significant level α = 0.05, do we observe sufficient evidence to reject the null hypoth-
esis?
Explain your answer.
2. We consider the same data table in Question 1. The company producing the drugs claims that Drug A
is expected to reduce blood pressure by 7 units. Suppose the drop of blood pressure of Drug A follows
a normal distribution with unknown standard deviation, we want to test if the company’s claim is true.

We want to carry out the following steps of a hypothesis test. You can use the following R outputs.
> # Drug A data as a vector
> drug_A = c(5, 4, 2, 6, 9, 1, 1, 5, 6, 3, 7, 14, 8, 3, 4, 10, 7, 12, 6, 9)
> mean(drug_A)
[1] 6.1
> round(sd(drug_A),1)
[1] 3.5

(a) State the null and alternative hypotheses.


(b) What distribution of test statistic should be used here? Explain your answer.
(c) Compute the observed test statistic.
(d) What values of test statistics will argue against the null hypotheses? Is this a one-sided or two-sided
test? Explain your answer.
(e) What is the p-value? Write down the R code (in one line) that you used for calculating the p-value.
(f) With a significant level α = 0.05, do we observe sufficient evidence to reject the null hypothesis?
Explain your answer.
3. Consider the table below which counts the number of right- and left-handers in each of three random
samples of individuals from each of three populations:
Left-handed Right-handed Total
Population 1 4 36 40
Population 2 4 26 30
Population 3 2 28 30
Total 10 90 100
It is of interest to test whether the proportions of hand dominance (left-handed and right-handed) are
the same across all three populations. Following the following steps, performing a chi-squred test to
investigate this.

(a) State the null and alternative hypotheses.


(b) Set up the table of expected count (you may also set up the table of expected probability for this).
(c) Compute the observed test statistic.
(d) What is the degrees of freedom of the distribution of test statistic? Explain your answer.
(e) What values of test statistics will argue against the null hypotheses? Explain your answer.
(f) With a significant level α = 0.05, do we observe sufficient evidence to reject the null hypothesis?
Explain your answer.

> qchisq(0.95, 4)
[1] 9.487729
> qchisq(0.95, 3)
[1] 7.814728
> qchisq(0.95, 2)
[1] 5.991465
> qchisq(0.95, 1)
[1] 3.841459
4. A marketing company wants to understand the purchasing behaviour of customers for a new product
they have launched. To do this, they conducted a survey to collect data on customer demographics
and whether or not they purchased the product after receiving a promotional offer.

One key variable they are interested in is Age. The company hypothesises that older customers may
be more likely to purchase the product. The binary outcome of the study is whether the customer
purchased the product (Purchased = 1) or did not purchase the product (Purchased = 0).

You are given a dataset that includes the ages of 100 customers and whether they purchased the prod-
uct. The company asks you to perform logistic regression to analyse the relationship between Age and
the likelihood of purchasing the product.

A simulated data set will be used in this question. You should use the code below to simulate the data
in R to complete the following tasks.
> # Simulating a dataset in R
> set.seed(123)
> n <- 100
> # Age centered around 40 years with standard deviation 10
> Age <- rnorm(n, mean = 40, sd = 10)
> # Logistic function with Age as predictor
> Purchased <- rbinom(n, 1, prob = 1 / (1 + exp(-(0.1 * Age - 4))))
> data <- data.frame(Age, Purchased)
>
> # Display first few rows of the dataset
> head(data)
(a) Fit a logistic regression model to the data using Age as the predictor. Write down the logistic
regression equation using the estimated coefficients from your model.
(b) Interpret the estimated coefficient for Age. What does it tell you about how age affects the prob-
ability of purchasing the product? Describe it in terms of the odds ratio as well.
(c) Based on your model, calculate the predicted probability that a 35-year-old person will purchase
the product. What does it tell you?
5. A real estate company is analysing the factors that influence the price of houses in a specific city. They
collected data on several features of 100 houses, including the number of bedrooms, size of the house
(in square feet), lot size, number of bathrooms, and age of the house.

The company is interested in using multiple linear regression to predict house prices based on these
features. However, they are unsure which variables to include in the final model. You are asked to help
them select the best model using forward and backward variable selection.

The dataset includes 100 observations with the following variables:


- Bedrooms: The number of bedrooms in the house.
- HouseSize: The size of the house in square feet.
- LotSize: The size of the lot in square feet.
- Bathrooms: The number of bathrooms in the house.
- Age: The age of the house (in years).
- Price: The price of the house (in dollars).

A simulated data set will be used in this question. You should use the code below to simulate the data
in R to complete the following tasks.
> # Set seed for reproducibility
> set.seed(123)
> # Simulating data for the 100 houses
> n <- 100
> Bedrooms <- sample(2:5, n, replace = TRUE)
> HouseSize <- rnorm(n, mean = 2000, sd = 500)
> LotSize <- rnorm(n, mean = 6000, sd = 2000)
> Bathrooms <- sample(1:4, n, replace = TRUE)
> Age <- sample(1:20, n, replace = TRUE)
>
> # Generate house prices based on a true underlying model
> Price <- 50000 + 30000 * Bedrooms + 120 * HouseSize + 50 * LotSize
> + 25000 * Bathrooms - 500 * Age + rnorm(n, mean = 0, sd = 50000)
>
> # Create a data frame
> house_data <- data.frame(Bedrooms, HouseSize, LotSize, Bathrooms, Age, Price)

(a) Perform forward variable selection using AIC to choose the best model for predicting house price
based on the features. Report the variables that were included in the final model, and if any
variables were removed, specify them. Additionally, provide the regression equation for the final
model, including the estimated coefficients.
(b) Perform backward variable selection using AIC to choose the best model for predicting house price
based on the features. Report the variables that were included in the final model, and if any
variables were removed, specify them. Additionally, provide the regression equation for the final
model, including the estimated coefficients.
(c) Compare the models obtained from forward and backward selection. Are they the same? Explain
any differences and suggest which model you would recommend to the real estate company.

You might also like