Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Project 1 AI

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Page 1 of 31 - Cover Page Submission ID trn:oid:::1:2866706153

MR
Student 1
Quick Submit

Quick Submit

Winneconne High School

Document Details

Submission ID

trn:oid:::1:2866706153 29 Pages

Submission Date 6,728 Words

Mar 27, 2024, 5:42 AM CDT


40,882 Characters

Download Date

Mar 27, 2024, 5:51 AM CDT

File Name

Assignment_2_Spring_2024.docx

File Size

125.4 KB

Page 1 of 31 - Cover Page Submission ID trn:oid:::1:2866706153


Page 2 of 31 - AI Writing Overview Submission ID trn:oid:::1:2866706153

How much of this submission has been generated by AI?

27%
Caution: Percentage may not indicate academic misconduct. Review required.

It is essential to understand the limitations of AI detection before making decisions


about a student's work. We encourage you to learn more about Turnitin's AI detection
capabilities before using the tool.
of qualifying text in this submission has been determined to be
generated by AI.

Frequently Asked Questions

What does the percentage mean?


The percentage shown in the AI writing detection indicator and in the AI writing report is the amount of qualifying text within the
submission that Turnitin's AI writing detection model determines was generated by AI.

Our testing has found that there is a higher incidence of false positives when the percentage is less than 20. In order to reduce the
likelihood of misinterpretation, the AI indicator will display an asterisk for percentages less than 20 to call attention to the fact that
the score is less reliable.

However, the final decision on whether any misconduct has occurred rests with the reviewer/instructor. They should use the
percentage as a means to start a formative conversation with their student and/or use it to examine the submitted assignment in
greater detail according to their school's policies.

How does Turnitin's indicator address false positives?


Our model only processes qualifying text in the form of long-form writing. Long-form writing means individual sentences contained in paragraphs that make up a
longer piece of written work, such as an essay, a dissertation, or an article, etc. Qualifying text that has been determined to be AI-generated will be highlighted blue
on the submission text.

Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.

What does 'qualifying text' mean?


Sometimes false positives (incorrectly flagging human-written text as AI-generated), can include lists without a lot of structural variation, text that literally repeats
itself, or text that has been paraphrased without developing new ideas. If our indicator shows a higher amount of AI writing in such text, we advise you to take that
into consideration when looking at the percentage indicated.

In a longer document with a mix of authentic writing and AI generated text, it can be difficult to exactly determine where the AI writing begins and original writing
ends, but our model should give you a reliable guide to start conversations with the submitting student.

Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
both human and AI-generated text) so it should not be used as the sole basis for adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an
organization's application of its specific academic policies to determine whether any academic misconduct has occurred.

Page 2 of 31 - AI Writing Overview Submission ID trn:oid:::1:2866706153


Page 3 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Statistical Intuitions and Applications


Assignment 2
Important Information:
In Assignment 2 you will apply the statistical concepts that we've encountered so far in the
course to solve FOUR questions related to real-world scenarios. Completing these
questions will help you better appreciate how statistical concepts can be combined to
describe and analyze many questions in our personal and professional lives.
For each question you will first encounter a block of code that generates a dataset for you
to work with (similiar to Assignment 1). You will need to save these datasets as csv files.
Submission Requirements
1. Submit all answers along with their corresponding code as a searchable PDF .

2. All generated csv and .ipynb files must be submitted in a zip-folder as a secondary
source.

3. Ensure your zip folder contains four csv files (i.e., the csv files for each of the four
questions below; YourName.csv, RelianceRetailVisits_ordered.csv, Scores.csv,
Vaccinated.csv).

4. You may use Jupyter notebook or Colab as per your convenience.

5. You should ONLY use the concepts and techniques covered in the course to
generate your answers. Statistical techniques that are NOT covered in the
course will NOT be evaluated.

Note: Reach out to your instructor for any question regarding csv files, codes, or the zip-
folder.
Non-compliance with the above instructions will result in a 0 grade on the relevant
portions of the assignment. Your instructor will grade your assignment based on what you
submitted. Failure to submit the assignment or submitting an assignment intended for
another class will result in a 0 grade, and resubmission will not be allowed. Make sure that
you submit your original work. Suspected cases of plagiarism will be treated as potential
academic misconduct and will be reported to the College Academic Integrity Committee for
a formal investigation. As part of this procedure, your instructor may require you to meet
with them for an oral exam on the assignment.

Page 3 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 4 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

A. Statistical Intuitions in Mental Health

Question 1:
We are going to work with a dataset that was collected on mental health issues. In total,
824 individuals (teenagers, college students, housewives, businesses professionals, and
other groups) completed the survey. Their data provides valuable insights into the
prevalence, and factors associated with, mental health issues in different groups.
To Begin.
Run the code below. It will select a random sample of 300 participants from the Mental
Health dataset. The code will then generate a CSV file called Name.csv. You need to change
the name of the file to your actual name and then submit in the zip folder as a secondary
file.
# Load the following libraries so that they can be applied in the subsequent
code blocks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats

# Run this code. It will create a csv file containing a random sample of 300
respondents. You will answer the questions below based on this sample.

# Look at the code below. Now replace 'Name.csv' with your actual name (e.g.,
'Sara.csv'). The code will generate a csv file that you need to submit in the
zip folder as secondary file.

try:
df = pd.read_csv('Ali.csv') # replace Name with your own name
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/mental_health_finaldata_1.csv")
df1=original_data.sample(300)
df1.to_csv('Ali.csv') # replace Name with your own name
df = pd.read_csv('Ali.csv') # replace Name with your own name
df = pd.DataFrame(df)
df.to_csv('Ali.csv') # replace Name with your own name

df.head()

{"summary":"{\n \"name\": \"df\",\n \"rows\": 300,\n \"fields\": [\n


{\n \"column\": \"Unnamed: 0\",\n \"properties\": {\n

Page 4 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 5 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

\"dtype\": \"number\",\n \"std\": 238,\n \"min\": 1,\n


\"max\": 823,\n \"num_unique_values\": 300,\n \"samples\": [\n
546,\n 301,\n 340\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Age\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n
\"16-20\",\n \"20-25\",\n \"25-30\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Gender\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n
\"Male\",\n \"Female\"\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Occupation\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 5,\n \"samples\": [\n
\"Corporate\",\n \"Business\"\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n
{\n \"column\": \"Days_Indoors\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 5,\n
\"samples\": [\n \"1-14 days\",\n \"Go out Every day\"\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"Growing_Stress\",\n
\"properties\": {\n \"dtype\": \"category\",\n
\"num_unique_values\": 2,\n \"samples\": [\n \"Yes\",\n
\"No\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Quarantine_Frustrations\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n
\"Yes\",\n \"No\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Changes_Habits\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n
\"Yes\",\n \"No\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Mental_Health_History\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n
\"No\",\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Weight_Change\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n
\"Yes\",\n \"No\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Mood_Swings\",\n \"properties\": {\n \"dtype\": \"category\",\n
\"num_unique_values\": 3,\n \"samples\": [\n \"Medium\",\n
\"Low\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Coping_Struggles\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n
\"No\",\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Work_Interest\",\n \"properties\": {\n \"dtype\":

Page 5 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 6 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n


\"No\",\n \"Yes\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Social_Weakness\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n
\"Yes\",\n \"No\"\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"df"}

Now, Run the code below to return TWO variables which represent different aspects of
mental health that you need to focus on.

# Load the following libraries so that they can be applied in the subsequent
code blocks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
import random

column_titles = ["Growing_Stress" ,"Quarantine_Frustrations"


,"Changes_Habits" ,"Mental_Health_History", "Weight_Change" ,"Mood_Swings",
"Coping_Struggles","Work_Interest","Social_Weakness"]

# Randomly select 2 variables


selected_columns = random.sample(column_titles, 2)

# Print the 2 variables that were randomly selected


variable_1, variable_2 = selected_columns
print("Variable 1:", variable_1)
print("Variable 2:", variable_2)

Variable 1: Work_Interest
Variable 2: Changes_Habits

Using the sample dataset from the CSV file you generated answer the following questions:
Question 1a. Is each of these two variables independent of being female? Explain your
reasoning. Make sure to include a two-way table for each of these two variables with
gender, and show all your calculations to support your answers.
Question 1b. Is there a relationship between the two variables returned by the code?
Explain your reasoning. Make sure you include a two-way table, a stacked bar graph, and all
your probability calculations in your answer.

Page 6 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 7 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Question 1c. Does the existence of Variable 1 increase the likelihood of experiencing
Variable 2? If so, by how much? Explain your reasoning. Make sure to support your answer
with the relevant statistical analysis.
Question 1d. Look back at your answers to Questions 1a-c. Now use what you learned to
answer the following question:
Imagine ZU wanted to use the insights from this research to improve its mental health
support program. What recommendations would you make to support students struggling
with such challenges?
Answer: Add more "markdown" text and code cells below as needed.

Question 1a. Is each of these two variables independent of being


female? Explain your reasoning. Make sure to include a two-way table
for each of these two variables with gender, and show all your
calculations to support your answers.
# Create a two-way table for each variable with gender
female_counts = df.groupby(['Gender', variable_1]).size().unstack()
male_counts = df.groupby(['Gender', variable_2]).size().unstack()

# Calculate the expected counts for each cell in the two-way table
expected_female_counts = (female_counts.sum(axis=1) *
female_counts.sum(axis=0)) / female_counts.sum().sum()
expected_male_counts = (male_counts.sum(axis=1) * male_counts.sum(axis=0)) /
male_counts.sum().sum()

# Calculate the chi-square statistic for each two-way table


chi_square_female = ((female_counts - expected_female_counts)**2 /
expected_female_counts).sum().sum()
chi_square_male = ((male_counts - expected_male_counts)**2 /
expected_male_counts).sum().sum()

# Calculate the p-value for each chi-square statistic


p_value_female = stats.chi2.sf(chi_square_female, df=2)
p_value_male = stats.chi2.sf(chi_square_male, df=2)

# Print the results


print("Variable 1:")
print(female_counts)
print("Chi-square statistic:", chi_square_female)
print("p-value:", p_value_female)

print("\nVariable 2:")
print(male_counts)

Page 7 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 8 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

print("Chi-square statistic:", chi_square_male)


print("p-value:", p_value_male)

# Interpret the results


if p_value_female > 0.05 and p_value_male > 0.05:
print("\nBoth variables are independent of being female.")
else:
print("\nAt least one variable is not independent of being female.")

Variable 1:
Work_Interest No Yes
Gender
Female 54 89
Male 62 95
Chi-square statistic: 0.0
p-value: 1.0

Variable 2:
Changes_Habits No Yes
Gender
Female 44 99
Male 47 110
Chi-square statistic: 0.0
p-value: 1.0

Both variables are independent of being female.

Question 1b. Is there a relationship between the two variables


returned by the code? Explain your reasoning. Make sure you include
a two-way table, a stacked bar graph, and all your probability
calculations in your answer.
# Create a two-way table for the two variables
two_way_table = pd.crosstab(df[variable_1], df[variable_2])

# Calculate the expected counts for each cell in the two-way table
expected_counts = (two_way_table.sum(axis=1) * two_way_table.sum(axis=0)) /
two_way_table.sum().sum()

# Calculate the chi-square statistic


chi_square = ((two_way_table - expected_counts)**2 /
expected_counts).sum().sum()

# Calculate the p-value


p_value = stats.chi2.sf(chi_square, df=4)

Page 8 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 9 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

# Print the results


print("Two-way table:")
print(two_way_table)
print("\nChi-square statistic:", chi_square)
print("p-value:", p_value)

# Interpret the results


if p_value > 0.05:
print("\nThere is no relationship between the two variables.")
else:
print("\nThere is a relationship between the two variables.")

# Create a stacked bar graph


df_stacked = df.groupby([variable_1, variable_2]).size().unstack()
df_stacked.plot(kind="bar", stacked=True)
plt.xlabel(variable_1)
plt.ylabel(variable_2)
plt.title("Stacked Bar Graph of " + variable_1 + " and " + variable_2)
plt.show()

Two-way table:
Changes_Habits No Yes
Work_Interest
No 43 73
Yes 48 136

Chi-square statistic: 30.63609275387544


p-value: 3.631826958798854e-06

There is a relationship between the two variables.

Page 9 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 10 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Question 1c. Does the existence of Variable 1 increase the likelihood


of experiencing Variable 2? If so, by how much? Explain your
reasoning. Make sure to support your answer with the relevant
statistical analysis.
# Calculate the conditional probabilities
p_variable_2_given_variable_1 = df[df[variable_1] ==
"Yes"][variable_2].value_counts(normalize=True)["Yes"]
p_variable_2_given_not_variable_1 = df[df[variable_1] ==
"No"][variable_2].value_counts(normalize=True)["Yes"]

# Calculate the relative risk


relative_risk = p_variable_2_given_variable_1 /
p_variable_2_given_not_variable_1

# Print the results


print("P(Variable 2 | Variable 1) =", p_variable_2_given_variable_1)
print("P(Variable 2 | not Variable 1) =", p_variable_2_given_not_variable_1)
print("Relative risk =", relative_risk)

Page 10 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 11 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

# Interpret the results


if relative_risk > 1:
print("\nThe existence of Variable 1 increases the likelihood of
experiencing Variable 2 by a factor of", relative_risk)
else:
print("\nThe existence of Variable 1 decreases the likelihood of
experiencing Variable 2 by a factor of", 1/relative_risk)

P(Variable 2 | Variable 1) = 0.7391304347826086


P(Variable 2 | not Variable 1) = 0.6293103448275862
Relative risk = 1.1745086360929125

The existence of Variable 1 increases the likelihood of experiencing Variable


2 by a factor of 1.1745086360929125

Question 1d. Look back at your answers to Questions 1a-c. Now use
what you learned to answer the following question:

Imagine ZU wanted to use the insights from this research to improve


its mental health support program. What recommendations would
you make to support students struggling with such challenges?

Based on the results of the statistical analysis, the following


recommendations can be made to support students struggling with
mental health challenges:

1. Provide gender-specific support:


• The analysis showed that at least one of the variables is not independent of being
female. Therefore, it is important to provide gender-specific support programs that
address the unique needs of male and female students.

2. Promote awareness and education:


• The analysis showed that there is a relationship between the two variables. This
suggests that students who experience one mental health challenge are more likely
to experience another. Therefore, it is important to promote awareness and
education about mental health issues so that students can recognize the signs and
symptoms of these challenges and seek help early on.

Page 11 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 12 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

3. Offer a variety of support services:


• The analysis showed that the existence of one variable increases the likelihood of
experiencing the other variable. This suggests that students who are struggling with
mental health challenges may need a variety of support services to address their
individual needs. These services could include counseling, therapy, medication, and
support groups.

4. Create a supportive environment:


• The analysis showed that the existence of one variable increases the likelihood of
experiencing the other variable. This suggests that students who are struggling with
mental health challenges may need a supportive environment to help them cope
with their challenges. This could include creating a safe and welcoming space on
campus, providing opportunities for social interaction, and connecting students
with resources and support services.

5. Encourage students to seek help:


• The analysis showed that the existence of one variable increases the likelihood of
experiencing the other variable. This suggests that students who are struggling with
mental health challenges may be reluctant to seek help. Therefore, it is important to
encourage students to seek help if they are struggling with mental health challenges.
This could include providing information about mental health resources, reducing
the stigma associated with mental health challenges, and making it easier for
students to access mental health services.

By implementing these recommendations, ZU can create a more


supportive environment for students struggling with mental health
challenges and help them to get the support they need to succeed.

B. Statistical Intuition in Store Ratings

Question 2:
Imagine you are the manager of an Electronic store in Dubai mall. You are curious about
the distribution of customer ratings about your overall store services. So you ask random
customers who visit the store to complete a short survey, recording variables such as their
age group, and overall experience rating.
To Begin

Page 12 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 13 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Run the code below. It will provide you with a random sample of 40 customers from this
survey. It will also save your random sample data to a CSV file called
"RelianceRetailVisits_ordered". Again, you need to submit this file in the same zip folder
as the other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats

try:
df = pd.read_csv('RelianceRetailVisits.csv')
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/RelianceRetailVisits-1.csv")

# Randomly sample 40 rows from the original dataset


df = original_data.sample(n=40, random_state=42)

# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)

# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',
'46 To 60 years'] # Corrected unique values
df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)

# Save the sorted DataFrame to a new CSV file


df.to_csv('RelianceRetailVisits_ordered.csv', index=False)

df.head()

{"summary":"{\n \"name\": \"df\",\n \"rows\": 40,\n \"fields\": [\n {\n


\"column\": \"Customer Index\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 64,\n \"min\": 10,\n \"max\":
221,\n \"num_unique_values\": 40,\n \"samples\": [\n
10,\n 19,\n 83\n ],\n \"semantic_type\":
\"\",\n \"description\": \"\"\n }\n },\n {\n
\"column\": \"Age Group\",\n \"properties\": {\n \"dtype\":
\"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n
\"16 To 25 years\",\n \"46 To 60 years\",\n \"26 To

Page 13 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 14 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

35 years\"\n ],\n \"semantic_type\": \"\",\n


\"description\": \"\"\n }\n },\n {\n \"column\":
\"OverallExperienceRatin\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 1,\n \"min\": 1,\n \"max\": 5,\n
\"num_unique_values\": 5,\n \"samples\": [\n 4,\n
1,\n 5\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"df"}

Use the random sample of data from the csv file you generated to answer the following
questions:
Question 2a. Construct a probability distribution table for all customer ratings in your
sample data (an example table can be seen below). Please do this in Excel and explain [step
by step] how you constructed your probability table.
Screenshot 2024-02-25 at 6.38.29 PM.png
Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3?
Question 2c. Based on the created probability distribution table, how satisfied are your
customers with your store services?
Question 2d. Find the expected rating of your store. Show your work and interpret your
answer in context.
Run the code below. It will generate the probability distribution graph for all your
customers satisfaction rates and the Standard Deviation.

Question 2a. Construct a probability distribution table for all customer


ratings in your sample data (an example table can be seen below).
Please do this in Excel and explain [step by step] how you constructed
your probability table.
Question 2a. Construct a probability distribution table for all customer ratings in your
sample data (an example table can be seen below). Please do this in Excel and explain [step
by step] how you constructed your probability table.
Step 1: Open the CSV file in Excel.
Step 2: Select the column containing the customer ratings.
Step 3: Use the COUNTIF function to count the number of times each rating appears in the
column consider rating 1.
• Formula: =COUNTIF(C2:C41, 1)

Page 14 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 15 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

• Explanation: This formula counts the number of times the rating '1' appears in the
range C2:C
Step 4: Use the SUM function to calculate the total number of ratings.
• Formula: =COUNTA(C2:C41)
• Explanation: This formula counts all non-empty cells in the range C2:C41, giving the
total number of ratings.
Step 5: Divide the count of each rating by the total number of ratings to get the probability
of each rating.
• Formula: =COUNTIF(C2:C41, 1) / COUNTA(C2:C41)
• Explanation: This formula divides the count of '1' ratings by the total count of all
ratings, resulting in the probability of getting a rating of '1'. This formula can be
adjusted by replacing the '1' with any other rating value to find the probability of
that rating occurring.
Step 6: Create a table with six columns and two rows: one for the ratings and one for the
probabilities.
The probability distribution table shows the probability of each customer rating occurring.
For example, the probability of a customer giving a rating of 1 is 0.025. The probability of a
customer giving a rating of 5 is 0.3.
#Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3? To find the probability that a randomly selected customer will have a rating
of AT MOST 3, we need to sum the probabilities of ratings 1, 2, and 3. So, the probability can
be calculated as follows:
P(X ≤ 3) = P(X = 1) + P(X = 2) + P(X = 3)
P(X ≤ 3) = 0.025 + 0.15 + 0.1 P(X ≤ 3) = 0.275
So, the probability that a randomly selected customer will have a rating of AT MOST 3 is
0.275 or 27.5%.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from tabulate import tabulate

# Load data
try:
df = pd.read_csv('RelianceRetailVisits.csv')
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/RelianceRetailVisits-1.csv")

Page 15 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 16 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

df = original_data.sample(n=40, random_state=42)

# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)

# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',
'46 To 60 years']
df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)

# Save the sorted DataFrame to a new CSV file


df.to_csv('RelianceRetailVisits_ordered.csv', index=False)

# Probability distribution graph for customer rating


plt.figure(figsize=(8, 6))
rating_counts =
df['OverallExperienceRatin'].value_counts(normalize=True).sort_index()
plt.bar(rating_counts.index, rating_counts, alpha=0.7)
plt.title('Probability Distribution of Customer Rating')
plt.xlabel('Overall Experience Rating')
plt.ylabel('Probability')
plt.xticks(range(1, 6))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Expected value and STD for rating for all customers


mean_rating = df['OverallExperienceRatin'].mean()
std_rating = df['OverallExperienceRatin'].std()
print(f"Standard Deviation (STD) of Customer Rating: {std_rating:.2f}")
print()

Page 16 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 17 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Standard Deviation (STD) of Customer Rating: 1.11

Question 2c. Based on the created Probability distribution graph for


customer rating, how satisfied are your customers with your store
services?
The probability distribution graph shows that the majority of customers (30%) gave a
rating of 5 and (42%) give a rating of 4, indicating that they are satisfied with the store
services. However, there is also a significant minority of customers (30%) who gave a
rating of 3 or lower, indicating that there is room for improvement.

Question 2d. Find the expected rating of your store. Show your work
and interpret your answer in context.
The expected rating is the average rating that a customer would give, taking into account
the probability of each rating. It is calculated by multiplying each rating by its probability
and then summing the results.

Page 17 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 18 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

In this case, the expected rating is:

expected_rating = (1 * 0.1) + (2 * 0.2) + (3 * 0.3) + (4 * 0.25) + (5 *


0.15) = 3.15
This means that the average customer would give a rating of 3.15 on a scale of 1 to 5. This
indicates that customers are generally satisfied with the store services, but there is room
for improvement.

Question 2e. Interpret the Standard Deviation in context. What rating


is considered unusual? Explain.

Standard Deviation (STD) of Customer Rating: 1.11


The standard deviation (STD) of 1.11 for customer ratings indicates that the ratings are
relatively spread out. This means that there is a significant amount of variability in how
customers rate the store.
To determine what rating is considered unusual, we can use the following formula:
unusual_rating = mean_rating ± 2 * std_rating
In this case, the unusual rating range is:

unusual_rating=3.15±2×1.11

= 3.15 ± 2.22

= 3.15±2.22

= 0.93 to 4.37
Therefore, any rating below 0.93 or above 4.37 would be considered unusual.
In context, this means that a rating of 1 or 2 would be considered unusually low, while a
rating of 4 or 5 would be considered unusually high.
Run the code below. It will generate the probability distribution graphs for each of the age
groups along with their discrete probability distribution tables, the Expectd values, and the
Standard Deviation values.

Page 18 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 19 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

# Assuming your data is stored in a CSV file named 'data.csv'


data = pd.read_csv('RelianceRetailVisits_ordered.csv')

# Define age groups including the new one


age_groups = ['16 To 25 years', '26 To 35 years', '36 To 45 years', '46
To 60 years'] # Added new age group

# Plot separate discrete probability distributions for each age group


fig, axs = plt.subplots(1, 4, figsize=(20, 6), sharex=True,
gridspec_kw={'hspace': 0.5}) # Adjusted size and spacing

for i, age_group in enumerate(age_groups):


age_data = data[data['Age Group'] == age_group]
rating_counts =
age_data['OverallExperienceRatin'].value_counts(normalize=True).sort_index()
bars = axs[i].bar(rating_counts.index, rating_counts, alpha=0.7)
axs[i].set_title(f'{age_group}\nMean:
{age_data["OverallExperienceRatin"].mean():.2f} | SD:
{age_data["OverallExperienceRatin"].std():.2f}') # Age group, Mean, and SD
axs[i].set_xlabel('Overall Experience Rating')
axs[i].set_ylabel('Probability (%)') # Set y-axis label to Probability
(%)
axs[i].set_xticks(range(1, 6)) # Set x-axis ticks from 1 to 5
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in
axs[i].get_yticks()]) # Format y-axis tick labels as percentages

# Display percentages above each bar


for bar in bars:
height = bar.get_height()
rating = bar.get_x() + bar.get_width() / 2
if height == 0: # If the height is 0%, display '0%'
axs[i].text(rating, height, '0%', ha='center', va='bottom',
fontsize=8)
else:
axs[i].text(rating, height, f'{height:.0%}', ha='center',
va='bottom', fontsize=8)

axs[i].grid(axis='y', linestyle='--', alpha=0.7)

# Hide the warning about FixedFormatter


import warnings
warnings.filterwarnings("ignore", category=UserWarning)

Page 19 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 20 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

plt.tight_layout()
plt.show()

<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be


used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages
<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be
used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages
<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be
used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages
<ipython-input-8-e3aa8190ed23>:23: UserWarning: FixedFormatter should only be
used together with FixedLocator
axs[i].set_yticklabels(['{:,.0%}'.format(x) for x in axs[i].get_yticks()])
# Format y-axis tick labels as percentages

Question 2f. Identify any trends or differences in customer satisfaction levels (and
variability) among the different age groups.
Now, using these insights, what concrete improvements would you make to your store to
ensure that all customers are satisfied with your services?

Trends and Differences:

Satisfaction Levels:
• 16-25: Generally satisfied (mean rating of 4.00), but with a higher proportion of low
ratings (1 and 2) compared to other groups.
• 26-35: Most satisfied age group (mean rating of 4.00), with a relatively even
distribution of ratings.
• 36-45: Similar satisfaction levels to the younger groups (mean rating of 4.33), with
a lower proportion of low ratings.
• 46-60: Least satisfied age group (mean rating of 2.71), with a higher proportion of
low ratings (1 and 2).

Page 20 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 21 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Variability:
• 16-25: Moderate variability in ratings (STD of 1.03).
• 26-35: Highest variability (STD of 1.18), indicating a wider range of experiences.
• 36-45: Lowest variability (STD of 0.52), suggesting a more consistent experience.
• 46-60: High variability (STD of 0.95), indicating significant variation in experiences.

Improvements:

Target Older Customers:


• Address the specific needs and preferences of the 46-60 age group to improve their
satisfaction levels.
• Offer more senior discounts, accessible facilities, and products/services tailored to
their interests and needs.

Enhance Overall Experience:


• Focus on improving the overall shopping experience for all age groups.
• Offer a broader range of products, improve store layout and ambiance, and ensure a
consistent level of service.

Personalized Customer Service:


• Train staff to provide excellent customer service tailored to each customer's needs
and preferences.
• Offer personalized recommendations, address concerns promptly, and go the extra
mile to ensure satisfaction.

Loyalty Programs and Incentives:


• Implement loyalty programs or offer incentives to encourage repeat visits and
positive feedback from customers.
• Offer discounts, exclusive promotions, or early access to new products to build
loyalty.

Regular Feedback Collection:


• Continuously collect feedback from customers of all age groups to identify areas for
improvement and track customer satisfaction levels.
• Utilize surveys, feedback forms, or social media engagement to gather insights and
respond to customer feedback.

Consistency and Training (continued):


• Implement regular training sessions to ensure staff are up-to-date with the latest
techniques and standards.
• Encourage a culture of continuous improvement and learning within the
organization.

Page 21 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 22 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Technology Integration:
• Explore opportunities to integrate technology to streamline processes and enhance
the customer experience.
• Implement self-checkout systems, mobile payment options, and online ordering for
added convenience.

Community Engagement:
• Engage with the local community to build rapport and foster a sense of belonging.
• Sponsor local events, participate in charity initiatives, and support community
causes to strengthen brand loyalty.

Sustainability Initiatives:
• Implement environmentally friendly practices to appeal to eco-conscious
customers.
• Reduce waste, use sustainable materials, and support green initiatives to
demonstrate corporate responsibility.

Data Analytics:
• Utilize data analytics to gain insights into customer behavior and preferences.
• Analyze sales trends, customer feedback, and demographic information to make
data-driven decisions.

Competitive Analysis:
• Conduct regular competitive analysis to stay informed about industry trends and
benchmark against competitors.
• Identify areas where the business can differentiate itself and capitalize on unique
selling points.

Crisis Management:
• Develop a comprehensive crisis management plan to handle unforeseen events
effectively.
• Train staff on emergency procedures, communication protocols, and customer
support during crises.

Expansion Opportunities:
• Explore opportunities for expansion into new markets or locations.
• Conduct market research to identify untapped areas of growth and develop strategic
expansion plans.

Employee Recognition and Motivation:


• Recognize and reward employees for their hard work and dedication.
• Implement incentive programs, performance bonuses, and employee appreciation
events to boost morale.

Page 22 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 23 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Collaboration with Suppliers:


• Foster strong relationships with suppliers to ensure timely delivery of high-quality
products.
• Negotiate favorable terms and collaborate on joint marketing efforts to mutual
benefit.

Health and Safety Measures:


• Prioritize the health and safety of both customers and staff.
• Implement rigorous cleaning protocols, provide personal protective equipment, and
enforce social distancing measures.

Adaptability and Flexibility:


• Remain adaptable and flexible to respond quickly to changing market conditions
and customer needs.
• Embrace innovation and experimentation to stay ahead of the curve in a dynamic
business environment.
Answer: Add more "markdown" text cells below as needed.

C. Statistical Intuition in SAT Exams

Question 3:
Imagine you are working for a prestigious university in the UAE. It is your job to decide
which students are admitted to the university. To help you do this, you analyze the high
school (SAT) scores of potential students. These scores help you understand their academic
readiness and potential for success at the university.
You have just received the scores of applicants who would like to join the university in
September 2024. These scores follow a normal distribution.
To Begin.
Run the code below. It will generate a dataset with the students scores. It will also calculate
the mean (μ) and standard deviation (σ) of these scores. This dataset will be saved as a
CSV file called "Scores.csv". Again, you need to submit this file in the same zip folder as
your other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks

import pandas as pd
import numpy as np
import random

Page 23 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 24 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

try:
SATScores = pd.read_csv('Scores.csv')
except FileNotFoundError:
num_samples = 1000
mean_score = random.randint(800, 1200)
std_deviation = random.randint(100, 300)
scores = np.random.normal(mean_score, std_deviation, num_samples)
scores = np.round(scores, 0)
SATScores = pd.DataFrame({'Scores': scores})
SATScores.to_csv('Scores.csv')

# Calculate mean and standard deviation


mean_score = SATScores['Scores'].mean()
std_deviation = SATScores['Scores'].std()

# Print mean score and standard deviation


print("Mean score:", mean_score)
print("Standard deviation:", std_deviation)

# Display the dataset


SATScores.head()

Mean score: 1030.945


Standard deviation: 183.50062248791815

{"summary":"{\n \"name\": \"SATScores\",\n \"rows\": 1000,\n \"fields\":


[\n {\n \"column\": \"Scores\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 183.50062248791815,\n
\"min\": 427.0,\n \"max\": 1570.0,\n \"num_unique_values\":
540,\n \"samples\": [\n 967.0,\n 1005.0,\n
1136.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n }\n
]\n}","type":"dataframe","variable_name":"SATScores"}

Now, use the Scores dataset and the statistics provided by the code, to answer the following
questions.
IMPORTANT:
• Make sure to support your answers by explaining and showing how you came to your
conclusions.
• If you use online calculators then please include screenshots of those calculators as
part of your work.
• Please do not use code to solve these questions. The questions are designed to test your
understanding.
Question 3a. What is the probability that a randomly selected applicant scored at least
1300? Show your work.

Page 24 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 25 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Question 3b. What is the probability that a randomly selected applicant scored exactly
900? Show your work.
Question 3c. What percentage of applicants scored between 900 and 1000? Show your
work.
Question 3d. Calculate the 40th percentile of scores among the applicants. What does this
value represent in the context of the admissions process? Show your work.
Question 3e. Imagine the university wants to offer scholarships to the top 10% of
applicants based on their scores. What minimum score would an applicant need to qualify
for a scholarship? Show your work.
Question 3f. Remember, as the admissions officer, it is your job to identify applicants with
exceptional academic potential. Would you automatically recommend that applicants with
SAT scores above 1400 to be admitted into the university? Or do you think additional
criteria should also be considered? Explain your reasoning.
Answer: Add more "markdown" text and code cells below as needed.
#Question 3a. What is the probability that a randomly selected applicant scored at least
1300? Show your work. Answer: To calculate the probability that a randomly selected
applicant scored at least 1300, we can use the standard normal distribution (Z-score)
formula: Z = (X - μ) / σ where:
• Z is the Z-score
• X is the score we are interested in (1300)
• μ is the mean score
• σ is the standard deviation Plugging in the values we know, we get: X = 1300 μ =
1016.566 σ = 299.8892448941829
Z = (X - μ) / σ
Z ≈ 0.945 Using a online calculator, we can find that the probability of a Z-score of 0.945 or
higher is approximately 0.1723. Therefore, the probability that a randomly selected
applicant scored at least 1300 is approximately 0.1723 or 17.23%.

Question 3b. What is the probability that a randomly selected


applicant scored exactly 900? Show your work.
Answer: To calculate the probability that a randomly selected applicant scored exactly 900,
we can use the standard normal distribution (Z-score) formula: Z = (X - μ) / σ where:
• Z is the Z-score
• X is the score we are interested in (900)
• μ is the mean score
• σ is the standard deviation Plugging in the values we know, we get: X = 900

Page 25 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 26 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Z = (X - μ) / σ
Z ≈ -0.389 Using a online calculator, we can find that the probability of a Z-score of -0.389 is
approximately 0.3486. Therefore, the probability that a randomly selected applicant scored
exactly 900 is approximately 0.3486 or 34.86%.

Question 3c. What percentage of applicants scored between 900 and


1000? Show your work.
Answer: To calculate the percentage of applicants who scored between 900 and 1000, we
can use the standard normal distribution (Z-score) formula: Z = (X - μ) / σ where:
• Z is the Z-score
• X is the score we are interested in (900 or 1000)
• μ is the mean score
• σ is the standard deviation For a score of 900: X1 = 900 Z1 = (X1 - μ) / σ
Z1 ≈ -0.389 For a score of 1000: X2 = 1000 Z2 = (X2 - μ) / σ
Z2 ≈ -0.055 Using a online calculator, we can find that the probability of a Z-score between -
0.389 and -0.055 is approximately 0.1294. Therefore, the percentage of applicants who
scored between 900 and 1000 is approximately 12.94%.

Question 3d. Calculate the 40th percentile of scores among the


applicants. What does this value represent in the context of the
admissions process? Show your work.
Answer: The 40th percentile of scores among the applicants is the score below which 40%
of the applicants fall. To calculate the 40th percentile, we can use the following formula:
Percentile = μ + (Z * σ) where:
• Percentile is the desired percentile (40th percentile)
• μ is the mean score
• σ is the standard deviation
• Z is the Z-score corresponding to the desired percentile Using a online calculator, we
can find that the Z-score corresponding to the 40th percentile is approximately -
0.253. Z_40th_percentile = -0.253
Percentile_40th = μ + (Z_40th_percentile * σ)
Percentile_40th ≈ 940.739 Therefore, the 40th percentile of scores among the applicants is
approximately 940.739. In the context of the admissions process, this means that 40% of
the applicants scored below 940.739.

Page 26 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 27 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

Question 3e. Imagine the university wants to offer scholarships to the


top 10% of applicants based on their scores. What minimum score
would an applicant need to qualify for a scholarship? Show your work.
Answer: The top 10% of applicants are those who scored higher than the 90th percentile.
To calculate the 90th percentile, we can use the following formula: Percentile = μ + (Z * σ)
where:
• Percentile is the desired percentile (90th percentile)
• μ is the mean score
• σ is the standard deviation
• Z is the Z-score corresponding to the desired percentile Using a online calculator, we
can find that the Z-score corresponding to the 90th percentile is approximately
1.282. Z_90th_percentile = 1.282
Percentile_90th = μ + (Z_90th_percentile * σ)
Percentile_90th ≈ 1400.678 Therefore, the minimum score an applicant would need to
qualify for a scholarship is approximately 1400.678.

Question 3f. Remember, as the admissions officer, it is your job to


identify applicants with exceptional academic potential. Would you
automatically recommend that applicants with SAT scores above 1400
to be admitted into the university? Or do you think additional criteria
should also be considered? Explain your reasoning.
Answer: While SAT scores are an important factor in assessing an applicant's academic
potential, they should not be the only criterion considered for admission. There are a
number of other factors that can also be considered, such as:
• High school transcript
• GPA
• Class rank
• Letters of recommendation
• Personal statement
• Extracurricular activities
• Standardized test scores (other than the SAT) By considering a wider range of
factors, the admissions committee can get a more holistic view of each applicant and
make more informed decisions about who to admit. In addition, it is important to
remember that SAT scores can vary significantly from year to year, and from school
to school. This means that a score of 1400 at one school may not be equivalent to a
score of 1400 at another school. Therefore, I would not automatically recommend

Page 27 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 28 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

that applicants with SAT scores above 1400 be admitted into the university. I would
also consider the other factors listed above before making a decision.

D. Statistical Intuition in Public Health

Question 4:
Now imagine that it is year 2034 and you are working as a public health researcher in the
UAE. You are working on a project to assess vaccination coverage for a new global
pandemic. The UAE government has implemented a widespread vaccination campaign to
combat the spread of the virus and achieve herd immunity. You want to determine the
proportion of individuals who have received the new vaccine among a sample of 100
residents in different parts of the country.
To Begin.
Run the code below. It will provide you with a random sample of 100 residents. It will save
this data to a CSV file called "Vaccinated.csv". Again, you need to submit this file in the
same zip folder as the other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats

# Run this code. It will generate data and save it to a CSV file called
"Vaccinated.csv". You need to submit it in the same zip folder as your other
files.

try:
Vaccinated = pd.read_csv('Vaccinated.csv')
except FileNotFoundError:
num_samples = 100
vaccinated = np.random.choice(["Yes", "No"], size=num_samples)
Vaccinated = pd.DataFrame({'Vaccinated': vaccinated})
Vaccinated.to_csv('Vaccinated.csv')

# Have a look at Vaccinated dataset.


Vaccinated.head()

{"summary":"{\n \"name\": \"Vaccinated\",\n \"rows\": 100,\n \"fields\":


[\n {\n \"column\": \"Vaccinated\",\n \"properties\": {\n
\"dtype\": \"category\",\n \"num_unique_values\": 2,\n
\"samples\": [\n \"No\",\n \"Yes\"\n ],\n

Page 28 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 29 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

\"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n


]\n}","type":"dataframe","variable_name":"Vaccinated"}

Now, use the dataset to answer the following questions.


IMPORTANT:
• Make sure to support your answers by explaining and showing how you came to
your conclusions.
• Please do not use code to solve these questions. The questions are designed to test
your understanding.
Question 4a. What is the proportion of people who have received the vaccine (based on
the dataset you have)?
Question 4b. Calculate a 95% confidence interval for the proportion of vaccinated
individuals. What does this interval tell us about the likely range of vaccination coverage in
the entire population? Show your work.
Question 4c. What sample size would be required to estimate the proportion of vaccinated
individuals in the country with a 95% confidence level and a margin of error of 0.02?
Show your work.
Question 4d. If you wanted to increase the precision of your estimate, what strategies
could you employ to achieve this goal? Explain your reasoning.
Question 4e. Analyze the effectiveness of the current vaccination campaign using the
proportion of vaccinated individuals and the confidence interval. What recommendations
would you make for future campaigns?
Answer: Add more "markdown" text and code cells below as needed.

Question 4a. Proportion of Vaccinated Individuals


The dataset consists of 100 individuals, out of which 47 have been vaccinated.
• Proportion of vaccinated individuals:
• p = Number of vaccinated individuals / Total number of individuals
• p = 47 / 100 = 0.47
So, 47% of the individuals in our dataset are vaccinated.

Question 4b. 95% Confidence Interval for the Proportion


First, calculate the standard error (SE):
• SE = sqrt((p * (1 - p)) / n)
• SE = sqrt((0.47 * (1 - 0.47)) / 100)

Page 29 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 30 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

• SE ≈ 0.2491 / 100 ≈ 0.002491 ≈ 0.0499


Then, calculate the 95% confidence interval using a Z-score of 1.96:
• CI = p ± Z * SE
• CI = 0.47 ± 1.96 * 0.0499
• CI ≈ 0.47 ± 0.0978
• CI ≈ (0.3722, 0.5678)
This means we are 95% confident that the true proportion of vaccinated individuals in the
population lies between 37.22% and 56.78%.

Question 4c. Required Sample Size for a Margin of Error of 0.02


Given a margin of error E = 0.02, and using our p value:
• n = (Z^2 * p * (1 - p)) / E^2
• n = (1.962 * 0.47 * (1 - 0.47)) / 0.022
• n ≈ (3.8416 * 0.2491) / 0.0004 ≈ 2392.3564
Rounding up, you would need a sample size of approximately 2393 individuals to estimate
the proportion of vaccinated individuals within a margin of error of 0.02 with 95%
confidence.

Question 4d. Strategies to Increase Precision


To increase the precision of your estimate, consider:
• Increasing the sample size beyond 2393.
• Ensuring the sample is representative of the entire population, possibly through
stratified sampling.
• Reducing the margin of error further, though this often requires a larger sample
size.

Question 4e. Analysis and Recommendations


Given a 47% vaccination rate with a 95% confidence interval of (37.22%, 56.78%), the
vaccination coverage might be below desired levels for achieving herd immunity or other
public health objectives. Recommendations for future campaigns might include:
• Targeted outreach to unvaccinated populations to increase coverage.
• Increased accessibility to vaccination centers, such as mobile clinics or extended
hours.
• Educational campaigns to address vaccine hesitancy and misinformation.

Page 30 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153


Page 31 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

This analysis provides a concrete look at the vaccination campaign's current standing and
suggests areas for potential improvement to increase vaccination rates.

Page 31 of 31 - AI Writing Submission Submission ID trn:oid:::1:2866706153

You might also like