Project 1 AI
Project 1 AI
Project 1 AI
MR
Student 1
Quick Submit
Quick Submit
Document Details
Submission ID
trn:oid:::1:2866706153 29 Pages
Download Date
File Name
Assignment_2_Spring_2024.docx
File Size
125.4 KB
27%
Caution: Percentage may not indicate academic misconduct. Review required.
Our testing has found that there is a higher incidence of false positives when the percentage is less than 20. In order to reduce the
likelihood of misinterpretation, the AI indicator will display an asterisk for percentages less than 20 to call attention to the fact that
the score is less reliable.
However, the final decision on whether any misconduct has occurred rests with the reviewer/instructor. They should use the
percentage as a means to start a formative conversation with their student and/or use it to examine the submitted assignment in
greater detail according to their school's policies.
Non-qualifying text, such as bullet points, annotated bibliographies, etc., will not be processed and can create disparity between the submission highlights and the
percentage shown.
In a longer document with a mix of authentic writing and AI generated text, it can be difficult to exactly determine where the AI writing begins and original writing
ends, but our model should give you a reliable guide to start conversations with the submitting student.
Disclaimer
Our AI writing assessment is designed to help educators identify text that might be prepared by a generative AI tool. Our AI writing assessment may not always be accurate (it may misidentify
both human and AI-generated text) so it should not be used as the sole basis for adverse actions against a student. It takes further scrutiny and human judgment in conjunction with an
organization's application of its specific academic policies to determine whether any academic misconduct has occurred.
2. All generated csv and .ipynb files must be submitted in a zip-folder as a secondary
source.
3. Ensure your zip folder contains four csv files (i.e., the csv files for each of the four
questions below; YourName.csv, RelianceRetailVisits_ordered.csv, Scores.csv,
Vaccinated.csv).
5. You should ONLY use the concepts and techniques covered in the course to
generate your answers. Statistical techniques that are NOT covered in the
course will NOT be evaluated.
Note: Reach out to your instructor for any question regarding csv files, codes, or the zip-
folder.
Non-compliance with the above instructions will result in a 0 grade on the relevant
portions of the assignment. Your instructor will grade your assignment based on what you
submitted. Failure to submit the assignment or submitting an assignment intended for
another class will result in a 0 grade, and resubmission will not be allowed. Make sure that
you submit your original work. Suspected cases of plagiarism will be treated as potential
academic misconduct and will be reported to the College Academic Integrity Committee for
a formal investigation. As part of this procedure, your instructor may require you to meet
with them for an oral exam on the assignment.
Question 1:
We are going to work with a dataset that was collected on mental health issues. In total,
824 individuals (teenagers, college students, housewives, businesses professionals, and
other groups) completed the survey. Their data provides valuable insights into the
prevalence, and factors associated with, mental health issues in different groups.
To Begin.
Run the code below. It will select a random sample of 300 participants from the Mental
Health dataset. The code will then generate a CSV file called Name.csv. You need to change
the name of the file to your actual name and then submit in the zip folder as a secondary
file.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
# Run this code. It will create a csv file containing a random sample of 300
respondents. You will answer the questions below based on this sample.
# Look at the code below. Now replace 'Name.csv' with your actual name (e.g.,
'Sara.csv'). The code will generate a csv file that you need to submit in the
zip folder as secondary file.
try:
df = pd.read_csv('Ali.csv') # replace Name with your own name
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/mental_health_finaldata_1.csv")
df1=original_data.sample(300)
df1.to_csv('Ali.csv') # replace Name with your own name
df = pd.read_csv('Ali.csv') # replace Name with your own name
df = pd.DataFrame(df)
df.to_csv('Ali.csv') # replace Name with your own name
df.head()
Now, Run the code below to return TWO variables which represent different aspects of
mental health that you need to focus on.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
import random
Variable 1: Work_Interest
Variable 2: Changes_Habits
Using the sample dataset from the CSV file you generated answer the following questions:
Question 1a. Is each of these two variables independent of being female? Explain your
reasoning. Make sure to include a two-way table for each of these two variables with
gender, and show all your calculations to support your answers.
Question 1b. Is there a relationship between the two variables returned by the code?
Explain your reasoning. Make sure you include a two-way table, a stacked bar graph, and all
your probability calculations in your answer.
Question 1c. Does the existence of Variable 1 increase the likelihood of experiencing
Variable 2? If so, by how much? Explain your reasoning. Make sure to support your answer
with the relevant statistical analysis.
Question 1d. Look back at your answers to Questions 1a-c. Now use what you learned to
answer the following question:
Imagine ZU wanted to use the insights from this research to improve its mental health
support program. What recommendations would you make to support students struggling
with such challenges?
Answer: Add more "markdown" text and code cells below as needed.
# Calculate the expected counts for each cell in the two-way table
expected_female_counts = (female_counts.sum(axis=1) *
female_counts.sum(axis=0)) / female_counts.sum().sum()
expected_male_counts = (male_counts.sum(axis=1) * male_counts.sum(axis=0)) /
male_counts.sum().sum()
print("\nVariable 2:")
print(male_counts)
Variable 1:
Work_Interest No Yes
Gender
Female 54 89
Male 62 95
Chi-square statistic: 0.0
p-value: 1.0
Variable 2:
Changes_Habits No Yes
Gender
Female 44 99
Male 47 110
Chi-square statistic: 0.0
p-value: 1.0
# Calculate the expected counts for each cell in the two-way table
expected_counts = (two_way_table.sum(axis=1) * two_way_table.sum(axis=0)) /
two_way_table.sum().sum()
Two-way table:
Changes_Habits No Yes
Work_Interest
No 43 73
Yes 48 136
Question 1d. Look back at your answers to Questions 1a-c. Now use
what you learned to answer the following question:
Question 2:
Imagine you are the manager of an Electronic store in Dubai mall. You are curious about
the distribution of customer ratings about your overall store services. So you ask random
customers who visit the store to complete a short survey, recording variables such as their
age group, and overall experience rating.
To Begin
Run the code below. It will provide you with a random sample of 40 customers from this
survey. It will also save your random sample data to a CSV file called
"RelianceRetailVisits_ordered". Again, you need to submit this file in the same zip folder
as the other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
try:
df = pd.read_csv('RelianceRetailVisits.csv')
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/RelianceRetailVisits-1.csv")
# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)
# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',
'46 To 60 years'] # Corrected unique values
df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)
df.head()
Use the random sample of data from the csv file you generated to answer the following
questions:
Question 2a. Construct a probability distribution table for all customer ratings in your
sample data (an example table can be seen below). Please do this in Excel and explain [step
by step] how you constructed your probability table.
Screenshot 2024-02-25 at 6.38.29 PM.png
Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3?
Question 2c. Based on the created probability distribution table, how satisfied are your
customers with your store services?
Question 2d. Find the expected rating of your store. Show your work and interpret your
answer in context.
Run the code below. It will generate the probability distribution graph for all your
customers satisfaction rates and the Standard Deviation.
• Explanation: This formula counts the number of times the rating '1' appears in the
range C2:C
Step 4: Use the SUM function to calculate the total number of ratings.
• Formula: =COUNTA(C2:C41)
• Explanation: This formula counts all non-empty cells in the range C2:C41, giving the
total number of ratings.
Step 5: Divide the count of each rating by the total number of ratings to get the probability
of each rating.
• Formula: =COUNTIF(C2:C41, 1) / COUNTA(C2:C41)
• Explanation: This formula divides the count of '1' ratings by the total count of all
ratings, resulting in the probability of getting a rating of '1'. This formula can be
adjusted by replacing the '1' with any other rating value to find the probability of
that rating occurring.
Step 6: Create a table with six columns and two rows: one for the ratings and one for the
probabilities.
The probability distribution table shows the probability of each customer rating occurring.
For example, the probability of a customer giving a rating of 1 is 0.025. The probability of a
customer giving a rating of 5 is 0.3.
#Question 2b. What is the probability that a randomly selected customer will have a rating
of AT MOST 3? To find the probability that a randomly selected customer will have a rating
of AT MOST 3, we need to sum the probabilities of ratings 1, 2, and 3. So, the probability can
be calculated as follows:
P(X ≤ 3) = P(X = 1) + P(X = 2) + P(X = 3)
P(X ≤ 3) = 0.025 + 0.15 + 0.1 P(X ≤ 3) = 0.275
So, the probability that a randomly selected customer will have a rating of AT MOST 3 is
0.275 or 27.5%.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from tabulate import tabulate
# Load data
try:
df = pd.read_csv('RelianceRetailVisits.csv')
except FileNotFoundError:
original_data =
pd.read_csv("https://raw.githubusercontent.com/DanaSaleh1003/IDS-103-Spring-
2024/main/RelianceRetailVisits-1.csv")
df = original_data.sample(n=40, random_state=42)
# Fill missing values for '46 To 60 years' age group with default values or
remove NaN rows
df.fillna({'Age Group': '46 To 60 years'}, inplace=True)
# Sort the DataFrame based on the 'Age Group' column in the desired order
desired_order = ['26 To 35 years', '16 To 25 years', '36 To 45 years',
'46 To 60 years']
df['Age Group'] = pd.Categorical(df['Age Group'], categories=desired_order,
ordered=True)
df.sort_values(by='Age Group', inplace=True)
Question 2d. Find the expected rating of your store. Show your work
and interpret your answer in context.
The expected rating is the average rating that a customer would give, taking into account
the probability of each rating. It is calculated by multiplying each rating by its probability
and then summing the results.
unusual_rating=3.15±2×1.11
= 3.15 ± 2.22
= 3.15±2.22
= 0.93 to 4.37
Therefore, any rating below 0.93 or above 4.37 would be considered unusual.
In context, this means that a rating of 1 or 2 would be considered unusually low, while a
rating of 4 or 5 would be considered unusually high.
Run the code below. It will generate the probability distribution graphs for each of the age
groups along with their discrete probability distribution tables, the Expectd values, and the
Standard Deviation values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
plt.tight_layout()
plt.show()
Question 2f. Identify any trends or differences in customer satisfaction levels (and
variability) among the different age groups.
Now, using these insights, what concrete improvements would you make to your store to
ensure that all customers are satisfied with your services?
Satisfaction Levels:
• 16-25: Generally satisfied (mean rating of 4.00), but with a higher proportion of low
ratings (1 and 2) compared to other groups.
• 26-35: Most satisfied age group (mean rating of 4.00), with a relatively even
distribution of ratings.
• 36-45: Similar satisfaction levels to the younger groups (mean rating of 4.33), with
a lower proportion of low ratings.
• 46-60: Least satisfied age group (mean rating of 2.71), with a higher proportion of
low ratings (1 and 2).
Variability:
• 16-25: Moderate variability in ratings (STD of 1.03).
• 26-35: Highest variability (STD of 1.18), indicating a wider range of experiences.
• 36-45: Lowest variability (STD of 0.52), suggesting a more consistent experience.
• 46-60: High variability (STD of 0.95), indicating significant variation in experiences.
Improvements:
Technology Integration:
• Explore opportunities to integrate technology to streamline processes and enhance
the customer experience.
• Implement self-checkout systems, mobile payment options, and online ordering for
added convenience.
Community Engagement:
• Engage with the local community to build rapport and foster a sense of belonging.
• Sponsor local events, participate in charity initiatives, and support community
causes to strengthen brand loyalty.
Sustainability Initiatives:
• Implement environmentally friendly practices to appeal to eco-conscious
customers.
• Reduce waste, use sustainable materials, and support green initiatives to
demonstrate corporate responsibility.
Data Analytics:
• Utilize data analytics to gain insights into customer behavior and preferences.
• Analyze sales trends, customer feedback, and demographic information to make
data-driven decisions.
Competitive Analysis:
• Conduct regular competitive analysis to stay informed about industry trends and
benchmark against competitors.
• Identify areas where the business can differentiate itself and capitalize on unique
selling points.
Crisis Management:
• Develop a comprehensive crisis management plan to handle unforeseen events
effectively.
• Train staff on emergency procedures, communication protocols, and customer
support during crises.
Expansion Opportunities:
• Explore opportunities for expansion into new markets or locations.
• Conduct market research to identify untapped areas of growth and develop strategic
expansion plans.
Question 3:
Imagine you are working for a prestigious university in the UAE. It is your job to decide
which students are admitted to the university. To help you do this, you analyze the high
school (SAT) scores of potential students. These scores help you understand their academic
readiness and potential for success at the university.
You have just received the scores of applicants who would like to join the university in
September 2024. These scores follow a normal distribution.
To Begin.
Run the code below. It will generate a dataset with the students scores. It will also calculate
the mean (μ) and standard deviation (σ) of these scores. This dataset will be saved as a
CSV file called "Scores.csv". Again, you need to submit this file in the same zip folder as
your other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import pandas as pd
import numpy as np
import random
try:
SATScores = pd.read_csv('Scores.csv')
except FileNotFoundError:
num_samples = 1000
mean_score = random.randint(800, 1200)
std_deviation = random.randint(100, 300)
scores = np.random.normal(mean_score, std_deviation, num_samples)
scores = np.round(scores, 0)
SATScores = pd.DataFrame({'Scores': scores})
SATScores.to_csv('Scores.csv')
Now, use the Scores dataset and the statistics provided by the code, to answer the following
questions.
IMPORTANT:
• Make sure to support your answers by explaining and showing how you came to your
conclusions.
• If you use online calculators then please include screenshots of those calculators as
part of your work.
• Please do not use code to solve these questions. The questions are designed to test your
understanding.
Question 3a. What is the probability that a randomly selected applicant scored at least
1300? Show your work.
Question 3b. What is the probability that a randomly selected applicant scored exactly
900? Show your work.
Question 3c. What percentage of applicants scored between 900 and 1000? Show your
work.
Question 3d. Calculate the 40th percentile of scores among the applicants. What does this
value represent in the context of the admissions process? Show your work.
Question 3e. Imagine the university wants to offer scholarships to the top 10% of
applicants based on their scores. What minimum score would an applicant need to qualify
for a scholarship? Show your work.
Question 3f. Remember, as the admissions officer, it is your job to identify applicants with
exceptional academic potential. Would you automatically recommend that applicants with
SAT scores above 1400 to be admitted into the university? Or do you think additional
criteria should also be considered? Explain your reasoning.
Answer: Add more "markdown" text and code cells below as needed.
#Question 3a. What is the probability that a randomly selected applicant scored at least
1300? Show your work. Answer: To calculate the probability that a randomly selected
applicant scored at least 1300, we can use the standard normal distribution (Z-score)
formula: Z = (X - μ) / σ where:
• Z is the Z-score
• X is the score we are interested in (1300)
• μ is the mean score
• σ is the standard deviation Plugging in the values we know, we get: X = 1300 μ =
1016.566 σ = 299.8892448941829
Z = (X - μ) / σ
Z ≈ 0.945 Using a online calculator, we can find that the probability of a Z-score of 0.945 or
higher is approximately 0.1723. Therefore, the probability that a randomly selected
applicant scored at least 1300 is approximately 0.1723 or 17.23%.
Z = (X - μ) / σ
Z ≈ -0.389 Using a online calculator, we can find that the probability of a Z-score of -0.389 is
approximately 0.3486. Therefore, the probability that a randomly selected applicant scored
exactly 900 is approximately 0.3486 or 34.86%.
that applicants with SAT scores above 1400 be admitted into the university. I would
also consider the other factors listed above before making a decision.
Question 4:
Now imagine that it is year 2034 and you are working as a public health researcher in the
UAE. You are working on a project to assess vaccination coverage for a new global
pandemic. The UAE government has implemented a widespread vaccination campaign to
combat the spread of the virus and achieve herd immunity. You want to determine the
proportion of individuals who have received the new vaccine among a sample of 100
residents in different parts of the country.
To Begin.
Run the code below. It will provide you with a random sample of 100 residents. It will save
this data to a CSV file called "Vaccinated.csv". Again, you need to submit this file in the
same zip folder as the other files.
# Load the following libraries so that they can be applied in the subsequent
code blocks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats
# Run this code. It will generate data and save it to a CSV file called
"Vaccinated.csv". You need to submit it in the same zip folder as your other
files.
try:
Vaccinated = pd.read_csv('Vaccinated.csv')
except FileNotFoundError:
num_samples = 100
vaccinated = np.random.choice(["Yes", "No"], size=num_samples)
Vaccinated = pd.DataFrame({'Vaccinated': vaccinated})
Vaccinated.to_csv('Vaccinated.csv')
This analysis provides a concrete look at the vaccination campaign's current standing and
suggests areas for potential improvement to increase vaccination rates.