Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Attachment 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

CIS 2334 Semester Project

Part 3, Due November 27th, 2023, at 3:30 pm


Background Information
The marine biologists research team are satisfied with the Excel application that you have developed, it
helped them greatly understand the abalone across the country. In addition to the analysis you have
done in part 2, the scientists are interested in finding any additional underlying patterns within the
abalone data set. In other words, the research team wants to build mathematical model(s) that could
reveal the fundamental relationships among the variables in the abalone data set.

To build a solid model, you need to go through the following steps and finalize your models in the end.
Note: You need to install (add-ins) Analysis ToolPak to be able to do this project. It will appear as Data
Analysis in Data navigation bar. Also, add proper titles for each worksheet.

Project Tasks
Task 1. Prepare the dataset (10 points)
Firstly, you need to prepare the data for building the models. In classic data modeling tasks, you only use
a portion of the data to train your model – this portion of the data is called the training set; the rest of
the data is used to evaluate the performances of your models – this is called the test set.

What you need to do:

a. Create a new excel file called “Firstname_Lastname_DataModeling.xlsx”.


b. Name your current worksheet “Original Data”.
c. Copy the data in your “Personal Data” worksheet from your semester Project Part 2 and paste
the data set in the “Original Data” worksheet.
d. Create a new worksheet called “Training set” and copy the first 2/3rds of the data from the
“Original data” and paste them in the “Training set” worksheet.
e. Create a new worksheet called “Test set” and copy the remaining data (1/3 of the data) from
the “Original data” and paste them in the “Test set” worksheet.

Task 2. Find relationships among variables in stacked data


(15 points)

Before modeling the data, you need to have a better understanding of the relationship among the
variables. The research team have specified a set of numerical variables that they care the most about.
These numerical variables are listed in the table below. In particular, the scientists are mostly interested
in the rings of the abalone since it tells the age of the abalones.

Hint: Use Descriptive Statistics in Data Analysis and check Labels in First Row to describe/analyze the
characteristics of each data. Remember to select data title in Input Range.

1|Page
Lengt Diamete Heigh Whole_weigh Shucked_weigh Viscera_weigh Shell_weigh Ring
h r t t t t t s

What you need to do:

a. Create a new worksheet called “Stacked data analysis”.


b. Using the “Training set”, explore and create Histograms for different variables listed in the
previous table and then pick the 3 “most interesting” histograms and describe/analyze the
characteristics of each of them.
c. Using the “Training set, create a Box Plot for Shucked_weight, Viscera_weight and Shell_weight
and describe/analyze the characteristics of each variables.
d. Using the “Training set”, explore and create Scatter Plots for different pairs of variables listed in
the previous table, and then pick the “5 most interesting” scatter plots, describe and analyze the
characteristics of each of them. Hint: you can pick one variable as an instance and pair it with
the rest of variables.
e. Using the “Training set”, calculate the correlations between each pair of variables listed above.
Identify the 5 correlations. Apply conditional formatting to highlight these 5 strongest
correlations.
f. Use the Scatter Plots to illustrate and verify the 5 strongest correlations. Comment on your
findings.

Task 3. Build regression models for stacked data (15 points)


Since you have revealed the 5 strong correlations, you want to build regression models that describe
theses relationships mathematically.

Hint: Use regression in Data Analysis to build regression models.

What you need to do:

a. Create a new worksheet called “Regressions for stacked data”.


b. Use the “Training set” and build a regression model between the 2 variables that are the most
correlated. Use your common sense in order to determine which variable is the dependent
variable.
c. Analyze your regression equation. Explain the coefficients of your model.
d. Use the “Test set” and compute the test mean squared error for the regression model you have
built.
e. Use the “Training set” and build a regression model for the variables that have the fifth
strongest correlation.
f. Analyze your regression equation. Explain the coefficients of your models.
g. Use the “Test set” and compute test the mean squared error for the regression model you have
built.
h. Compare the mean squared error between the two regression models. Describe and comment
your findings.

2|Page
Task 4. Create unstacked data (15 points)
It is very important to look at the 3 different genders separately and see if the relationships are different
for the 3 different genders.

What you need to do:

a. Create a new worksheet called “Unstacked Training set”.


b. Unstack the Training set, separating males, females, and infants.
c. Create a new worksheet called “Unstacked Test set”.
d. Unstack the Test set, separating males, females, and infants.

Task 5. Find relationships among variables in unstacked data


(15 points)

What you need to do:

a. Create a new worksheet called “Unstacked data analysis”.


b. Use each gender’s data in the “Unstacked Training set” to create the same set of 3 “most
interesting” histograms as in the Task 2 Step b.
c. Compare the stacked data histograms against each gender’s histogram. Are there any
differences? If so, describe them.
d. Use the whole_weight variable of each gender in the “Unstacked Training set” and for all
genders in the “Training set” to create a Box Plot (four boxes – whole_weight for female,
whole_weight for male, whole_weight for infant, and whole_weight for all). Describe your
findings from the plot.
e. Use each gender’s data in the “Unstacked Training set” to compute the same correlation matrix
as in the Task 2 Step e.
f. Compare the four correlation matrixes (one for each gender and one from the task 2 step e).
Describe the value differences for the top-5-strong corelated variables identified in the Task 2
Step e.
g. Use each gender’s data in the “Unstacked Training set” to create the same set of scatter plots as
in the Task 2 Step f.
h. Compare the stacked data scatter plots against each gender’s scatter plots. Are there any
differences? If so, describe them.

Task 6. Build regression models for unstacked data (15 points)

3|Page
Next, you need to build regression models on the unstacked data and compare them with the models
with the stacked data.

What you need to do:

a. Create a new worksheet called “Regressions for unstacked data”.


b. Build regression models on the same variables as in Task 3 Step b but use each gender’s data in
“Unstacked Training set”.
c. Explicate your regression equations. Explain the coefficients of your models.
d. Use the “Unstacked Test set”. Compute the test mean squared errors for the regression models
you have built.
e. Build regression models on the same variables as in Task 3 Step e but use each gender’s data in
“Unstacked Training set”.
f. Explicate your regression equations. Explain the coefficients of your models.
g. Use the “Unstacked Test set”. Compute the test mean squared errors for the regression models
you have built.
h. Compare the mean squared errors between the stacked data regression models and the
unstacked data regression models. Describe and comment your findings.

Task 7. Build one-variable regression models for Abalone


“Rings”
(15 points)

The rings on the abalone indicate it’s age. The most interesting problem that the research team found is
the prediction of the abalone’s age using the other measurement variables in the data. You believe you
can build good regression models to do the prediction.

You need to find the best predictor for abalone’s age. This is a trail and error process.

Hints: Use the variable Rings as an instance and pair it with the rest of variables.

What you need to do:

a. Create a new worksheet called “Single variable regression for Rings”.


b. Explore different variables for the regression model of “Rings”. Use the stacked data.
c. Examine the mean squared error for each model. The smaller the error the better the model.
d. Choose the best regression model. Explain your regression equation. Explain the coefficients of
your model. Report the mean squared error of this best model.

4|Page

You might also like