FIT1043 A2 Specification - S2 2024 - Gks6arg
FIT1043 A2 Specification - S2 2024 - Gks6arg
FIT1043 A2 Specification - S2 2024 - Gks6arg
Aim
The main aim of this assignment is to conduct predictive analytics, by building predictive models on a
dataset using Python in the Jupyter Notebook environment.
** Not taught in this unit, you are to explore and elaborate these in your report submission. This will serve
as a gentle introduction to lifelong learning, encouraging you to learn independently.
Data
We will explore the following datasets in Task A (plus a dataset of your choice in Task B):
1. Student_List_A2.csv
2. Student_List_A2_Submission.csv
Hand-in Requirements
Please hand in a PDF file containing your code, answers and explanations to questions, a Jupyter
notebook file (.ipynb) containing your Python code to all the questions, a CSV file for your
predictions in Task A5 and a video file:
● The PDF file should contain answers and explanations to the questions.
o You can use Microsoft Word or other word processing software to format your
submission. Alternatively, generate your PDF from your jupyter notebook formatted
using markdown. Either way save the final copy to a PDF before submitting.
o Make sure to include screenshots/images of the graphs you generate. Also, do NOT
include screenshots of your code if using Microsoft Word or other word processing
software.
1
DOI: 10.34740/kaggle/ds/5195702
● The .csv file should contain:
o your predictions in Task A5.
● The video file should contain:
o An up to 3-minute recording of yourself explaining your answers to Task B1. You
can use Zoom to prepare your recording. Please see Task B for more details.
You will need to submit four separate files (i.e., .pdf file, .ipynb file, .csv file and your video file). Zip,
rar or any other similar file compression format is not acceptable and will have a penalty of 10%.
Assignment Tasks:
Note: You need to use Python to complete all tasks.
3. Can you identify any missing values in the columns of this dataset? If so, replace
the missing values with the median value of the relevant column where you find
missing values.
4. Identify a data quality problem related to the ‘Absences’ column and delete the
rows that exhibit this problem. Refer to Week 4 for information on data quality
problems.
5. Examine the 'GPA' and 'GradeClass' columns together for additional data quality
issues. Propose an appropriate solution for these issues and resolve them.
A2. Supervised Learning (1.5 marks)
1. Explain supervised machine learning, the notion of labelled data, and train and
test datasets.
2. Use the wrangled data from A1 and separate the features and the label. Note
that:
o the label, in this case, is the ‘GradeClass’
o studentID is not logically a useful predictor of a student's grade so should
not be used as a feature
o GPA is translated to GradeClass. They both represent the same thing so
GPA should not be used as a feature.
o Use the rest of the features as predictors.
3. Use the sklearn.model_selection.train_test_split function to split your data for
training and testing (Keep 80% of the data for training).
We have demonstrated a k-means clustering algorithm in week 7. Your task in this part
is to find an interesting dataset and apply k-means clustering on it using Python. For
instance, Kaggle is a private company which runs data science competitions and
provides a list of their publicly available datasets: https://www.kaggle.com/datasets
1. Select a suitable dataset that contains some missing data and at least two
numerical features. Please note you cannot use the same data set used in the
applied sessions/lectures in this unit. Please include a link to your dataset in
your report. You may wish to:
● provide the direct link to the public dataset from the internet, or
● place the data file in your Monash student - google drive and provide its
link in the submission.
2. Perform wrangling on the dataset to handle/treat the missing data and explain
your procedure
3. Perform k-means clustering, choosing two numerical features in your dataset
and create k clusters using Python (k>=2)
4. Visualise the data as well as the results of the k-means clustering, and describe
your findings about the identified clusters.
B2. Video Preparation (2 marks)
Presentation is one of the important steps in a data science process. In this task you
will need to prepare an up to 3 minutes video of yourself (you can share your code on
screen) and describe your approach on the above task (Task B1).
● Please make sure to keep your camera on (show yourself) during recording. You
may want to share your screen with your code while you talk.)
Good Luck! ☺