Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Project_1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Machine Learning Project 1

Kama, Rosa or Canadian wheat seed?

Goal of the Project


The aim of this project is to become more familiar with the notions of under-
fitting and overfitting, as well as the pipeline to validate a model. To do so,
you will solve a simple classification problem using Decision Trees. The next
sections of this document will present the data, as well as the different tasks
you are expected to perform.

Modalities
After performing the different tasks, you are expected to handle out a report of
max. 3 pages, including plots. Additional pages will not be read. When
writing your report, keep in mind that we are just waiting for the answers to
the different questions we will ask in the next sections (in other words, it is not
necessary to give additional information about, e.g., how a model works, what
is the dataset, the context, etc.). This report can be written in English or in
French, and should be exported in .pdf format.
On the implementation side, you will use Python 3 with the scikit-learn
framework (http://scikit-learn.org), which provides many tools for ma-
chine learning and has a detailed documentation. Please check the instructions
in https://scikit-learn.org/stable/install.html to install scikit-learn.
Your code should be submitted with your report on Webcampus by
providing a notebook or a script (.ipynd or .py file(s)).

!!! Please, carefully respect the aforementioned instructions. Due


to the high number of reports to evaluate, failure to comply with at
least one of the previous instructions will result in a grade of 0/20 !!!

1
The dataset
The aim of this project is to build a Decision Tree Classifier1 in order to discrimi-
nate 3 different types of wheat seeds thanks to several geometrical features. The
3 different names of wheat seeds are Kama, Rosa and Canadian, corresponding
to the labels 1, 2 and 3, in the data files, respectively. For each instance, 7
numerical features are available, which are the (i) area, (ii) perimeter, (iii) com-
pactness, (iv) length of kernel, (v) width of kernel, (vi) asymmetry coefficient
and (vii) length of kernel groove of each seed. Those features were obtained
through a soft X-ray technique.
Attached to this project description, you will find two data files named
“train.txt” and “test.txt”. The first one constitutes your available training data.
The first columns contain the 7 features and the last one, the corresponding
label for 147 different instances. It is recommended to load the data using
the loadtxt() function from numpy2 . The following python code presents an
example of how to load the data in a scikit-learn-ready fashion:
import numpy a s np
data = np . l o a d t x t ( " . / t r a i n . t x t " )
X, y = data [ : , : − 1 ] , data [ : , − 1 ]

The second file, “test.txt” only contains additional instances without their
corresponding labels (the file only contains 7 columns). In other words, those
data cannot be used in your training pipeline and will only be used at the end of
the project to validate your model(s) by submitting it in an online leaderboard
(more details in the last section).

Task 0 - Prepare the data


First of all, you will need to split the data in a training and testing set in
order to validate your model (at this stage, you should not use the “test.txt”
file). To do so, we recommend to use the train_test_split() function from
scikit-learn3 with a test_size of 0.33 and a random_state of 42.

For this task, you just have to split your data in a training and testing set. You
will use those same splits all along this project. You do not have to comment
this task in your final report.

1 https://scikit-learn.org/stable/modules/generated/sklearn.tree.

DecisionTreeClassifier.html
2 https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html
3 https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.

train_test_split.html

2
Task 1 - Train your first Decision Tree
Your first task will be to train and evaluate the performance of a Decision Tree
Classifier with the following meta-parameters (see the documentation for more
details):
• criterion = "gini"
• max_depth = 1
• min_samples_split = 2
• min_samples_leaf = 1
• max_leaf_nodes = None
• random_state = 42
A companion library (utils.py) is also attached to this project. It contains
several functions to make your life easier. For example, in order to draw the de-
cision trees, you will have to use the plot_tree() function from scikit-learn.
However, this function requires the feature names. Those feature names can be
easily retrieved by a function called get_features() provided in the companion
library.

Train the proposed decision tree classifier. Report the decision tree you ob-
tained, as well as the training and testing accuracy. Is your model presenting
evidence of under or overfitting? Justify. Discuss the pros and cons of using
this decision tree, if any.

Task 2 - Train another Decision Tree


Your second task will be to train and evaluate the performance of a Decision
Tree Classifier with the following meta-parameters:
• criterion = "gini"
• max_depth = 6
• min_samples_split = 2
• min_samples_leaf = 1
• max_leaf_nodes = None
• random_state = 42
Train the proposed decision tree classifier. Report the decision tree you ob-
tained, as well as the training and testing accuracy. Is your model presenting
evidence of under or overfitting? Justify. Discuss the pros and cons of using
this decision tree, if any.

3
Task 3 - Find the best Decision Tree
In this task, you will now try to find the best decision tree. To do so, you will first
only try different values between 1 and 10 for the max_depth meta-parameter,
and report of graph of the training/testing accuracy w.r.t. the corresponding
choice for max_depth. After that, you are free to play with the other meta-
parameters if you want. For each experiment, leave random_state = 42.

Train several decision trees with max_depth going from 1 to 10. Report a
graph showing the training/testing accuracy w.r.t. max_depth. Discuss this
graph by making links with the theory. Spot some choices for max_depth that
leads to under and overfitting. What would be a good choice for max_depth?
Justify. Briefly comment the experiments you made if you played with other
meta-parameters.

Task 4 - Join the leaderboard!


The final task will be to submit your model into a private Kaggle competition.
To do so, we will ask you to first make a Kaggle account (see https://www.
kaggle.com/). If you do not use your real name when registering to Kaggle,
please provide in your report the pseudo you used to make your submissions.
In order to make a submission in the competition, you will need to generate
a submission file in a .csv format. This file will contain your predictions for the
instances in the “test.txt” data file. As the true labels are known by Kaggle (but
not by you), this submission file will allow Kaggle to evaluate the performance
of your model, and to rank it in a leaderboard. The required submission file
can easily be generated by using the generate_submission() function from
the companion library. You are not allowed to use other models than
decision trees. Furthermore, you are not allowed to use other external
data to train your model(s).
The kaggle competiton for this project can be accessed using the follow-
ing link: https://www.kaggle.com/t/36849c1d78ce45b1820ee5b78c858d0e.
Have fun!

For this task, it is mandatory to perform at least one submission to the Kaggle
competition (if you use a pseudo, please provide it in your report). Note that
your final rank will not be used to evaluate the quality of the project.

You might also like