Project_1
Project_1
Project_1
Modalities
After performing the different tasks, you are expected to handle out a report of
max. 3 pages, including plots. Additional pages will not be read. When
writing your report, keep in mind that we are just waiting for the answers to
the different questions we will ask in the next sections (in other words, it is not
necessary to give additional information about, e.g., how a model works, what
is the dataset, the context, etc.). This report can be written in English or in
French, and should be exported in .pdf format.
On the implementation side, you will use Python 3 with the scikit-learn
framework (http://scikit-learn.org), which provides many tools for ma-
chine learning and has a detailed documentation. Please check the instructions
in https://scikit-learn.org/stable/install.html to install scikit-learn.
Your code should be submitted with your report on Webcampus by
providing a notebook or a script (.ipynd or .py file(s)).
1
The dataset
The aim of this project is to build a Decision Tree Classifier1 in order to discrimi-
nate 3 different types of wheat seeds thanks to several geometrical features. The
3 different names of wheat seeds are Kama, Rosa and Canadian, corresponding
to the labels 1, 2 and 3, in the data files, respectively. For each instance, 7
numerical features are available, which are the (i) area, (ii) perimeter, (iii) com-
pactness, (iv) length of kernel, (v) width of kernel, (vi) asymmetry coefficient
and (vii) length of kernel groove of each seed. Those features were obtained
through a soft X-ray technique.
Attached to this project description, you will find two data files named
“train.txt” and “test.txt”. The first one constitutes your available training data.
The first columns contain the 7 features and the last one, the corresponding
label for 147 different instances. It is recommended to load the data using
the loadtxt() function from numpy2 . The following python code presents an
example of how to load the data in a scikit-learn-ready fashion:
import numpy a s np
data = np . l o a d t x t ( " . / t r a i n . t x t " )
X, y = data [ : , : − 1 ] , data [ : , − 1 ]
The second file, “test.txt” only contains additional instances without their
corresponding labels (the file only contains 7 columns). In other words, those
data cannot be used in your training pipeline and will only be used at the end of
the project to validate your model(s) by submitting it in an online leaderboard
(more details in the last section).
For this task, you just have to split your data in a training and testing set. You
will use those same splits all along this project. You do not have to comment
this task in your final report.
1 https://scikit-learn.org/stable/modules/generated/sklearn.tree.
DecisionTreeClassifier.html
2 https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html
3 https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.
train_test_split.html
2
Task 1 - Train your first Decision Tree
Your first task will be to train and evaluate the performance of a Decision Tree
Classifier with the following meta-parameters (see the documentation for more
details):
• criterion = "gini"
• max_depth = 1
• min_samples_split = 2
• min_samples_leaf = 1
• max_leaf_nodes = None
• random_state = 42
A companion library (utils.py) is also attached to this project. It contains
several functions to make your life easier. For example, in order to draw the de-
cision trees, you will have to use the plot_tree() function from scikit-learn.
However, this function requires the feature names. Those feature names can be
easily retrieved by a function called get_features() provided in the companion
library.
Train the proposed decision tree classifier. Report the decision tree you ob-
tained, as well as the training and testing accuracy. Is your model presenting
evidence of under or overfitting? Justify. Discuss the pros and cons of using
this decision tree, if any.
3
Task 3 - Find the best Decision Tree
In this task, you will now try to find the best decision tree. To do so, you will first
only try different values between 1 and 10 for the max_depth meta-parameter,
and report of graph of the training/testing accuracy w.r.t. the corresponding
choice for max_depth. After that, you are free to play with the other meta-
parameters if you want. For each experiment, leave random_state = 42.
Train several decision trees with max_depth going from 1 to 10. Report a
graph showing the training/testing accuracy w.r.t. max_depth. Discuss this
graph by making links with the theory. Spot some choices for max_depth that
leads to under and overfitting. What would be a good choice for max_depth?
Justify. Briefly comment the experiments you made if you played with other
meta-parameters.
For this task, it is mandatory to perform at least one submission to the Kaggle
competition (if you use a pseudo, please provide it in your report). Note that
your final rank will not be used to evaluate the quality of the project.