Machine Learning in Python
Machine Learning in Python
University of Sunderland
The second important thing is the size of the dataset. There is no straightforward answer for how large
it should be. The answer may depend on many factors, for example:
The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris
dataset). This is a good project because it is so well understood and it’s used in tons of examples.
• Attributes are numeric so you have to figure out how to load and handle data.
• It is a classification problem, allowing you to practice with perhaps an easier type of
supervised learning algorithm.
• It is a multi-class classification problem (multi-nominal) that may require some specialized
handling.
• It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a
screen or A4 page).
• All of the numeric attributes are in the same units and the same scale, not requiring any
special scaling or transforms to get started.
Now it is time to take a look at the data. We are going to take a look at the data a few different
ways:
We can get a quick look at the dimensions of the data with the shape property:
We can see the whole data set with the following code:
Note: the 20 in brackets will show us the first 20 rows of the dataset, you can edit this to see more or
less data.
We can see a statistical summary of our dataset by using the describe function. This will give us a
summary of each attribute including the count, mean, the min and max values as well as some
percentiles.
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as
an absolute count, it is known as the class distribution.
You should now have a basic idea about the data. We need to extend that with some visualizations
Faculty of Computer Science
University of Sunderland
Data Visualisation
Viewing your data in python relies on matplotlib libraries, before we look at the Iris dataset, here are
some code examples that will help you understand how data is presented.
Now let’s go back to our Iris dataset, we ae going to look at different types of plots.
We start with some univariate plots, that is, plots of each individual variable. Given that the input
variables are numeric, we can create box and whisker plots of each. Enter and run the following code
after you have imported the dataset:
This should give you a much clearer idea of the distribution of the input attributes.
We can also create a histogram of each input variable to get an idea of the distribution:
It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as
we can use algorithms that can exploit this assumption.
Multivariate plots allow us to see the interactions between the variables in our dataset. If we look at
scatterplots of pairs of attributes, it can help us spot structured relationships between those variables.
Faculty of Computer Science
University of Sunderland
Given that we are going to be using matplotlib to display 3d data and graphs, we need to switch to a
different coding environment. Close everything down and relaunch Anaconda, but this time select
Spyder instead of Jupyter.