Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
64 views

Machine Learning in Python

The document discusses machine learning using the Iris dataset in Python. It explains that datasets need credible data sources and sufficient size. The Iris dataset is recommended for beginners as it has numeric attributes, is a small classification problem with 4 attributes and 150 rows, and requires no special preprocessing. The document shows how to load the Iris data, view its dimensions and statistics, and breakdown classes. Finally, it demonstrates univariate and multivariate data visualization techniques like box plots, histograms, and scatter plots to better understand the Iris dataset attributes and relationships.

Uploaded by

Katlo Kay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Machine Learning in Python

The document discusses machine learning using the Iris dataset in Python. It explains that datasets need credible data sources and sufficient size. The Iris dataset is recommended for beginners as it has numeric attributes, is a small classification problem with 4 attributes and 150 rows, and requires no special preprocessing. The document shows how to load the Iris data, view its dimensions and statistics, and breakdown classes. Finally, it demonstrates univariate and multivariate data visualization techniques like box plots, histograms, and scatter plots to better understand the Iris dataset attributes and relationships.

Uploaded by

Katlo Kay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Faculty of Computer Science

University of Sunderland

Machine Learning in Python


In order to start the machine learning process, you need to possess a set of data to be used for training
the algorithm. It's very important to ensure that the source of data is credible, otherwise you would
receive incorrect results, even if the algorithm itself is working correctly (following the garbage in,
garbage out principle).

The second important thing is the size of the dataset. There is no straightforward answer for how large
it should be. The answer may depend on many factors, for example:

• the type of problem you're looking to solve,


• the number of features in the data,
• the type of algorithm used.

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris
dataset). This is a good project because it is so well understood and it’s used in tons of examples.

• Attributes are numeric so you have to figure out how to load and handle data.
• It is a classification problem, allowing you to practice with perhaps an easier type of
supervised learning algorithm.
• It is a multi-class classification problem (multi-nominal) that may require some specialized
handling.
• It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a
screen or A4 page).
• All of the numeric attributes are in the same units and the same scale, not requiring any
special scaling or transforms to get started.

Loading libraries and importing the dataset, enter and run:

Now it is time to take a look at the data. We are going to take a look at the data a few different
ways:

1. Dimensions of the dataset.


2. Peek at the data itself.
3. Statistical summary of all attributes.
4. Breakdown of the data by the class variable.
Faculty of Computer Science
University of Sunderland

We can get a quick look at the dimensions of the data with the shape property:

We can see the whole data set with the following code:

Note: the 20 in brackets will show us the first 20 rows of the dataset, you can edit this to see more or
less data.

We can see a statistical summary of our dataset by using the describe function. This will give us a
summary of each attribute including the count, mean, the min and max values as well as some
percentiles.

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as
an absolute count, it is known as the class distribution.

You should now have a basic idea about the data. We need to extend that with some visualizations
Faculty of Computer Science
University of Sunderland

Data Visualisation

Viewing your data in python relies on matplotlib libraries, before we look at the Iris dataset, here are
some code examples that will help you understand how data is presented.

Enter and run the code below:

You should get the following output:

Next we can add some markers:

You should get the following output:


Faculty of Computer Science
University of Sunderland

Now let’s go back to our Iris dataset, we ae going to look at different types of plots.

1. Univariate plots to better understand each attribute.


2. Multivariate plots to better understand the relationships between attributes.

We start with some univariate plots, that is, plots of each individual variable. Given that the input
variables are numeric, we can create box and whisker plots of each. Enter and run the following code
after you have imported the dataset:

This should give you a much clearer idea of the distribution of the input attributes.

We can also create a histogram of each input variable to get an idea of the distribution:

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as
we can use algorithms that can exploit this assumption.

Multivariate plots allow us to see the interactions between the variables in our dataset. If we look at
scatterplots of pairs of attributes, it can help us spot structured relationships between those variables.
Faculty of Computer Science
University of Sunderland

Given that we are going to be using matplotlib to display 3d data and graphs, we need to switch to a
different coding environment. Close everything down and relaunch Anaconda, but this time select
Spyder instead of Jupyter.

Type in and run the following code:

You might also like