Data Analysis Book Python (Pandas)
Data Analysis Book Python (Pandas)
This is a short taster course on Data Analysis using Pandas. It's designed to get you up and
running with one of the best and most popular Python libraries to analyze your datasets. You
don't need to buy any software for this course, as everything is free to download.
Part One: Getting Started with Pandas
1. Install Jupyter, Python, Pandas
2. Jupyter Notebook Tutorial
3. Pandas Dataframes
4. Importing Datasets
5. Basic Pandas Operations
6. Dealing with Nulls
Part Two: Analyzing Data with Pandas
1. Pandas Filtering
2. Using loc, iloc in Pandas
3. Pandas GroupBy
4. Pandas Apply Function
5. Pandas plots, graphs, charts
Install Jupyter Notebook on Windows
We're going to be using Jupyter Notebooks to do this Pandas Data Analysis course. You're going
to be downloading some CSV Datasets from us, so you'll need Jupyter installed on your
computer. Except for the first lesson. You don't need to install anything for the Getting Started
tutorial, which is here:
Pandas Dataframes - Getting Started
To run the code online, head on over to the Jupyter site here:
Online Jupyter Notebook
For the rest of the Data Analysis course, you'll need Python and Jupyter installed. This is pretty
easy on Windows.
Install Python
If you haven't already done so, you can get Python here:
https://www.python.org/downloads/windows/
Scroll to the bottom of the page. Under Files, download the 64-bit version (Windows installer
(64-bit)).
Install Python on Windows
Once you've downloaded Python, double-click the downloaded file to launch the installer. You
should see this:
Put a check in the box at the bottom, Add python.exe to PATH.
Also, make a note of where Python is installing (the red box and arrow in the image above).
You're going to need this location shortly. But it's usually in
your AppData\Local\Programs folder on Windows. Something like this:
C:\Users\Ken\AppData\Local\Programs\Python\Python311\
The 311 above means Python 3.11. Yours may be different. But you should see a file
called python.exe inside of this folder.
There should also be a folder called Scripts. Look inside of the Scripts folder and note there are
a few entries for pip:
In the image above, we have pip, pip3 and pip3.11. You use pip to install other Python packages.
Like Jupyter.
We need to change to our Python folder. You do this with the cd command (short for Change
Directory). Enter this:
cd PATH_TO_YOUR_PYTHON_INSTALL
Replace PATH_TO_YOUR_PYTHON_INSTALL with the location you were asked to
remember earlier in this tutorial. (Hopefully, you haven't forgotten this already! If so, search
python.exe.)
Press Enter to change the directory:
Look inside your Scripts folder again to see lots more files:
Back to the Command Prompt. To launch Jupyter, enter this command:
Jupyter notebook
Your Command Prompt should look something like this:
When you press enter, you should see a browser appear with the start page for Jupyter:
Note the address bar at the top. You are running a server on your computer now
(localhost:8888/) and this webpage is called tree.
You could make a start right now on the rest of the course. However, if you were to close the
webpage down, you'd have to restart the Command Prompt, change the directory to the Python
folder, and enter the Jupyter Notebook line again. It would make life a lot easier if you had a
shortcut on your desktop that you could double-click to launch a Jupyter Notebook.
A Jupyter Notebook is a great way to run Python and Pandas code as it means you don't have to
install an IDE. In the previous lesson, we installed Jupyter on Windows. But the only reason we
did so was because we'll need to load up some local datasets for this Pandas course. You can use
the online Jupyter version here (Jupyter Notebook link):
Jupyter Notebook
We have a Python course for beginners here:
Home and Learn Python Course
You can do a lot of the course through the Jupyter website, rather than installing PyCharm.
Let's see how Jupyter Notebooks work, though.
On the left, you'll see a list of folders, files, and any Notebooks you create. Once you create a
Notebook, you can reopen it by simply clicking its link on the left.
Top right of the homepage, click the button that says New. You'll see a dropdown list:
Click on the Folder item. It might look like nothing has happened, but you should see a folder
called Untitled Folder has been added to the list on the left. Select the folder by checking the
box next to it. You'll see some icons appear top left:
Click on the New button again. From the menu, select the Python item:
Creating a new Python file will launch it in a new browser tab. It should look like this:
Notice that it says Untitled at the top. Left-click on the word Untitled and you'll see this box
appear:
Type a new name for your Python file and click the Rename button. You should see your new
title appear at the top in place of Untitled.
Click back on the Tutorials tab for a moment. You'll see your new Notebook in the folder:
If you accidentally close down your Getting Started tab, click on it here to reopen it.
Click back on your Getting Started tab. Let's add some code and run it.
Our Hello Pandas message is output below the cell. Notice that a new cell gets added whenever
you Run your code.
You can add Python functions to cells and run them from another cell. To clarify this, add the
following Python function to the cell below the first one:
def greetMe(myName):
return 'Welcome to the lesson, ' + myName
With your cursor inside of the greetMe function, click the Run button again. The only thing that
will happen is that a new cell appears below the function. Add this in the empty cell:
greetMe('Kenny')
Run the third cell and you should see something like this output:
You can cut cells, if you don't want them. Click inside the new empty cell. Type this:
'Cut this'
From then menu at the top, select Edit > Cut Cells. (Or just click the scissors icon.)
The cell should disappear.
To get a new cell, click inside cell 3, the one with the greetMe('Kenny') in it. From
the Insert menu at the top of Jupyter, click Insert > Cell Below:
You'll get a new cell:
OK, that's about it for this lesson. There's not much else to a Jupyter Notebook - they are pretty
easy and intuitive to use. Let's move on to the first lesson on Data Analysis in Pandas.
Click Next to see this screen:
Give your shortcut a name and then click Finish. You should see a Command Prompt shortcut
appear on your desktop. To test it out, close down your browser with the Jupyter Notebook.
Close down the Command Prompt. Double-click your new shortcut and everything should start
up again.
0 Cat 3
1 Dog 2
2 Fish 7
Simple, hey! This data is a list of pets that people own and how many. The table has three
columns. This first column is called the index, and is just a unique value.
There are three rows in the table, row 0, row 1, and row 2. It might be that that first row
represents a pet owner. They keep cats and have three of them. The pet owner in the second row,
row 1, keeps 2 dogs, while the owner of the third row has 7 fish.
A single column of data, by the way, is called a Series. You can, for example, get just the Pet
column from the Dataframe and do something with this single column (Series).
Start up your Jupyter Notebooks app. Create a new Notebook and rename it to anything you like.
(If you're not sure how to do this, see here: Install Jupyter.)
To construct such a table in Pandas, you first import the library:
import pandas as pd
The pd here is just a variable name. You could call it almost anything you like:
import pandas as pan
The variable is now called pan. You can use this pan variable from now on whenever you want
to use something from the pandas library. (You'll see how it works shortly.) However, we'll stick
with pd as the variable name as it's become a quite common naming convention.
To create the simple table above, we use this syntax:
df_pets = pd.DataFrame(
{
'COL_1_NAME': ['COL_VALUE_1', 'COL_VALUE_2', 'COL_VALUE_3'],
'COL_2_NAME': ['COL_VALUE_1', 'COL_VALUE_2', 'COL_VALUE_3'],
}
)
The df_pets above is just another variable name. The Dataframe will be stored inside of this
variable. After an equal sign, we have this:
pd.DataFrame
The pd variable is the one holding a reference to the pandas library, the one we imported as.
All the round, curly, and square brackets that come after DataFrame can be a pain. Miss one out
and you'll get errors. But you need a pair of round brackets after DataFrame:
pd.DataFrame()
Inside the round brackets you need a pair of curly brackets:
pd.DataFrame( { } )
Curly brackets in Python mean you want a dictionary object. A dictionary is something with a
Key/Value pair. Like this:
Name:Ken
Age:34
Job: Writer
The Keys here are Name, Age and Job. The values are Ken, 34, and Writer.
Inside the curly brackets, you can add your column names and values:
import pandas as pd
df_pets = pd.DataFrame( {
'Pet': ['Cat', 'Dog', 'Fish'],
'Number': [3, 2, 7]
})
df_pets
Notice the format for column names: They go between quotation marks, followed by a colon:
'Pet':
'Number':
You can use single or double quotes.
After the colon, you can add a Python list (the square brackets) for your values.
['Cat', 'Dog', 'Fish'],
Notice the comma after the list - you need that to separate each column and values. (Except the
final one.)
So, copy and paste the following into the first empty cell of your new Jupyter Notebook (The
indent should be one press of the TAB key on your keyboard):
import pandas as pd
df_pets = pd.DataFrame( {
'Pet': ['Cat', 'Dog', 'Fish'],
'Number': [3, 2, 7]
})
df_pets
Press the Run button and you should see this:
Incidentally, you can type the code all on one line. We've spread it over a few lines just so that
you can see the syntax better. So you could do this instead:
df_pets = pd.DataFrame( { 'Pet': ['Cat', 'Dog', 'Fish'], 'Number': [3, 2, 7] } )
(The spaces don't matter, either.)
Let's move on and import a file that we can use as a Dataframe.
You learned how to create a DataFrame in Pandas. Once you have a DataFrame, you can start to
manipulate your data. In this lesson, you'll learn how to import some data from a csv file and use
that as a DataFrame.
File Formats
You can import a wide variety of file formats using Pandas, from data in the popular CSV format
to Excel files. You start with the word read, then an underscore, then the format you want to
read. For example, to read in a CSV file, you'd do this:
import pandas as pd
file = pd.read_csv('path_to_file.csv')
For an Excel file, it would be this:
import pandas as pd
file = pd.read_excel('path_to_file.xlsx')
For a JSON file, it would be this:
import pandas as pd
file = pd.read_json('path_to_file.json')
For a full list of the formats, see here on the Pandas site:
Pandas File Formats
So, here's a CSV file for you to download. (CSV stands for comma separated values.)
Download the Pets Data CSV File (Right-click, Save As)
Once you've downloaded the file, you can double click it to open it up, if you like. There's not
much too it:
Now let's import this file and see what it looks like in Pandas.
Start a new Jupyter Notebook. Add the following in the first cell:
import pandas as pd
df_pet = pd.read_csv('PATH_TO_FILE.csv')
df_pets
Replace PATH_TO_FILE with wherever you saved your downloaded file to. In Windows, you
can get the full file path by opening an Explorer window. (A shortcut is the WINDOWS Key + E
on your keyboard.)
Navigate to the folder where the file is. Select the file and, on the Home tab, click the Copy path
button, as in the image below:
Paste between the round brackets of read_csv and you'll have something like this:
pd.read_csv('C:\Users\Ken\Documents\DataScience\csv\pets-data.csv')
You now need to change all the backslashes to forward slashes, otherwise you'll get a Unicode
Escape error: (Or you can double up on the backslashes in place of the single backslashes
above.)
pd.read_csv('C:/Users/Ken/Documents/DataScience/csv/pets-data.csv')
Or this:
pd.read_csv('C:\\Users\\Ken\\Documents\\DataScience\\csv\\pets-data.csv')
Click the Run button in your Jupyter Notebook and you should see this:
What we've done here is import the pandas library. The variable name we're going to use is pd.
On line 2, we've read the CSV file in and stored it all in a variable called df_students. This
variable, df_students, now contains a Dataframe. We print the Dataframe out on the third line.
Notice that not all the results are displayed, just the top 5 and the bottom five. We are told that
there are 102 rows and 6 columns.
If you want to display some records from the top of your results, you can use the head function.
Like this:
df_pets.head()
If you don't type a number between the round brackets of head, you'll get the first 5 rows of data.
Type a number between the round brackets of head and it will display that number of rows:
df_pets.head(10)
If you want to display rows from the bottom instead of rows from the top, use tail instead of
head:
df_pets.tail()
df_pets.tail(10)
you imported a CSV file and loaded it into a Pandas DataFrame. In this lesson, you'll learn basic
operations you can perform on your Pandas data. Let's start with Shape an Info. If you haven't
done this lesson, here's a CSV file for you to download. (CSV stands for comma separated
values.)
Download the Pets Data CSV File (Right-click, Save As)
Once you've downloaded the file, copy and paste the following code into a new cell in a Jupyter
Notebook:
import pandas as pd
df_pet = pd.read_csv('PATH_TO_FILE.csv')
df_pets
Replace PATH_TO_FILE with wherever you saved your downloaded file to.
Pandas Shape
Whenever you first read a csv file in and create a Dataframe, you'll want to display basic
information about the Dataframe. We'll go through some of the methods you can use.
If you want to display how many rows and columns you have in the Dataframe, you can use
shape. Enter the following into an empty cell in your notebook (this assumes you are following
along from the previous lesson and have created the df_pets DataFrame.):
df_pets.shape
Shape is an attribute of Dataframes, rather than a method or function, so it doesn't need round
brackets.
Press Run at the top of your Jupyter Notebook and you should see this:
df_pets.shape
(27, 4)
The first number between the round brackets is how many rows you have in your Dataframe.
The second number is how many columns you have.
Pandas Info
To get information about the Dataframe, you can use the function info(). Enter the following into
an empty cell in your Notebook:
df_pets.info()
You should see this when you click the Run button (or press CTRL + ENTER on your keyboard
as a shortcut.):
You can add more than one column. Just add a comma after the square bracket, then the new
column and functions. Try this:
df_pets.agg( { 'OwnerAge' : ['mean', 'min', 'max'], 'PetAge': ['mean', 'min', 'max'] } )
Run the command to see the new table:
Pandas value_counts()
One very useful function you can use is called value_counts. As its name suggests, you use to
count values in your data set. As an example, try this in a new cell in your Jupyter Notebook:
df_pets['PetType'].value_counts()
You should see this as the result:
So, you need the name of your Dataframe and a column name between square brackets:
df_pets['PetType']
After a dot, type the function name and its round brackets:
df_pets['PetType'].value_counts()
You get a count of all the values in the series that don't have null values (see next lesson for a
deeper dive into null values). The display is in descending order. So, the count of 9 comes first
here. It's counting how many times the PetType Dog was recorded.
If you want the display in ascending order, you need to add ascending=True between the round
brackets of value_counts. Try this;
df_pets['PetType'].value_counts(ascending=True)
The result is this:
This time, 5 is the value at the top and the PetType is Rabbit.
You can do an alphabetical sort of the values. You just need to add a new function on the end -
sort_index. Try this:
df_pets['PetType'].value_counts().sort_index(ascending=True)
The results is this:
Now the PetTypes are sorted alphabetically. Cat comes first because we specified
that ascending=True. If you miss this out, you get a descending sort and Rabbit would be
top.
If you want an alphabetical sort when some of the values are equal, you need to add yet
another function on the end:
df_pets['PetType'].value_counts().sort_index().sort_values()
The new function is sort_values. If we had, say Fish and Rabbit at both 6 for their count,
then the PetTypes aould be sorted alphabetically in the results.
You can also display the values as percentages. Try this:
df_pets['PetType'].value_counts(normalize=True)
By using normalize=True between the round brackets of value_counts, you get a
percentage figure:
Now that you've gotten the hang of the basics, let's move on. We'll take a look at null
values, because you'll get a lot of them in your own data sets, and you need to know how to
deal with them.
In the previous lesson, you ran some basic Pandas commands to inspect your data. We
mentioned something about null values. In this lesson, you'll learn more about these null values.
We chain two functions together, here, the isnull function and the sum function we used in the
previous lesson. The result is a table of the null values in each column of the data set.
Once last point on Nulls - if you use Aggregate function like sum and count, NaN values are
ignored.
Pandas Dtypes
Dtypes are the type of data going into your columns. In a new cell in your Notebook, enter the
following:
df_students.dtypes
You should see this:
We have a list of column names and the type of data going into each column.
The First and Last columns are both of the object data type. This means they will be able to
hold string of text. The Math, Physics, and Computers columns store scores in the float64 data
type.
We'd like to change some of these types. The object types are OK. But the rest of the types can
be changed. The scores should be integers rather than float. After all, we don't have scores like
58.7 out of 100. Using the correct types means you won't be using up so much memory in your
data analysis. Not crucial here, with only 100 or so rows. But it would be crucial if you had a
Dataframe with hundreds of thousands of records.
To convert a column to a different type, use the function astype(). In between the round brackets
of astype, you need the data type you're converting to. We'd like to convert to the int type, so the
format is this:
df_students['Math'] = df_students['Math'].astype(int)
Before the equal sign, we're specifying that we want to convert the Math column. This goes
between square brackets after the data frame name, df_students. After the equal sign, we do the
same - specify the column we want to convert. This time, we type a dot and then add astype(int).
With that in mind, add the following five lines to a new cell in your Notebook:
df_students['Math'] = df_students['Math'].astype(int)
df_students['Physics'] = df_students['Physics'].astype(int)
df_students['Computers'] = df_students['Computers'].astype(int)
df_students.dtypes
Run your code to see the following:
Our final four columns are now converted to int32 data types.
(A word of warning: If you have NaN values in your column, you won't be able to convert to
integers from float. You'll get an error.)
In the next lesson below, you'll learn how to filter your Pandas Dataframes so that you can return
only the data you're interested in.
Let's load up our student data again and see if we can't answer a few questions with the help of
Pandas. If you haven't yet downloaded the student data, you can grab it here:
Student Scores Data Set (right click, Save As)
We're going to do some filtering, so it helps to know what these symbols mean (the symbols are
called conditional operators):
Symbol Meaning
== Equal to
!= Not equal to
& And
| Or
With that in mind, off we go.
Load up the Student data with these lines, changing PATH_TO_FILE to point to a location on
your own computer:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
To see is loaded OK, view the first 5 records:
df_students.head()
Nothing we haven't already done in previous lessons. But let's start with a simple question: Did
anyone score 100 on the Math exam?
We could just do this:
df_students['Math'] == 100
In between the square brackets after our Dataframe name, we have the name of one of our
Columns. Because the column name is text, Math needs to go between quote marks. The double
equal sign means equal to. So we're asking if any value in the Math column of
the df_students Dataframe is equal to 100. When you run the line, you should see a printout of
True or False values:
To see more details, you need to wrap the filter code in a Dataframe object, which
is df_students for us. Try this in a new Jupyter Notebook cell:
df_students[df_students['Math'] == 100]
When you run the code, you should see that only Alex Huffman scored 100 on the Math exam:
If you like, you can assign the conditional test to a variable:
math_filter = df_students['Math'] == 100
Now, the results of the equality test are placed into the variable we have called math_filter. To
get some output, wrap the variable name in a Dataframe object:
math_filter = df_students['Math'] == 100
df_students[math_filter]
The result is the same as above, but it might be easier to read. Let's ask another question.
Who scored more than 85 on the Computer exam? Ignoring the index numbers, display
just their last name and the score.
We clearly need a greater than symbol here. But how do you specify which columns to display?
The answer is to put your column names into a python list:
df_students[ ['Last', 'Computers'] ]
You can add as many column names as you want. Here, we add the 'Physics' columns as well:
df_students[ ['Last', 'Computers', 'Physics'] ]
Notice where all the square brackets are in the line above. If you were to run the line, you'd see
just your selected columns displayed:
This prints out all the students, though. We need to narrow it down to just those who scored more
than 85. To do that, add the conditional statement at the end:
df_students[['Last', 'Computers']][df_students['Computers'] > 85]
Run the code to see the results:
Find all the students who scored more than 80 on both the Math and Physics exams.
We're better off using variables to answer this, as it could be messy to read, otherwise. First, we
can store the Math query into a variable:
math_exam_result = df_students['Math'] > 85
Then do the Physics query:
phys_exam_result = df_students['Physics'] > 85
Now display the results with this line:
df_students[ math_exam_result & phys_exam_result ]
Our two variable go between the square brackets of the Dataframe called df_students.
Separating the two is the and symbol (&). Run the code to see this:
If you wanted it all on one line, it would be this:
df_students[(df_students['Math'] > 85) & (df_students['Physics'] > 85)]
Very messy!
In the image above, we have the Computers results. If we only wanted to display certain column,
we can change out code to this:
df_students[['Last', 'Math', 'Physics']][math_exam_result & phys_exam_result]
When the code is run, the result is this:
If you wanted to know who got more than 85 in Math OR Physics change the and (&) symbol to
an or symbol (|).
Counting
If you wnated to know how many results are in your query, you can wrap them in a len function.
Like this:
len( df_students[df_students['Math'] < 20] )
Run the code and you'll see an answer of 21 displayed.
If you wanted to add some text as well, wrap the above line in a str function: (str is short for
string)
"Scored less than 20 in Math: " + str( len(df_students[df_students['Math'] < 20]) )
The result would be:
Scored less than 20 in Math: 21
OK. Let's move on and explore something else. But getting good at filtering is the way forward,
if you want to get good at Pandas. In the next lesson, you'll learn about loc and iloc, two more
ways to filter your data.
Sometimes, you'll want to extract just a few results from your data set. In which case, you can
use loc and iloc. The difference between loc and iloc is that the former is used to refer to
columns by name and the latter is used to refer to columns by their number. (loc is short for
location). Think of it as iloc being an integer location and loc as a text location. Let's see some
examples to clear things up.
Here's our student data again, in case you haven't yet downloaded it:
Student Scores Data Set (right click, Save As)
In a new cell in a Jupyter Notebook, access the CSV file with this code:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
df_students.head()
Replace PATH_TO_FILE with a location on your own computer, wherever you downloaded the
file to.
You should see the following output when you run the code:
Now, suppose you wanted to examine just the Math scores and the student names. The column
names you need are First, Last, and Math. You can use loc to extract just these columns. The
syntax is this:
df_name[rows, columns]
So, in between square brackets after your Dataframe name, you need to specify which rows you
want and which columns. The two are separated by a comma.
The first column in your data set is 0 rather than 1 (the index numbers in the left column is
ignored as they come from Pandas).
Slicing with loc and iloc in Pandas
Now try this in a new cell:
student_subset = df_students.loc[0:3, ['First', 'Last', 'Math']]
student_subset
The result is this:
This time, we have numbers before and after the rows colon:
0:3
We saying we want the rows from row 0 to row 3, which is four rows.
The colon is really a 'from : to' statement:
from : to
0:3
1:5
17:22
If you want all the rows from a certain number, you can miss out the second number:
5:
10:
17:
If you want all the numbers up to a certain number, miss out the first number:
:3
:5
:17
Play around with the numbers before and after the colon to get a feel for how slicing works.
By the way, you can use slicing on the columns, as well, if you're using iloc:
student_subset = df_students.iloc[:, 0:2]
student_subset.head()
This is the same as we did before:
student_subset = df_students.iloc[:, [0, 1, 2]]
You can also grab every other by adding a second colon:
subset_iloc = df_student_scores.iloc[::2, 0:2]
The code above will get you every other row and just the student name columns.
If you wanted, say, every other row from the first 10 rows, you can write this:
subset_iloc = df_student_scores.iloc[0:11:2, 0:2]
This grabs every other row from 0 to 11.
OK, that's enough of loc and iloc. We'll move on and take a look at the important topic of Group
By.
The groupby function is used quite a lot in Pandas, and need it for when you want to look at your
data from a categorical point of view. For example, take a look at this spreadsheet:
We have three columns: Pet, Sex_Owner, Age_Owner. There are two clear categories here: Pet
and Sex_Owner. They are categories because there are a limited number of values. In the Pet
column, we have four different animals: Cat, Dog, Fish, Rabbit. For the Sex_Owner, we only
have two choices, Male or Female (apologies for the binary nature of this example).
What the groupby function allows you to do is to look at things from the perspective of these
groups. Let's see how it works.
If you haven't already done so, download the pets CSV file and save it to your computer:
Download the Pets Data CSV File (Right-click, Save As)
Start a new Jupyter Notebook and add the following:
import pandas as pd
df_pets = pd.read_csv("PATH_TO_FILE/pets-data.csv")
df_pets.head()
Obviously, change the PATH_TO_FILE to where on your computer you saved the CSV file to.
Run the code and you should see this:
Pretty much what we did before, in previous sections - display the first five rows of our data. Our
Dataframe is called df_pets.
If we only wanted to group by the different types of pets, we could do this:
pets_group = df_pets.groupby('Pet')
pets_group
To the right of an equal sign, we have the name of our Dataframe (df_pets). After the Dataframe
name, we have the groupby function. In between the round brackets of groupby, we have the
name of the column we'd like to split into groups - Pet. (The name of the column goes between
quote marks.)
If you run those two lines in your Notebook, you'll see something like this:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022E0F6BC460>
This message is telling us that pets_group variable is now a DataFrameGroupBy object.
Which is what we want. However, what we'd like to do is to split our data into two groups. First,
we want to group on the different types of pets, then we want to group on the sex of the owner.
So, run these two lines:
pets_group = df_pets.groupby( ['Pet', 'OwnerGender'] )
pets_group
Again, when you run the lines, you'll see the output telling you that you have a
DataFrameGroupBy object. This is useful information, in that you've learned that
the pets_group is now a DataFrameGroupBy object. But it's not very useful apart from that.
What we need to do next is to execute some sort of function on this new DataFrameGroupBy
object. The function we'll use is an agg function. (We did agg functions in a previous lesson, so
won't explain it again.)
Try these two lines in a new Jupyter Notebook cell:
agg_mean = pets_group.agg({'OwnerAge': ['mean']})
agg_mean
Run the code and you'll see this output:
The pets_group variable is now a DataFrameGroup object, remember. It has the Pet and
OwnerGender information in it. We're just using an agg function to get the mean age of the pet
owners. We can quickly see that rabbits are preferred by youngsters while cats and dogs are
preferred by older people.
Notice that the Pet column lists our four pets: Cat, Dog, Fish, Rabbit. This was our first group.
For each pet, we have a F or an M, which was the second group. On the first row, you can see
that if the cat owner is a female then the mean age is 39.6. If the cat owner is a male, the mean
age is 49.
You can add other agg functions, just like we did before. Try this:
agg_funcs = pets_group.agg({'Age_Owner': ['mean', 'min', 'max']})
agg_funcs
Run the code to see the following output:
This time, we're using three aggregate functions, mean, min, and max. We have two new
columns under the OwnerAge heading, min and max. These get you the minimum value in the
column and the maximum value.
You can have the agg function calculate more than one column. So, if we had a column for
PetAge, we could do this with the agg function:
aggs2 = pets_group.agg( {
'OwnerAge': ['mean'],
'PetAge': ['mean', 'min', 'max', 'sum']
})
aggs2
The result is this:
Notice how this is laid out and indented in the image above. Because it's Pandas code, rather than
Python code, you can indent however you like. It's a lot easier to read, now that the code is split
over multiple lines and is indented.
Notice, too, that there is a comma separating the OwnerAge line and the PetAge line. If you want
to work with more than two columns, just remember to add the comma in the right place.
Now, how can we answer the following question;
Question: What is the mean age of male pet owners and what is the mean age of female pet
owners?
To answer this question, first create a new group by OwnerGender:
owner = df_pets.groupby('OwnerGender')
Once you have that group, you can then specify a column from the group. One of the columns is
called OwnerAge. You can get the mean of this column like this:
owner.OwnerAge.mean()
The code and the output would be this:
Notice the line that's in green. It's green because it's a comment. You can turn any line of code
into a comment by starting the line with the # symbol. But we can quickly see that the mean age
for female pet owners is 30 while the mean age for male pet owners is just under 40.
Here's a followup question:
Question: What is the mean age of pet owners by pet type?
You can answer the question in a simliar way. Try this code in a new cell:
owner2 = df_pets.groupby('Pet')
owner2.OwnerAge.mean()
Run the code to see the following:
You can use the Pandas apply function to apply a Python function to your data series or
Dataframe. To clear that up, we're going to use our pets CSV file again. If oyu haven't yet
downloaded this dataset, you can do so here:
Download the Pets Data CSV File (Right-click, Save As)
The Ownergender column in the dataset has an F or a M in it (again, apologies for the binary
nature of this example):
Now, what if we wanted to convert the F to Female and the M to Male? How would we go about
this? Well, we can use apply. Let's see how.
Load the CSV pets data again with these lines, changing PATH_TO_FILE to point to a location
on your own computer:
import pandas as pd
df_pets = pd.read_csv('PATH_TO_FILE\\pets-data.csv')
df_pets.head()
In a new Jupyter Notebook cell, add this Python function:
def OwnerGenderFormat(gen):
if gen == "F":
return "Female"
elif gen == "M":
return "Male"
else:
return "None"
It should look like this, in your Notebook: (Make sure the indents are the same as in the image
above. The if should be one press of the tab key on your keyboard and the return should be two
tabs.)
Make sure to run this code. It's just a simple if statement, though. It returns either Female or
Male, depending on the value of gen.
Test out this function in a new Notebook cell. Add this to it:
OwnerGenderFormat("F")
Run the code and you should see this:
OK, the function works well when tested. Now to apply it to the OwnerGender column.
In a new Notebook cell, enter these lines:
df_pets['OwnerGender'] = df_pets['OwnerGender'].apply(OwnerGenderFormat)
df_pets.head()
Run the code and the new values for the OwnerGender should display:
To understand the code, first look at what we have to the left of the equal sign (=). We have this:
df_pets['OwnerGender']
This is just referencing the OwnerGender column (a series) of our df_pets Dataframe.
To the right of the equal sign, we have this:
df_pets['OwnerGender'].apply(OwnerGenderFormat)
Again, we have the reference to the OwnerGender column. After a dot, we then have this:
apply(OwnerGenderFormat)
In between the round brackets of the apply function, we have the name of our Python function.
Pandas will apply this function to every cell in the OwnerGender column. Behind the scenes, it
passes each cell value over to the function, storing the value in the gen variable. When it's done,
it writes the results back to the OwnerGender column on the left of the equal sign.
Now let's do something a little more sophisticated. We'll switch back to our Student Data. If you
haven't yet downloaded this dataset, you can do so here:
Student Scores Data Set (right click, Save As)
Load up the dataset with these lines:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
df_students.head(10)
Obviously, replace PATH_TO_FILE to point to a location on your own computer.
Looking at the data, we can see we have exam score, but there no grades. What we'd like to do is
to add a new column with Exam Grades as the values. Let's start with the Math grades.
The first thing to do is to add a new column. There are lots of ways to add columns in Pandas.
For us, though, it would make sense just to copy the Math column. We can then convert scores
into grades with this new column.
In a new cell in your Jupyter Notebook, add and run this code:
df_students['MathGrades'] = df_students['Math']
df_students.head()
We're creating a new column to the left of the equal sign (=). First, we have our Dataframe name,
df_students. In between the square brackets of the Dataframe, we have given the new column the
name MathGrades. This goes between quote marks.
To the right of the equal sign, we have this:
= df_students['Math']
This just copies the Math column to whatever is on the left of the equal sign, which is a new
column for us.
When you run the code, you should see this:
And there's our new column on the end.
Let's convert the scores to grades. Add this function in a new cell in your Notebook:
def getGrade(val):
if val >= 90 and val <= 100:
return "A+"
elif val >= 80 and val <= 89:
return "A"
elif val >= 70 and val <= 79:
return "B+"
elif val >= 60 and val <= 69:
return "B"
elif val >= 50 and val <= 59:
return "C+"
elif val >= 40 and val <= 49:
return "C"
elif val >= 30 and val <= 39:
return "D"
else:
return "F"
It should look like this in your Notebook:
Make sure your cursor is flashing inside the cell and Run the code. Running the code when
you've finished typing your function will ensure that Pandas knows about it. Plus, if you've made
any mistakes in your Python code, you'll see an error message you can use to correct your code.
In a new cell, test the code out with this line:
'Grade is ' + getGrade(34)
When you run the code, you should see a grade of D print out.
OK, our new function works. Now let's apply it to our new column. Add this code in a new
column:
df_students['MathGrades'] = df_students['Math'].apply(getGrade)
df_students.head()
Run the code and you should see this:
We now have a column of Math grades.
The code is pretty much the same as before. Last time, we had this:
df_students['MathGrades'] = df_students['Math']
This copied the values in the Math column over to a new column called MathGrades. This time,
we have this:
df_students['MathGrades'] = df_students['Math'].apply(getGrade)
The only difference is the apply function on the end:
apply(getGrade)
Pandas gets a grade for each value in the Math column. It then sets that grade in the MathGrades
column (to the left of the equal sign.) We can do it this way because MathGrades is an exact
copy of the Math column.
See if you can do the other two columns by yourself. Add two new columns. Convert the Physics
and Computers scores to grades.
Now let's use a lambda to add up values in columns, returning the reult in a new column.
Suppose we want to add up the scores in our student data columns (Math + Physics +
Computers). The total would go in a new column called ExamTotal. We can construct the
lambda part like this:
apply(lambda colName: colName['Math'] + colName['Physics'] + colName['Computers'] )
We also need to add the name of our Dataframe:
df_students.apply(lambda colName: colName['Math'] + colName['Physics'] +
colName['Computers'] )
This adds up the values in each column (Math, Physics, Computers).
(Notice that we're not using a column name after df_students. That's because we want to apply
our lambda to the entire Dataframe, and not just a single column.)
In Pandas, you can apply your lambda code to either rows or columns. This is done with the axis
attribute, which can be set to either 1 or 0. The default is 0 and means columns.
apply(lambda x: CODE_GOES_HERE, axis = 1)
apply(lambda x: CODE_GOES_HERE, axis = 0)
Adding the axis attribute, the code would be this:
df_students.apply(lambda colName: colName['Math'] + colName['Physics'] +
colName['Computers'], axis = 1 )
When you run the code, the result is as follows:
Notice how we've spread the code over few lines. You can do this in your own code, if the lines
are looking too long are becoming hard to read.
But let's move on - lambdas can improve your Pandas skills, but they can be a bit tricky to get
the hang of. In the next lesson, we'll cover Pandas and plots. You get to create some little charts!
In this lesson, we'll start charting our data as create a few Pandas plots. We'll assume you're
using a Jupyter Notebook to do these tutorials. This will make life easier, as it has plotting built
in. However, you do need to install something calledMatPlotLib. So, fire up your command
prompt again. (If you're not sure what this means, see the first lesson here: install.) Navigate to
your Python directory. Enter this command:
pip3 install matplotlib
It should look like this in your command prompt:
Let's load our student data again. If you haven't already downloaded this dataset, you can grab a
copy here:
Student Scores Data Set (right click, Save As)
Load the dataset and display the first five rows with these lines (change PATH_TO_FILE to
point to a location on your own computer):
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
df_students.head()
Let's just extract the Math column (columns are called Series, remember) and see what happens
when we plot it. Add this line in a new cell in your Notebook:
df_students['Math'].plot()
Run the line to see this plot appear:
Using the inbuilt function plot gets you a line chart by default. If you want another type of chart,
you can use one of the following:
plot.area()
plot.bar()
plot.barh()
plot.box()
plot.hexbin()
plot.hist()
plot.kde()
plot.density()
plot.line()
plot.pie()
plot.scatter()
So a line chart would be this:
df_students['Math'].plot.line()
Our chart is a a bit messy, though. Let's add the grade columns to our Dataframe, like we did in a
previous lesson. First, add this Python function to a new Notebook cell:
Make sure to run the code so that Pandas knows about it.
def getGrade(val):
if val >= 90 and val <= 100:
return "A"
elif val >= 70 and val <= 89:
return "B"
elif val >= 50 and val <= 69:
return "C"
elif val >= 30 and val <= 49:
return "D"
elif val >= 10 and val <= 29:
return "E"
else:
return "F"
It should look like this in your Notebook:
The code uses value_counts to get a count of how many students are in each grade. (We're
adding a sort on the end.)
We can now create a bar chart. In a new cell, add and run this line:
seriesMath.plot.bar()
You'll see this:
Here's the code to create a bar chart from the Physics grades:
seriesPhys = df_students['PhysGrades'].value_counts() .sort_index(ascending=True)
seriesPhys.plot.bar()
And here's the code for the Computers grades:
seriesComp = df_students['CompGrades'].value_counts() .sort_index(ascending=True)
seriesComp.plot.bar()
Make sure to create these two series, seriesPhys and seriesComp, by running the code - we'll be
needing them soon.
The charts look a bit bland, though. You can spruce them up by including a few attributes
between the round brackets after the plot type. Here are just a few of them:
Attribute Value Example
The numbers are how many students are grouped in age grade. So, 8 students got an A in Math,
18 students got an A in Physics, while 9 students got an A grade in computers.
Let's see all this in a bar chart.
Add the following in a new Notebook cell:
newDF.plot.bar( y=['Math', 'Physics', 'Computers'] )
The result is this, when you run the line:
We have a nice bar chart with all three subjects compared for each grade.
Notice what we have between the round brackets of bar:
y=['Math', 'Physics', 'Computers']
We're specifying which columns from our new Dataframe that we want to use in the y axis.
You can also add a column name that you want to use in the x axis, if you need to:
bar(x='Grades', y='GradesCount')
Often, you don't need to specify the y column as Pandas usually guess right which column to use.
bar(x='Grades')
We can add some formatting to our chart, though. Try this:
newDF.plot.bar(y=['Math', 'Physics', 'Computers'],
figsize=(8,8),
legend=True,
xlabel='Exam Grades',
ylabel='Num Achieving Grade',
color={'Math': '#003f5c',
'Physics': '#bc5090',
'Computers': '#ffa600'},
fontsize=20)
Run the code to see the updated chart:
But that's enough of charts and the end of this Pandas short course. Hope you enjoyed it. If you
want to take Pandas further, there's a webiste called Kaggle that's a great place to go to get
datasets. Not only that, others will upload the code they used to analyse the dataset, so you can
learn from them.