0% found this document useful (0 votes)

39 views

Data Analysis Book Python (Pandas)

Uploaded by

blessedonoriode5

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Data Analysis Book Python (Pandas)

Uploaded by

blessedonoriode5

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

DATA ANALYSIS – PANDAS

This is a short taster course on Data Analysis using Pandas. It's designed to get you up and
running with one of the best and most popular Python libraries to analyze your datasets. You
don't need to buy any software for this course, as everything is free to download.
Part One: Getting Started with Pandas
1. Install Jupyter, Python, Pandas
2. Jupyter Notebook Tutorial
3. Pandas Dataframes
4. Importing Datasets
5. Basic Pandas Operations
6. Dealing with Nulls
Part Two: Analyzing Data with Pandas
1. Pandas Filtering
2. Using loc, iloc in Pandas
3. Pandas GroupBy
4. Pandas Apply Function
5. Pandas plots, graphs, charts
Install Jupyter Notebook on Windows
We're going to be using Jupyter Notebooks to do this Pandas Data Analysis course. You're going
to be downloading some CSV Datasets from us, so you'll need Jupyter installed on your
computer. Except for the first lesson. You don't need to install anything for the Getting Started
tutorial, which is here:
Pandas Dataframes - Getting Started
To run the code online, head on over to the Jupyter site here:
Online Jupyter Notebook
For the rest of the Data Analysis course, you'll need Python and Jupyter installed. This is pretty
easy on Windows.

Install Python
If you haven't already done so, you can get Python here:
https://www.python.org/downloads/windows/
Scroll to the bottom of the page. Under Files, download the 64-bit version (Windows installer
(64-bit)).
Install Python on Windows
Once you've downloaded Python, double-click the downloaded file to launch the installer. You
should see this:
Put a check in the box at the bottom, Add python.exe to PATH.
Also, make a note of where Python is installing (the red box and arrow in the image above).
You're going to need this location shortly. But it's usually in
your AppData\Local\Programs folder on Windows. Something like this:
C:\Users\Ken\AppData\Local\Programs\Python\Python311\
The 311 above means Python 3.11. Yours may be different. But you should see a file
called python.exe inside of this folder.
There should also be a folder called Scripts. Look inside of the Scripts folder and note there are
a few entries for pip:
In the image above, we have pip, pip3 and pip3.11. You use pip to install other Python packages.
Like Jupyter.

Install Jupyter Notebook

Now open up a Windows Command Prompt. You can do this by typing Command Prompt in
the search box bottom left on your Windows Desktop. You should see a black or grey screen
appear:

We need to change to our Python folder. You do this with the cd command (short for Change
Directory). Enter this:
cd PATH_TO_YOUR_PYTHON_INSTALL
Replace PATH_TO_YOUR_PYTHON_INSTALL with the location you were asked to
remember earlier in this tutorial. (Hopefully, you haven't forgotten this already! If so, search
python.exe.)
Press Enter to change the directory:

Now install Jupyter with this line:

pip3 install JupyterLab
If this doesn't work, try one of the other pip files in your scripts folder:
pip install JupyterLab
pip3.11 install JupyterLab

Press returns and you should see the installation:

When the installation is finished, you'll see the prompt line reappear:

Look inside your Scripts folder again to see lots more files:
Back to the Command Prompt. To launch Jupyter, enter this command:
Jupyter notebook
Your Command Prompt should look something like this:
When you press enter, you should see a browser appear with the start page for Jupyter:

Note the address bar at the top. You are running a server on your computer now
(localhost:8888/) and this webpage is called tree.
You could make a start right now on the rest of the course. However, if you were to close the
webpage down, you'd have to restart the Command Prompt, change the directory to the Python
folder, and enter the Jupyter Notebook line again. It would make life a lot easier if you had a
shortcut on your desktop that you could double-click to launch a Jupyter Notebook.

Create a Jupyter Notebook Shortcut

Right-click a blank area of your desktop. From the menu that appears, select New > Shortcut.
You should see this:
Enter this in the textbox:
C:\Windows\System32\cmd.exe /k cd YOUR_PYTHON_PATH && Jupyter notebook
Change YOUR_PYTHON_PATH to whatever the file path is to your Python folder, the one
you typed earlier (it's a good thing you remembered it!). For us, we would replace
YOUR_PYTHON_PATH with this:
C:\Users\Ken\AppData\Local\Programs\Python\Python311
And the whole thing would be (all on one line):
C:\Windows\System32\cmd.exe/k cd C:\Users\Ken\AppData\Local\Programs\Python\Python311
&& jupyter notebook

A Jupyter Notebook is a great way to run Python and Pandas code as it means you don't have to
install an IDE. In the previous lesson, we installed Jupyter on Windows. But the only reason we
did so was because we'll need to load up some local datasets for this Pandas course. You can use
the online Jupyter version here (Jupyter Notebook link):
Jupyter Notebook
We have a Python course for beginners here:
Home and Learn Python Course
You can do a lot of the course through the Jupyter website, rather than installing PyCharm.
Let's see how Jupyter Notebooks work, though.

Getting Started with Jupyter Notebooks

Using your knowledge from the previous lesson, launch the Jupyter homepage. You should be
looking at this:

On the left, you'll see a list of folders, files, and any Notebooks you create. Once you create a
Notebook, you can reopen it by simply clicking its link on the left.
Top right of the homepage, click the button that says New. You'll see a dropdown list:
Click on the Folder item. It might look like nothing has happened, but you should see a folder
called Untitled Folder has been added to the list on the left. Select the folder by checking the
box next to it. You'll see some icons appear top left:

Click the Rename button. Name your folder Tutorials.

Click on the newly-renamed folder to be taken inside of it:

Click on the New button again. From the menu, select the Python item:

Creating a new Python file will launch it in a new browser tab. It should look like this:

Notice that it says Untitled at the top. Left-click on the word Untitled and you'll see this box
appear:
Type a new name for your Python file and click the Rename button. You should see your new
title appear at the top in place of Untitled.
Click back on the Tutorials tab for a moment. You'll see your new Notebook in the folder:

If you accidentally close down your Getting Started tab, click on it here to reopen it.
Click back on your Getting Started tab. Let's add some code and run it.

Add Code to a Jupyter Notebook

Code in Jupyter Notebooks is added to cells. Cells are the grey areas that say into the left of
them. This is where you write your code.
Click inside of the cell. Add the following:
Greeting='HelloPandas!'
greeting
It should look like this:
To see the output of your code, you need to click the Run button, highlighted by a red arrow in
the image above:

Our Hello Pandas message is output below the cell. Notice that a new cell gets added whenever
you Run your code.
You can add Python functions to cells and run them from another cell. To clarify this, add the
following Python function to the cell below the first one:
def greetMe(myName):
return 'Welcome to the lesson, ' + myName
With your cursor inside of the greetMe function, click the Run button again. The only thing that
will happen is that a new cell appears below the function. Add this in the empty cell:
greetMe('Kenny')

Run the third cell and you should see something like this output:
You can cut cells, if you don't want them. Click inside the new empty cell. Type this:
'Cut this'
From then menu at the top, select Edit > Cut Cells. (Or just click the scissors icon.)
The cell should disappear.
To get a new cell, click inside cell 3, the one with the greetMe('Kenny') in it. From
the Insert menu at the top of Jupyter, click Insert > Cell Below:
You'll get a new cell:

OK, that's about it for this lesson. There's not much else to a Jupyter Notebook - they are pretty
easy and intuitive to use. Let's move on to the first lesson on Data Analysis in Pandas.
Click Next to see this screen:
Give your shortcut a name and then click Finish. You should see a Command Prompt shortcut
appear on your desktop. To test it out, close down your browser with the Jupyter Notebook.
Close down the Command Prompt. Double-click your new shortcut and everything should start
up again.

Now let's see how to use a Jupyter Notebook.

What is Pandas? What is a DataFrame?

Pandas is an open-source tool used for data manipulation and analysis. It's built on top of Python.
The idea is that you create something called a Dataframe, which is like a table, and then perform
operations on this Dataframe to extract information about your data. Here's what a Dataframe
might look like when printed out:
Pet Number

0 Cat 3

1 Dog 2

2 Fish 7
Simple, hey! This data is a list of pets that people own and how many. The table has three
columns. This first column is called the index, and is just a unique value.
There are three rows in the table, row 0, row 1, and row 2. It might be that that first row
represents a pet owner. They keep cats and have three of them. The pet owner in the second row,
row 1, keeps 2 dogs, while the owner of the third row has 7 fish.
A single column of data, by the way, is called a Series. You can, for example, get just the Pet
column from the Dataframe and do something with this single column (Series).
Start up your Jupyter Notebooks app. Create a new Notebook and rename it to anything you like.
(If you're not sure how to do this, see here: Install Jupyter.)
To construct such a table in Pandas, you first import the library:
import pandas as pd
The pd here is just a variable name. You could call it almost anything you like:
import pandas as pan
The variable is now called pan. You can use this pan variable from now on whenever you want
to use something from the pandas library. (You'll see how it works shortly.) However, we'll stick
with pd as the variable name as it's become a quite common naming convention.
To create the simple table above, we use this syntax:
df_pets = pd.DataFrame(
{
'COL_1_NAME': ['COL_VALUE_1', 'COL_VALUE_2', 'COL_VALUE_3'],
'COL_2_NAME': ['COL_VALUE_1', 'COL_VALUE_2', 'COL_VALUE_3'],
}
)
The df_pets above is just another variable name. The Dataframe will be stored inside of this
variable. After an equal sign, we have this:
pd.DataFrame
The pd variable is the one holding a reference to the pandas library, the one we imported as.
All the round, curly, and square brackets that come after DataFrame can be a pain. Miss one out
and you'll get errors. But you need a pair of round brackets after DataFrame:
pd.DataFrame()
Inside the round brackets you need a pair of curly brackets:
pd.DataFrame( { } )
Curly brackets in Python mean you want a dictionary object. A dictionary is something with a
Key/Value pair. Like this:
Name:Ken
Age:34
Job: Writer
The Keys here are Name, Age and Job. The values are Ken, 34, and Writer.
Inside the curly brackets, you can add your column names and values:
import pandas as pd
df_pets = pd.DataFrame( {
'Pet': ['Cat', 'Dog', 'Fish'],
'Number': [3, 2, 7]
})
df_pets
Notice the format for column names: They go between quotation marks, followed by a colon:
'Pet':
'Number':
You can use single or double quotes.
After the colon, you can add a Python list (the square brackets) for your values.
['Cat', 'Dog', 'Fish'],
Notice the comma after the list - you need that to separate each column and values. (Except the
final one.)
So, copy and paste the following into the first empty cell of your new Jupyter Notebook (The
indent should be one press of the TAB key on your keyboard):
import pandas as pd
df_pets = pd.DataFrame( {
'Pet': ['Cat', 'Dog', 'Fish'],
'Number': [3, 2, 7]
})
df_pets
Press the Run button and you should see this:

Incidentally, you can type the code all on one line. We've spread it over a few lines just so that
you can see the syntax better. So you could do this instead:
df_pets = pd.DataFrame( { 'Pet': ['Cat', 'Dog', 'Fish'], 'Number': [3, 2, 7] } )
(The spaces don't matter, either.)
Let's move on and import a file that we can use as a Dataframe.
You learned how to create a DataFrame in Pandas. Once you have a DataFrame, you can start to
manipulate your data. In this lesson, you'll learn how to import some data from a csv file and use
that as a DataFrame.

File Formats
You can import a wide variety of file formats using Pandas, from data in the popular CSV format
to Excel files. You start with the word read, then an underscore, then the format you want to
read. For example, to read in a CSV file, you'd do this:
import pandas as pd
file = pd.read_csv('path_to_file.csv')
For an Excel file, it would be this:
import pandas as pd
file = pd.read_excel('path_to_file.xlsx')
For a JSON file, it would be this:
import pandas as pd
file = pd.read_json('path_to_file.json')
For a full list of the formats, see here on the Pandas site:
Pandas File Formats
So, here's a CSV file for you to download. (CSV stands for comma separated values.)
Download the Pets Data CSV File (Right-click, Save As)
Once you've downloaded the file, you can double click it to open it up, if you like. There's not
much too it:

Now let's import this file and see what it looks like in Pandas.
Start a new Jupyter Notebook. Add the following in the first cell:
import pandas as pd
df_pet = pd.read_csv('PATH_TO_FILE.csv')
df_pets
Replace PATH_TO_FILE with wherever you saved your downloaded file to. In Windows, you
can get the full file path by opening an Explorer window. (A shortcut is the WINDOWS Key + E
on your keyboard.)
Navigate to the folder where the file is. Select the file and, on the Home tab, click the Copy path
button, as in the image below:

Paste between the round brackets of read_csv and you'll have something like this:
pd.read_csv('C:\Users\Ken\Documents\DataScience\csv\pets-data.csv')
You now need to change all the backslashes to forward slashes, otherwise you'll get a Unicode
Escape error: (Or you can double up on the backslashes in place of the single backslashes
above.)
pd.read_csv('C:/Users/Ken/Documents/DataScience/csv/pets-data.csv')
Or this:
pd.read_csv('C:\\Users\\Ken\\Documents\\DataScience\\csv\\pets-data.csv')
Click the Run button in your Jupyter Notebook and you should see this:
What we've done here is import the pandas library. The variable name we're going to use is pd.
On line 2, we've read the CSV file in and stored it all in a variable called df_students. This
variable, df_students, now contains a Dataframe. We print the Dataframe out on the third line.
Notice that not all the results are displayed, just the top 5 and the bottom five. We are told that
there are 102 rows and 6 columns.
If you want to display some records from the top of your results, you can use the head function.
Like this:
df_pets.head()
If you don't type a number between the round brackets of head, you'll get the first 5 rows of data.
Type a number between the round brackets of head and it will display that number of rows:
df_pets.head(10)
If you want to display rows from the bottom instead of rows from the top, use tail instead of
head:
df_pets.tail()
df_pets.tail(10)

you imported a CSV file and loaded it into a Pandas DataFrame. In this lesson, you'll learn basic
operations you can perform on your Pandas data. Let's start with Shape an Info. If you haven't
done this lesson, here's a CSV file for you to download. (CSV stands for comma separated
values.)
Download the Pets Data CSV File (Right-click, Save As)
Once you've downloaded the file, copy and paste the following code into a new cell in a Jupyter
Notebook:
import pandas as pd
df_pet = pd.read_csv('PATH_TO_FILE.csv')
df_pets
Replace PATH_TO_FILE with wherever you saved your downloaded file to.

Pandas Shape
Whenever you first read a csv file in and create a Dataframe, you'll want to display basic
information about the Dataframe. We'll go through some of the methods you can use.
If you want to display how many rows and columns you have in the Dataframe, you can use
shape. Enter the following into an empty cell in your notebook (this assumes you are following
along from the previous lesson and have created the df_pets DataFrame.):
df_pets.shape
Shape is an attribute of Dataframes, rather than a method or function, so it doesn't need round
brackets.
Press Run at the top of your Jupyter Notebook and you should see this:
df_pets.shape
(27, 4)
The first number between the round brackets is how many rows you have in your Dataframe.
The second number is how many columns you have.
Pandas Info
To get information about the Dataframe, you can use the function info(). Enter the following into
an empty cell in your Notebook:
df_pets.info()
You should see this when you click the Run button (or press CTRL + ENTER on your keyboard
as a shortcut.):

OK, so what is it telling us here?

Well, we learn that the Range Index goes from 0 to 26, which is a total of 27 entries. The total
number of columns is 4. Then we get a table printed out. The table tells us what the name of the
columns are: Pet, OwnerGender, OwnerAge, and PetAge. We also learn that there are something
called non-null values, and Dtypes (we'll get to these Dtypes an null values in a later lesson).
Let's see how you can rename a column.

Rename Columns in Pandas

Let's rename the Pet column. To get a list of all your column names, you can use
the columns attribute (no round brackets on the end as it's not a function or method). Add this
line to an empty cell in your Notebook:
df_pets.columns
You should this, when you run the code:
Index(['Pet', 'OwnerGender', 'OwnerAge', 'PetAge'], dtype='object')
To rename columns, you use the rename function (note the round brackets on the end):
df_pets.rename()
Except, you need something in between the round brackets of rename. You need the
word columns, followed by an equal sign. After the equal sign, you need a Python dictionary. If
you're not familiar with Python dictionaries, they look like this:
{'KeyName': 'KeyValue'}
It's just a pair of curly brackeys with a key name and a value for the key.
In Pandas, you type the name of the column you want to rename, surrounding by quote marks.
After a colon, you need a new name for the column, again, surrounded by quote marks. Here's
some code to enter into a new cell in your Notebook:
df_pets.rename(columns={'Pet': 'PetType'}, inplace=True)
df_pets.columns
The inplace=True at the end (you need a comma after the final curly bracket) means that the
original file will be amended. If you didn't add this, you'd just get a copy and the underlying
Dataframe would remain unchanged.
When you run the code, you should see this:
Index(['PetType', 'OwnerGender', 'OwnerAge', 'PetAge'], dtype='object')
You can rename more than one column at a time. Just added more items to your dictionary:
df_pets.rename(columns={'Pet': 'PetType', 'OwnerGender': 'GenderOfPetOwner'},
inplace=True)
Display the first five rows of your Dataframe, though:
df_pets.head()
You should now have this:
Remove Duplicate Rows in Pandas
You'll want to know if your Dataframe contains duplicate rows. To check, use the duplicated
function:
df_pets.duplicated()
If you were to run this line in a new Notebook cell, you'd see that you get back a series of True
or False values. To get the actual results for all rows, you would need to run this line:
df_pets[df_pets.duplicated()]
You need the name of your Dataframe first, then a pair of square brackets. Inside the square
brackets, you need the dataframe name again followed by the duplicated function. When you
run the command, you'd see a table listing any duplicated values. There are no duplicates in our
data set, so the table will be empty, and you'll just see the column names.
You can get rid of duplicate rows quite easily by using the function drop_duplicates.
df_pets.drop_duplicates(inplace=True)
To see the results, you'd typically run the shape aommand again:
df_pets.shape
When the shape command runs, you should find that you have fewer rows than when you started.

Basic Aggregate Function in Pandas

Aggregate functions are ones like sum, mean, count, max, min. Let's see how they work.
Suppose you wanted to know the mean age of all the pet owners. You could do it like this:
df_pets['OwnerAge'].mean()
You need your Dataframe name (df_pets), then a pair of square brackets:
df_pets[ ]
Inside the square brackets, type a column name surround with quote marks:
df_pets['OwnerAge']
The aggregate function you want to use goes after the square brackets. Type a dot, then the name
of the aggragate function you want to use, not forgetting the empty round brackets on the end:
df_pets['OwnerAge'].mean()
Run that line in a new empty cell in your Jupyter Notebook. You should see this:
So the mean age of the pet owners in our data set is just over 35.
Now try some of these:
df_pets['PetAge'].mean()
df_pets['OwnerAge'].count()
df_pets['OwnerAge'].sum()
df_pets['OwnerAge'].max()
df_pets['OwnerAge'].min()
(The count function tells you how many rows are being counted.)
Here is a fuller list of the aggregate functions, should you need them:
Aggregate Name Explanation

count counts the values in a column. Ignores null values.

min minimum value in the column

max maximum value in the column

first first value for a category

last last value for a category

std standard deviation

sum sum of values in a column

mean the mean of the column values

median the median of the column values

mode the mode of the column values

var variance (unbiased )

mad the mean of the absolute deviation

quantile the quantile of the column values

skew skew ( unbiased)

sem standard error of the mean

unique unique values in a group

nunique Get a count number of the unique values in a column

One all rounder for looking at aggregate statistics is the describe function. Try this in a new cell:
df_pets['OwnerAge'].describe()
When you run the command, you should see a table of basics stats, like count, mean, min. max,
etc.

Running more than one Pandas Aggregate Functions at a time

You can run more than one aggregate function at a time, if you want. This time, however, you
need to use the agg function after the name of your Dataframe:
df_pets.agg()
In between the round brackets, you need one of those python dictionaries again:
df_pets.agg( { } )
To get the mean, min, and max of the pet owner, add the column name in quote marks then a a
colon:
df_pets.agg( { 'OwnerAge' : } )
After the colon, you can add your aggregate functions in square brackets:
df_pets.agg( { 'OwnerAge' : ['mean', 'min', 'max'] } )
Run the above code and you should see a table appear:

You can add more than one column. Just add a comma after the square bracket, then the new
column and functions. Try this:
df_pets.agg( { 'OwnerAge' : ['mean', 'min', 'max'], 'PetAge': ['mean', 'min', 'max'] } )
Run the command to see the new table:

Pandas value_counts()
One very useful function you can use is called value_counts. As its name suggests, you use to
count values in your data set. As an example, try this in a new cell in your Jupyter Notebook:
df_pets['PetType'].value_counts()
You should see this as the result:

So, you need the name of your Dataframe and a column name between square brackets:
df_pets['PetType']
After a dot, type the function name and its round brackets:
df_pets['PetType'].value_counts()
You get a count of all the values in the series that don't have null values (see next lesson for a
deeper dive into null values). The display is in descending order. So, the count of 9 comes first
here. It's counting how many times the PetType Dog was recorded.
If you want the display in ascending order, you need to add ascending=True between the round
brackets of value_counts. Try this;
df_pets['PetType'].value_counts(ascending=True)
The result is this:
This time, 5 is the value at the top and the PetType is Rabbit.
You can do an alphabetical sort of the values. You just need to add a new function on the end -
sort_index. Try this:
df_pets['PetType'].value_counts().sort_index(ascending=True)
The results is this:

Now the PetTypes are sorted alphabetically. Cat comes first because we specified
that ascending=True. If you miss this out, you get a descending sort and Rabbit would be
top.
If you want an alphabetical sort when some of the values are equal, you need to add yet
another function on the end:
df_pets['PetType'].value_counts().sort_index().sort_values()
The new function is sort_values. If we had, say Fish and Rabbit at both 6 for their count,
then the PetTypes aould be sorted alphabetically in the results.
You can also display the values as percentages. Try this:
df_pets['PetType'].value_counts(normalize=True)
By using normalize=True between the round brackets of value_counts, you get a
percentage figure:
Now that you've gotten the hang of the basics, let's move on. We'll take a look at null
values, because you'll get a lot of them in your own data sets, and you need to know how to
deal with them.
In the previous lesson, you ran some basic Pandas commands to inspect your data. We
mentioned something about null values. In this lesson, you'll learn more about these null values.

Pandas and Null Data

Whenever you get some data, it's typical to find some values missing. In Excel, for example,
these would be cells with no data in them. In Pandas, and in Data Analysis in general, these
missing values are called Nulls.
As an example, let's load a new data set. Download the following CSV file:
Student Scores Data Set (right click, Save As)
Save the file to your computer and remember the file location. Now copy and paste this code into
a new cell in a Jupyter Notebook:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
df_students.head(10)
Obviously, replace PATH_TO_FILE to point to a location on your own computer. (You saw
how to do this in a previous lesson.)
When you run the code above, you should see the first 40 rows of the data print out. Notice row
36:
Row 36 has a value of NaN for the Physics exam. NaN stands for Not a Number. Pandas will
enter this NaN when it finds a blank value - missing data, in other words.
To see how many null values you have, run the infor command in a new cell:
df_students.info()
You should see this:
The Non-Null Count column in the image above is referring to the NaN values in the data set.
The Range Index runs from 0 to 101, for a total of 102 entries. We're OK with the first two
columns, First and Last. They don't have any null values. Three of the columns have null
values, though (NaN values). Math has 2, Physics has 3 and Computers has 4. To see this more
clearly, you can get a sum and a list of your null values. Run this command in a new cell in your
Jupyter Notebook:
df_students.isnull().sum()
You should see this when you run your code:

We chain two functions together, here, the isnull function and the sum function we used in the
previous lesson. The result is a table of the null values in each column of the data set.

Dealing with Null Values in Pandas

There are built-in functions to deal with your NaN values. But you could just leave them alone.
If you wanted something like the average score, Pandas will ignore NaN values.
Depending on your data, you can fill your NaN values with anything you like. A simple way is
this:
df_students.fillna(0)
The fillna function will fill the NaN values with whatever you have between its round brackets.
In the line above, we're telling Pandas to fill the NaN values with zeros. It will do this for the
entire dataset, whichever column the NaN value is in.
This can be hit and miss, though. You get better results if you use fillna on a column-by-column
basis. In which case, use something like this:
df_students['Math'] = df_students['Math'].fillna(0)
This selects just the Math column and replaces the NaN values with zeros. Replace Math with
whatever your column is called. (You'll learn more about how square brackets are used in this
way later.)
Drop NaN Values
You can also choose to drop rows and columns, if leaving them will have an impact on your
results.
To drop all your rows that have NaN values, use the following:
df_students.dropna()
To make sure these changes filter down into your dataset, add inplace=True between the round
brackets:
df_students.dropna(inplace=True)
You can drop all the columns that have NaN values with this:
df_students.dropna(axis=1, inplace=True)
By setting axis=1, you're specifying columns as the thing you want to drop. The alternative is
axis=0, which drops rows. Rows is the default, so you can miss it out.
Let's fill our NaN values with zeros. We might be skewing the results a little if we were to do
something like average scores, but not by much, as we only have a few NaN values.
So, add the following four lines to a new cell in your Jupyter Notebook:
df_students['Math'] = df_students['Math'].fillna(0)
df_students['Physics'] = df_students['Physics'].fillna(0)
df_students['Computers'] = df_students['Computers'].fillna(0)
df_students.isnull().sum()

Run your code and you should see this:

Once last point on Nulls - if you use Aggregate function like sum and count, NaN values are
ignored.
Pandas Dtypes
Dtypes are the type of data going into your columns. In a new cell in your Notebook, enter the
following:
df_students.dtypes
You should see this:

We have a list of column names and the type of data going into each column.
The First and Last columns are both of the object data type. This means they will be able to
hold string of text. The Math, Physics, and Computers columns store scores in the float64 data
type.
We'd like to change some of these types. The object types are OK. But the rest of the types can
be changed. The scores should be integers rather than float. After all, we don't have scores like
58.7 out of 100. Using the correct types means you won't be using up so much memory in your
data analysis. Not crucial here, with only 100 or so rows. But it would be crucial if you had a
Dataframe with hundreds of thousands of records.
To convert a column to a different type, use the function astype(). In between the round brackets
of astype, you need the data type you're converting to. We'd like to convert to the int type, so the
format is this:
df_students['Math'] = df_students['Math'].astype(int)
Before the equal sign, we're specifying that we want to convert the Math column. This goes
between square brackets after the data frame name, df_students. After the equal sign, we do the
same - specify the column we want to convert. This time, we type a dot and then add astype(int).
With that in mind, add the following five lines to a new cell in your Notebook:
df_students['Math'] = df_students['Math'].astype(int)
df_students['Physics'] = df_students['Physics'].astype(int)
df_students['Computers'] = df_students['Computers'].astype(int)
df_students.dtypes
Run your code to see the following:
Our final four columns are now converted to int32 data types.
(A word of warning: If you have NaN values in your column, you won't be able to convert to
integers from float. You'll get an error.)
In the next lesson below, you'll learn how to filter your Pandas Dataframes so that you can return
only the data you're interested in.

Let's load up our student data again and see if we can't answer a few questions with the help of
Pandas. If you haven't yet downloaded the student data, you can grab it here:
Student Scores Data Set (right click, Save As)
We're going to do some filtering, so it helps to know what these symbols mean (the symbols are
called conditional operators):
Symbol Meaning

> Greater than

< Less than

>= Greater than or equal to

<= Less than or equal to

== Equal to

!= Not equal to

& And

| Or
With that in mind, off we go.
Load up the Student data with these lines, changing PATH_TO_FILE to point to a location on
your own computer:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
To see is loaded OK, view the first 5 records:
df_students.head()
Nothing we haven't already done in previous lessons. But let's start with a simple question: Did
anyone score 100 on the Math exam?
We could just do this:
df_students['Math'] == 100
In between the square brackets after our Dataframe name, we have the name of one of our
Columns. Because the column name is text, Math needs to go between quote marks. The double
equal sign means equal to. So we're asking if any value in the Math column of
the df_students Dataframe is equal to 100. When you run the line, you should see a printout of
True or False values:

To see more details, you need to wrap the filter code in a Dataframe object, which
is df_students for us. Try this in a new Jupyter Notebook cell:
df_students[df_students['Math'] == 100]
When you run the code, you should see that only Alex Huffman scored 100 on the Math exam:
If you like, you can assign the conditional test to a variable:
math_filter = df_students['Math'] == 100
Now, the results of the equality test are placed into the variable we have called math_filter. To
get some output, wrap the variable name in a Dataframe object:
math_filter = df_students['Math'] == 100
df_students[math_filter]
The result is the same as above, but it might be easier to read. Let's ask another question.

Which students scored more than 90 on the Physics exam?

We can anwer this question in a similar way to the previous one. We only need to change a
couple of things. Try this code in a new Notebook cell:
physics_filter = df_students['Physics'] > 90
df_students[physics_filter]
Now, the column we want from our df_students Dataframe is Physics. The conditional operator
we're using is the greater than symbol (>)
If we wanted to say 'greater than or equal to 90', we'd add an equal sign:
physics_filter = df_students['Physics'] >= 90
We can change the symbol to a less than sign to find out who scored less than 90:
physics_filter = df_students['Physics'] < 90
Less than or equal to 90 is this:
physics_filter = df_students['Physics'] <= 90
df_students[physics_filter]
Run the code and note that only 1 student, Oliver Miah, scored 100 in Physics. If we wanted to
exclude this result, we'd write this:
math_filter = df_students['Physics'] != 100
df_students[math_filter]
We'd then get 101 rows returned, the missing one being poor Oliver. He's missing because we
use the operators for not equal to (!=)
Let's ask another question.

Who scored more than 85 on the Computer exam? Ignoring the index numbers, display
just their last name and the score.
We clearly need a greater than symbol here. But how do you specify which columns to display?
The answer is to put your column names into a python list:
df_students[ ['Last', 'Computers'] ]
You can add as many column names as you want. Here, we add the 'Physics' columns as well:
df_students[ ['Last', 'Computers', 'Physics'] ]
Notice where all the square brackets are in the line above. If you were to run the line, you'd see
just your selected columns displayed:

This prints out all the students, though. We need to narrow it down to just those who scored more
than 85. To do that, add the conditional statement at the end:
df_students[['Last', 'Computers']][df_students['Computers'] > 85]
Run the code to see the results:

Let's do a query that involves AND.

Find all the students who scored more than 80 on both the Math and Physics exams.
We're better off using variables to answer this, as it could be messy to read, otherwise. First, we
can store the Math query into a variable:
math_exam_result = df_students['Math'] > 85
Then do the Physics query:
phys_exam_result = df_students['Physics'] > 85
Now display the results with this line:
df_students[ math_exam_result & phys_exam_result ]
Our two variable go between the square brackets of the Dataframe called df_students.
Separating the two is the and symbol (&). Run the code to see this:
If you wanted it all on one line, it would be this:
df_students[(df_students['Math'] > 85) & (df_students['Physics'] > 85)]
Very messy!
In the image above, we have the Computers results. If we only wanted to display certain column,
we can change out code to this:
df_students[['Last', 'Math', 'Physics']][math_exam_result & phys_exam_result]
When the code is run, the result is this:

If you wanted to know who got more than 85 in Math OR Physics change the and (&) symbol to
an or symbol (|).

Counting
If you wnated to know how many results are in your query, you can wrap them in a len function.
Like this:
len( df_students[df_students['Math'] < 20] )
Run the code and you'll see an answer of 21 displayed.
If you wanted to add some text as well, wrap the above line in a str function: (str is short for
string)
"Scored less than 20 in Math: " + str( len(df_students[df_students['Math'] < 20]) )
The result would be:
Scored less than 20 in Math: 21

OK. Let's move on and explore something else. But getting good at filtering is the way forward,
if you want to get good at Pandas. In the next lesson, you'll learn about loc and iloc, two more
ways to filter your data.

Sometimes, you'll want to extract just a few results from your data set. In which case, you can
use loc and iloc. The difference between loc and iloc is that the former is used to refer to
columns by name and the latter is used to refer to columns by their number. (loc is short for
location). Think of it as iloc being an integer location and loc as a text location. Let's see some
examples to clear things up.
Here's our student data again, in case you haven't yet downloaded it:
Student Scores Data Set (right click, Save As)
In a new cell in a Jupyter Notebook, access the CSV file with this code:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
df_students.head()
Replace PATH_TO_FILE with a location on your own computer, wherever you downloaded the
file to.
You should see the following output when you run the code:
Now, suppose you wanted to examine just the Math scores and the student names. The column
names you need are First, Last, and Math. You can use loc to extract just these columns. The
syntax is this:
df_name[rows, columns]
So, in between square brackets after your Dataframe name, you need to specify which rows you
want and which columns. The two are separated by a comma.

Get all the Rows but only specified Columns

In Python, you can use the colon to slice values. If you want all the rows and all the columns,
just use a colon by itself:
df_name[: , :]
The above code will return all the rows in your data set and all the columns. If you want all the
rows but just specific columns, try this in a new Jupyter Notebook cell:
student_subset = df_students.loc[:, ['First', 'Last', 'Math']]
student_subset.head()
The result should be this:
Before the comma in the square brackets, we have a single colon (:). This means we want all the
rows. After the comma, we have a Python list:
['First', 'Last', 'Math']
Python lists go between square brackets. Each column name is between quote marks and
separated by commas (except the last one).
If you don't know the names of columns, you can use the column number instead. But you need
to use iloc, rather than loc, otherwise you'll get an error. Try this:
student_subset = df_students.iloc[:, [0, 1, 2]]
student_subset.head()
Run the code to see the following output:

The first column in your data set is 0 rather than 1 (the index numbers in the left column is
ignored as they come from Pandas).
Slicing with loc and iloc in Pandas
Now try this in a new cell:
student_subset = df_students.loc[0:3, ['First', 'Last', 'Math']]
student_subset
The result is this:

This time, we have numbers before and after the rows colon:
0:3
We saying we want the rows from row 0 to row 3, which is four rows.
The colon is really a 'from : to' statement:
from : to
0:3
1:5
17:22
If you want all the rows from a certain number, you can miss out the second number:
5:
10:
17:
If you want all the numbers up to a certain number, miss out the first number:
:3
:5
:17
Play around with the numbers before and after the colon to get a feel for how slicing works.
By the way, you can use slicing on the columns, as well, if you're using iloc:
student_subset = df_students.iloc[:, 0:2]
student_subset.head()
This is the same as we did before:
student_subset = df_students.iloc[:, [0, 1, 2]]
You can also grab every other by adding a second colon:
subset_iloc = df_student_scores.iloc[::2, 0:2]
The code above will get you every other row and just the student name columns.
If you wanted, say, every other row from the first 10 rows, you can write this:
subset_iloc = df_student_scores.iloc[0:11:2, 0:2]
This grabs every other row from 0 to 11.
OK, that's enough of loc and iloc. We'll move on and take a look at the important topic of Group
By.

The groupby function is used quite a lot in Pandas, and need it for when you want to look at your
data from a categorical point of view. For example, take a look at this spreadsheet:

We have three columns: Pet, Sex_Owner, Age_Owner. There are two clear categories here: Pet
and Sex_Owner. They are categories because there are a limited number of values. In the Pet
column, we have four different animals: Cat, Dog, Fish, Rabbit. For the Sex_Owner, we only
have two choices, Male or Female (apologies for the binary nature of this example).
What the groupby function allows you to do is to look at things from the perspective of these
groups. Let's see how it works.
If you haven't already done so, download the pets CSV file and save it to your computer:
Download the Pets Data CSV File (Right-click, Save As)
Start a new Jupyter Notebook and add the following:
import pandas as pd
df_pets = pd.read_csv("PATH_TO_FILE/pets-data.csv")
df_pets.head()
Obviously, change the PATH_TO_FILE to where on your computer you saved the CSV file to.
Run the code and you should see this:

Pretty much what we did before, in previous sections - display the first five rows of our data. Our
Dataframe is called df_pets.
If we only wanted to group by the different types of pets, we could do this:
pets_group = df_pets.groupby('Pet')
pets_group
To the right of an equal sign, we have the name of our Dataframe (df_pets). After the Dataframe
name, we have the groupby function. In between the round brackets of groupby, we have the
name of the column we'd like to split into groups - Pet. (The name of the column goes between
quote marks.)
If you run those two lines in your Notebook, you'll see something like this:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022E0F6BC460>
This message is telling us that pets_group variable is now a DataFrameGroupBy object.
Which is what we want. However, what we'd like to do is to split our data into two groups. First,
we want to group on the different types of pets, then we want to group on the sex of the owner.
So, run these two lines:
pets_group = df_pets.groupby( ['Pet', 'OwnerGender'] )
pets_group
Again, when you run the lines, you'll see the output telling you that you have a
DataFrameGroupBy object. This is useful information, in that you've learned that
the pets_group is now a DataFrameGroupBy object. But it's not very useful apart from that.
What we need to do next is to execute some sort of function on this new DataFrameGroupBy
object. The function we'll use is an agg function. (We did agg functions in a previous lesson, so
won't explain it again.)
Try these two lines in a new Jupyter Notebook cell:
agg_mean = pets_group.agg({'OwnerAge': ['mean']})
agg_mean
Run the code and you'll see this output:

The pets_group variable is now a DataFrameGroup object, remember. It has the Pet and
OwnerGender information in it. We're just using an agg function to get the mean age of the pet
owners. We can quickly see that rabbits are preferred by youngsters while cats and dogs are
preferred by older people.
Notice that the Pet column lists our four pets: Cat, Dog, Fish, Rabbit. This was our first group.
For each pet, we have a F or an M, which was the second group. On the first row, you can see
that if the cat owner is a female then the mean age is 39.6. If the cat owner is a male, the mean
age is 49.
You can add other agg functions, just like we did before. Try this:
agg_funcs = pets_group.agg({'Age_Owner': ['mean', 'min', 'max']})
agg_funcs
Run the code to see the following output:

This time, we're using three aggregate functions, mean, min, and max. We have two new
columns under the OwnerAge heading, min and max. These get you the minimum value in the
column and the maximum value.
You can have the agg function calculate more than one column. So, if we had a column for
PetAge, we could do this with the agg function:
aggs2 = pets_group.agg( {
'OwnerAge': ['mean'],
'PetAge': ['mean', 'min', 'max', 'sum']
})
aggs2
The result is this:
Notice how this is laid out and indented in the image above. Because it's Pandas code, rather than
Python code, you can indent however you like. It's a lot easier to read, now that the code is split
over multiple lines and is indented.
Notice, too, that there is a comma separating the OwnerAge line and the PetAge line. If you want
to work with more than two columns, just remember to add the comma in the right place.
Now, how can we answer the following question;
Question: What is the mean age of male pet owners and what is the mean age of female pet
owners?
To answer this question, first create a new group by OwnerGender:
owner = df_pets.groupby('OwnerGender')
Once you have that group, you can then specify a column from the group. One of the columns is
called OwnerAge. You can get the mean of this column like this:
owner.OwnerAge.mean()
The code and the output would be this:
Notice the line that's in green. It's green because it's a comment. You can turn any line of code
into a comment by starting the line with the # symbol. But we can quickly see that the mean age
for female pet owners is 30 while the mean age for male pet owners is just under 40.
Here's a followup question:
Question: What is the mean age of pet owners by pet type?
You can answer the question in a simliar way. Try this code in a new cell:
owner2 = df_pets.groupby('Pet')
owner2.OwnerAge.mean()
Run the code to see the following:

Another way to get the same data as above is like this:

owner_age_plot = df_pets.groupby('Pet')['OwnerAge'].mean()
This time, we've deliberately called the variable owner_age_plot. That's because Pandas has
basic plotting built in. All you need to do is add the function plot():
owner_age_plot.plot()
Run the code to see this:
We'll explore plots and graphs in more detail in a later lesson. For now, let's move on. In the bext
lesson, we'll look at the apply function in Pandas.

You can use the Pandas apply function to apply a Python function to your data series or
Dataframe. To clear that up, we're going to use our pets CSV file again. If oyu haven't yet
downloaded this dataset, you can do so here:
Download the Pets Data CSV File (Right-click, Save As)
The Ownergender column in the dataset has an F or a M in it (again, apologies for the binary
nature of this example):

Now, what if we wanted to convert the F to Female and the M to Male? How would we go about
this? Well, we can use apply. Let's see how.
Load the CSV pets data again with these lines, changing PATH_TO_FILE to point to a location
on your own computer:
import pandas as pd
df_pets = pd.read_csv('PATH_TO_FILE\\pets-data.csv')
df_pets.head()
In a new Jupyter Notebook cell, add this Python function:
def OwnerGenderFormat(gen):
if gen == "F":
return "Female"
elif gen == "M":
return "Male"
else:
return "None"
It should look like this, in your Notebook: (Make sure the indents are the same as in the image
above. The if should be one press of the tab key on your keyboard and the return should be two
tabs.)

Make sure to run this code. It's just a simple if statement, though. It returns either Female or
Male, depending on the value of gen.
Test out this function in a new Notebook cell. Add this to it:
OwnerGenderFormat("F")
Run the code and you should see this:
OK, the function works well when tested. Now to apply it to the OwnerGender column.
In a new Notebook cell, enter these lines:
df_pets['OwnerGender'] = df_pets['OwnerGender'].apply(OwnerGenderFormat)
df_pets.head()
Run the code and the new values for the OwnerGender should display:

To understand the code, first look at what we have to the left of the equal sign (=). We have this:
df_pets['OwnerGender']
This is just referencing the OwnerGender column (a series) of our df_pets Dataframe.
To the right of the equal sign, we have this:
df_pets['OwnerGender'].apply(OwnerGenderFormat)
Again, we have the reference to the OwnerGender column. After a dot, we then have this:
apply(OwnerGenderFormat)
In between the round brackets of the apply function, we have the name of our Python function.
Pandas will apply this function to every cell in the OwnerGender column. Behind the scenes, it
passes each cell value over to the function, storing the value in the gen variable. When it's done,
it writes the results back to the OwnerGender column on the left of the equal sign.

Now let's do something a little more sophisticated. We'll switch back to our Student Data. If you
haven't yet downloaded this dataset, you can do so here:
Student Scores Data Set (right click, Save As)
Load up the dataset with these lines:
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
df_students.head(10)
Obviously, replace PATH_TO_FILE to point to a location on your own computer.
Looking at the data, we can see we have exam score, but there no grades. What we'd like to do is
to add a new column with Exam Grades as the values. Let's start with the Math grades.
The first thing to do is to add a new column. There are lots of ways to add columns in Pandas.
For us, though, it would make sense just to copy the Math column. We can then convert scores
into grades with this new column.
In a new cell in your Jupyter Notebook, add and run this code:
df_students['MathGrades'] = df_students['Math']
df_students.head()
We're creating a new column to the left of the equal sign (=). First, we have our Dataframe name,
df_students. In between the square brackets of the Dataframe, we have given the new column the
name MathGrades. This goes between quote marks.
To the right of the equal sign, we have this:
= df_students['Math']
This just copies the Math column to whatever is on the left of the equal sign, which is a new
column for us.
When you run the code, you should see this:
And there's our new column on the end.
Let's convert the scores to grades. Add this function in a new cell in your Notebook:
def getGrade(val):
if val >= 90 and val <= 100:
return "A+"
elif val >= 80 and val <= 89:
return "A"
elif val >= 70 and val <= 79:
return "B+"
elif val >= 60 and val <= 69:
return "B"
elif val >= 50 and val <= 59:
return "C+"
elif val >= 40 and val <= 49:
return "C"
elif val >= 30 and val <= 39:
return "D"
else:
return "F"
It should look like this in your Notebook:
Make sure your cursor is flashing inside the cell and Run the code. Running the code when
you've finished typing your function will ensure that Pandas knows about it. Plus, if you've made
any mistakes in your Python code, you'll see an error message you can use to correct your code.
In a new cell, test the code out with this line:
'Grade is ' + getGrade(34)
When you run the code, you should see a grade of D print out.
OK, our new function works. Now let's apply it to our new column. Add this code in a new
column:
df_students['MathGrades'] = df_students['Math'].apply(getGrade)
df_students.head()
Run the code and you should see this:
We now have a column of Math grades.
The code is pretty much the same as before. Last time, we had this:
df_students['MathGrades'] = df_students['Math']
This copied the values in the Math column over to a new column called MathGrades. This time,
we have this:
df_students['MathGrades'] = df_students['Math'].apply(getGrade)
The only difference is the apply function on the end:
apply(getGrade)
Pandas gets a grade for each value in the Math column. It then sets that grade in the MathGrades
column (to the left of the equal sign.) We can do it this way because MathGrades is an exact
copy of the Math column.
See if you can do the other two columns by yourself. Add two new columns. Convert the Physics
and Computers scores to grades.

Apply and Lambda Functions

As well as writing your own custom Python functions to use in Pandas, you can also apply
something called a lambda function. Lambda functions are seen in a lot of programming
languages and are not exclusive to Pandas. They are said to be anonymous functions, in that they
don't need a name. Our Python functions, for example, were called OwnerGenderFormat and
getGrade. With a lambda function, we don't need to come up with a name. We just create a
variable and use that. If that's not too clear, let's try an example.
Previously, we did this to convert F in Female and M into Male:
df_pets['OwnerGender'] = df_pets['OwnerGender'].apply(OwnerGenderFormat)
The Pandas apply function called our custom-made OwnerGenderFormat function into action.
OwnerGenderFormat was a Python function. But, if we were to use a lambda, we woudn't need a
separate Python function (though you can use Python functions in lambda expressions).
The structure of a lambda is this:
apply(lambda x: CODE_GOES_HERE)
The x after lambda is just a variable name and you can call it anything you want. After a colon,
you add the code you want to apply. This can be just about anything. For example, we can divide
one column of values by another:
df['New_COL_NAME'] = df.apply( lambda x: x['COL_1'] / x['COL_2'] )
Pandas will look for a column in the dataframe (df) called COL_1 then divide it by the value in
COL_2. The result is returned to df['New_COL_NAME']
But think of lambdas as a loop that goes through all your cell data.
We can rewrite our OwnerGender code to use a lambda instead. That way, we won't need a
separate Python function. We can rewrite it like this:
df_pets['OwnerGender'] = df_pets['OwnerGender'].apply( lambda rowVal: 'Male' if rowVal ==
'M' else 'Female' )
It looks a bit complicated, so let's break it down.
The code for the lambda, the part in between the round brackets of apply, is this:
lambda rowVal: 'Male' if rowVal == 'M' else 'Female'
This is an if statement and reads, "enter Male if rowVal equals M, else enter Female". (Notice
that this goes after a colon).
The result is the same as we did before.

Now let's use a lambda to add up values in columns, returning the reult in a new column.

Suppose we want to add up the scores in our student data columns (Math + Physics +
Computers). The total would go in a new column called ExamTotal. We can construct the
lambda part like this:
apply(lambda colName: colName['Math'] + colName['Physics'] + colName['Computers'] )
We also need to add the name of our Dataframe:
df_students.apply(lambda colName: colName['Math'] + colName['Physics'] +
colName['Computers'] )
This adds up the values in each column (Math, Physics, Computers).
(Notice that we're not using a column name after df_students. That's because we want to apply
our lambda to the entire Dataframe, and not just a single column.)
In Pandas, you can apply your lambda code to either rows or columns. This is done with the axis
attribute, which can be set to either 1 or 0. The default is 0 and means columns.
apply(lambda x: CODE_GOES_HERE, axis = 1)
apply(lambda x: CODE_GOES_HERE, axis = 0)
Adding the axis attribute, the code would be this:
df_students.apply(lambda colName: colName['Math'] + colName['Physics'] +
colName['Computers'], axis = 1 )
When you run the code, the result is as follows:

Notice how we've spread the code over few lines. You can do this in your own code, if the lines
are looking too long are becoming hard to read.
But let's move on - lambdas can improve your Pandas skills, but they can be a bit tricky to get
the hang of. In the next lesson, we'll cover Pandas and plots. You get to create some little charts!

In this lesson, we'll start charting our data as create a few Pandas plots. We'll assume you're
using a Jupyter Notebook to do these tutorials. This will make life easier, as it has plotting built
in. However, you do need to install something calledMatPlotLib. So, fire up your command
prompt again. (If you're not sure what this means, see the first lesson here: install.) Navigate to
your Python directory. Enter this command:
pip3 install matplotlib
It should look like this in your command prompt:

Let's load our student data again. If you haven't already downloaded this dataset, you can grab a
copy here:
Student Scores Data Set (right click, Save As)
Load the dataset and display the first five rows with these lines (change PATH_TO_FILE to
point to a location on your own computer):
import pandas as pd
df_students = pd.read_csv('PATH_TO_FILE\\StudentScores.csv')
df_students.head()
Let's just extract the Math column (columns are called Series, remember) and see what happens
when we plot it. Add this line in a new cell in your Notebook:
df_students['Math'].plot()
Run the line to see this plot appear:
Using the inbuilt function plot gets you a line chart by default. If you want another type of chart,
you can use one of the following:
plot.area()
plot.bar()
plot.barh()
plot.box()
plot.hexbin()
plot.hist()
plot.kde()
plot.density()
plot.line()
plot.pie()
plot.scatter()
So a line chart would be this:
df_students['Math'].plot.line()
Our chart is a a bit messy, though. Let's add the grade columns to our Dataframe, like we did in a
previous lesson. First, add this Python function to a new Notebook cell:
Make sure to run the code so that Pandas knows about it.
def getGrade(val):
if val >= 90 and val <= 100:
return "A"
elif val >= 70 and val <= 89:
return "B"
elif val >= 50 and val <= 69:
return "C"
elif val >= 30 and val <= 49:
return "D"
elif val >= 10 and val <= 29:
return "E"
else:
return "F"
It should look like this in your Notebook:

Now add these lines in a new cell:

df_students['MathGrades'] = df_students['Math'].apply(getGrade)
df_students['PhysGrades'] = df_students['Physics'].apply(getGrade)
df_students['CompGrades'] = df_students['Computers'].apply(getGrade)
df_students.head()
When you run the code, you should see the first five results:
Let's put the Math grades into a series of their own and then plot them. Add the following to a
new Notebook cell:
seriesMath = df_students['MathGrades'].value_counts() .sort_index(ascending=True)
seriesMath
You should see this when the code is run:

The code uses value_counts to get a count of how many students are in each grade. (We're
adding a sort on the end.)
We can now create a bar chart. In a new cell, add and run this line:
seriesMath.plot.bar()
You'll see this:
Here's the code to create a bar chart from the Physics grades:
seriesPhys = df_students['PhysGrades'].value_counts() .sort_index(ascending=True)
seriesPhys.plot.bar()
And here's the code for the Computers grades:
seriesComp = df_students['CompGrades'].value_counts() .sort_index(ascending=True)
seriesComp.plot.bar()
Make sure to create these two series, seriesPhys and seriesComp, by running the code - we'll be
needing them soon.
The charts look a bit bland, though. You can spruce them up by including a few attributes
between the round brackets after the plot type. Here are just a few of them:
Attribute Value Example

figsize tuple seriesMath.plot.bar( figsize=(7, 7) )

grid bool (True/False) seriesMath.plot.bar( grid=True )

legend bool (True/False) seriesMath.plot.bar( legend=True )

xlabel string seriesMath.plot.bar( xlabel='Grades' )

ylabel string seriesMath.plot.bar( ylabel='Num of Students' )

color color value seriesMath.plot.bar( color='red' )

fontsize float seriesMath.plot.bar( fontsize=20.5 )
For the color, it can be a name, as in the example above, a hex value like #00ffff. You can also
use an RGB value like this:
color=(.5,1,0)
The RGB values are from 0 to 1, rather than 0 to 255 as you may be used to. You can also use an
alpha on the end:
color=(.5,1,0, .2)
Let's try a few of the examples out, though. Add this code to a new cell:
seriesMath.plot.bar(figsize=(8,8),
grid=True,
legend=True,
xlabel='Grades',
ylabel='Num of Students',
color='#992323',
fontsize=20)
We've added line breaks in the code, as it makes it easier to read. Pandas doesn't care about line
breaks. Note where all the commas are, though. Here's the result:
Pandas Bar Charts - More than one column
You can have more than one column in your bar charts. For us, we have three subjects we'd like
to display in our bar chart: Math, Physics, Computers. The best way to tackle the problem is by
creating a new Dataframe object and assign our three series to it. That's easy enough. Add this
code to a new Notebook cell:
newDF = pd.DataFrame(columns=['Math', 'Physics', 'Computers'])
newDF['Math'] = seriesMath
newDF['Physics'] = seriesPhys
newDF['Computers'] = seriesComp
newDF
We first create a DataFrame. In between the round brackets of DataFrame, type the name you
want for your columns. This needs to be a list, hence the square brackets.
The next three lines assigns thos individual series we set yo each new column.
You should see this when you run the code: (If you get errors when running the code, it means
you didn't set up the series.)

The numbers are how many students are grouped in age grade. So, 8 students got an A in Math,
18 students got an A in Physics, while 9 students got an A grade in computers.
Let's see all this in a bar chart.
Add the following in a new Notebook cell:
newDF.plot.bar( y=['Math', 'Physics', 'Computers'] )
The result is this, when you run the line:
We have a nice bar chart with all three subjects compared for each grade.
Notice what we have between the round brackets of bar:
y=['Math', 'Physics', 'Computers']
We're specifying which columns from our new Dataframe that we want to use in the y axis.
You can also add a column name that you want to use in the x axis, if you need to:
bar(x='Grades', y='GradesCount')
Often, you don't need to specify the y column as Pandas usually guess right which column to use.
bar(x='Grades')
We can add some formatting to our chart, though. Try this:
newDF.plot.bar(y=['Math', 'Physics', 'Computers'],
figsize=(8,8),
legend=True,
xlabel='Exam Grades',
ylabel='Num Achieving Grade',
color={'Math': '#003f5c',
'Physics': '#bc5090',
'Computers': '#ffa600'},
fontsize=20)
Run the code to see the updated chart:

Notice how we've specified the colors for the bars:

color={'Math': '#003f5c', 'Physics': '#bc5090', 'Computers': '#ffa600'}
We have curly brackets after the color attribute. Inside of the curly brackets, we have a column
name, a colon, then a color value:
"Math": '#003f5c'
Each of the column names and their color values are separated by commas.

But that's enough of charts and the end of this Pandas short course. Hope you enjoyed it. If you
want to take Pandas further, there's a webiste called Kaggle that's a great place to go to get
datasets. Not only that, others will upload the code they used to analyse the dataset, so you can
learn from them.

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Jupyter Notebook
100% (1)
Jupyter Notebook
10 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Introduction To Anaconda and Jupyter Notebooks
100% (1)
Introduction To Anaconda and Jupyter Notebooks
14 pages
Python Bootcamp
No ratings yet
Python Bootcamp
14 pages
Lec 2
No ratings yet
Lec 2
39 pages
openSAP_pyhton1_TipsTrick
No ratings yet
openSAP_pyhton1_TipsTrick
7 pages
DV Activity
No ratings yet
DV Activity
5 pages
Introduction To Python by Data Science Nigeria
No ratings yet
Introduction To Python by Data Science Nigeria
56 pages
Num Py
No ratings yet
Num Py
20 pages
Python_Info
No ratings yet
Python_Info
11 pages
Chapter2 - Python and Jupyter Notebook
No ratings yet
Chapter2 - Python and Jupyter Notebook
31 pages
Installations
No ratings yet
Installations
29 pages
Python
No ratings yet
Python
3 pages
Lab 1 Manual
No ratings yet
Lab 1 Manual
5 pages
Jupyter Notebook Basics
No ratings yet
Jupyter Notebook Basics
32 pages
Jupyter Notebook.docx
No ratings yet
Jupyter Notebook.docx
71 pages
Jupyter Notebook For Beginners
100% (2)
Jupyter Notebook For Beginners
23 pages
Lab01 - Getting - Started
No ratings yet
Lab01 - Getting - Started
5 pages
Numpy - Python Package For Data
No ratings yet
Numpy - Python Package For Data
9 pages
Python Unit1
No ratings yet
Python Unit1
35 pages
Jupyternotebook Tutorial Bypaige
No ratings yet
Jupyternotebook Tutorial Bypaige
15 pages
Setup Environment & Python Basics
No ratings yet
Setup Environment & Python Basics
62 pages
PYTHON100-MOD3-SOFTWARE INSTALLATION-V2-en
No ratings yet
PYTHON100-MOD3-SOFTWARE INSTALLATION-V2-en
7 pages
L1 JupyterLab Overview
No ratings yet
L1 JupyterLab Overview
4 pages
Grokking Data Science
No ratings yet
Grokking Data Science
61 pages
Introduction To Jupyter Notebooks
No ratings yet
Introduction To Jupyter Notebooks
26 pages
Data Visualization_Lab_Manual_2024
No ratings yet
Data Visualization_Lab_Manual_2024
13 pages
Python Basics: Before Numpy
No ratings yet
Python Basics: Before Numpy
49 pages
A1 T1 Lecture1
No ratings yet
A1 T1 Lecture1
29 pages
Section 4 Further Problem-Solving and Programming Skills
No ratings yet
Section 4 Further Problem-Solving and Programming Skills
26 pages
Introduction To Python Lecture 2: Introduction To Jupyter: Pavlos Antoniou
No ratings yet
Introduction To Python Lecture 2: Introduction To Jupyter: Pavlos Antoniou
36 pages
Jupyter Notebook Cheat Sheet
No ratings yet
Jupyter Notebook Cheat Sheet
1 page
Python Introduction
No ratings yet
Python Introduction
73 pages
Dsf - Unit II Notes
No ratings yet
Dsf - Unit II Notes
43 pages
Data Analysis Tutorial
No ratings yet
Data Analysis Tutorial
152 pages
Anaconda Jupyter-Notebook
No ratings yet
Anaconda Jupyter-Notebook
2 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Getting Started With Python
No ratings yet
Getting Started With Python
8 pages
Lab 0 - Getting Started with Jupyter Notebook
No ratings yet
Lab 0 - Getting Started with Jupyter Notebook
3 pages
Assignment
No ratings yet
Assignment
5 pages
Environment Setup And Jupyter Notebook Walk-Through (2)
No ratings yet
Environment Setup And Jupyter Notebook Walk-Through (2)
3 pages
Jupyter
No ratings yet
Jupyter
13 pages
Lab Manual
No ratings yet
Lab Manual
100 pages
How To Install Jupyter Notebook On Ubuntu: Getting Started
No ratings yet
How To Install Jupyter Notebook On Ubuntu: Getting Started
95 pages
Python Primer Install Steps
No ratings yet
Python Primer Install Steps
4 pages
A Crash Course in Python For Scientists PDF
No ratings yet
A Crash Course in Python For Scientists PDF
55 pages
Programming Fundamentals Lab 1
No ratings yet
Programming Fundamentals Lab 1
15 pages
Jupyter Notebook Installation Guide (Mac)
No ratings yet
Jupyter Notebook Installation Guide (Mac)
27 pages
Untitled
No ratings yet
Untitled
41 pages
Jupyter Notebook Stable
No ratings yet
Jupyter Notebook Stable
157 pages
How To Use Jupyter Notebooks - Codecademy
No ratings yet
How To Use Jupyter Notebooks - Codecademy
11 pages
Anaconda and Jupyter Notebook
No ratings yet
Anaconda and Jupyter Notebook
12 pages
Lab0 - Warm Up
No ratings yet
Lab0 - Warm Up
5 pages
Lab 12 Manual
No ratings yet
Lab 12 Manual
43 pages
Python Pandas Tutorial
No ratings yet
Python Pandas Tutorial
45 pages
Python
No ratings yet
Python
27 pages
Jupyter
No ratings yet
Jupyter
15 pages
Programming Puzzles: Python Edition: The Guide to Sharpen Your Coding Skills with Engaging and Challenging Puzzles
From Everand
Programming Puzzles: Python Edition: The Guide to Sharpen Your Coding Skills with Engaging and Challenging Puzzles
Matthew Whiteside
No ratings yet
Mastering Python Basics: Python, #1
From Everand
Mastering Python Basics: Python, #1
AnwaarX
No ratings yet
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
From Everand
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
Coleman Newton
No ratings yet
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
SD Hackrr
No ratings yet
SD Hackrr
4 pages
1 Resume 77750101
No ratings yet
1 Resume 77750101
6 pages
Applications of Stacks
No ratings yet
Applications of Stacks
17 pages
Atlantis - The Lost Continent Finally Found
100% (1)
Atlantis - The Lost Continent Finally Found
337 pages
Preposition and Conjunction
No ratings yet
Preposition and Conjunction
20 pages
About - William Shakespeare
No ratings yet
About - William Shakespeare
5 pages
Psychoanalysis K - Wake 2017
No ratings yet
Psychoanalysis K - Wake 2017
170 pages
Affixes
No ratings yet
Affixes
15 pages
Making Progress To First Certificate PDF
No ratings yet
Making Progress To First Certificate PDF
2 pages
The Nation of Birds Explanation
No ratings yet
The Nation of Birds Explanation
7 pages
speaking daily activity
No ratings yet
speaking daily activity
11 pages
Daftar Pengumpul Soal
No ratings yet
Daftar Pengumpul Soal
6 pages
Show Cause Regarding Poor Performance HSLC Result 2025
No ratings yet
Show Cause Regarding Poor Performance HSLC Result 2025
2 pages
Word Study
100% (6)
Word Study
80 pages
文章写作的特点
100% (1)
文章写作的特点
6 pages
MTI-NCOI-Annotations-Form
No ratings yet
MTI-NCOI-Annotations-Form
5 pages
Paul Baltes Lifespan (ENG)
No ratings yet
Paul Baltes Lifespan (ENG)
16 pages
C Language by Ramesh Sir
50% (2)
C Language by Ramesh Sir
177 pages
Weekly Home Learning Plan For Grade 3
No ratings yet
Weekly Home Learning Plan For Grade 3
2 pages
Wa0010.
No ratings yet
Wa0010.
5 pages
Math Lesson For Puerto Rico Websire For Lisa
No ratings yet
Math Lesson For Puerto Rico Websire For Lisa
2 pages
5fa9547986fd360027cb85f9 1635128249 1. Introduction To The 21st Century Literacies
No ratings yet
5fa9547986fd360027cb85f9 1635128249 1. Introduction To The 21st Century Literacies
9 pages
Abbreviations
No ratings yet
Abbreviations
20 pages
Dipankar Gupta Caste and Politics Identity Over System
No ratings yet
Dipankar Gupta Caste and Politics Identity Over System
24 pages
Don Marcelino Briefer
No ratings yet
Don Marcelino Briefer
3 pages
Changing The Default Password For Sap
No ratings yet
Changing The Default Password For Sap
2 pages
Learning J
No ratings yet
Learning J
420 pages
The Giver 1-6
100% (1)
The Giver 1-6
3 pages
Funeral Rites
No ratings yet
Funeral Rites
1 page
Prediction & Mamaidev Gujarati
77% (35)
Prediction & Mamaidev Gujarati
21 pages

Data Analysis Book Python (Pandas)

Uploaded by

Data Analysis Book Python (Pandas)

Uploaded by

DATA ANALYSIS – PANDAS

Install Jupyter Notebook

Now install Jupyter with this line:

Press returns and you should see the installation:

Create a Jupyter Notebook Shortcut

Getting Started with Jupyter Notebooks

Click the Rename button. Name your folder Tutorials.

Add Code to a Jupyter Notebook

Now let's see how to use a Jupyter Notebook.

What is Pandas? What is a DataFrame?

OK, so what is it telling us here?

Rename Columns in Pandas

Basic Aggregate Function in Pandas

count counts the values in a column. Ignores null values.

min minimum value in the column

max maximum value in the column

first first value for a category

last last value for a category

std standard deviation

sum sum of values in a column

mean the mean of the column values

median the median of the column values

mode the mode of the column values

var variance (unbiased )

mad the mean of the absolute deviation

quantile the quantile of the column values

skew skew ( unbiased)

unique unique values in a group

nunique Get a count number of the unique values in a column

Running more than one Pandas Aggregate Functions at a time

Pandas and Null Data

Dealing with Null Values in Pandas

Run your code and you should see this:

> Greater than

< Less than

>= Greater than or equal to

<= Less than or equal to

Which students scored more than 90 on the Physics exam?

Let's do a query that involves AND.

Get all the Rows but only specified Columns

Another way to get the same data as above is like this:

Apply and Lambda Functions

Now add these lines in a new cell:

figsize tuple seriesMath.plot.bar( figsize=(7, 7) )

grid bool (True/False) seriesMath.plot.bar( grid=True )

legend bool (True/False) seriesMath.plot.bar( legend=True )

xlabel string seriesMath.plot.bar( xlabel='Grades' )

ylabel string seriesMath.plot.bar( ylabel='Num of Students' )

color color value seriesMath.plot.bar( color='red' )

Notice how we've specified the colors for the bars:

You might also like