Data Visualization With Python for Beginners
Data Visualization With Python for Beginners
WITH PYTHON
FOR BEGINNERS
Visualize Your Data Using Pandas,
Matplotlib and Seaborn
AI PUBLISHING
© Copyright 2020 by AI Publishing
All rights reserved.
First Printing, 2020
Edited by AI Publishing
Ebook Converted and Cover by Gazler Studio
Published by AI Publishing LLC
ISBN-13: 978-1-7330426-8-0
Legal Notice:
You cannot amend, distribute, sell, use, quote, or paraphrase any part
of the content within this book without the consent of the author.
Disclaimer Notice:
Please note the information contained within this document is for
educational and entertainment purposes only. No warranties of any
kind are expressed or implied. Readers acknowledge that the author
is not engaging in the rendering of legal, financial, medical, or
professional advice. Please consult a licensed professional before
attempting any techniques outlined in this book.
How to contact us
About the Publisher
AI Publishing Is Searching for Authors Like You
Preface
Chapter 1: Introduction
1.1. What is Data Visualization
1.2. Environment Setup
1.3. Python Crash Course
1.4. Data Visualization Libraries
Exercise 1.1
Exercise 1.2
Hands-on Project
Exercise Solutions
Exercise 1.1
Exercise 1.2
Exercise 2.1
Exercise 2.2
Exercise 3.1
Exercise 3.2
Exercise 4.1
Exercise 4.2
Exercise 5.1
Exercise 5.2
Exercise 6.1
Exercise 6.2
Exercise 7.1
Exercise 7.2
Exercise 8.1
Exercise 9.1
Exercise 9.2
Exercise 10.1
Exercise 10.2
Preface
§ Book Approach
The book follows a very simple approach. It is divided into 10
chapters. Chapter 1 contains an introduction while the 2nd
and 3rd chapters cover the Matplotlib library. Python’s
Seaborn library is covered in 4th and 5th chapters while the
6th and 7th chapters explore the Pandas library. The 8th
chapter covers 3-D plotting, while the 9th chapter explains
how to draw maps via the Basemap library. Finally, the 10th
chapter covers interactive data visualization via the Plotly
library.
https://www.aispublishing.net/book-data-visualization
Get in Touch with Us
In the first chapter of this book, you will see how to set up
the Python environment needed to run various data
visualization libraries. The chapter also contains a crash
Python course for absolute beginners in Python. Finally, the
different data visualization libraries that we are going to
study in this book have been discussed. The chapter ends
with a simple exercise.
$ cd / tmp
$ curl –o https://repo.anaconda.com.archive/Anaconda3-5.2.0-
Linux-x86_64.sh
$ sha256sum Anaconda3-5.2.0-Linux-x86_64.sh
09f53738b0cd3bb96f5b1bac488e5528df9906be2480fe61df-
40e0e0d19e3d48
Anaconda3-5.2.0-Linux-x86_64.sh
Output
Output
[/home/tola/anaconda3] >>>
The installation will proceed once you press Enter.
Once again, you have to be patient as the installation
process takes some time to complete.
6. You will receive the following result when the
installation is complete. If you wish to use conda
command, type Yes.
Output
…
Installation finished.
Do you wish the installer to prepend Anaconda3
install location to path in your /home/tola/.bashrc?
[yes|no]
[no]>>>
$ source `/.bashrc
$ conda list
Output:
Script 2:
# A string Variable
first_name = “Joseph”
print(type(first_name))
# An Integer Variable
age = 20
print(type(age))
#Tuples
days = (“Sunday”, “Monday”, “Tuesday”, “Wednesday”,
“Thursday”, “Friday”, “Saturday”)
print(type(days))
#Dictionaries
days2 = {1:”Sunday”, 2:”Monday”, 3:”Tuesday”, 4:”Wednesday”,
5:”Thursday”, 6:”Friday”, 7:”Saturday”}
print(type(days2))
Output:
<class ‘str’>
<class ‘int’>
<class ‘float’>
<class ‘bool’>
<class ‘list’>
<class ‘tuple’>
<class ‘dict’>
Script 3:
X = 20
Y = 10
print(X + Y)
print(X - Y)
print(X * Y)
print(X / Y)
print(X ** Y)
Output:
30
10
200
2.0
10240000000000
Logical Operators
Script 4:
X = True
Y = False
print(X and Y)
print(X or Y)
print(not(X and Y))
Output:
False
True
True
Comparison Operators
Script 5:
X = 20
Y = 35
print(X == Y)
print(X != Y)
print(X > Y)
print(X < Y)
print(X >= Y)
print(X <= Y)
Output:
False
True
False
True
False
True
Assignment Operators
Script 6:
X = 20; Y = 10
R = X + Y
print(R)
X = 20;
Y = 10
X += Y
print(X)
X = 20;
Y = 10
X -= Y
print(X)
X = 20;
Y = 10
X *= Y
print(X)
X = 20;
Y = 10
X /= Y
print(X)
X = 20;
Y = 10
X %= Y
print(X)
X = 20;
Y = 10
X **= Y
print(X)
Output:
30
30
10
200
2.0
0
10240000000000
Membership Operators
Script 7:
Output:
True
Script 8:
Output:
True
IF Statement
Script 8:
# The if statment
if 10 > 5:
print(«Ten is greater than 10»)
Output:
Script 9:
# if-else statement
if 5 > 10:
print(“5 is greater than 10”)
else:
print(«10 is greater than 5»)
Output:
10 is greater than 5
IF-Elif Statement
Script 10:
if 5 > 10:
print(«5 is greater than 10»)
elif 8 < 4:
print(«8 is smaller than 4»)
else:
print(«5 is not greater than 10 and 8 is not smaller than
4»)
Output:
For Loop
Script 11:
items = range(5)
for item in items:
print(item)
Output:
0
1
2
3
4
While Loop
Script 12:
c = 0
while c < 10:
print(c)
c = c +1
Output:
0
1
2
3
4
5
6
7
8
9
1.3.9. Functions
Functions, in any programming language, are used to
implement that piece of code that is required to be executed
numerous times at different locations in the code. In such
cases, instead of writing long pieces of codes, again and
again, you can simply define a function that contains the
piece of code, and then you can call the function wherever
you want in the code.
Script 13:
def myfunc():
print(“This is a simple function”)
Output:
You can also pass values to a function. The values are passed
inside the parenthesis of the function call. However, you
must specify the parameter name in the function definition,
too. In the following script, we define a function named
myfuncparam(). The function accepts one parameter, i.e.,
num. The value passed in the parenthesis of the function call
will be stored in this num variable and will be printed by the
print() method inside the myfuncparam() method.
Script 14:
def myfuncparam(num):
print(“This is a function with parameter value: “+num)
Output:
Script 15:
def myreturnfunc():
return “This function returns a value”
val = myreturnfunc()
print(val)
Output:
Script 16:
class Fruit:
name = “apple”
price = 10
def eat_fruit(self):
print(“Fruit has been eaten”)
f = Fruit()
f.eat_fruit()
print(f.name)
print(f.price)
Output:
Fruit has been eaten
apple
10
Script 17:
class Fruit:
name = “apple”
price = 10
def eat_fruit(self):
print(“Fruit has been eaten”)
f = Fruit(«Orange», 15)
f.eat_fruit()
print(f.name)
print(f.price)
Output:
1.4.1. Matplotlib
1.4.2. Seaborn
1.4.3. Basemap
1.4.4. Pandas
1.4.5. Plotly
Exercise 1.1
Question 1
A- For Loop
B- While Loop
C- Both A & B
D- None of the above
Question 2
A- Single Value
B- Double Value
C- More than two values
D- None
Question 3
A- In
B- Out
C- Not In
D- Both A and C
Answer: D
Exercise 1.2
2.1. Introduction
In the first chapter of the book, you saw briefly what data
visualization is, why it is important, and what its various
applications are. You also installed different software that we
will be using in order to execute data visualization scripts in
this book.
In this chapter, you will see how to draw some of the most
commonly used plots with the Matplotlib library.
Script 1:
Output:
Script 2:
import matplotlib.pyplot as plt
import numpy as np
import math
fig = plt.figure()
ax = plt.axes()
ax.plot(x_vals, y_vals)
Output:
Script 3:
plt.rcParams[“figure.figsize”] = [8,6]
In the output, it is evident that the default plot size has been
increased.
Output:
2.3. Titles, Labels, and Legends
You can improve the aesthetics and readability of your
graphs by adding titles, labels, and legends to your graph.
Let’s first see how to add titles and labels to a plot.
Script 4:
Here in the output, you can see the labels and titles that you
specified in the script 4.
Output:
In addition to changing titles and labels, you can also specify
the color for the line plot. To do so, you simply have to pass
shorthand notation for the color name to the plot() function,
for example, “r” for red, “b” for blue, and so on. Here is an
example:
Script 5:
Output:
To add a legend, you need to make two changes. First, you
have to pass a string value for the label attribute of the
plot() function. Next, you have to pass the value for the loc
attribute of the legend method of the pyplot module. In the
loc attribute, you have to pass the location of your legend.
The following script plots a legend at the upper center
corner of the plot.
Script 6:
You can also plot multiple line plots inside one graph. All you
have to do is call the plot() method twice with different
values for x and y axes. The following script plots a line plot
for square root in red and for a cube function in blue.
Script 7:
Output:
Script 8:
import pandas as pd
data = pd.read_csv(«E:\Data Visualization with
Python\Datasets\iris_data.csv»)
If you do not see any error, the file has been read
successfully. To see the first five rows of the Pandas
dataframe containing the data, you can use the head()
method as shown below:
Script 9:
data.head()
Output:
You can see that the iris_data.csv file has five columns. We
can use values from any of these two columns to plot a line
plot. To do so, for x and y axes, we need to pass the data
dataframe column names to the plot() function of the pyplot
module. To access a column name from a Pandas dataframe,
you need to specify the dataframe name followed by a pair
of square brackets. Inside the brackets, the column name is
specified. The following script plots a line plot, where the x-
axis contains values from the sepal_length column, whereas
the y-axis contains values from the petal_length column of
the dataframe.
Script 10:
plt.xlabel(‘Sepal Length’)
plt.ylabel(‘Petal Length’)
plt.title(‘Sepal vs Petal Length’)
plt.plot(data[«sepal_length»], data[«petal_length»], ‘b’)
2.5. Plotting Using TSV Data Source
Like CSV, you can also read a TSV file via the read_csv()
method. You have to pass ‘\t’ as the value for the sep
parameter. The script 11 reads iris_data.tsv file and stores it
in a Pandas dataframe. Next, the first five rows of the
dataframe have been printed via the head() method.
Script 11:
import pandas as pd
data = pd.read_csv(«E:\Data Visualization with
Python\Datasets\iris_data.tsv», sep=’\t’)
data.head()
Output:
The remaining process to plot the line plot remains the same
as it was for the CSV file. The following script plots a line plot
where the x-axis contains sepal length, and the y-axis
displays petal length.
Script 12:
plt.xlabel(‘Sepal Length’)
plt.ylabel(‘Petal Length’)
plt.title(‘Sepal vs Petal Length’)
plt.plot(data[«SepalLength»], data[«PetalLength»], «b»)
Output:
2.6. Scatter Plot
Script 13:
plt.xlabel(‘Sepal Length’)
plt.ylabel(‘Petal Length’)
plt.title(‘Sepal vs Petal Length’)
plt.scatter(data[«SepalLength»], data[«PetalLength»], c =
«b»)
The output shows a scatter plot with blue points. The plot
clearly shows that with an increase in sepal length, the petal
length of an iris flower also increases.
Output:
Script 14:
Output:
Like line plots, you can plot multiple scatter plots inside one
graph. To do so, you have to call the scatter() method twice
with the same value for the x-axis while different values for
the y-axis. In the following script, you will see two scatter
plots. The first scatter plot plots the relation between sepal
vs. petal length using blue markers, and the second scatter
plot plots the relation between sepal length and sepal width
using red markers.
Script 15:
plt.xlabel(‘Sepal Length’)
plt.ylabel(‘Petal Length’)
plt.title(‘Sepal vs Petal Length’)
plt.scatter(data[«SepalLength»], data[«PetalLength»], c =
«b», marker = «x», label=»Petal Length»)
plt.scatter(data[«SepalLength»], data[«SepalWidth»], c = &
#x00AB;r», marker = «o», label=»Sepal Width»)
plt.legend(loc=’upper center’)
Output:
Script 16:
import pandas as pd
data = pd.read_csv(r»E:\Data Visualization with
Python\Datasets\titanic_data.csv»)
data.head()
Output:
To plot a bar plot, you need to call the bar() method. The
categorical values are passed on the x-axis, and
corresponding aggregated numerical values are passed on
the y-axis. The following script plots a bar plot between
genders and ages of the Titanic ship.
Script 17:
Output:
You can also create horizontal bar plots. To do so, you need
to call the barh() method, as shown below:
Script 18:
plt.xlabel(‘Ages’)
plt.ylabel(‘Class’)
plt.title(‘Class vs Age’)
plt.barh(data[«Pclass»], data[«Age»])
Output:
The output shows the relationship between the passenger
class and the ages of the passengers in the unfortunate
Titanic ship.
2.8. Histograms
Script 19
plt.title(‘Age Histogram’)
plt.hist(data[«Age»])
Output:
Script 20
plt.title(‘Fare Histogram’)
plt.hist(data[«Fare»])
Output:
Script 21:
plt.title(‘Age Histogram’)
plt.hist(data[«Age»], bins = 5)
Output:
Script 22:
Output:
In the previous section, you saw how to plot a pie plot using
raw values. Let’s see how to plot a pie plot using a Pandas
dataframe as the source.
Script 23:
import pandas as pd
data = pd.read_csv(r”E:\Data Visualization with
Python\Datasets\titanic_data.csv”)
pclass = data[“Pclass”].value_counts()
print(pclass)
Here is an output.
Output:
Script 24:
print(pclass.index.values.tolist())
print(pclass.values.tolist())
Output:
Script 25:
labels = pclass.index.values.tolist()
values = pclass.values.tolist()
explode = (0.05, 0.05, 0.05)
plt.pie(values, explode=explode, labels=labels,
autopct=’%1.1f%%’, shadow=True, startangle=140)
plt.show()
Output:
Script 26:
London = [25,26,32,19,28,39,24]
Tokyo = [20,29,23,35,32,26,18]
Paris= [18,21,28,35,29,25,22]
plt.legend()
plt.show()
Output:
Question 1:
A- color
B- c
C- r
D- None of the above
Question 2:
A- title
B- label
C- axis
D- All of the above
Question 3:
A- autopct = ‘%1.1f%%’
B- percentage = ‘%1.1f%%’
C- perc = ‘%1.1f%%’
D- None of the Above
Exercise 2.2
References
1. https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html
2. http://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html
3. https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html
4. https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html
5. https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.pie.html
6.
https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/stackplot_demo.html
Advanced Plotting with Matplotlib
3.1. Introduction
Script 1:
plt.rcParams[“figure.figsize”] = [12,8]
plt.subplot(2,2,1)
plt.plot(x_vals, y_vals, ‘bo-’)
plt.subplot(2,2,2)
plt.plot(x_vals, y_vals, ‘rx-’)
plt.subplot(2,2,3)
plt.plot(x_vals, y_vals, ‘g*-’)
plt.subplot(2,2,4)
plt.plot(x_vals, y_vals, ‘g*-’)
Output:
Script 2:
plt.rcParams[“figure.figsize”] = [12,8]
x_vals = np.linspace(0, 20, 20)
y_vals = [math.sqrt(i) for i in x_vals]
plt.subplot(2,3,1)
plt.plot(x_vals, y_vals, ‘bo-’)
plt.subplot(2,3,2)
plt.plot(x_vals, y_vals, ‘rx-’)
plt.subplot(2,3,3)
plt.plot(x_vals, y_vals, ‘g*-’)
plt.subplot(2,3,4)
plt.plot(x_vals, y_vals, ‘g*-’)
plt.subplot(2,3,5)
plt.plot(x_vals, y_vals, ‘bo-’)
plt.subplot(2,3,6)
plt.plot(x_vals, y_vals, ‘rx-’)
Output:
The idea of multiple plots is to show different information in
each plot. The following script plots six line plots in two rows
and three columns. The three line plots in the first row show
the square root of 20 numbers between 0 and 20. The three
line plots in the second row show the cube of the same 20
numbers.
Script 3:
plt.rcParams[“figure.figsize”] = [12,8]
plt.subplot(2,3,1)
plt.plot(x_vals, y_vals, ‘bo-’)
plt.subplot(2,3,2)
plt.plot(x_vals, y_vals, ‘rx-’)
plt.subplot(2,3,3)
plt.plot(x_vals, y_vals, ‘g*-’)
plt.subplot(2,3,4)
plt.plot(x_vals, y2_vals, ‘g*-’)
plt.subplot(2,3,5)
plt.plot(x_vals, y2_vals, ‘bo-’)
plt.subplot(2,3,6)
plt.plot(x_vals, y2_vals, ‘rx-’)
Output:
Script 4:
plt.rcParams[“figure.figsize”] = [12,8]
figure = plt.figure()
Output:
To add a plot to the above figure, all you have to do is call
the plot() method using the axes object. This plot() is the
same as the plot method of the pyplot module. Look at the
following script.
Script 5:
plt.rcParams[“figure.figsize”] = [12,8]
figure = plt.figure()
axes.plot(x_vals, y_vals)
axes.set_xlabel(‘X Axis’)
axes.set_ylabel(‘Y Axis’)
Output:
Script 6:
plt.rcParams[“figure.figsize”] = [12,8]
In the output below, you can see a line plot inside another
plot.
Output:
3.4. Using Subplots Function to Create Multiple Plots
Script 7:
Output:
Script 8:
plt.rcParams[«figure.figsize»] = [12,8]
Output:
3.5. Saving a Matplotlib Plot
Script 9:
plt.rcParams[«figure.figsize»] = [12,8]
figure.savefig(r’E:/Subplots.jpg’)
Output:
Further Readings – Matplotlib Subplots
To study more about the Matplotlib subplots, please check
Matplotlib’s official documentation for Subplots. Get used
to searching and reading this documentation. It is a great
resource of knowledge.
Exercise 3.1
Question 1:
Which plot function will you use to plot a graph in the 5th cell
of a multiple plot figure with four rows and two columns?
A- plt.subplot(5,4,2)
B- plt.subplot(2,4,5)
C- plt.subplot(4,2,5)
D- None of the Above
Question 2:
How will you create a subplot with five rows and three
columns using subplots() function?
A- plt.subplots(nrows=5, ncols=3)
B- plt.subplots(5,3)
C- plt.subplots(rows=5, cols=3)
D- All of the Above
Question 3
A- figure.saveimage()
B- figure.savegraph()
C- figure.saveplot()
D- figure.savefig()
Exercise 3.2
Draw multiple plots with three rows and one column. Show
the sine of any 30 integers in the first plot, the cosine of the
same 30 integers in the second plot, and the tangent of the
same 30 integers in the 3rd plot.
Introduction to the Python Seaborn Library
4.1. Introduction
In the previous two chapters, you saw how to plot different
types of graphs using Python’s Matplotlib library. In this
chapter, you will see how to perform data visualization with
Seaborn, which is yet another extremely handy Python
library for data visualization. The Seaborn library is based on
the Matplotlib library. Therefore, you will also need to import
the Matplotlib library before you plot any Matplotlib graph.
plt.rcParams[«figure.figsize»] = [10,8]
tips_data = sns.load_dataset(‘tips’)
tips_data.head()
The above script imports the Matplotlib and Seaborn
libraries. Next, the default plot size is increased to 10 x 8.
After that, the load_dataset() method of the Seaborn
module is used to load the tips dataset. Finally, the first five
records of the tips dataset have been displayed on the
console. Here is the output.
Output:
Script 1:
plt.rcParams[«figure.figsize»] = [10,8]
sns.distplot(tips_data[‘total_bill’])
Output:
Similarly, the following script plots a dist plot for the tip
column of the tips dataset.
Script 2:
sns.distplot(tips_data[‘tip’])
Output:
The line on top of the histogram shows the kernel density
estimate for the histogram. The line can be removed by
passing False as the value for the kde attribute of the
distplot() function, as shown in the following example.
Script 3:
Output:
Further Readings – Seaborn Distributional Plots [1]
To study more about Seaborn distributional plots, please
check Seaborn’s official documentation for distributional
plots. Try to plot distributional plots with a different set of
attributes, as mentioned in the official documentation.
Script 4:
Script 5:
Output:
Further Readings – Seaborn Joint Plots [2]
To study more about Seaborn joint plots, please check
Seaborn’s official documentation for joint plots. Try to plot
joint plots with a different set of attributes, as mentioned in
the official documentation.
The pair plot is used to plot a joint plot for all the
combinations of numeric and Boolean columns in a dataset.
To plot a pair plot, you need to call the pairplot() function
and pass it to your dataset.
Script 6:
sns.pairplot(data=tips_data)
Output:
Script 7:
Output:
The rug plot is the simplest of all the Seaborn plots. The rug
plot basically plots small rectangles for all the data points in
a specific column. The rugplot() function is used to plot a
rug plot in Seaborn. The following script plots a rugplot() for
the total_bill column of the tips dataset.
Script 8:
sns.rugplot(tips_data[‘total_bill’])
Output:
You can see a high concentration of rectangles between 10
and 20, which shows that the total bill amount for most of
the bills is between 10 and 20.
Script 9:
plt.rcParams[«figure.figsize»] = [8,6]
sns.set_style(«darkgrid»)
titanic_data = sns.load_dataset(‘titanic’)
titanic_data.head()
Output:
Script 10:
You can further categorize the bar plot using the hue
attribute. For example, the following bar plot plots the
average ages of passengers traveling in different classes and
further categorized based on their genders.
Script 11:
You can also plot multiple bar plots depending upon the
number of unique values in a categorical column. To do so,
you need to call the catplot() function and pass the
categorical column name as the value for the col attribute
column. The following script plots two bar plots—one for the
passengers who survived the Titanic accident and one for
those who didn’t survive.
Script 12:
sns.catplot(x=»pclass», y=»age», hue=»sex», col=»survived»,
data=titanic_data, kind=»bar»);
Output:
The count plot plots plot like a bar plot. However, unlike the
bar plot, which plots average values, the count plot simply
displays the counts of the occurrences of records for each
unique value in a categorical column. The countplot()
function is used to plot a count plot with Seaborn. The
following script plots a count plot for the pclass column of
the Titanic dataset.
Script 13:
sns.countplot(x=’pclass’, data=titanic_data)
Output:
Like a bar plot, you can also further categorize the count plot
by passing a value for the hue parameter. The following
script plots a count plot for the passengers traveling in
different classes of the Titanic ship categorized further by
their genders.
Script 14:
sns.countplot(x=’pclass’, hue=’sex’, data=titanic_data)
Output:
Script 15:
sns.boxplot(x=titanic_data[«fare»])
Output:
Similarly, the following script plots the vertical box plot for
the fare column of the Titanic dataset.
Script 16:
sns.boxplot(y=titanic_data[«fare»])
Output:
You can also plot multiple box plots for every unique value in
a categorical column. For instance, the following script plots
box plots for the age column of the passengers who traveled
alone as well as for passengers who were accompanied by at
least one other passenger.
Script 17:
Output:
Let’s first discuss the passengers traveling alone, which are
represented by the orange box. The result shows that half of
the passengers were aged more than 30, while the remaining
half was aged less than 30. Among the lower half, the age of
the passengers in the first quartile was between 6 and 23,
while the passengers in the second quartile were aged
between 24 and 30. In the same way, you can get
information about the 3rd and 4th age quartile of the
passengers traveling alone. A comparison of the two box
plots reveals that the median age of the passengers traveling
alone is slightly greater than the median age of the
passengers accompanied by other passengers.
Like bar and count plots, the hue attribute can also be used
to categorize box plots.
For instance, the following script plots box plots for the
passengers traveling alone and along with other passengers,
further categorized based on their genders.
Script 18:
Output:
Script 19:
Output:
You can see that the output doesn’t contain any outliers for
the box plots.
Script 20:
Output:
The output shows that among the passengers traveling
alone, the passengers whose age is less than 15 are very few
as shown by the orange violin plot on the right. This behavior
is understandable as children are normally accompanied by
someone. This can be further verified by looking at the blue
violin plot on the left that corresponds to the passengers
accompanied by other passengers.
Script 21:
sns.violinplot(x=’alone’, y=’age’,
hue=’sex’,data=titanic_data)
Output:
For a better comparison and to save space, you can also plot
split violin plots. In split violin plots, each half corresponds to
one value in a category column. For instance, the following
script plots two violin plots—one each for the passengers
traveling alone and for the passengers not traveling alone.
Each plot is further split into two parts based on the genders
of the passengers.
Script 22:
sns.violinplot(x=’alone’, y=’age’,
hue=’sex’,data=titanic_data, split=True)
Output:
Script 23:
Output:
Script 24:
sns.stripplot(x=’alone’, y=’age’,
hue=’sex’,data=titanic_data)
Output:
Finally, like violin plots, you can also split strip plots, as
demonstrated by the following example.
Script 25:
sns.stripplot(x=’alone’, y=’age’,
hue=’sex’,data=titanic_data, split = True)
Output:
Further Readings –Seaborn Strip Plot [9]
To study more about Seaborn strip plots, please check
Seaborn’s official documentation for strip plots. Try to plot
strip plots with a different set of attributes, as mentioned in
the official documentation.
Script 26:
sns.swarmplot(x=’alone’, y=’age’, data=titanic_data)
Output:
Script 27:
sns.swarmplot(x=’alone’, y=’age’,
hue=’sex’,data=titanic_data)
Output:
Script 28:
sns.swarmplot(x=’alone’, y=’age’,
hue=’sex’,data=titanic_data, split = True)
Output:
Exercise 4.1
Question 1
Which plot is used to plot multiple joint plots for all the
combinations of numeric and Boolean columns in a dataset?
A- Joint Plot
B- Pair Plot
C- Dist Plot
D- Scatter Plot
Answer: B
Question 2
A- barplot()
B- jointplot()
C- catplot()
D- mulplot()
Answer: C
Question 3
A- kind
B- type
C- hue
D- col
Answer: A
Exercise 4.2
Plot a swarm violin plot using Titanic data that displays the
fare paid by male and female passengers.
References
1. https://seaborn.pydata.org/generated/seaborn.distplot.html
2. https://seaborn.pydata.org/generated/seaborn.jointplot.html
3. https://seaborn.pydata.org/generated/seaborn.pairplot.html
4. https://seaborn.pydata.org/generated/seaborn.rugplot.html
5. https://seaborn.pydata.org/generated/seaborn.barplot.html
6. https://seaborn.pydata.org/generated/seaborn.countplot.html
7. https://seaborn.pydata.org/generated/seaborn.boxplot.html
8. https://seaborn.pydata.org/generated/seaborn.violinplot.html
9. https://seaborn.pydata.org/generated/seaborn.stripplot.html
10. https://seaborn.pydata.org/generated/seaborn.swarmplot.html
Advanced Plotting with Seaborn
Let’s first import the tips dataset from the Seaborn library.
Script 1:
plt.rcParams[«figure.figsize»] = [10,8]
tips_data = sns.load_dataset(‘tips’)
tips_data.head()
Output:
Let’s now plot a scatter plot with the values from the
total_bill column of the tips dataset on the x-axis and values
from the tips column on the y-axis. To plot a scatter plot, you
need to call the scatterplot() method of the Seaborn library.
Script 2:
To change the color of the scatter plot, simply pass the first
letter of any color to the color attribute of the scatterplot()
function.
Script 3:
Output:
Finally, to change the marker shape for the scatter plot, you
need to pass a value for the marker attribute. For example,
the following scatter plot plots blue x markers on the scatter
plot.
Script 4:
Output:
Further Readings – Seaborn Scatter Plots [1]
To study more about Seaborn scatter plots, please check
Seaborn’s official documentation for Scatter plots. Try to
plot scatter plots with a different set of attributes, as
mentioned in the official documentation.
Script 5:
sns.set_style(‘darkgrid’)
sns.scatterplot(x=»total_bill», y=»tip», data=tips_data,
color = ‘b’, marker = ‘x’)
Output:
Script 6:
sns.set_style(‘whitegrid’)
sns.scatterplot(x=»total_bill», y=»tip», data=tips_data,
color = ‘b’, marker = ‘x’)
Output:
In addition to styling the background, you can style the plot
for different devices via the set_context() function. By
default, the context is set to notebook. However, if you want
to plot your plot on a poster, you can pass poster as a
parameter to the set_context() function. In the output, you
will see a plot with bigger annotations, as shown below.
Script 7:
sns.set_context(‘poster’)
sns.scatterplot(x=»total_bill», y=»tip», data=tips_data,
color = ‘b’, marker = ‘x’)
Output:
Further Readings – Styling Seaborn Plots [2]
To study more about how to style seaborn plots, please
check Seaborn’s official documentation for styling Seaborn
plots. Try to Apply Seaborn styles with a different set of
attributes, as mentioned in the official documentation.
Script 8:
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams[«figure.figsize»] = [8,6]
sns.set_style(«darkgrid»)
titanic_data = sns.load_dataset(‘titanic’)
titanic_data.head()
Output:
Script 9:
titanic_data.corr()
Output:
From the output above, you can see that we now have
meaningful information across rows as well. In the following
script, we first increase the default plot size and then pass
the correlation matrix of the Titanic dataset to the heatmap()
function to create a heat map.
Script 10:
plt.rcParams[«figure.figsize»] = [10,8]
corr_values = titanic_data.corr()
sns.heatmap(corr_values, annot= True)
You can see a heat map in the output, as shown below. The
higher the correlation is, the darker the cell containing the
correlation.
Output:
You can see that the above plot is cropped from the top and
bottom. The following script plots the uncropped plot. In the
following script, we use the set_ylim() method to increase
the plot size from top and bottom cell by 0.5 percent.
Script 11:
plt.rcParams[«figure.figsize»] = [10,8]
corr_values = titanic_data.corr()
ax = sns.heatmap(corr_values, annot= True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Output:
You can also change the default color of the heatmap. To do
so, you need to pass a value for the cmap attribute of the
heatmap() function. Look at the script below:
Script 12:
plt.rcParams[«figure.figsize»] = [10,8]
corr_values = titanic_data.corr()
ax = sns.heatmap(corr_values, annot= True, cmap =
‘coolwarm’)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Output:
In addition to changing the color of the heat map, you can
also specify the line color and width that separates cells in a
heat map.
Let’s import the flights dataset from the Seaborn library. The
flights dataset contains records of the passengers traveling
each month from 1949 to 1960.
Script 13:
plt.rcParams[«figure.figsize»] = [10,8]
flights_data = sns.load_dataset(‘flights’)
flights_data.head()
Output:
Script 14:
flights_data_pivot =flights_data.pivot_table(index=’month’,
columns=’year’, values=’passengers’)
ax = sns.heatmap(flights_data_pivot, cmap = ‘coolwarm’,
linecolor=’black’, linewidth=1)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Output:
Further Readings – Seaborn Heatmaps [3]
To study more about Seaborn heatmaps, please check
Seaborn’s official documentation for heat maps. Try to plot
heat maps with a different set of attributes, as mentioned
in the official documentation.
Script 15:
flights_data_pivot =flights_data.pivot_table(index=’month’,
columns=’year’, values=’passengers’)
ax = sns.clustermap(flights_data_pivot, cmap = ‘coolwarm’)
Output:
Like heat map, you can also specify line color and width
separating cells in a cluster map. Here is an example:
Script 16:
flights_data_pivot =flights_data.pivot_table(index=’month’,
columns=’year’, values=’passengers’)
ax = sns.clustermap(flights_data_pivot, cmap = ‘coolwarm’,
linecolor=’black’, linewidth=1)
Output:
Further Readings – Seaborn Cluster Maps [4]
To study more about Seaborn cluster maps, please check
Seaborn’s official documentation for cluster maps. Try to
plot cluster maps with a different set of attributes, as
mentioned in the official documentation.
In the previous chapter, you saw how the pair plot can be
used to plot relationships between numeric columns of a
dataset. Before we see pair grids in action, let’s revise how
the pair plot works. The following script plots the pair plot
for the tips dataset.
Script 17:
plt.rcParams[«figure.figsize»] = [10,8]
tips_data = sns.load_dataset(‘tips’)
sns.pairplot(tips_data)
Output:
Let’s now plot a pair grid for the tips dataset. To do so, you
have to pass the Pandas dataframe containing the tips
dataset to the PairGrid() function, as shown below.
Script 18:
sns.PairGrid(tips_data)
Output:
To actually plot a graph on the grids returned by the
PairGrid() function, you need to call the map() function on
the object returned by the PairGrid() function. Inside the
map function, the type of plot is passed as a parameter. For
instance, the following PairGrid() function plots a scatter
plot for all the pairs of numerical columns in the tips dataset.
Script 19:
pgrids = sns.PairGrid(tips_data)
pgrids.map(plt.scatter)
Output:
With a pair grid, you can plot different types of plots on the
diagonal, upper portion from the diagonal, and the lower
portion from a diagonal. For instance, the following pair grid
plots a kernel density estimation plots on diagonal,
distributional plots on the upper part of the diagonal, and
scatter plots on the lower side of the diagonal.
Script 20:
pgrids = sns.PairGrid(tips_data)
pgrids.map_diag(sns.distplot)
pgrids.map_upper(sns.kdeplot)
pgrids.map_lower(plt.scatter)
Output:
Further Readings – Seaborn Pair Grids [5]
To study more about Seaborn pair grids, please check
Seaborn’s official documentation for pair grids. Try to plot
pair grids with a different set of attributes, as mentioned in
the official documentation.
Script 21:
You can see gender across columns and time across rows as
respectively, specified by the FacetGrid() function’s col and
row attributes.
Output:
Similarly, you can use the facet grid to plot scatter plots for
the total_bill and tips columns, with respect to sex and time
columns.
Script 22:
Script 23:
Output:
You can plot regression plots for two columns on the y-axis.
To do so, you need to pass a column name for the hue
parameter of the lmplot() function.
Script 24:
Output:
Script 25:
Output:
Exercise 5.1
Question 1
A- set_style (‘darkgrid’)
B- set_style (‘whitegrid’)
C- set_style (‘poster’)
D- set_context (‘poster’)
Question 2
A- correlation()
B- corr()
C- heatmap()
D- none of the above
Question 3
A- annotate()
B- annot()
C- mark()
D- display()
Exercise 5.2
References
1. https://seaborn.pydata.org/generated/seaborn.scatterplot.html
2. https://seaborn.pydata.org/tutorial/aesthetics.html
3. https://seaborn.pydata.org/generated/seaborn.heatmap.html
4. https://seaborn.pydata.org/generated/seaborn.clustermap.html
5. https://seaborn.pydata.org/generated/seaborn.PairGrid.html
6. https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
7. https://seaborn.pydata.org/generated/seaborn.lmplot.html
Introduction to Pandas Library for Data
Analysis
6.1. Introduction
import pandas as pd
In the second chapter of this book, you saw how the Pandas
library can be used to read CSV and TSV files. Here, we will
just briefly recap how to read CSV files with Pandas. The
following script reads the “titanic_data.csv” file from the
Datasets folders in the Resources. The beginning five rows of
the Titanic dataset have been printed via the head() method
of the Pandas dataframe containing the Titanic dataset.
Script 1:
import pandas as pd
titanic_data = pd.read_csv(r»E:\Data Visualization with
Python\Datasets\titanic_data.csv»)
titanic_data.head()
Output:
The read_csv() method reads data from a CSV or TSV file
and stores it in a Pandas dataframe, which is a special object
that stores data in the form of rows and columns.
Script 2:
titanic_pclass1= (titanic_data.Pclass == 1)
titanic_pclass1
Output:
0 False
1 True
2 False
3 True
4 False
…
886 False
887 True
888 False
889 True
890 False
Name: Pclass, Length: 891, dtype: bool
Script 3:
titanic_pclass1= (titanic_data.Pclass == 1)
titanic_pclass1_data = titanic_data[titanic_pclass1]
titanic_pclass1_data.head()
Output:
Script 4:
titanic_pclass_data = titanic_data[titanic_data.Pclass == 1]
titanic_pclass_data.head()
Output:
Another commonly used operator to filter rows is the isin
operator. The isin operator takes a list of values and returns
only those rows where the column used for comparison
contains values from the list passed to isin operator as a
parameter. For instance, the following script filters those
rows where age is in 20, 21, or 22.
Script 5:
ages = [20,21,22]
age_dataset = titanic_data[titanic_data[«Age»].isin(ages)]
age_dataset.head()
Output:
Script 6:
ages = [20,21,22]
ageclass_dataset = titanic_data[titanic_data[«Age»].
isin(ages) & (titanic_data[«Pclass»] == 1) ]
ageclass_dataset.head()
Output:
Script 7:
The output below shows that the dataset now contains only
Name, Sex, and Age columns.
Output:
In addition to filtering columns, you can also drop columns
that you don’t want in the dataset. To do so, you need to call
the drop() method and pass it the list of columns that you
want to drop. For instance, the following script drops the
Name, Age, and Sex columns from the Titanic dataset and
returns the remaining columns.
Script 8:
Output:
Script 9:
titanic_pclass1_data = titanic_data[titanic_data.Pclass ==
1]
print(titanic_pclass1_data.shape)
titanic_pclass2_data = titanic_data[titanic_data.Pclass ==
2]
print(titanic_pclass2_data.shape)
Output:
(216, 12)
(184, 12)
Script 10:
final_data = titanic_pclass1_data.append(titanic_pclass2_
data, ignore_index=True)
print(final_data.shape)
Output:
(400, 12)
The output now shows that the total number of rows is 400,
which is the sum of the number of rows in the two
dataframes that we concatenated.
Script 11:
final_data = pd.concat([titanic_pclass1_data,
titanic_pclass2_data])
print(final_data.shape)
Output:
(400, 12)
Script 12:
df1 = final_data[:200]
print(df1.shape)
df2 = final_data[200:]
print(df2.shape)
Output:
(200, 12)
(200, 12)
(400, 24)
Script 13:
age_sorted_data = titanic_data.sort_values(by=[‘Age’])
age_sorted_data.head()
Output:
Script 14:
age_sorted_data = titanic_data.sort_values(by=[‘Age’],
ascending = False)
age_sorted_data.head()
Output:
You can also pass multiple columns to the by attribute of the
sort_values()function. In such a case, the dataset will be
sorted by the first column, and in case of equal values for
two or more records, the dataset will be sorted by the
second column and so on. The following script first sorts the
data by Age and then by Fare, both by descending orders.
Script 15:
age_sorted_data = titanic_data.sort_values(by=
[‘Age’,’Fare’], ascending = False)
age_sorted_data.head()
Output:
Script 16:
updated_class = titanic_data.Pclass.apply(lambda x : x + 2)
updated_class.head()
The output shows that all the values in the Pclass column
have been incremented by 2.
Output:
0 5
1 3
2 5
3 3
4 5
def mult(x):
return x * 2
updated_class = titanic_data.Pclass.apply(mult)
updated_class.head()
Output:
0 6
1 2
2 6
3 2
4 6
Script 18:
import matplotlib.pyplot as plt
import seaborn as sns
flights_data = sns.load_dataset(‘flights’)
flights_data.head()
Output:
Script 19:
flights_data_pivot =flights_data.pivot_table(index=’month’,
columns=’year’, values=’passengers’)
flights_data_pivot.head()
Output:
The crosstab() function is used to plot the cross-tabulation
between two columns. Let’s plot a cross tab matrix between
passenger class and age columns for the Titanic dataset.
Script 20:
import pandas as pd
titanic_data = pd.read_csv(r»E:\Data Visualization with
Python\Datasets\titanic_data.csv»)
titanic_data.head()
pd.crosstab(titanic_data.Pclass, titanic_data.Age,
margins=True)
Output:
Script 21:
import numpy as np
titanic_data.Fare = np.where( titanic_data.Age > 20,
titanic_data.Fare +5 , titanic_data.Fare)
titanic_data.head()
Output:
Exercise 6.1
Question 1
A- 0
B- 1
C- 2
D- None of the above
Question 2
A- sort_dataframe()
B- sort_rows()
C- sort_values()
D- sort_records()
Question 3
A- filter()
B- filter_columns()
C- apply_filter()
D- None of the above()
Exercise 6.2
Use the apply function to subtract 10 from the Fare column
of the Titanic dataset, without using lambda expression.
References
1 https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.filter.html
2 https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.append.html
3 https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.concat.html
4 https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.sort_values.html
5 https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.apply.html
Pandas for Data Visualization
7.1. Introduction
Script 1:
import pandas as pd
titanic_data = pd.read_csv(r»E:\Data Visualization with
Python\Datasets\titanic_data.csv»)
titanic_data.head()
Output:
Script 2:
Output:
Script 3:
import matplotlib.pyplot as plt
titanic_data[‘Age’].plot(kind=’hist’)
Output:
Script 4:
Output:
Finally, you can change the color of your histogram by
specifying the color name to the color attribute, as shown
below.
Script 5:
Output:
Further Readings – Pandas Histogram [1]
To study more about the Pandas histogram, please check
Pandas’ official documentation for a histogram. Try to
execute the histogram method with a different set of
attributes, as mentioned in the official documentation.
Script 6:
flights_data = sns.load_dataset(‘flights’)
flights_data.head()
Output:
By default, the index serves as the x-axis. In the above script,
the left-most column, i.e., containing 0,1,2 … is the index
column. To plot a line plot, you have to specify the column
names for x and y axes. If you specify only the column value
for the y-axis, the index is used as the x-axis. The following
script plots a line plot for the passengers column of the
flights data.
Script 7:
Output:
Similarly, you can change the color of the line plot via the
color attribute, as shown below.
Script 8:
Output:
In the previous examples, we didn’t pass the column name
for the x-axis. Let’s see what happens when we specify the
year as the column name for the x-axis.
Script 9:
Output:
The output shows that for each year, we have multiple
values. This is because each year has 12 months. However,
the overall trend remains the same and the number of
passengers traveling by air increases as the years pass.
Script 10:
flights_data.plot.scatter(x=’year’, y=’passengers’, figsize=
(8,6))
Output:
Like a line plot and histogram, you can also change the color
of a scatter plot by passing the color name as the value for
the color attribute. Look at the following script.
Script 11:
flights_data.plot.scatter(x=’year’, y=’passengers’,
color=’red’, figsize=(8,6))
Output:
Further Readings – Pandas Scatter Plots [3]
To study more about Pandas scatter plots, please check
Pandas’ official documentation for scatter plots. Try to
execute the scatter() method with a different set of
attributes, as mentioned in the official documentation.
Script 12:
print(sex_mean)
print(type(sex_mean.tolist()))
Output:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
<class ‘list’>
Script 13:
df = pd.DataFrame({‘Gender’:[‘Female’, ‘Male’],
‘Age’:sex_mean.tolist()})
ax = df.plot.bar(x=’Gender’, y=’Age’, figsize=(8,6))
Output:
You can also plot horizontal bar plots via the Pandas library.
To do so, you need to call the barh() function, as shown in
the following example.
Script 14:
df = pd.DataFrame({‘Gender’:[‘Female’, ‘Male’],
‘Age’:sex_mean.tolist()})
ax = df.plot.barh(x=’Gender’, y=’Age’, figsize=(8,6))
Output:
Finally, like all the other Pandas plots, you can change the
color of both vertical and horizontal bar plots by passing the
color name to the color attribute of the corresponding
function.
Script 15:
df = pd.DataFrame({‘Gender’:[‘Female’, ‘Male’],
‘Age’:sex_mean.tolist()})
ax = df.plot.barh(x=’Gender’, y=’Age’, figsize=(8,6), color
= ‘orange’)
Output:
Further Readings – Pandas Bar Plots [4]
To study more about Pandas bar plots, please check
Pandas’ official documentation for bar plots. Try to execute
the bar plot methods with a different set of attributes, as
mentioned in the official documentation.
To plot box plots via the Pandas library, you need to call the
box() function. The following script plots box plots for all the
numeric columns in the Titanic dataset.
Script 16:
Output:
Further Readings – Pandas Box Plots [5]
To study more about Pandas box plots, please check
Pandas’ official documentation for box plots. Try to
execute the box plot methods with a different set of
attributes, as mentioned in the official documentation.
tips_data = sns.load_dataset(‘tips’)
The output shows that most of the time, the tip is between
two and four dollars.
Output:
Output:
Script 19:
Output:
Script 20:
Output:
Further Readings – Pandas KDE Plots [7]
To study more about Pandas KDE plots, please check
Pandas’ official documentation for KDE plots. Try to
execute the kde plot methods with a different set of
attributes, as mentioned in the official documentation.
In this section, you will see how to plot time series data with
Pandas. You will work with Google Stock Price data from
7thJanuary 2015 to 7th January 2020. The dataset is available
in the resources folder by the name google_data.csv. The
following script reads the data into a Pandas dataframe.
Script 21:
google_stock = pd.read_csv(r»E:\Data Visualization with
Python\Datasets\google_data.csv»)
google_stock.head()
Output:
Script 22:
google_stock[‘Date’] =
google_stock[‘Date’].apply(pd.to_datetime)
google_stock.set_index(‘Date’, inplace=True)
google_stock.plot.line( y=’Open’, figsize=(12,8))
Output:
Script 23:
google_stock.resample(rule=’A’).mean()
Output:
Similarly, to plot the monthly mean values for all the columns
in the Google stock dataset, you will need to pass M as a
value for the rule attribute, as shown below.
Script 24:
google_stock.resample(rule=’M’).mean()
Output:
In addition to aggregate values for all the columns, you can
resample data with respect to a single column. For instance,
the following script prints the yearly mean values for the
opening stock prices of Google stock over a period of five
years.
Script 25:
google_stock[‘Open’].resample(‘A’).mean()
Output:
Date
2015-12-31 602.676217
2016-12-31 743.732459
2017-12-31 921.121193
2018-12-31 1113.554101
2019-12-31 1187.009821
2020-12-31 1346.470011
Freq: A-DEC, Name: Open, dtype: float64
Script 26:
google_stock[‘Open’].resample(‘A’).mean().plot(kind=’bar’,
figsize=(8,6))
Output:
Similarly, here is the line plot for the yearly mean opening
stock prices for Google stock over a period of five years.
Script 27:
google_stock[‘Open’].resample(‘A’).mean().plot(kind=’line’,
figsize=(8,6))
Output:
Further Readings – Pandas Resample Method [8]
To study more about the Pandas time sampling functions
for time series data analysis, please check Pandas’ official
documentation for the resample function. Try to execute
the time resample() method with a different set of
attributes, as mentioned in the official documentation.
Script 28:
google_stock.shift(3).head()
Output:
You can see that the first three rows now contain null values,
while what previously was the first record has now been
shifted to the 4th row.
In the same way, you can shift rows backward. To do so, you
have to pass a negative value to the shift function.
Script 29:
google_stock.shift(-3).tail()
Output:
Exercise 7.1
Question 1
A- set_color()
B- define_color()
C- color()
D- None of the above
Question 2
A- horz_bar()
B- barh()
C- bar_horizontal()
D- horizontal_bar()
Question 3
Exercise 7.2
Display a bar plot using the Titanic dataset that displays the
average age of the passengers who survived vs. those who
did not survive.
References
1. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.hist.html
2. https://pandas.pydata.org/pandas-
docs/version/0.23/generated/pandas.DataFrame.plot.line.html
3. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.plot.scatter.html
4. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.plot.bar.html
5. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.boxplot.html
6. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.plot.hexbin.html
7. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.plot.kde.html
8. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.Series.resample.html
9. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.shift.html
3D Plotting with Matplotlib
In the second and third chapters of this book, you saw how
the Matplotlib library can be used to plot two-dimensional
(2D) plots. In fact, in all the previous chapters, you saw how
to plot 2D plots with different Python libraries. In this
chapter, you will briefly see how the Matplotlib library can be
used to plot 3D plots.
Script 1:
figure1 = plt.figure()
axis1 = figure1.add_subplot( projection=’3d’)
x = [1,7,6,3,2,4,9,8,1,9]
y = [4,6,1,8,3,7,9,1,2,4]
z = [6,4,9,2,7,8,1,3,4,9]
axis1.plot(x,y,z)
axis1.set_xlabel(‘X-axis’)
axis1.set_ylabel(‘Y-axis’)
axis1.set_zlabel(‘Z-axis’)
plt.show()
Output:
In the previous section, we plotted a 3D plot with random
integers. Let’s now plot a 3D line plot that shows the
relationship between the values in the total_bill, tip, and size
columns of the tips dataset.
Script 2:
plt.rcParams[«figure.figsize»] = [10,8]
tips_data = sns.load_dataset(‘tips’)
tips_data.head()
Output:
The following script converts the values in the total_bill, tip,
and size columns of the tips dataset into a list of values that
will be passed to the plot() function of the axis object.
Script 3:
bill = tips_data[‘total_bill’].tolist()
tip = tips_data[‘tip’].tolist()
size = tips_data[‘size’].tolist()
Script 4:
figure2 = plt.figure()
axis2 = figure2.add_subplot( projection=’3d’)
axis2.plot(bill,tip,size)
axis2.set_xlabel(‘bill’)
axis2.set_ylabel(‘tip’)
axis2.set_zlabel(‘size’)
plt.show()
Output:
Script 5:
figure2 = plt.figure()
axis2 = figure2.add_subplot( projection=’3d’)
axis2.scatter(bill,tip,size)
axis2.set_xlabel(‘bill’)
axis2.set_ylabel(‘tip’)
axis2.set_zlabel(‘size’)
plt.show()
Output:
Further Readings – Matplotlib 3D Scatter Plot (2)
To study more about Matplotlib 3D scatter plot functions,
please check Matplotlib’s official documentation. Try to
execute the scatter () method with a different set of
attributes, as mentioned in the official documentation.
Script 6:
figure2 = plt.figure()
axis3 = figure2.add_subplot( projection=’3d’)
x3 =bill
y3 = tip
z3 = np.zeros(tips_data.shape[0])
dx = np.ones(tips_data.shape[0])
dy = np.ones(tips_data.shape[0])
dz = bill
axis3.set_xlabel(‘bill’)
axis3.set_ylabel(‘tip’)
axis3.set_zlabel(‘size’)
plt.show()
Output:
Exercise 8.1
Plot a scatter plot that shows the distribution of pclass, age,
and fare columns from the Titanic dataset.
References
1. https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#line-plots
2. https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#scatter-plots
3. https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#bar-plots
Interactive Data Visualization with Bokeh
In all the chapters till now, you have been plotting static
graphs. In this chapter and the next one, you will see how to
plot interactive graphs. Interactive graphs are the type of
graphs that show different information based on the actions
performed by the users. In this chapter, you will see how to
plot interactive plots with Python’s Bokeh library. In the next
chapter, you will see how to plot interactive plots with Plotly.
9.1. Installation
Use the pip installer to install the Bokeh library. To do so,
execute the following command on your command line.
Script 1:
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sns
flights_data = sns.load_dataset(‘flights’)
flights_data.head()
Output:
Next, to plot a line plot, we have to first create an object of
the figure class. The following script imports the classes
required to plot the Bokeh plots.
Script 2:
Script 3:
output_file(‘E:/bokeh.html’)
Script 4:
plot = figure(
title = ‘Years vs Passengers’,
x_axis_label =’Year’,
y_axis_label =’Passengers’,
plot_width=600,
plot_height=400
)
Next, you need data sources that you will use to plot a
graph. We will be plotting the year against the number of
passengers.
Script 5:
year = flights_data[‘year’]
passengers = flights_data[‘passengers’]
Script 6:
At this point in time, the plot has been created and saved.
However, to display the plot, you have to call the show()
method, as shown below.
Script 7:
show(plot)
Output:
Further Readings – Bokeh Line Plot (1)
To study more about Bokeh’s line plot functions, please
check Bokeh’s official documentation for line()function. Try
to execute the line() method with a different set of
attributes, as mentioned in the official documentation.
Script 8:
month_passengers = flights_data.groupby(«month»)
[«passengers»].mean()
print(month_passengers.index.tolist())
print(month_passengers.tolist())
Output:
Script 9:
plot2 = figure(
x_range = month_passengers.index.tolist(),
title = ‘Month vs Passengers’,
x_axis_label =’Month’,
y_axis_label =’Passengers’,
plot_height=400
)
Output:
Script 11:
plot3 = figure(
title = ‘Years vs Passengers’,
x_axis_label =’Year’,
y_axis_label =’Passengers’,
plot_width=600,
plot_height=400
)
Script 12:
year = flights_data[‘year’]
passengers = flights_data[‘passengers’]
Script 13:
plot3.scatter(year,passengers, legend=’Years vs
Passengers’, line_width=2)
show(plot3)
Output:
Let’s plot another scatter plot using the tips dataset. The
scatter plot shows the values from the total_bill column on
the x-axis and the tips on the y-axis. The following script
loads the tips dataset.
Script 14:
tips_data = sns.load_dataset(‘tips’)
tips_data.head()
Output:
And the following script plots a scatter plot showing the
distribution of total_bill vs. tips.
Script 14:
plot4 = figure(
title = ‘Total Bill vs Tips’,
x_axis_label =’Totall Bill’,
y_axis_label =’Tips’,
plot_width=600,
plot_height=400
)
Script 15:
total_bill = tips_data[‘total_bill’]
tips = tips_data[‘tip’]
Script 16:
Script 17:
plot5 = figure(
title = ‘Total Bill vs Tips’,
x_axis_label =’Totall Bill’,
y_axis_label =’Tips’,
plot_width=600,
plot_height=400
)
Script 18:
plot5.circle(total_bill, tips, radius = 0.5)
show(plot5)
Output:
In this chapter, you saw how to plot interactive plots via the
Bokeh library. In the next chapter, you will see how to plot
interactive plots via the Plotly library, which is yet another
useful library for interactive data plotting.
Exercise 9.1
Question 1
Which object is used to set the width and height of a plot in
Bokeh?
A- figure()
B- width()
C- height()
D- None of the above
Question 2
A- line
B- width
C- line_width
D- length
Question 3
In the Bokeh library, the list of values used to plot bar plots is
passed to the following attribute of the bar plot:
A- values
B- legends
C- y
D- top
Exercise 9.2
Plot a bar plot using the Titanic dataset that displays the
average age of both male and female passengers.
References
1.
https://docs.bokeh.org/en/latest/docs/reference/plotting.html#bokeh.plotting.fi
gure.Figure.line
2.
https://docs.bokeh.org/en/latest/docs/reference/plotting.html#bokeh.plotting.fi
gure.Figure.vbar
3.
https://docs.bokeh.org/en/latest/docs/reference/plotting.html#bokeh.plotting.fi
gure.Figure.circle
Interactive Data Visualization with Plotly
import pandas as pd
import numpy as np
%matplotlib inline
init_notebook_mode(connected=True)
import cufflinks as cf cf.go_offline()
Script 1:
flights_data = sns.load_dataset(‘flights’)
flights_data.head()
Output:
Let’s first plot a very simple line plot using Pandas only. To
do so, you need to select the column for which you want to
plot a static line plot and then call the “plot()” method. The
following script plots plot for the passengers columns of the
flights dataset.
Script 2:
dataset_filter = flights_data[[«passengers»]]
dataset_filter.plot()
Output:
Script 3:
dataset_filter.iplot()
Script 4:
flights_data.iplot(kind=’bar’, x=[‘month’],y= ‘passengers’)
If you hover the mouse below, you will see the actual number
of passengers traveling in a specific month. The output
shows that the maximum number of passengers travel in the
months of July and August, probably due to vacation.
Output:
Script 5:
Output:
Further Readings – Plotly Cufflinks Bar Plots[2]
To study more about Plotly Cufflinks bar plot functions,
please check Cufflinks’ official documentation for the
Cufflinks bar plot function. Try to execute the cufflinks bar
plot method with a different set of attributes, as mentioned
in the official documentation.
Script 6:
flights_data.iplot(kind=’scatter’, x= ‘month’, y=
‘passengers’, mode= ‘markers’)
Output:
Let’s plot a scatter plot using the tips dataset. The following
script imports the tips dataset from the Seaborn library.
Script 7:
tips_data = sns.load_dataset(‘tips’)
tips_data.head()
Output:
The following script plots Plotly scatter plot, which shows
values in the total_bill column on the x-axis and values for
the tip column on the y-axis.
Script 8:
The output shows that with the increase in the total bill, the
corresponding tip also increases.
Output:
Further Readings – Plotly Cufflinks Scatter Plots [3]
To study more about Plotly Cufflinks scatter plot functions,
please check Cufflinks’ official documentation for the
Cufflinks scatter plot functions. Try to execute the cufflinks
scatter plot method with a different set of attributes, as
mentioned in the official documentation.
In the previous chapters, you saw how to plot box plots with
the Pandas and Seaborn libraries. You can also plot box plots
via the Plotly and Cufflinks libraries. The following script
plots the box plot for the numeric columns of the tips
dataset.
Script 9:
tips_data.iplot(kind=’box’)
Output:
Further Readings – Plotly Cufflinks Box Plots [4]
To study more about Plotly Cufflinks box plot functions,
please check Cufflinks’ official documentation for the
cufflinks box plot functions. Try to execute the cufflinks
box plot method with a different set of attributes, as
mentioned in the official documentation.
10.6. Histogram
Script 10:
Output:
Now to plot an interactive histogram via Plotly and Cufflinks,
pass hist as a value to the kind attribute of the iplot()
function.
Script 11:
titanic_data[‘Age’].iplot(kind=’hist’,bins=25)
Output:
Exercise 10.1
Question 1
A- plot()
B- iplot()
C- draw()()
D- idraw()
Question 2
A- shape, markers
B- shape, scatter
C- mode, marker
D- mode, scatter
Question 3
A- histogram()
B- histo()
C- hist()
D- none of the above
Answer: C
Exercise 10.2
References
1. https://plot.ly/python/v3/ipython-notebooks/cufflinks/#line-charts
2. https://plot.ly/python/v3/ipython-notebooks/cufflinks/#bar-charts
3. https://plot.ly/python/v3/ipython-notebooks/cufflinks/#scatter-plot
4. https://plot.ly/python/v3/ipython-notebooks/cufflinks/#box-plots
5. https://plot.ly/python/v3/ipython-notebooks/cufflinks/#histograms
Hands-on Project
Script 1:
Output:
Another way to view all the columns in the dataset is by
using the columns attribute of the Pandas dataframe, as
shown below.
Script 2:
data_columns = customer_churn.columns.values.tolist()
print(data_columns)
Output:
Script 3:
sns.pairplot(data=customer_churn)
Output:
Script 4:
plt.rcParams[«figure.figsize»] = [10,8]
After pair plot, you are free to choose whichever plot you
want to plot depending upon the task. Let’s see if gender
plays any role in customer churn. You can plot the bar chart
for that, as shown below.
Script 5:
Output:
Script 6:
plt.title(‘Age Histogram’)
plt.hist(customer_churn[«Age»])
Output:
Script 7:
plt.scatter(customer_churn[«Age»],
customer_churn[«EstimatedSalary»], c = ‘g’)
Output:
countries = customer_churn[«Geography»].value_counts()
labels = countries.index.values.tolist()
values = countries.values.tolist()
explode = (0.05, 0.05, 0.05)
Output:
The output shows that 50 percent of the customers are from
France, while around 25 percent belong to Germany and
Spain each.
Let’s plot a box plot showing the percentile of age for the
passengers who left the bank and for those who didn’t leave
the bank with respect to gender.
Script 9:
Output:
Output:
Script 11:
corr_values = customer_churn.corr()
sns.heatmap(corr_values, annot= True)
Output:
Interactive plots reveal a lot of runtime information. If you
are interested in runtime information, I would suggest that
you plot an interactive plot. The following script plots an
interactive bar plot for gender and age columns of our
dataset using the Plotly library.
Script 12:
import pandas as pd
import numpy as np
%matplotlib inline
Output:
Exercise Solutions
§ Exercise 1.1
Question 1
A- For Loop
B- While Loop
C- Both A and B
D- None of the above
Answer: A
Question 2
A- Single Value
B- Double Value
C- More than two values
D- None
Answer: C
Question 3
A- In
B- Out
C- Not In
D- Both A and C
Answer: D
§ Exercise 1.2
Print the table of integer for 9 using a while loop.
Solution
j=1
while j< 11:
print(«9 x «+str(j)+ « = «+ str(9*j))
j=j+1
Output:
9 x 1 = 9
9 x 2 = 18
9 x 3 = 27
9 x 4 = 36
9 x 5 = 45
9 x 6 = 54
9 x 7 = 63
9 x 8 = 72
9 x 9 = 81
9 x 10 = 90
§ Exercise 2.1
Question 1
Answer: C
Question 2
A- title
B- label
C- axis
D- All of the above
Answer: B
Question 3
A- autopct = ‘%1.1f%%’
B- percentage = ‘%1.1f%%’
C- perc = ‘%1.1f%%’
D- None of the Above
Answer: A
§ Exercise 2.2
Create a pie chart that shows the distribution of passengers
with respect to their gender, in the unfortunate Titanic ship.
You can use the Titanic dataset for that purpose.
Solution:
import pandas as pd
data = pd.read_csv(r”E:\Data Visualization with
Python\Datasets\titanic_data.csv”)
data.head()
sex= data[“Sex”].value_counts()
print(sex)
labels = sex.index.values.tolist()
values = sex.values.tolist()
explode = (0.05, 0.05)
Output:
§ Exercise 3.1
Question 1
Which plot function will you use to plot a graph in the 5th cell
of a plot multiple plot figure with four rows and two
columns?
A- plt.subplot(5,4,2)
B- plt.subplot(2,4,5)
C- plt.subplot(4,2,5)
D- None of the Above
Answer: C
Question 2
How will you create a subplot with five rows and three
columns using the subplots() function?
A- plt.subplots(nrows=5, ncols=3)
B- plt.subplots(5,3)
C- plt.subplots(rows=5, cols=3)
D- All of the Above
Answer: A
Question 3
A- figure.saveimage()
B- figure.savegraph()
C- figure.saveplot()
D- figure.savefig()
Answer: D
§ Exercise 3.2
Draw multiple plots with three rows and one column. Show
the sine of any 30 integers in the first plot, the cosine of the
same 30 integers in the second plot, and the tangent of the
same 30 integers in the third plot.
Solution:
plt.rcParams[“figure.figsize”] = [12,8]
plt.subplot(3,1,1)
plt.plot(x_vals, y1_vals, ‘bo-’)
plt.subplot(3,1,2)
plt.plot(x_vals, y2_vals, ‘rx-’)
plt.subplot(3,1,3)
plt.plot(x_vals, y3_vals, ‘g*-’)
Output:
§ Exercise 4.1
Question 1
Which plot is used to plot multiple joint plots for all the
combinations of numeric and Boolean columns in a dataset?
A- Joint Plot
B- Pair Plot
C- Dist Plot
D- Scatter Plot
Answer: B
Question 2
A- barplot()
B- jointplot()
C- catplot()
D- mulplot()
Answer: C
Question 3
A- kind
B- type
C- hue
D- col
Answer: A
§ Exercise 4.2
Plot a swarm violin plot using the Titanic data that displays
the fare paid by male and female passengers. Further,
categorize the plot by passengers who survived and by
those who didn’t.
Solution:
sns.swarmplot(x=’sex’, y=’fare’,
hue=’survived’,data=titanic_data, split = True)
Output:
§ Exercise 5.1
Question 1
A- set_style (‘darkgrid’)
B- set_style (‘whitegrid’)
C- set_style (‘poster’)
D- set_context (‘poster’)
Answer: D
Question 2
Which function can be used to find the correlation between
all the numeric columns of a Pandas dataframe?
A- correlation()
B- corr()
C- heatmap()
D- none of the above
Answer: B
Question 3
A- annotate()
B- annot()
C- mark()
D- display()
Answer: B
§ Exercise 5.2
Solution:
Output:
§ Exercise 6.1
Question 1
A- 0
B- 1
C- 2
D- None of the above
Answer: B
Question 2
A- sort_dataframe()
B- sort_rows()
C- sort_values()
D- sort_records()
Answer: C
Question 3
A- filter()
B- filter_columns()
C- apply_filter ()
D- None of the above()
Answer: A
§ Exercise 6.2
Solution:
def subt(x):
return x - 10
updated_class = titanic_data.Fare.apply(subt)
updated_class.head()
Output:
0 2.2500
1 66.2833
2 2.9250
3 48.1000
4 3.0500
Name: Fare, dtype: float64
§ Exercise 7.1
Question 1
A- set_color()
B- define_color()
C- color()
D- None of the above
Answer: C
Question 2
A- horz_bar()
B- barh()
C- bar_horizontal()
D- horizontal_bar()
Answer: B
Question 3
A- shift_back(5)
B- shift(5)
C- shift_behind(-5)
D- shift(-5)
Answer: D
§ Exercise 7.2
Display a bar plot using the Titanic dataset that displays the
average age of the passengers who survived vs. those who
did not survive.
Solution:
df = pd.DataFrame({‘Survived’:[‘No’, ‘Yes’],
‘Age’:surv_mean.tolist()})
ax = df.plot.bar(x=’Survived’, y=’Age’, figsize=(8,6))
Output:
§ Exercise 8.1
Solution:
plt.rcParams[«figure.figsize»] = [8,6]
sns.set_style(«darkgrid»)
titanic_data = sns.load_dataset(‘titanic’)
pclass = titanic_data[‘pclass’].tolist()
age = titanic_data[‘age’].tolist()
fare = titanic_data[‘fare’].tolist()
figure4 = plt.figure()
axis4 = figure4.add_subplot( projection=’3d’)
axis4.scatter(bill,tip,size)
axis4.set_xlabel(‘pclass’)
axis4.set_ylabel(‘age’)
axis4.set_zlabel(‘fare’)
plt.show()
Output:
§ Exercise 9.1
Question 1
A- figure()
B- width()
C- height()
D- None of the above
Answer: A
Question 2
A- line
B- width
C- line_width
D- length
Answer: C
Question 3
In the Bokeh library, the list of values used to plot bar plots is
passed to the following attribute of the bar plot:
A- values
B- legends
C- y
D- top
Answer: D
§ Exercise 9.2
Plot a bar plot using the Titanic dataset that displays the
average age of both male and female passengers.
Solution:
sex_mean = titanic_data.groupby(“Sex”)[“Age”].mean()
plotx = figure(
x_range = sex_mean.index.tolist(),
title = ‘Sex vs Age’,
x_axis_label =’Sex’,
y_axis_label =’Age’,
plot_height=400
)
Output:
§ Exercise 10.1
Question 1
A- plot()
B- iplot()
C- draw()()
D- idraw()
Answer: B
Question 2
Answer: C
Question 3
A- histogram()
B- histo()
C- hist()
D- none of the above
Answer: C
§ Exercise 10.2
Solution:
titanic_data[‘Pclass’].iplot(kind=’hist’)
Output: