Data Visualization Using Python
Data Visualization Using Python
Don't bother much about the tool used for creating the visuals.
Matplotlib
matplotlib is the first data visualization library in Python and is widely used.
In this course you will learn:
Usage of matplotlib library in creating basic plots such as Line plot , Scatter
Plot , etc.
Installing Matplotlib
matplotlib is a third party library and is not part of standard Python library.
You can easily install it using pip utility as shown in below expression.
Loading matplotlib
matplotlib is loaded using import as shown in below expression
import matplotlib
You can find the version of matplotlib with the below command.
print(matplotlib.__version__)
If matplotlib is already installed and you want to upgrade it, run the below
command on command line prompt.
About Matplotlib
In matplotlib , everything is organized in a hierarchy.
Through the created figures, one or more axes/subplot objects are created.
The axes objects are further used for doing many plotting actions.
Figure
Figure refers to the whole area or page on which everything is drawn.
Creating a Figure
A figure is created using figure function of pyplot module, as shown below.
NOTE : The code snippets shown in this course assumes that you have
imported pylot as plt .
Viewing a Figure
show method can be used to view the created figure as shown below.
fig = plt.figure()
plt.show()
Output
<matplotlib.figure.Figure at 0x185417f0>
You will be able to view a picture only when a figure contains at least
one Axes element
The default width and height of a figure are 6 and 4 inches respectively.
Axes
An Axes is the region of the figure, available for plotting data.
An Axes contains two Axis objects in case of 2D plots and three Axis objects
in case of 3D plots.
Creating an Axes
Syntax
fig = plt.figure()
ax = fig.add_subplot(111)
plt.show()
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set_title("My First Plot")
ax.set_xlabel("X-Axis");
ax.set_ylabel('Y-Axis')
ax.set_xlim([0,5]);
ax.set_ylim([0,10])
plt.show()
Plotting Data
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='My First Plot',
xlabel='X-Axis', ylabel='Y-Axis',
xlim=(0, 5), ylim=(0,10))
x = [1, 2, 3, 4]; y = [2, 4, 6, 8]
plt.plot(x, y)
plt.show()
However, Explicit is better than implicit . Hence prefer former style of plotting.
fig = plt.figure(figsize=(8,6))
Adding a Legend
The legend uses the label , provided to a line drawn using plot as shown in
below code.
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='My First Plot',
Types of Plots
Data can be presented in a different number of plots.
In this topic, you will learn how to draw below-mentioned plots using matplotlib .
Line plot
Scatter plot
Bar plot
Pie plot
Histogram
Box plot
Line Plot
Line Plot is used to visualize a trend in data.
Syntax
plot(x, y)
# 'x' , 'y' : Data values representing two variables.
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='Avg. Daily Temperature in Jan 2018',
xlabel='Day', ylabel='Temperature (in deg)',
xlim=(0, 30), ylim=(25, 35))
days = [1, 5, 8, 12, 15, 19, 22, 26, 29]
temp = [29.3, 30.1, 30.4, 31.5, 32.3, 32.6, 31.8, 32.4, 32.7]
ax.plot(days, temp)
plt.show()
marker : Chooses a marker for data points, e.g., circle, triangle, etc.
A green dashed line , having width 3 can be generated by using the following
expression.
The below-shown expression plots a green colored line with data points
marked in circles .
Using plot function multiple times is one of the ways to draw multiple lines.
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='Avg. Daily Temperature of Jan 2018',
xlabel='Day', ylabel='Temperature (in deg)',
xlim=(0, 30), ylim=(25, 35))
days = [1, 5, 8, 12, 15, 19, 22, 26, 29]
location1_temp = [29.3, 30.1, 30.4, 31.5, 32.3, 32.6, 31.8, 3
2.4, 32.7]
location2_temp = [26.4, 26.8, 26.1, 26.4, 27.5, 27.3, 26.9, 2
6.8, 27.0]
ax.plot(days, location1_temp, color='green', marker='o', line
width=3)
Scatter Plot
Scatter plot is very similar to Line Plot .
Scatter Plot is used for showing how one variable is related with another.
Scatter Plot consists of data points. If the spread of data points is linear, then
two variables are highly correlated.
Syntax
scatter(x, y)
# 'x', 'y' : Data values representing two variables.
plt.sca() sets the current Axes to ax and the current Figure to the parent of ax.
scatter plot only marks the data points with the chosen marker.
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='Avg. Daily Temperature of Jan 2018',
xlabel='Day', ylabel='Temperature (in deg)',
xlim=(0, 30), ylim=(25, 35))
days = [1, 5, 8, 12, 15, 19, 22, 26, 29]
temp = [29.3, 30.1, 30.4, 31.5, 32.3, 32.6, 31.8, 32.4, 32.7]
ax.scatter(days, temp)
plt.show()
If the number of values is less than the number of data points considered, then
the list is repeated.
The below example plots green colored circles of size 60 , with black edges .
plotfunction can also create a scatter plot when linestyle is set to none , and
a marker is chosen, as shown in below code.
Bar Plot
Bar Plot is commonly used for comparing categories.
barand barh are used for plotting vertical and horizontal bar plots
respectively.
Syntax
bar(x,height)
# 'x' : x coordinates of bars.
# 'height' : List of heights of each bar.
barh(y, width)
# 'y' : y coordinates of bars
# 'width' : List of widths.
The code also sets the ticks on X-Axis and labels them.
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='Avg. Quarterly Sales',
xlabel='Quarter', ylabel='Sales (in millions)')
quarters = [1, 2, 3]
sales_2017 = [25782, 35783, 36133]
ax.bar(quarters, sales_2017)
ax.set_xticks(quarters)
ax.set_xticklabels(['Q1-2017', 'Q2-2017', 'Q3-2017'])
plt.show()
Red color bars with black edges can be drawn using the below expression.
Vertical bar plots are used for comparing more than one category at a time.
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='Avg. Quarterly Sales',
xlabel='Quarter', ylabel='Sales (in millions)')
quarters = [1, 2, 3]
x1_index = [0.8, 1.8, 2.8]; x2_index = [1.2, 2.2, 3.2]
sales_2016 = [28831, 30762, 32178]; sales_2017 = [25782, 3578
3, 36133]
ax.bar(x1_index, sales_2016, color='yellow', width=0.4, edgec
olor='black', label='2016')
ax.bar(x2_index, sales_2017, color='red', width=0.4, edgecolo
r='black', label='2017')
ax.set_xticks(quarters)
ax.set_xticklabels(['Q1', 'Q2', 'Q3'])
ax.legend()
plt.show()
Horizontal bar plots are used while comparing values of one category at a
time.
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='Avg. Quarterly Sales',
xlabel='Sales (in millions)', ylabel='Quarter')
quarters = [1, 2, 3]
sales_2017 = [25782, 35783, 36133]
ax.barh(quarters, sales_2017, height=0.6, color='red')
ax.set_yticks(quarters)
ax.set_yticklabels(['Q1-2017', 'Q2-2017', 'Q3-2017'])
plt.show()
Syntax
pie(x)
# 'x' : sizes of portions, passed either as a fraction or a n
umber.
The above pie chart displays company sales, occurred in first three quarters
of 2017 .
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
ax.set(title='Avg. Quarterly Sales')
Labels and percentage of portions are drawn with below code snippet.
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
ax.set(title='Avg. Quarterly Sales')
sales_2017 = [25782, 35783, 36133]
quarters = ['Q1-2017', 'Q2-2017', 'Q3-2017']
ax.pie(sales_2017, labels=quarters, startangle=90, autopct='%
Histogram
Histogram is used to visualize the spread of data of a distribution.
Syntax
hist(x)
# 'x' : Data values of a single variable.
import numpy as np
np.random.seed(100)
x = 60 + 10*np.random.randn(1000)
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title="Distribution of Student's Percentage",
ylabel='Count', xlabel='Percentage')
ax.hist(x)
plt.show()
normed : Sets to True where bins display fraction and not the count.
import numpy as np
np.random.seed(100)
x = 60 + 10*np.random.randn(1000)
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title="Distribution of Student's Percentage",
ylabel='Proportion', xlabel='Percentage')
ax.hist(x, color='blue', bins=30, density=True)
plt.show()
Box Plots
Box plots are also used to visualize the spread of data.
boxplot(x)
# 'x' : list of values or list of list of values.
The above image displays box plot of Percentages obtained by 1000 Students
of a class.
import numpy as np
np.random.seed(100)
x = 50 + 10*np.random.randn(1000)
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title="Box plot of Student's Percentage",
xlabel='Class', ylabel='Percentage')
bootstrap : Number set to indicate that notches around the median are
bootstrapped.
List of data values can be passed as an argument for plotting multiple box
plots as shown in below code snippet.
import numpy as np
np.random.seed(100)
x = 50 + 10*np.random.randn(1000)
y = 70 + 25*np.random.randn(1000)
z = 30 + 5*np.random.randn(1000)
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title="Box plot of Student's Percentage",
xlabel='Class', ylabel='Percentage')
ax.boxplot([x, y, z], labels=['A', 'B', 'C'], notch=True, boo
tstrap=10000)
plt.show()
Matplotlib Styles
matplotlib.pyplot comes with a lot of styles. Based on the chosen style, the
display of figure changes.
You can view various styles available in pyplot by running the following
commands.
Using a Style
A specific style can be invoked with either of the two expressions shown
below.
plt.style.use('ggplot')
or
plt.style.context('ggplot')
Composing Styles
All custom style sheets are placed in a folder, stylelib , present in the config
directory of matplotlib .
import matplotlib
print(matplotlib.get_configdir())
Now, create a file mystyle.mplstyle with the below-shown contents and save it
in the folder <matplotlib_configdir/stylelib/ .
axes.titlesize : 24
axes.labelsize : 20
lines.linewidth : 8
lines.markersize : 10
xtick.labelsize : 16
ytick.labelsize : 16
matplotlib.style.reload_library()
matplotlibrc file
matplotlib uses all the settings specified in matplotlibrc file.
The location of active matplotlibrc file used by matplotlib can be found with
below expression.
import matplotlib
matplotlib.matplotlib_fname()
Matplotlib rcParams
All rc settings , present in matplotlibrc file are stored in a dictionary
named matplotlib.rcParams .
For example, if you want to change linewidth and color , the following
expressions can be used.
In this topic, you will see how to create multiple plots in a single figure.
Syntax
subplot creates the Axes object at index position and returns it.
fig = plt.figure(figsize=(10,8))
axes1 = plt.subplot(2, 2, 1, title='Plot1')
axes2 = plt.subplot(2, 2, 2, title='Plot2')
axes3 = plt.subplot(2, 2, 3, title='Plot3')
The above shown code creates a figure with four subplots, having two rows
and two columns.
The third argument, index value varied from 1 to 4 , and respective subplots
are drawn in row-major order.
Example 2 of 'subplot'
Now let's try to create a figure with three subplots, where the first subplot
spans all columns of first row.
fig = plt.figure(figsize=(10,8))
axes1 = plt.subplot(2, 2, (1,2), title='Plot1')
axes1.set_xticks([]); axes1.set_yticks([])
axes2 = plt.subplot(2, 2, 3, title='Plot2')
axes2.set_xticks([]); axes2.set_yticks([])
Initially, a grid with given number of rows and columns is set up.
Later while creating a subplot, the number of rows and columns of grid,
spanned by the subplot are provided as inputs to subplot function.
fig = plt.figure(figsize=(12,10))
gd = gridspec.GridSpec(3,3)
axes1 = plt.subplot(gd[0,:],title='Plot1')
axes1.set_xticks([]); axes1.set_yticks([])
axes2 = plt.subplot(gd[1,:-1], title='Plot2')
axes2.set_xticks([]); axes2.set_yticks([])
axes3 = plt.subplot(gd[1:, 2], title='Plot3')
axes3.set_xticks([]); axes3.set_yticks([])
axes4 = plt.subplot(gd[2, :-1], title='Plot4')
axes4.set_xticks([]); axes4.set_yticks([])
plt.show()
Adding Text
Text can be added to any part of the figure using text function.
Syntax
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.set(title='Writing Text',
xlabel='X-Axis', ylabel='Y-Axis',
xlim=(0, 5), ylim=(0, 9))
x = [1, 2, 3, 4]
Matplotlib Backend
matplotlib can generate plots in different outputs.
In general, backend refers to everything that occurs from the time of executing
a plotting code to generating the figure.
Choosing a Backend
The default backend chosen by matplotlib is available as backend setting
in matplotlibrc file.
If you want to alter the backend of many figures, change the value
of backend setting.
backend : WXAgg
You can also use use method if you want to go with a specific backend.
import matplotlib
matplotlib.use('WXAgg')
Choosing a Backend
The default backend chosen by matplotlib is available as backend setting
in matplotlibrc file.
If you want to alter the backend of many figures, change the value
of backend setting.
backend : WXAgg
You can also use use method if you want to go with a specific backend.
import matplotlib
matplotlib.use('WXAgg')
# Above expression must be used before importing pylot
Saving Figures
Once a backend is chosen, a matplotlib figure can be saved in any format
supported by it.
fig = plt.figure(figsize(8,6))
ax = fig.add_subplot(111)
ax.set(title='My First Plot',
xlabel='X-Axis', ylabel='Y-Axis',
xlim=(0, 5), ylim=(0,10))
x = [1, 2, 3, 4]; y = [2, 4, 6, 8]
plt.plot(x, y)
plt.savefig('myplot.png')
Using 3-Dimensional charts. Don't prefer 3-D plots, unless they add any value
over 2-D charts.
Whenever correlation is plotted, clarify that you have not established any
cause of link between the variables.
Prefer labeling data objects directly inside the plot, rather than using legends.
Create a visualization, which stands by itself. Avoid adding extra text to tell
more about visualization.
def test_hist_of_a_sample_normal_distribution():
fig=plt.figure(figsize=(8,6))
ax=fig.add_subplot(111)
np.random.seed(100)
x1=25 + 3.0*np.random.randn(1000)
ax.set(title='Histogram of a Single Dataset',xlabel='X1',ylabel='Bin Count')
plt.hist(x1,bins=30)
plt.savefig('histogram_normal.png')
plt.show()
test_hist_of_a_sample_normal_distribution()