Python Matplotlib Data Visualization
Python Matplotlib Data Visualization
Seaborn
Jupyter Notebooks: This tutorial is a Jupyter notebook - a document made of cells. Each cell can contain
code written in Python or explanations in plain English. You can execute code cells and view the results,
e.g., numbers, messages, graphs, tables, les, etc., instantly within the notebook. Jupyter is a powerful
platform for experimentation and analysis. Don't be afraid to mess around with the code & break things -
you'll learn a lot by encountering and xing errors. You can use the "Kernel > Restart & Clear Output"
menu option to clear all outputs and start again from the top.
Introduction
Data visualization is the graphic representation of data. It involves producing images that communicate
relationships among the represented data to viewers. Visualizing data is an essential part of data analysis and
machine learning. We'll use Python libraries Matplotlib and Seaborn to learn and apply some popular data
visualization techniques. We'll use the words chart , plot , and graph interchangeably in this tutorial.
To begin, let's install and import the libraries. We'll use the matplotlib.pyplot module for basic plots like line &
bar charts. It is often imported with the alias plt . We'll use the seaborn module for more advanced plots. It is
commonly imported with the alias sns .
Notice this we also include the special command %matplotlib inline to ensure that our plots are shown and
embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up
windows.
Line Chart
The line chart is one of the simplest and most widely used data visualization techniques. A line chart displays
information as a series of data points or markers connected by straight lines. You can customize the shape, size,
color, and other aesthetic elements of the lines and markers for better visual clarity.
Here's a Python list showing the yield of apples (tons per hectare) over six years in an imaginary country called
Kanto.
We can visualize how the yield of apples changes over time using a line chart. To draw a line chart, we can use the
plt.plot function.
plt.plot(yield_apples)
[<matplotlib.lines.Line2D at 0x7f90047a8520>]
Calling the plt.plot function draws the line chart as expected. It also returns a list of plots drawn
[<matplotlib.lines.Line2D at 0x7ff70aa20760>] , shown within the output. We can include a semicolon
( ; ) at the end of the last statement in the cell to avoiding showing the output and display just the graph.
plt.plot(yield_apples);
Let's enhance this plot step-by-step to make it more informative and beautiful.
plt.plot(years, yield_apples)
[<matplotlib.lines.Line2D at 0x7f90051fcfa0>]
Axis Labels
We can add labels to the axes to show what each axis represents using the plt.xlabel and plt.ylabel
methods.
plt.plot(years, yield_apples)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)');
plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)');
plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')
The fmt argument provides a shorthand for specifying the marker shape, line style, and line color. It can be
provided as the third argument to plt.plot .
fmt = '[marker][line][color]'
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')
If you don't specify a line style in fmt , only markers are drawn.
plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");
plt.figure(figsize=(12, 6))
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')
sns.set_style("darkgrid")
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')
You can also edit default styles directly by modifying the matplotlib.rcParams dictionary. Learn more:
https://matplotlib.org/3.2.1/tutorials/introductory/customizing.html#matplotlib-rcparams .
import matplotlib
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
import jovian
jovian.commit(project='python-matplotlib-data-visualization')
'https://jovian.ai/aakashns/python-matplotlib-data-visualization'
The rst time you run jovian.commit , you'll be asked to provide an API Key to securely upload the notebook to
your Jovian account. You can get the API key from your Jovian pro le page after logging in / signing up.
jovian.commit uploads the notebook to your Jovian account, captures the Python environment, and creates a
shareable link for your notebook, as shown above. You can use this link to share your work and let anyone
(including you) run your notebooks and reproduce your work.
Scatter Plot
In a scatter plot, the values of 2 variables are plotted as points on a 2-dimensional grid. Additionally, you can also
use a third variable to determine the size or color of the points. Let's try out an example.
The Iris ower dataset provides sample measurements of sepals and petals for three species of owers. The Iris
dataset is included with the Seaborn library and can be loaded as a Pandas data frame.
flowers_df
flowers_df.species.unique()
Let's try to visualize the relationship between sepal length and sepal width. Our rst instinct might be to create a
line chart using plt.plot .
plt.plot(flowers_df.sepal_length, flowers_df.sepal_width);
The output is not very informative as there are too many combinations of the two properties within the dataset.
There doesn't seem to be simple relationship between them.
We can use a scatter plot to visualize how sepal length & sepal width vary using the scatterplot function from
the seaborn module (imported as sns ).
sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width);
Adding Hues
Notice how the points in the above plot seem to form distinct clusters with some outliers. We can color the dots
using the ower species as a hue . We can also make the points larger using the s argument.
plt.figure(figsize=(12, 6))
plt.title('Sepal Dimensions')
sns.scatterplot(x=flowers_df.sepal_length,
y=flowers_df.sepal_width,
hue=flowers_df.species,
s=100);
Plotting using Pandas Data Frames
Seaborn has in-built support for Pandas data frames. Instead of passing each column as a series, you can provide
column names and use the data argument to specify a data frame.
plt.title('Sepal Dimensions')
sns.scatterplot(x='sepal_length',
y='sepal_width',
hue='species',
s=100,
data=flowers_df);
import jovian
jovian.commit()
'https://jovian.ai/aakashns/python-matplotlib-data-visualization'
Histogram
A histogram represents the distribution of a variable by creating bins (interval) along the range of values and
showing vertical bars to indicate the number of observations in each bin.
For example, let's visualize the distribution of values of sepal width in the owers dataset. We can use the
plt.hist function to create a histogram.
flowers_df.sepal_width
0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
...
145 3.0
146 2.5
147 3.0
148 3.4
149 3.0
Name: sepal_width, Length: 150, dtype: float64
We can immediately see that the sepal widths lie in the range 2.0 - 4.5, and around 35 values are in the range 2.9 -
3.1, which seems to be the most populous bin.
import numpy as np
import jovian
jovian.commit()
'https://jovian.ai/aakashns/python-matplotlib-data-visualization'
Bar Chart
Bar charts are quite similar to line charts, i.e., they show a sequence of values. However, a bar is shown for each
value, rather than points connected by lines. We can use the plt.bar function to draw a bar chart.
plt.bar(years, oranges);
Like histograms, we can stack bars on top of one another. We use the bottom argument of plt.bar to achieve
this.
plt.bar(years, apples)
plt.bar(years, oranges, bottom=apples);
tips_df = sns.load_dataset("tips");
tips_df
We might want to draw a bar chart to visualize how the average bill amount varies across different days of the
week. One way to do this would be to compute the day-wise averages and then use plt.bar (try it as an
exercise).
However, since this is a very common use case, the Seaborn library provides a barplot function which can
automatically compute averages.
The lines cutting each bar represent the amount of variation in the values. For instance, it seems like the variation
in the total bill is relatively high on Fridays and low on Saturday.
We can also specify a hue argument to compare bar plots side-by-side based on a third feature, e.g., sex.
import jovian
jovian.commit()
'https://jovian.ai/aakashns/python-matplotlib-data-visualization'
Heatmap
A heatmap is used to visualize 2-dimensional data like a matrix or a table using colors. The best way to understand
it is by looking at an example. We'll use another sample dataset from Seaborn, called flights , to visualize
monthly passenger footfall at an airport over 12 years.
flights_df
year 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
month
Jan 112 115 145 171 196 204 242 284 315 340 360 417
Feb 118 126 150 180 196 188 233 277 301 318 342 391
Mar 132 141 178 193 236 235 267 317 356 362 406 419
Apr 129 135 163 181 235 227 269 313 348 348 396 461
May 121 125 172 183 229 234 270 318 355 363 420 472
Jun 135 149 178 218 243 264 315 374 422 435 472 535
Jul 148 170 199 230 264 302 364 413 465 491 548 622
Aug 148 170 199 242 272 293 347 405 467 505 559 606
Sep 136 158 184 209 237 259 312 355 404 404 463 508
Oct 119 133 162 191 211 229 274 306 347 359 407 461
Nov 104 114 146 172 180 203 237 271 305 310 362 390
Dec 118 140 166 194 201 229 278 306 336 337 405 432
flights_df is a matrix with one row for each month and one column for each year. The values show the
number of passengers (in thousands) that visited the airport in a speci c month of a year. We can use the
sns.heatmap function to visualize the footfall at the airport.
The footfall at the airport in any given year tends to be the highest around July & August.
The footfall at the airport in any given month tends to grow year by year.
We can also display the actual values in each block by specifying annot=True and using the cmap argument to
change the color palette.
import jovian
jovian.commit()
'https://jovian.ai/aakashns/python-matplotlib-data-visualization'
Images
We can also use Matplotlib to display images. Let's download an image from the internet.
urlretrieve('https://i.imgur.com/SkPbq.jpg', 'chart.jpg');
Before displaying an image, it has to be read into memory using the PIL module.
img = Image.open('chart.jpg')
An image loaded using PIL is simply a 3-dimensional numpy array containing pixel intensities for the red, green &
blue (RGB) channels of the image. We can convert the image into an array using np.array .
img_array = np.array(img)
img_array.shape
(481, 640, 3)
plt.imshow(img);
We can turn off the axes & grid lines and show a title using the relevant functions.
plt.grid(False)
plt.title('A data science meme')
plt.axis('off')
plt.imshow(img);
To display a part of the image, we can simply select a slice from the numpy array.
plt.grid(False)
plt.axis('off')
plt.imshow(img_array[125:325,105:305]);
Plotting multiple charts in a grid
Matplotlib and Seaborn also support plotting multiple charts in a grid, using plt.subplots , which returns a set
of axes for plotting.
Here's a single grid showing the different types of charts we've covered in this tutorial.
plt.tight_layout(pad=2);
sns.pairplot(flowers_df, hue='species');
sns.pairplot(tips_df, hue='sex');
Let's save and upload our work before continuing.
import jovian
jovian.commit()
In this tutorial we've covered some of the fundamental concepts and popular techniques for data visualization
using Matplotlib and Seaborn. Data visualization is a vast eld and we've barely scratched the surface here. Check
out these references to learn and discover more:
Data Visualization cheat sheet: https://jovian.ml/aakashns/dataviz-cheatsheet
You are now ready to move on to the next tutorial: Exploratory Data Analysis - A Case Study
2. What is Matplotlib?
3. What is Seaborn?
4. How do you install Matplotlib and Seaborn?
5. How you import Matplotlib and Seaborn? What are the common aliases used while importing these modules?
11. How do you plot multiple line charts on the same axes?
12. How do you show a legend for a line chart with multiple lines?
15. What are the different options for styling lines & markers in line charts? Illustrate with examples?
18. Where can you see a list of all the arguments accepted by plt.plot?
19. How do you change the size of the gure using Matplotlib?
20. How do you apply the default styles from Seaborn globally for all charts?
21. What are the prede ned styles available in Seaborn? Illustrate with examples.
25. How do you decide when to use a scatter plot v.s. a line chart?
26. How do you specify the colors for dots on a scatter plot using a categorical variable?
27. How do you customize the title, gure size, legend, etc., for Seaborn plots?
31. How do you draw a histogram using Matplotlib? Illustrate with an example.
38. How do you draw a bar chart using Matplotlib? Illustrate with an example.
43. What do the lines cutting the bars in a Seaborn bar plot represent?
49. How do you draw a heat map using Seaborn? Illustrate with an example.
51. How do you show the original values from the dataset on a heat map?
52. How do you download images from a URL in Python?
55. How do you convert an image loaded using PIL into a Numpy array?
56. How many dimensions does a Numpy array for an image have? What does each dimension represent?
60. How do you turn off the axes and gridlines in a chart?
62. How do you plot multiple charts in a grid using Matplotlib and Seaborn? Illustrate with examples.
66. Where can you learn about the different types of charts you can create using Matplotlib and Seaborn?