Beginner Guide Matplotlib Data Visualization Exploration Python
Beginner Guide Matplotlib Data Visualization Exploration Python
Exploration in Python
BE G I NNE R D AT A VI S UA LI Z AT I O N PYT HO N S T RUC T URE D D AT A T E C HNI Q UE
I love working with matplotlib in Python. It was the first visualization library I learned to master and it has
stayed with me ever since. There is a reason why matplotlib is the most popular Python library for data
visualization and exploration – the flexibility and agility it offers is unparalleled!
Matplotlib provides an easy but comprehensive visual approach to present our findings. There are a
number of visualizations we can choose from to present our results, as we’ll soon see in this tutorial.
From histograms to scatterplots, matplotlib lays down an array of colors, themes, palettes, and other
options to customize and personalize our plots. matplotlib is useful whether you’re performing data
exploration for a machine learning project or simply want to create dazzling and eye-catching charts.
Note: If you’re new to the world of Python, we highly recommend taking the below popular free courses:
What is matplotlib?
Let’s put a formal definition to matplotlib before we dive into the crux of the article. If this is the first time
you’ve heard of matplotlib, here’s the official description:
“Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive
environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web
application servers, and four graphical user interface toolkits.”
You can draw up all sorts of charts and visualization using matplotlib. I will be exploring the most common
plots in the matplotlib Python library in this tutorial. We will first understand the dataset at hand and then
start building different plots using matplotlib, including scatterplots and line charts!
Note: If you’re looking for a matplotlib alternative or want to explore other Python visualization libraries,
check out the below tutorial on Seaborn:
Become a Data Visualization Whiz with this Comprehensive Guide to Seaborn in Python
Bar Graph
Pie Chart
Box Plot
Histogram
Line Chart and Subplots
Scatter Plot
Before we get into the different visualizations and chart types, I want to spend a few minutes
understanding the data. This is a critical part of the machine learning pipeline and we should pay full
attention to it.
We will be analyzing the Food Demand Forecasting project in this matplotlib tutorial. The aim of this
project is to predict the number of food orders that customers will place in the upcoming weeks with the
company. We will, of course, only spend time on the exploration stage of the project.
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 plt.style.use('seaborn')
view raw
Matplotlib_1.py hosted with ❤ by GitHub
I have used a matplotlib stylesheet to make our plots look neat and pretty. Here, I have used the ‘seaborn’
stylesheet. However, there are plenty of other stylesheets in Matplotlib which you can use to best suit your
presentation style.
Our dataset has three dataframes: df_meal describing the meals, df_center describing the food centers,
and df_food describing the overall food order. Have a look at them below:
1 df_meal = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\meal_info.csv')
2 df_meal.head()
view raw
Matplotlib_2.py hosted with ❤ by GitHub
1 df_center = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\fulfilment_center_info.csv')
2 df_center.head()
view raw
Matplotlib_3.py hosted with ❤ by GitHub
1 df_food = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\train_food.csv')
2 df_food.head()
view raw
Matplotlib_4.py hosted with ❤ by GitHub
I will first merge all the three dataframes into a single dataframe. This will make it easier to manipulate the
data while plotting it:
1 df = pd.merge(df_food,df_center,on='center_id')
2 df = pd.merge(df,df_meal,on='meal_id')
view raw
Matplotlib_5.py hosted with ❤ by GitHub
Right – now let’s jump into the different chart types we can create using matplotlib in Python!
1. Bar Graph using matplotlib
First, we want to find the most popular food item that customers have bought from the company.
I will be using the Pandas pivot_table function to find the total number of orders for each category of the
food item:
1 table = pd.pivot_table(data=df,index='category',values='num_orders',aggfunc=np.sum)
2 table
view raw
Matplotlib_6.py hosted with ❤ by GitHub
Bar graphs are best used when we need to compare the quantity of categorical values within the same category.
1 #bar graph
2 plt.bar(table.index,table['num_orders'])
3
4 #xticks
5 plt.xticks(rotation=70)
6
7 #x-axis labels
8 plt.xlabel('Food item')
9
10 #y-axis labels
11 plt.ylabel('Quantity sold')
12
13 #plot title
14 plt.title('Most popular food')
15
16 #save plot
17 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_6.png',dpi=300,bbox_inches='tight')
18
19 #display
20 plot plt.show();
view raw
Matplotlib_7.py hosted with ❤ by GitHub
It is always important to label your axis. You can do this by employing the plt.xlabel() and plt.ylabel()
functions. You can use plt.title() for naming the title of the plot. If your xticks are overlapping, rotate them
using the rotate parameter in plt.xticks() so that they are easy to view for the audience.
You can save your plot using the plt.savefig() function by providing the file path as a parameter. Finally,
always display your plot using plt.show().
While analyzing the plot, we can see that Beverages were the most popular food item sold by the company.
Wait, was it because they were sold with almost all the meals? Was Rice Bowl the most popular food item?
Let’s divide the total food item order by the number of unique meals it is present in.
view raw
Matplotlib_8.py hosted with ❤ by GitHub
Yes, our hypothesis was correct! Rice Bowl was indeed the most popular food item sold by the company.
A pie chart is suitable to show the proportional distribution of items within the same category.
view raw
Matplotlib_8.py hosted with ❤ by GitHub
1 #pie plot
2 plt.pie([x*100 for x in d_cuisine.values()],labels=[x for x in d_cuisine.keys()],autopct='%0.1f',explode=[0,0,0.1,0])
3
4 #label the plot
5 plt.title('Cuisine share %')
6 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_8.png',dpi=300,bbox_inches='tight')
7 plt.show();
view raw
Matplotlib_9.py hosted with ❤ by GitHub
I used plt.pie() to draw the pie chart and adjust its parameters to make it more appealing
The autopct parameter was used to print the values within the pie chart up to 1 decimal place
The explode parameter was used to offset the Italian wedge to make it stand out from the rest. This
makes it instantly clear to the viewer that people love Italian food!
A pie chart is rendered useless when there are a lot of items within a category. This will decrease the size of each slice and there will
be no distinction between the items.
Since we are discussing cuisine, let’s check out which one is the most expensive cuisine! For this, I will be
using a Box Plot.
Box plot gives statistical information about the distribution of numeric data divided into different groups. It is useful for detecting
outliers within each group.
The lower, middle and upper part of the box represents the 25th, 50th, and 75th percentile values
respectively
The top whisker represents Q3+1.5*IQR
The bottom whisker represents Q1-1.5*IQR
Outliers are shown as scatter points
Shows skewness in the data
view raw
Matplotlib_10.py hosted with ❤ by GitHub
1 #plotting boxplot
2 plt.boxplot([x for x in c_price.values()],labels=[x for x in c_price.keys()])
3
4 #x and y-axis labels
5 plt.xlabel('Cuisine')
6 plt.ylabel('Price')
7
8 #plot title
9 plt.title('Analysing cuisine price')
10
11 #save and display
12 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_9.png',dpi=300,bbox_inches='tight')
13 plt.show();
view raw
Matplotlib_11.py hosted with ❤ by GitHub
Continental cuisine was the most expensive cuisine served by the company! Even its median price is higher
than the maximum price of all the cuisines.
Box plot does not show the distribution of data points within each group.
On the topic of prices, did we forget to inspect the base price and checkout price? Don’t worry, we will do
that using a histogram.
A histogram shows the distribution of numeric data through a continuous interval by segmenting data into different bins. Useful for
inspecting skewness in the data.
Since base_price is a continuous variable, we will inspect its range in different distinct orders using a
histogram. We can do this using plt.hist().
But the confusing part is what should be the number of bins? By default, it is 10. However, there is no
correct answer and you can vary it according to your dataset to best visualize it.
1 #plotting histogram
2 plt.hist(df['base_price'],rwidth=0.9,alpha=0.3,color='blue',bins=15,edgecolor='red')
3
4 #x and y-axis labels
5 plt.xlabel('Base price range')
6 plt.ylabel('Distinct order')
7
8 #plot title
9 plt.title('Inspecting price effect')
10
11 #save and display the plot
12 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_10.png',dpi=300,bbox_inches='tight')
13 plt.show();
view raw
Matplotlib_12.py hosted with ❤ by GitHub
I have chosen the number of bins as 15 and it is evident that most of the orders had a base price of ~300.
It is easy to confuse histograms with bar plots. But remember, histograms are used with continuous data whereas bar plots are used
with categorical data.
A line plot is useful for visualizing the trend in a numerical value over a continuous time interval.
How are the weekly and monthly sales of the company varying? This is a critical business question that
makes or breaks the marketing strategy.
Before exploring that, I will create two lists for storing the week-wise and month-wise revenue of the
company:
view raw
Matplotlib_13.py hosted with ❤ by GitHub
I will compare the revenue of the company in every week as well as in every month using two line-plots
drawn side by side. For this, I will be using the plt.subplots() function.
Matplotlib subplots makes it easy to view and compare different plots in the same figure.
To understand how this function works, you need to know what Figure, Axes, and Axis are in a matplotlib
plot.
Figure is the outermost container for the Matplotlib plot(s). There can a single or multiple plots, called
Axes, within a Figure. Each of these Axes contains the x and y-axis known as the Axis.
The plt.subplots() figure returns the figure and axes. You can provide as an input to the function how you
want to display the axes within the figure. These will be adjusted using the nrows and ncols parameters.
You can even adjust the size of the figure using the figsize parameter.
Axes are returned as a list. To plot for specific axes, you can access them as a list object. The rest of the
plotting is done the same way as simple plots:
view raw
Matplotlib_14.py hosted with ❤ by GitHub
We can see an increasing trend in the number of food orders with the number of weeks and months,
though the trend is not very strong.
Finally, I will try to analyze whether the center type had any effect on the number of orders from different
center types. I will do this by comparing a scatter plot, a boxplot and a bar graph in the same figure.
We have already seen the use of boxplots and bar graphs, but scatter plots have their own advantages.
Scatter plots are useful for showing the relationship between two variables. Any correlation between variables or outliers in the data
can be easily spotted using scatter plots.
1 center_type_name = ['TYPE_A','TYPE_B','TYPE_C']
2
3 #relation between op area and number of orders
4 op_table=pd.pivot_table(df,index='op_area',values='num_orders',aggfunc=np.sum)
5
6 #relation between center type and op area
7 c_type = {}
8 for i in center_type_name:
9 c_type[i] = df[df['center_type']==i].op_area
10
11 #relation between center type and num of orders
12 center_table=pd.pivot_table(df,index='center_type',values='num_orders',aggfunc=np.sum)
13
14 #subplots
15 fig,ax = plt.subplots(nrows=3,ncols=1,figsize=(8,12))
16
17 #scatter plots
18 ax[0].scatter(op_table.index,op_table['num_orders'],color='pink')
19 ax[0].set_xlabel('Operation area')
20 ax[0].set_ylabel('Number of orders')
21 ax[0].set_title('Does operation area affect num of orders?')
22 ax[0].annotate('optimum operation area of 4 km^2',xy=(4.2,1.1*10**7),xytext=(7,1.1*10**7),arrowprops=dict(facecolor='black', shri
23
24 #boxplot
25 ax[1].boxplot([x for x in c_type.values()], labels=[x for x in c_type.keys()])
26 ax[1].set_xlabel('Center type')
27 ax[1].set_ylabel('Operation area')
28 ax[1].set_title('Which center type had the optimum operation area?')
29
30 #bar graph
31 ax[2].bar(center_table.index,center_table['num_orders'],alpha=0.7,color='orange',width=0.5)
32 ax[2].set_xlabel('Center type')
33 ax[2].set_ylabel('Number of orders')
34 ax[2].set_title('Orders per center type')
35
36 #show figure
37 plt.tight_layout()
38 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_12.png',dpi=300,bbox_inches='tight')
39 plt.show();
view raw
Matplotlib_15.py hosted with ❤ by GitHub
The scatter plot makes it instantly visible that the optimum operation area of a center is 4 km sq. The
boxplot shows that the TYPE_A center type had the most number of optimum size centers because of a
compact box with a median around 4 km sq. Because of this, they had more orders placed by customers
than any other center type.
End Notes
You are now a step closer to creating wonderful plots in Matplotlib. However, the best way to master
plotting is to practice, practice and practice!
For this, I suggest you go through other such amazing datasets on the DataHack platform and visualize till
you dream in plots!
Next, you can go through the below resources to build your existing skillset:
Aniruddha Bhandari
I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict
the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from
the challenges that I face, with the community to make the world a better place!