Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
36 views

Beginner Guide Matplotlib Data Visualization Exploration Python

The document provides an overview of matplotlib, the most popular Python library for data visualization and exploration. It discusses how matplotlib can be used to create various visualizations like bar graphs, pie charts, box plots, histograms, line charts, and scatter plots from data. It also includes code samples to demonstrate creating basic bar graphs and pie charts from sample food order data using matplotlib in Python. The code samples show how to customize visual aspects of the plots like labels, titles, saving figures, and displaying the final plots.

Uploaded by

udayalugolu6363
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Beginner Guide Matplotlib Data Visualization Exploration Python

The document provides an overview of matplotlib, the most popular Python library for data visualization and exploration. It discusses how matplotlib can be used to create various visualizations like bar graphs, pie charts, box plots, histograms, line charts, and scatter plots from data. It also includes code samples to demonstrate creating basic bar graphs and pie charts from sample food order data using matplotlib in Python. The code samples show how to customize visual aspects of the plots like labels, titles, saving figures, and displaying the final plots.

Uploaded by

udayalugolu6363
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

A Beginner's Guide to matplotlib for Data Visualization and

Exploration in Python
BE G I NNE R D AT A VI S UA LI Z AT I O N PYT HO N S T RUC T URE D D AT A T E C HNI Q UE

matplotlib – The Most Popular Python Library for Data Visualization


and Exploration

I love working with matplotlib in Python. It was the first visualization library I learned to master and it has
stayed with me ever since. There is a reason why matplotlib is the most popular Python library for data
visualization and exploration – the flexibility and agility it offers is unparalleled!

Matplotlib provides an easy but comprehensive visual approach to present our findings. There are a
number of visualizations we can choose from to present our results, as we’ll soon see in this tutorial.

From histograms to scatterplots, matplotlib lays down an array of colors, themes, palettes, and other
options to customize and personalize our plots. matplotlib is useful whether you’re performing data
exploration for a machine learning project or simply want to create dazzling and eye-catching charts.

Note: If you’re new to the world of Python, we highly recommend taking the below popular free courses:

Python for Data Science


Pandas for Data Analysis in Python

What is matplotlib?
Let’s put a formal definition to matplotlib before we dive into the crux of the article. If this is the first time
you’ve heard of matplotlib, here’s the official description:

“Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive
environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web
application servers, and four graphical user interface toolkits.”

You can draw up all sorts of charts and visualization using matplotlib. I will be exploring the most common
plots in the matplotlib Python library in this tutorial. We will first understand the dataset at hand and then
start building different plots using matplotlib, including scatterplots and line charts!

Note: If you’re looking for a matplotlib alternative or want to explore other Python visualization libraries,
check out the below tutorial on Seaborn:

Become a Data Visualization Whiz with this Comprehensive Guide to Seaborn in Python

Here are the Visualization We’ll Design using matplotlib

Bar Graph
Pie Chart
Box Plot
Histogram
Line Chart and Subplots
Scatter Plot

Understanding the Dataset and the Problem Statement

Before we get into the different visualizations and chart types, I want to spend a few minutes
understanding the data. This is a critical part of the machine learning pipeline and we should pay full
attention to it.

We will be analyzing the Food Demand Forecasting project in this matplotlib tutorial. The aim of this
project is to predict the number of food orders that customers will place in the upcoming weeks with the
company. We will, of course, only spend time on the exploration stage of the project.

Let us first import the relevant libraries:

1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 plt.style.use('seaborn')

view raw
Matplotlib_1.py hosted with ❤ by GitHub

I have used a matplotlib stylesheet to make our plots look neat and pretty. Here, I have used the ‘seaborn’
stylesheet. However, there are plenty of other stylesheets in Matplotlib which you can use to best suit your
presentation style.
Our dataset has three dataframes: df_meal describing the meals, df_center describing the food centers,
and df_food describing the overall food order. Have a look at them below:

1 df_meal = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\meal_info.csv')
2 df_meal.head()

view raw
Matplotlib_2.py hosted with ❤ by GitHub

1 df_center = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\fulfilment_center_info.csv')
2 df_center.head()

view raw
Matplotlib_3.py hosted with ❤ by GitHub

1 df_food = pd.read_csv('C:\\Users\Dell\\Desktop\\train_food\\train_food.csv')
2 df_food.head()

view raw
Matplotlib_4.py hosted with ❤ by GitHub

I will first merge all the three dataframes into a single dataframe. This will make it easier to manipulate the
data while plotting it:

1 df = pd.merge(df_food,df_center,on='center_id')
2 df = pd.merge(df,df_meal,on='meal_id')

view raw
Matplotlib_5.py hosted with ❤ by GitHub

Right – now let’s jump into the different chart types we can create using matplotlib in Python!
1. Bar Graph using matplotlib

First, we want to find the most popular food item that customers have bought from the company.

I will be using the Pandas pivot_table function to find the total number of orders for each category of the
food item:

1 table = pd.pivot_table(data=df,index='category',values='num_orders',aggfunc=np.sum)
2 table

view raw
Matplotlib_6.py hosted with ❤ by GitHub

Next, I will try to visualize this using a bar graph.

Bar graphs are best used when we need to compare the quantity of categorical values within the same category.

Bar graph is generated using plt.bar() in matplotlib:

1 #bar graph
2 plt.bar(table.index,table['num_orders'])
3
4 #xticks
5 plt.xticks(rotation=70)
6
7 #x-axis labels
8 plt.xlabel('Food item')
9
10 #y-axis labels
11 plt.ylabel('Quantity sold')
12
13 #plot title
14 plt.title('Most popular food')
15
16 #save plot
17 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_6.png',dpi=300,bbox_inches='tight')
18
19 #display
20 plot plt.show();

view raw
Matplotlib_7.py hosted with ❤ by GitHub
It is always important to label your axis. You can do this by employing the plt.xlabel() and plt.ylabel()
functions. You can use plt.title() for naming the title of the plot. If your xticks are overlapping, rotate them
using the rotate parameter in plt.xticks() so that they are easy to view for the audience.

You can save your plot using the plt.savefig() function by providing the file path as a parameter. Finally,
always display your plot using plt.show().

While analyzing the plot, we can see that Beverages were the most popular food item sold by the company.
Wait, was it because they were sold with almost all the meals? Was Rice Bowl the most popular food item?

Let’s divide the total food item order by the number of unique meals it is present in.

1 #dictionary for meals per food item


2 item_count = {}
3
4 for i in range(table.index.nunique()):
5 item_count[table.index[i]] = table.num_orders[i]/df_meal[df_meal['category']==table.index[i]].shape[0]
6
7 #bar plot
8 plt.bar([x for x in item_count.keys()],[x for x in item_count.values()],color='orange')
9
10 #adjust xticks
11 plt.xticks(rotation=70)
12
13 #label x-axis
14 plt.xlabel('Food item')
15
16 #label y-axis
17 plt.ylabel('No. of meals')
18
19 #label the plot
20 plt.title('Meals per food item')
21
22 #save plot
23 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_7.png',dpi=300,bbox_inches='tight')
24
25 #display plot
26 plt.show();

view raw
Matplotlib_8.py hosted with ❤ by GitHub
Yes, our hypothesis was correct! Rice Bowl was indeed the most popular food item sold by the company.

Bar graphs should not be used for continuous values.

2. Pie Chart using matplotlib

Let us now see the ratio of orders from each cuisine.

A pie chart is suitable to show the proportional distribution of items within the same category.

1 #dictionary for cuisine and its total orders


2 d_cuisine = {}
3
4 #total number of order
5 total = df['num_orders'].sum()
6
7 #find ratio of orders per cuisine
8 for i in range(df['cuisine'].nunique()):
9
10 #cuisine
11 c = df['cuisine'].unique()[i]
12
13 #num of orders for the cuisine
14 c_order = df[df['cuisine']==c]['num_orders'].sum()
15 d_cuisine[c] = c_order/total

view raw
Matplotlib_8.py hosted with ❤ by GitHub

Let’s plot the pie chart:

1 #pie plot
2 plt.pie([x*100 for x in d_cuisine.values()],labels=[x for x in d_cuisine.keys()],autopct='%0.1f',explode=[0,0,0.1,0])
3
4 #label the plot
5 plt.title('Cuisine share %')
6 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_8.png',dpi=300,bbox_inches='tight')
7 plt.show();

view raw
Matplotlib_9.py hosted with ❤ by GitHub
I used plt.pie() to draw the pie chart and adjust its parameters to make it more appealing
The autopct parameter was used to print the values within the pie chart up to 1 decimal place
The explode parameter was used to offset the Italian wedge to make it stand out from the rest. This
makes it instantly clear to the viewer that people love Italian food!

A pie chart is rendered useless when there are a lot of items within a category. This will decrease the size of each slice and there will
be no distinction between the items.

3. Box Plot using matplotlib

Since we are discussing cuisine, let’s check out which one is the most expensive cuisine! For this, I will be
using a Box Plot.

Box plot gives statistical information about the distribution of numeric data divided into different groups. It is useful for detecting
outliers within each group.

The lower, middle and upper part of the box represents the 25th, 50th, and 75th percentile values
respectively
The top whisker represents Q3+1.5*IQR
The bottom whisker represents Q1-1.5*IQR
Outliers are shown as scatter points
Shows skewness in the data

1 #dictionary for base price per cuisine


2 c_price = {}
3 for i in df['cuisine'].unique():
4 c_price[i] = df[df['cuisine']==i].base_price

view raw
Matplotlib_10.py hosted with ❤ by GitHub

Plotting the boxplot below:

1 #plotting boxplot
2 plt.boxplot([x for x in c_price.values()],labels=[x for x in c_price.keys()])
3
4 #x and y-axis labels
5 plt.xlabel('Cuisine')
6 plt.ylabel('Price')
7
8 #plot title
9 plt.title('Analysing cuisine price')
10
11 #save and display
12 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_9.png',dpi=300,bbox_inches='tight')
13 plt.show();

view raw
Matplotlib_11.py hosted with ❤ by GitHub

Continental cuisine was the most expensive cuisine served by the company! Even its median price is higher
than the maximum price of all the cuisines.

Box plot does not show the distribution of data points within each group.

4. Histogram using matplotlib

On the topic of prices, did we forget to inspect the base price and checkout price? Don’t worry, we will do
that using a histogram.

A histogram shows the distribution of numeric data through a continuous interval by segmenting data into different bins. Useful for
inspecting skewness in the data.

Since base_price is a continuous variable, we will inspect its range in different distinct orders using a
histogram. We can do this using plt.hist().

But the confusing part is what should be the number of bins? By default, it is 10. However, there is no
correct answer and you can vary it according to your dataset to best visualize it.

1 #plotting histogram
2 plt.hist(df['base_price'],rwidth=0.9,alpha=0.3,color='blue',bins=15,edgecolor='red')
3
4 #x and y-axis labels
5 plt.xlabel('Base price range')
6 plt.ylabel('Distinct order')
7
8 #plot title
9 plt.title('Inspecting price effect')
10
11 #save and display the plot
12 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_10.png',dpi=300,bbox_inches='tight')
13 plt.show();
view raw
Matplotlib_12.py hosted with ❤ by GitHub

I have chosen the number of bins as 15 and it is evident that most of the orders had a base price of ~300.

It is easy to confuse histograms with bar plots. But remember, histograms are used with continuous data whereas bar plots are used
with categorical data.

5. Line Plot and Subplots using matplotlib

A line plot is useful for visualizing the trend in a numerical value over a continuous time interval.

How are the weekly and monthly sales of the company varying? This is a critical business question that
makes or breaks the marketing strategy.

Before exploring that, I will create two lists for storing the week-wise and month-wise revenue of the
company:

1 #new revenue column


2 df['revenue'] = df.apply(lambda x: x.checkout_price*x.num_orders,axis=1)
3
4 #new month column
5 df['month'] = df['week'].apply(lambda x: x//4)
6
7 #list to store month-wise revenue
8 month=[]
9 month_order=[]
10
11 for i in range(max(df['month'])):
12 month.append(i)
13 month_order.append(df[df['month']==i].revenue.sum())
14
15 #list to store week-wise revenue
16 week=[]
17 week_order=[]
18
19 for i in range(max(df['week'])):
20 week.append(i)
21 week_order.append(df[df['week']==i].revenue.sum())

view raw
Matplotlib_13.py hosted with ❤ by GitHub

I will compare the revenue of the company in every week as well as in every month using two line-plots
drawn side by side. For this, I will be using the plt.subplots() function.
Matplotlib subplots makes it easy to view and compare different plots in the same figure.

To understand how this function works, you need to know what Figure, Axes, and Axis are in a matplotlib
plot.

Figure is the outermost container for the Matplotlib plot(s). There can a single or multiple plots, called
Axes, within a Figure. Each of these Axes contains the x and y-axis known as the Axis.

The plt.subplots() figure returns the figure and axes. You can provide as an input to the function how you
want to display the axes within the figure. These will be adjusted using the nrows and ncols parameters.
You can even adjust the size of the figure using the figsize parameter.

Axes are returned as a list. To plot for specific axes, you can access them as a list object. The rest of the
plotting is done the same way as simple plots:

1 #subplots returns a Figure and an Axes object


2 fig,ax=plt.subplots(nrows=1,ncols=2,figsize=(20,5))
3
4 #manipulating the first Axes
5 ax[0].plot(week,week_order)
6 ax[0].set_xlabel('Week')
7 ax[0].set_ylabel('Revenue')
8 ax[0].set_title('Weekly income')
9
10 #manipulating the second Axes
11 ax[1].plot(month,month_order)
12 ax[1].set_xlabel('Month')
13 ax[1].set_ylabel('Revenue')
14 ax[1].set_title('Monthly income')
15
16 #save and display the plot
17 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_11.png',dpi=300,bbox_inches='tight')
18 plt.show();

view raw
Matplotlib_14.py hosted with ❤ by GitHub

We can see an increasing trend in the number of food orders with the number of weeks and months,
though the trend is not very strong.

6. Scatter Plot using matplotlib

Finally, I will try to analyze whether the center type had any effect on the number of orders from different
center types. I will do this by comparing a scatter plot, a boxplot and a bar graph in the same figure.

We have already seen the use of boxplots and bar graphs, but scatter plots have their own advantages.
Scatter plots are useful for showing the relationship between two variables. Any correlation between variables or outliers in the data
can be easily spotted using scatter plots.

1 center_type_name = ['TYPE_A','TYPE_B','TYPE_C']
2
3 #relation between op area and number of orders
4 op_table=pd.pivot_table(df,index='op_area',values='num_orders',aggfunc=np.sum)
5
6 #relation between center type and op area
7 c_type = {}
8 for i in center_type_name:
9 c_type[i] = df[df['center_type']==i].op_area
10
11 #relation between center type and num of orders
12 center_table=pd.pivot_table(df,index='center_type',values='num_orders',aggfunc=np.sum)
13
14 #subplots
15 fig,ax = plt.subplots(nrows=3,ncols=1,figsize=(8,12))
16
17 #scatter plots
18 ax[0].scatter(op_table.index,op_table['num_orders'],color='pink')
19 ax[0].set_xlabel('Operation area')
20 ax[0].set_ylabel('Number of orders')
21 ax[0].set_title('Does operation area affect num of orders?')
22 ax[0].annotate('optimum operation area of 4 km^2',xy=(4.2,1.1*10**7),xytext=(7,1.1*10**7),arrowprops=dict(facecolor='black', shri
23
24 #boxplot
25 ax[1].boxplot([x for x in c_type.values()], labels=[x for x in c_type.keys()])
26 ax[1].set_xlabel('Center type')
27 ax[1].set_ylabel('Operation area')
28 ax[1].set_title('Which center type had the optimum operation area?')
29
30 #bar graph
31 ax[2].bar(center_table.index,center_table['num_orders'],alpha=0.7,color='orange',width=0.5)
32 ax[2].set_xlabel('Center type')
33 ax[2].set_ylabel('Number of orders')
34 ax[2].set_title('Orders per center type')
35
36 #show figure
37 plt.tight_layout()
38 plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_12.png',dpi=300,bbox_inches='tight')
39 plt.show();

view raw
Matplotlib_15.py hosted with ❤ by GitHub
The scatter plot makes it instantly visible that the optimum operation area of a center is 4 km sq. The
boxplot shows that the TYPE_A center type had the most number of optimum size centers because of a
compact box with a median around 4 km sq. Because of this, they had more orders placed by customers
than any other center type.

End Notes

You are now a step closer to creating wonderful plots in Matplotlib. However, the best way to master
plotting is to practice, practice and practice!

For this, I suggest you go through other such amazing datasets on the DataHack platform and visualize till
you dream in plots!

Next, you can go through the below resources to build your existing skillset:

Python for Data Science


Pandas for Data Analysis in Python
Tableau from Scratch: Become a Data Visualization Rockstar

Article Url - https://www.analyticsvidhya.com/blog/2020/02/beginner-guide-matplotlib-data-visualization-


exploration-python/

Aniruddha Bhandari
I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict
the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from
the challenges that I face, with the community to make the world a better place!

You might also like