Data Visualization with Python
Data Visualization with Python
md 11/15/2022
1 / 42
readme.md 11/15/2022
Introduction to Matplotlib
Backend Layer — Handles all the heavy works via communicating to the drawing toolkits in your
machine. It is the most complex layer.
Artist Layer — Allows full control and fine-tuning of the Matplotlib figure — the top-level container for
all plot elements.
Scripting Layer — The lightest scripting interface among the three layers, designed to make Matplotlib
work like MATLAB script.
# Now use a figure method to create an Axes artist; the Axes artist is
# added automatically to the figure container fig.axes.
# Here "111" is from the MATLAB convention: create a grid with 1 row and 1
# column, and use the first cell in that grid for the location of the new
# Axes.
ax = fig.add_subplot(111)
# Call the Axes method hist to generate the histogram; hist creates a
# sequence of Rectangle artists for each histogram bar and adds them
# to the Axes container. Here "100" means create 100 bins.
ax.hist(x, 100)
2 / 42
readme.md 11/15/2022
x = np.random.randn(10000)
plt.hist(x, 100)
plt.title(r'Normal distribution with $\mu=0, \sigma=1$')
plt.savefig('matplotlib_histogram.png')
plt.show()
%matplotlib notebook
import matplotlib.pyplot as plt
plt.plot(5, 5, 'o')
A magic function starts with %matplotlib, and to enforce plots to be rendered within the browser, you
pass in inline as the backend.
Matplotlib has a number of different backends available. One limitation of this backend is that you
cannot modify a figure once it's rendered.
So after rendering the above figure, there is no way for us to add, for example, a figure title or label its
axes. You will need to generate a new plot and add a title and the axes labels before calling the show
function.
A backend that overcomes this limitation is the notebook backend. With the notebook backend in
place, if a plt function is called, it checks if an active figure exists, and any functions you call will be
applied to this active figure. If a figure does not exist, it renders a new figure. So when we call the
plt.plot function to plot a circular mark at position (5, 5), the backend checks if an active figure exists.
Matplotlib - Pandas
3 / 42
readme.md 11/15/2022
Another thing that is great about Matplotlib is that pandas also has a built-in implementation of it.
view on GitHub
↥ back to top
Histogram
4 / 42
readme.md 11/15/2022
A histogram is a graph that shows the frequency of numerical data using rectangles. The height of a
rectangle (the vertical axis) represents the distribution frequency of a variable (the amount, or how often that
variable appears). The width of the rectangle (horizontal axis) represents the value of the variable (for instance,
minutes, years, or ages).
A histogram that depicts the distribution of immigration to Canada in 2013, but notice how the bins are not
aligned with the tick marks on the horizontal axis. This can make the histogram hard to read.
One way to solve this issue is to borrow the histogram function from the Numpy library. What histogram
does:
partitions the spread of the data in column 2013 into 10 bins of equal width,
computes the number of datapoints that fall in each bin,
returns this frequency (count) and the bin edges (bin_edges).
plt.ylabel('Number of Countries')
plt.xlabel('Number of Immigrants')
plt.show()
Bar Charts
A bar chart is a very popular visualization tool. Unlike a histogram, a bar chart also known as a bar graph is a
type of plot where the length of each bar is proportional to the value of the item that it represents. It is
commonly used to compare the values of a variable at a given point in time.
6 / 42
readme.md 11/15/2022
view on GitHub
↥ back to top
df_continents['Total'].plot(kind='pie',
figsize=(15, 8),
autopct='%1.1f%%',
startangle=90,
shadow=True,
labels=None, # turn off labels on pie chart
pctdistance=1.12, # the ratio between the center of
each pie slice and the start of the text generated by autopct
colors=colors_list, # add custom colors
explode=explode_list # 'explode' lowest 3 continents
)
7 / 42
readme.md 11/15/2022
plt.axis('equal')
# add legend
plt.legend(labels=df_continents.index, loc='upper left')
plt.show()
Box Plots
In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread
and skewness groups of numerical data through their quartiles.
The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of
the data, which are usually described using the five-number summary.
In the most straight-forward method, the boundary of the lower whisker is the minimum value of the data set,
and the boundary of the upper whisker is the maximum value of the data set.
Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the
upper quartile (Q3), a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest
observed data point from the dataset that falls within this distance.
Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile (Q1) and a whisker is drawn
down to the lowest observed data point from the dataset that falls within this distance. Because the whiskers
must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the
same for both sides. All other observed data points outside the boundary of the whiskers are plotted as
outliers. The outliers can be plotted on the box-plot as a dot, a small circle, a star, etc..
8 / 42
readme.md 11/15/2022
df_china.describe()
Country China
9 / 42
readme.md 11/15/2022
Country China
count 34.000000
mean 19410.647059
std 13568.230790
min 1527.000000
25% 5512.750000
50% 19945.000000
75% 31568.500000
max 42584.000000
Scatter Plots
We can mathematically analyze the trend using a regression line (line of best fit).
Get the equation of line of best fit. We will use Numpy's polyfit() method by passing in the following:
# we can use the sum() method to get the total population per year
df_tot = pd.DataFrame(df_can[years].sum(axis=0))
# change the years to type int (useful for regression later on)
df_tot.index = map(int, df_tot.index)
# rename columns
df_tot.columns = ['year', 'total']
10 / 42
readme.md 11/15/2022
plt.show()
Bubble Plots
To plot two different scatter plots in one plot, we can include the axes one plot into the other by
passing it via the ax parameter.
We will also pass in the weights using the s parameter. Given that the normalized weights are between
0-1, they won't be visible on the plot. Therefore, we will:
multiply weights by 2000 to scale it up on the graph, and,
add 10 to compensate for the min value (which has a 0 weight and therefore scale with $\times
2000$).
# transposed dataframe
df_can_t = df_can[years].transpose()
# let's label the index. This will automatically be the column name when we reset
11 / 42
readme.md 11/15/2022
the index
df_can_t.index.name = 'Year'
# China
ax0 = df_can_t.plot(kind='scatter',
x='Year',
y='China',
figsize=(15, 8),
alpha=0.5, # transparency
color='green',
s=norm_china * 2000 + 10, # pass in weights
xlim=(1975, 2015)
)
# India
ax1 = df_can_t.plot(kind='scatter',
x='Year',
y='India',
alpha=0.5,
color="blue",
s=norm_india * 2000 + 10,
ax=ax0
)
ax0.set_ylabel('Number of Immigrants')
ax0.set_title('Immigration from China and India from 1980 to 2013')
ax0.legend(['China', 'India'], loc='upper left', fontsize='x-large')
12 / 42
readme.md 11/15/2022
view on GitHub
↥ back to top
To create a waffle chart, use function create_waffle_chart which takes the following parameters as input:
plt.xticks([])
plt.yticks([])
total_values = values_cumsum[len(values_cumsum) - 1]
# create legend
legend_handles = []
for i, category in enumerate(categories):
if value_sign == '%':
label_str = category + ' (' + str(values[i]) + value_sign + ')'
else:
label_str = category + ' (' + value_sign + str(values[i]) + ')'
color_val = colormap(float(values_cumsum[i])/total_values)
legend_handles.append(mpatches.Patch(color=color_val, label=label_str))
Word Clouds
%matplotlib inline
15 / 42
readme.md 11/15/2022
df_can = pd.read_excel(
'Canada.xlsx',
sheet_name='Canada by Citizenship',
skiprows=range(20),
skipfooter=2)
# for sake of consistency, let's also make all column labels of type string
df_can.columns = list(map(str, df_can.columns))
# set the country name as index - useful for quickly looking up countries using
.loc method
df_can.set_index('Country', inplace = True)
# years that we will be using in this lesson - useful for plotting later on
years = list(map(str, range(1980, 2014)))
print ('data dimensions:', df_can.shape)
total_immigration = df_can['Total'].sum()
# total_immigration
max_words = 90
word_string = ''
for country in df_can.index.values:
# check if country's name is a single-word name
if country.count(" ") == 0:
repeat_num_times = int(df_can.loc[country, 'Total'] / total_immigration *
max_words)
word_string = word_string + ((country + ' ') * repeat_num_times)
wordcloud = WordCloud(background_color='white').generate(word_string)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
df_dsn_tot = pd.DataFrame(df_dsn[years].sum(axis=0))
# change the years to type float (useful for regression later on)
df_dsn_tot.index = map(float, df_dsn_tot.index)
# rename columns
df_dsn_tot.columns = ['year', 'total']
plt.figure(figsize=(15, 10))
sns.set(font_scale=1.5)
sns.set_style('whitegrid')
17 / 42
readme.md 11/15/2022
view on GitHub
↥ back to top
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import folium
df_can = pd.read_excel(
'Canada.xlsx',
sheet_name='Canada by Citizenship',
skiprows=range(20),
skipfooter=2)
df_can.rename(columns={'OdName':'Country',
'AreaName':'Continent','RegName':'Region'}, inplace=True)
# for sake of consistency, let's also make all column labels of type string
df_can.columns = list(map(str, df_can.columns))
# years that we will be using in this lesson - useful for plotting later on
years = list(map(str, range(1980, 2014)))
print ('data dimensions:', df_can.shape)
import json
world_geo = json.load(open('world_countries.json'))
# generate choropleth map using the total immigration of each country to Canada
from 1980 to 2013
world_map.choropleth(
geo_data=world_geo,
data=df_can,
columns=['Country', 'Total'],
key_on='feature.properties.name',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Immigration to Canada'
)
# display map
world_map
19 / 42
readme.md 11/15/2022
view on GitHub
↥ back to top
Dash is a python framework for building web analytic applications. It is written on top of Flask, Plotly.js,
and React.js. Dash is well-suited for building data visualization apps with highly custom user interfaces.
Panel works with visualizations from Bokeh, Matplotlib, HoloViews, and many other Python plotting
libraries, making them instantly viewable either individually or when combined with interactive widgets
that control them.
Voilà turns Jupyter notebooks into standalone web applications. It can be used with separate layout
tools like jupyter-flex or templates like voila-vuetify.
Streamlit can easily turn data scripts into shareable web apps with 3 main principles:
embrace python scripting,
treat widgets as variables, and
reuse data and computation.
Plotly python
20 / 42
readme.md 11/15/2022
Plotly cheatsheet
Open-source datasets
plotly.graph_objects
If Plotly Express does not provide a good starting point, it is possible to use the more generic go.Scatter
class from plotly.graph_objects. Whereas plotly.express has two functions scatter and line, go.Scatter
can be used both for plotting points (makers) or lines, depending on the value of mode. The different options
of go.Scatter are documented in its reference page.
# using plotly
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
airline_data = pd.read_csv('airline_data.csv',
encoding = "ISO-8859-1",
dtype={'Div1Airport': str, 'Div1TailNum': str,
'Div2Airport': str, 'Div2TailNum': str})
print("Data Shape:", airline_data.shape)
df_sample500 = airline_data.sample(n=500, random_state=42)
# df_sample500.head()
print("Sample Shape:", df_sample500.shape)
21 / 42
readme.md 11/15/2022
Extract average monthly arrival delay time and see how it changes over the year
# Group the data by Month and compute average over arrival delay time.
line_data = df_sample500.groupby('Month')['ArrDelay'].mean().reset_index()
# Display the data
line_data
22 / 42
readme.md 11/15/2022
↥ back to top
plotly.express
Bar Charts
# Group the data by destination state and reporting airline. Compute total number
of flights in each combination
bar_data = df_sample500.groupby(['DestState'])['Flights'].sum().reset_index()
# Use plotly express bar chart function px.bar. Provide input data, x and y axis
variable, and title of the chart.
# This will give total number of flights to the destination state.
fig = px.bar(bar_data, x="DestState", y="Flights",
title='Total number of flights to the destination state split by
reporting airline')
fig.show()
Bubble Charts
A bubble chart is a scatter plot in which a third dimension of the data is shown through the size of markers.
For other types of scatter plot, see the scatter plot documentation.
Histograms
24 / 42
readme.md 11/15/2022
Pie Chart
Sunburst Charts
Hierarchical view in the order of month and destination state holding value of number of flights
25 / 42
readme.md 11/15/2022
↥ back to top
Dashboard
Dash Basics
Dash is a Open-Source User Interface Python library for creating reactive, web-based applications. It is
enterprise-ready and a first-class member of Plotly’s open-source tools.
Dash applications are web servers running Flask and communicating JSON packets over HTTP requests.
Dash’s frontend renders components using React.js. It is easy to build a Graphical User Interface
using dash as it abstracts all technologies required to build the applications.
Dash is Declarative and Reactive. Dash output can be rendered in web browser and can be deployed to
servers.
Dash uses a simple reactive decorator for binding code to the UI. This is inherently mobile and cross-
platform ready.
# Randomly sample 500 data points. Setting the random state to be 42 so that we
get same result.
26 / 42
readme.md 11/15/2022
27 / 42
readme.md 11/15/2022
↥ back to top
A callback function is a python function that is automatically called by Dash whenever an input component's
property changes. Callback function is decorated with @app.callback decorator. (decorators wrap a function,
modifying its behavior.)
# Group the data by Month and compute average over arrival delay time.
line_data = df.groupby('Month')['ArrDelay'].mean().reset_index()
#
fig = go.Figure(data=go.Scatter(x=line_data['Month'],
y=line_data['ArrDelay'],
mode='lines',
marker=dict(color='green')))
fig.update_layout(title='Month vs Average Flight Delay Time',
xaxis_title="Month",
yaxis_title='ArrDelay')
return fig
↥ back to top
More Outputs
Dashboard Components
Monthly average carrier delay by reporting airline for the given year.
Monthly average weather delay by reporting airline for the given year.
Monthly average national air system delay by reporting airline for the given year.
Monthly average security delay by reporting airline for the given year.
Monthly average late aircraft delay by reporting airline for the given year.
30 / 42
readme.md 11/15/2022
app.layout = html.Div(children=[
html.H1('Flight Delay Time Statistics',
style={'textAlign': 'left',
'color': '#503D36',
'font-size': 30}),
html.Div(["Input Year: ",
dcc.Input(id='input-year',
type='number',
value='2010',
style={'height': '35px',
'font-size': 30}),],
style={'font-size': 30}),
html.Br(),
html.Br(),
html.Div([
html.Div(dcc.Graph(id='carrier-plot')),
html.Div(dcc.Graph(id='weather-plot'))
], style={'display': 'flex'}),
html.Div([
html.Div(dcc.Graph(id='nas-plot')),
html.Div(dcc.Graph(id='security-plot'))
], style={'display': 'flex'}),
html.Div(dcc.Graph(id='late-plot'), style={'width':'50%'})
])
This function takes in airline data and selected year as an input and performs
computation for creating charts and plots.
Arguments:
airline_data: Input airline data.
entered_year: Input year for which computation needs to be performed.
Returns:
Computed average dataframes for carrier delay, weather delay, NAS delay,
security delay, and late aircraft delay.
"""
def compute_info(airline_data, entered_year):
# Select data
df = airline_data[airline_data['Year']==int(entered_year)]
# Compute delay averages
avg_car = df.groupby(['Month','Reporting_Airline'])
['CarrierDelay'].mean().reset_index()
avg_weather = df.groupby(['Month','Reporting_Airline'])
['WeatherDelay'].mean().reset_index()
avg_NAS = df.groupby(['Month','Reporting_Airline'])
['NASDelay'].mean().reset_index()
avg_sec = df.groupby(['Month','Reporting_Airline'])
31 / 42
readme.md 11/15/2022
['SecurityDelay'].mean().reset_index()
avg_late = df.groupby(['Month','Reporting_Airline'])
['LateAircraftDelay'].mean().reset_index()
return avg_car, avg_weather, avg_NAS, avg_sec, avg_late
# Callback decorator
@app.callback( [
Output(component_id='carrier-plot', component_property='figure'),
Output(component_id='weather-plot', component_property='figure'),
Output(component_id='nas-plot', component_property='figure'),
Output(component_id='security-plot', component_property='figure'),
Output(component_id='late-plot', component_property='figure'),
],
Input(component_id='input-year', component_property='value'))
# Computation to callback function and return graph
def get_graph(entered_year):
↥ back to top
Dashboard Summary
Best dashboards answer critical business questions. It will help business make informed decisions,
thereby improving performance.
33 / 42
readme.md 11/15/2022
↥ back to top
import pandas as pd
import dash
import dash_html_components as html
import dash_core_components as dcc
from dash.dependencies import Input, Output, State
import plotly.graph_objects as go
import plotly.express as px
from dash import no_update
app = dash.Dash(__name__)
# REVIEW1: Clear the layout and do not display exception till callback gets
executed
app.config.suppress_callback_exceptions = True
app.layout = html.Div(children=[#TASK 3A
html.H1('Car Automobile Components',
style={'textAlign': 'center',
'color': '#503D36',
34 / 42
readme.md 11/15/2022
'font-size': 24}),
#outer division starts
html.Div([
# First inner divsion for adding dropdown helper text for Selected Drive
wheels
html.Div([
#TASK 3B
html.H2('Drive Wheels Type:', style={'margin-right': '2em'}),
]),
#TASK 3C
dcc.Dropdown(
id='demo-dropdown',
options=[
{'label': 'Rear Wheel Drive', 'value': 'rwd'},
{'label': 'Front Wheel Drive', 'value': 'fwd'},
{'label': 'Four Wheel Drive', 'value': '4wd'}
],
value='rwd'
),
#Second Inner division for adding 2 inner divisions for 2 output graphs
html.Div([
#TASK 3D
html.Div([ ], id='plot1'),
html.Div([ ], id='plot2')
], style={'display': 'flex'}),
])
#outer division ends
])
#layout ends
35 / 42
readme.md 11/15/2022
if __name__ == '__main__':
app.run_server()
Dash Airline
# REVIEW1: Clear the layout and do not display exception till callback gets
executed
app.config.suppress_callback_exceptions = True
# List of years
year_list = [i for i in range(2005, 2021, 1)]
36 / 42
readme.md 11/15/2022
Argument:
df: Filtered dataframe
Returns:
Dataframes to create graph.
"""
def compute_data_choice_1(df):
# Cancellation Category Count
bar_data = df.groupby(['Month','CancellationCode'])
['Flights'].sum().reset_index()
# Average flight time by reporting airline
line_data = df.groupby(['Month','Reporting_Airline'])
['AirTime'].mean().reset_index()
# Diverted Airport Landings
div_data = df[df['DivAirportLandings'] != 0.0]
# Source state count
map_data = df.groupby(['OriginState'])['Flights'].sum().reset_index()
# Destination state count
tree_data = df.groupby(['DestState', 'Reporting_Airline'])
['Flights'].sum().reset_index()
return bar_data, line_data, div_data, map_data, tree_data
Arguments:
df: Input airline data.
Returns:
Computed average dataframes for carrier delay, weather delay, NAS delay,
security delay, and late aircraft delay.
"""
def compute_data_choice_2(df):
# Compute delay averages
avg_car = df.groupby(['Month','Reporting_Airline'])
['CarrierDelay'].mean().reset_index()
avg_weather = df.groupby(['Month','Reporting_Airline'])
['WeatherDelay'].mean().reset_index()
avg_NAS = df.groupby(['Month','Reporting_Airline'])
['NASDelay'].mean().reset_index()
avg_sec = df.groupby(['Month','Reporting_Airline'])
['SecurityDelay'].mean().reset_index()
avg_late = df.groupby(['Month','Reporting_Airline'])
['LateAircraftDelay'].mean().reset_index()
return avg_car, avg_weather, avg_NAS, avg_sec, avg_late
# Application layout
app.layout = html.Div(children=
[
# TASK1: Add title to the dashboard
37 / 42
readme.md 11/15/2022
# Enter your code below. Make sure you have correct formatting.
id='input-year',
# Update dropdown values using list comphrehension
options=[{'label': i, 'value': i} for i in year_list],
placeholder="Select a year",
style={'width':'80%', 'padding':'3px', 'font-size':
'20px', 'text-align-last' : 'center'}),
# Place them next to each other using the division
style
],
style={'display': 'flex'}
),
]
),
html.Div(
[
html.Div([ ], id='plot2'),
html.Div([ ], id='plot3')
],
style={'display': 'flex'}
),
# TASK3: Add a division with two empty divisions inside. See above
disvision for example.
# Enter your code below. Make sure you have correct formatting.
html.Div(
[
html.Div([ ], id='plot4'),
html.Div([ ], id='plot5')
],
style={'display': 'flex'}
),
])
# REVIEW4: Holding output state till user enters all the form information. In
this case, it will be chart type and year
[
State("plot1", 'children'),
State("plot2", "children"),
State("plot3", "children"),
State("plot4", "children"),
State("plot5", "children")
])
# Add computation to callback function and return graph
def get_graph(chart, year, children1, children2, c3, c4, c5):
# Select data
df = airline_data[airline_data['Year']==int(year)]
if chart == 'OPT1':
# Compute required information for creating graph from the data
bar_data, line_data, div_data, map_data, tree_data =
compute_data_choice_1(df)
map_fig.update_layout(
title_text = 'Number of flights from origin state',
geo_scope='usa'
) # Plot only the USA instead of globe
airline
# Enter your code below. Make sure you have correct formatting.
tree_fig = px.treemap(tree_data, path=['DestState',
'Reporting_Airline'],
values='Flights',
color='Flights',
color_continuous_scale='RdBu',
title='Flight count by airline to destination state'
)
# Create graph
carrier_fig = px.line(avg_car, x='Month', y='CarrierDelay',
color='Reporting_Airline', title='Average carrrier delay time (minutes) by
airline')
weather_fig = px.line(avg_weather, x='Month', y='WeatherDelay',
color='Reporting_Airline', title='Average weather delay time (minutes) by
airline')
nas_fig = px.line(avg_NAS, x='Month', y='NASDelay',
color='Reporting_Airline', title='Average NAS delay time (minutes) by airline')
sec_fig = px.line(avg_sec, x='Month', y='SecurityDelay',
color='Reporting_Airline', title='Average security delay time (minutes) by
airline')
late_fig = px.line(avg_late, x='Month', y='LateAircraftDelay',
color='Reporting_Airline', title='Average late aircraft delay time (minutes) by
airline')
return[dcc.Graph(figure=carrier_fig),
dcc.Graph(figure=weather_fig),
dcc.Graph(figure=nas_fig),
dcc.Graph(figure=sec_fig),
dcc.Graph(figure=late_fig)]
41 / 42
readme.md 11/15/2022
↥ back to top
42 / 42