Python - Adv - 3 - Jupyter Notebook (Student)
Python - Adv - 3 - Jupyter Notebook (Student)
Python - Adv - 3 - Jupyter Notebook (Student)
Introduction
The base library for visualization in Python is matplotlib . Nearly every other library for
visualizing data is built on top of it. However, despite being incredibly flexible and powerful,
matplotlib is difficult to use for data analysis. Instead of being developed with one single API
design, it has grown organically as every new update needed to ensure backwards compatibility
with old code (otherwise all libraries building on it would break until updated). This continuity is part
of what makes it so attractive and simultaneously complicated.
Furthermore, matplotlib is designed to visualize anything, not just data. Because we're most
interested in examining and presenting relationships between data, however, we will use a different
library, seaborn . This library is specifically designed for statistical data visualization and provides
a consistent and easy-to-use API.
Scatter plots
We may, of course, be interested in more than just the x- and y- values. We can use additional
arguments to relplot(...) to distinguish data points
Points are now colored differently depending on whether the entry in the dataset corresponds to a
smoker or not. We can do the same for the size and style aesthetics as well.
Be warned that this will make plots extremely difficult to visualize parse.
The hue and size aesthetics have been categorical so far, meaning that distinct colors and
sizes were chosen for each possible, discrete value of the dataframe columns they were applied
to. They can also be applied to continuous, numerical variables. In this case, the color palette will
automatically be set to a gradient. We will see further on how to customize colors.
Line plots
By default, seaborn will create a scatterplot. In the case of time series, we may be interested in
creating a line plot to better visualize trends. We can do this by simply adding a kind="line"
argument (by default, this argument is kind="scatter" ).
In [ ]: df = pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum()})
By default, the dataframe will be sorted so that the x-values are in ascending order. This ensures
that the line plot looks like a timeseries plot. This can, however, be disabled by setting
sort=False . This could be useful, for example, if we are following the movement of an object or
tracking how two variables change simultaneously through time.
Line plots have the same aesthetic mapping possibilities as scatter plots, hue , size , and
style , and they can also be combined in the same way. Notice how multiple lines are created
and only points with the identical mapped aesthetics are connected. That means, if we create a
line plot that maps a variable to hue and to style , we will end up with an individual line for each
existing combination of variables in our data.
In [ ]: df = pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "North", "division": "A"})
df = df.append(pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "North", "division": "B"}))
df = df.append(pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "North", "division": "C"}))
df = df.append(pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "South", "division": "A"}))
df = df.append(pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "South", "division": "B"}))
sns.relplot(
x="time", y="value", kind="line", hue="region",
style="division", data=df)
In [ ]: df.head()
If using the style parameter, we can also decide whether we want dashes, dots, or both.
In [ ]: df = pd.DataFrame({
"time": np.arange(20),
"value": np.random.randn(20).cumsum(),
"region": "North"})
df = df.append(pd.DataFrame({
"time": np.arange(20),
"value": np.random.randn(20).cumsum(),
"region": "South"}))
sns.relplot(x="time", y="value", kind="line",
style="region", markers=True, data=df)
Aggregating Data
Often, we may have data with multiple measurements for the same data point, i.e. x-value. For
example, we might have several temperature sensors in a device as a failsafe. seaborn can
automatically aggregate y-values for identical x-values. By default, it plots the mean and the 95%
confidence interval around this mean in either direction.
Because seaborn uses bootstrapping to compute the confidence intervals and this is a time-
consuming process, it may be better to either switch to the standard deviation ( ci="sd" ) or turn
this off entirely and only plot the mean ( ci=None )
We can also change our estimator to any aggregation function, such as np.median(...) ,
np.sum(...) , or even np.max(...) . If we want to turn off aggregation then we just set
estimator=None . Note that this will plot all measurements and cause the data to be plotted in
strange ways.
Plotting Dates
Because they're so ubiquitous, seaborn natively supports the date format and will automatically
format plots accordingly.
In [ ]: df = pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum()})
df.head()
Exercises
1. Load the iris.csv dataset and create a scatter plot relating the petal length to the petal
width.
In [ ]: ###
In [ ]: # MC
iris = pd.read_csv("../data/iris.csv")
sns.relplot(x="petal_length", y="petal_width", data=iris)
2. Load the diamonds.csv dataset. Plot the carats versus the price again, but this time make
sure that points are colored based on the cut.
In [ ]: ###
In [ ]: # MC
diamonds = pd.read_csv("../data/diamonds.csv")
sns.relplot(data=diamonds, x="carat", y="price", hue="cut")
3. Load the mpg.csv dataset and create a line plot relating the mean mpg to the model_year .
Make sure each country of origin is shown in a separate line style.
In [ ]: ###
In [ ]: # MC
mpg = pd.read_csv("../data/mpg.csv")
sns.relplot(data=mpg, x="model_year", y="mpg",
kind="line", style="origin")
4. This time, use pandas to find the mean mpg value for each model_year and each country
of origin . Create a line plot relating the mean mpg to the model_year with one line for
each country of origin , as above.
Hint: Remember groupby ? Remember how we can use it for multiple columns
simultaneously?
Note: seaborn cannot use the index, even if it is named. You must use *.reset_index()
to ensure that the columns you grouped by are columns in the new data frame
In [ ]: ###
In [ ]: # MC
mpg = pd.read_csv("../data/mpg.csv")
mpg_mean = mpg.groupby(["model_year", "origin"])["mpg"].mean()
mpg_mean = mpg_mean.reset_index()
sns.relplot(data=mpg_mean, x="model_year", y="mpg", kind="line",
style="origin")
5. Consider the following (fake) stock data. Create a line plot from this data with one line for each
stock symbol and format the x-axis as a date.
In [ ]: ###
np.random.seed(101)
stock_data = pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum(),
"symbol": "TRDS"})
stock_data = stock_data.append(pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum(),
"symbol": "RISL"}))
stock_data.head()
In [ ]: # MC
np.random.seed(101)
stock_data = pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum(),
"symbol": "TRDS"})
stock_data = stock_data.append(pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum(),
"symbol": "RISL"}))
stock_fig = sns.relplot(data=stock_data, x="time", y="value",
hue="symbol", kind="line")
stock_fig.fig.autofmt_xdate()
The simplest way to represent the relationship between continuous and categorical data is with a
categorical scatter plot that represents the distribution of (continuous) values for each category.
For this, we can make use of the default value kind="strip" .
seaborn automatically adds jitter to the points to reduce their overlap. We can adjust this jitter by
passing a value between 0 and 1 (exclusive) or eliminate this jitter entirely by passing a boolean
False . Note that a value of 1 is interpreted as True and the default jitter width is used!
We can also prevent point overlap entirely by using a swarm plot. This will create a useful visual
approximateion of the distribution of the values.
Categorical plots only support the hue aesthetic, not the style or size aesthetics.
seaborn will make assumptions on the nature of your data. For example, if you pass two
continuous, numerical variables to catplot(...) , it will try to treat the x-axis as a categorical
variable.
Notice that this will break seaborn if you attempt to place to pseudo-categorical variable onto the
y-axis. We can, however, invert our axes if one of the variables is truly categorical, i.e. not
numerical.
Distribution Plots
Swarm plots are good for approximating distributions, but we often want to have an exact
description of the data distribution. For this, we can use box plots and variants thereof.
Boxplots encode valuable information about our distribution. For each subset of the data, i.e. each
box, the following pieces of information are shown:
Note that hue assumes a categorical variable when used on catplot(...) and seaborn will
therefore automatically convert numerical variables into categorical ones.
When quantiles aren't enough, seaborn can also display a violin plot. This kind of plot estimates
a density and plots it as a distribution
If a variable has only two possible values and is mapped to the hue aesthetic, then split=True
can be used to combine the two density estimates to compare them more easily.
Violin plots estimate the density. This kernel density estimator (KDE) requires a parameter, called
bandwidth, that determines how smooth or how detailed the density plot will be. Understanding
violin plots can therefore be more difficult and potentially misleading.
Violin plots automatically show the corresponding box plot stats inside. We can change this to
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 9/18
8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook
Like with line plots, we may be interested in summary statistics over our data. For this, we can use
a bar plot. seaborn will compute a summary statistic, such as the mean, as well as confidence
intervals for each individual category (denoted by the x-axis).
In [ ]: # Compute the mean survival rate for each sex and class as well as confidence int
sns.catplot(x="sex", y="survived", hue="class", kind="bar", data=titanic)
In [ ]: # Compute the total number of survivors for each sex and class as well as confide
sns.catplot(x="sex", y="survived", hue="class", kind="bar",
estimator=np.sum, data=titanic)
If we're just interested in counting the number of occurances of a single variable, we can use
kind="count" .
An alternative to a barplot is a "point plot", which connects groups. This can be used to track
pseudo-timeseries data that may only have a few categorical time points, e.g. sales data for 5
years. Notice how it connects data subgroups with the same value of the variable mapped to the
hue aesthetic ( sex ).
As before, we can also change the estimator and confidence interval method for point plots.
Exercises
1. Load the diamonds.csv dataset and create a categorical scatter plot that relates the price to
the cut
In [ ]: ###
In [ ]: # MC
diamonds = pd.read_csv("../data/diamonds.csv")
sns.catplot(data=diamonds, x="cut", y="price")
2. Change the jitter width of the previous plot so that the dot-columns are touching.
In [ ]: ###
In [ ]: # MC
sns.catplot(data=diamonds, x="cut", y="price", jitter=0.5)
3. This time, create a box plot that relates the carats to the clarity
In [ ]: ###
In [ ]: # MC
sns.catplot(data=diamonds, x="clarity", y="carat", kind="box")
4. Create a subset of the diamonds data consisting of only diamonds with colors of "J" (worst)
and "D" (best) and only with clarity "IF".
Hint: We can combine boolean masks for Pandas like so: diamonds.loc[(condition1) &
(condition2)]
Create a violin plot relating the price to the clarity and map the color to the hue aesthetic.
Make sure the density estimates for each color are combined in each violin.
In [ ]: ###
In [ ]: # MC
diamonds_jd = diamonds.loc[
diamonds["color"].isin(("J", "D")) &
(diamonds["clarity"] == "IF")]
sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin")
5. Play with the bandwidth parameter ( bw ) for the previous plot. How can you interpret the plot
for bw=0.01 , bw=0.1 , and bw=1 ?
In [ ]: ###
In [ ]: # MC
sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin", bw=0.01)
sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin", bw=0.1)
sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin", bw=1)
6. Using the full diamond dataset again, use a bar plot to determine how many diamonds there
are of each cut.
In [ ]: ###
In [ ]: # MC
sns.catplot(data=diamonds, x="cut", kind="count")
Element Ordering
All of the above plots allow us to customize the order of elements, both on the axes as well as for
the aesthetics. Naturally, the functions will only enable ordering aesthetics that are supported, e.g.
catplot(...) has no size_order or style_order arguments and relplot(...) has no
order argument as both axes depict continuous values.
In [ ]: # Compute the mean survival rate for each sex and class as well as confidence int
sns.catplot(x="sex", y="survived", hue="class", kind="bar",
order=["female", "male"], data=titanic)
In [ ]: sns.relplot(
x="total_bill", y="tip", hue="smoker", size="day", style="time",
style_order=["Lunch", "Dinner"],
size_order=["Thur", "Fri", "Sat", "Sun"],
hue_order=["Yes", "No"], data=tips)
Faceting
We can also instruct the functions relplot(...) and catplot(...) to create multiple plots if
we simply have too much detail to show in one. The parameters col=... and row=... let us
further split apart the data and show subsets in individual plots.
In [ ]: titanic.head()
In [ ]: # Compute the total number of survivors for each sex and class as well as confide
sns.catplot(x="sex", y="survived", hue="class",
kind="bar", col="embark_town", data=titanic)
In [ ]: tips.head()
relplot(kind=...)
scatter: scatterplot() --> matplotlib.pyplot.scatter()
line: lineplot() --> matplotlib.pyplot.line()
catplot(kind=...)
strip: stripplot() --> Calls multiple matplotlib functions
swarm: swarmplot() --> Calls multiple matplotlib functions
box: boxplot() --> Calls matplotlib.pyplot.boxplot()
violin: violinplot() --> Calls multiple matplotlib functions
bar: barplot() --> Calls matplotlib.pyplot.bar()
count: countplot() --> Calls matplotlib.pyplot.bar()
point: pointplot() --> Calls multiple matplotlib functions
Customizing Plots
Title
Legend
In [ ]: myFigure._legend.texts
In [ ]: myFigure._legend.texts[0].set_text("1st")
myFigure._legend.texts[1].set_text("2nd")
myFigure._legend.texts[2].set_text("3rd")
myFigure._legend.texts
In [ ]: myFigure.fig
Axis Labels
In some cases, tick labels may be too dense and must be rotated
In [ ]: myFigure.set_yticklabels(rotation=30)
myFigure.fig
Axis Limits
We use matplotlib to set our axis limits
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 14/18
8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook
Color
There are far more methods of creating and choosing color palettes in seaborn than could
possibly be shown here.
In general, we set the colors of our plot with the parameter palette=... . The simplest way to do
this is to define a dictionary relating the aesthetic names (here the passenger class) to colors. The
colors can be given either as a string (insofar as the color is known to seaborn), in hexadecimal
format indicating the color channel intensities ( #RRGGBB ), or as a tuple/list with 3 values indicating
the color mixing ( [r, g, b] , values should be between 0 and 1)
In [ ]: sns.catplot(
x="sex", y="survived", hue="class", kind="bar",
hue_order=["First", "Second", "Third"],
palette=["red", "#00FF00", (0, 1, 1)],
data=titanic)
xkcd produces a set of 954 named colors (https://xkcd.com/color/rgb/) to name random RGB
colors.
This becomes tiresome for many categories so seaborn offers several functions to generate
color palettes automatically. Some of these include:
sns.cubehelix_palette(...)
sns.diverging_palette(...)
sns.dark_palette(...)
Any of the ColorBrewer (http://colorbrewer2.org) presets
... and many more
Themes
Beyond color, seaborn also has support for themes. There are five built-in seaborn themes:
darkgrid , whitegrid , dark , white , and ticks . They can be invoked with
sns.set_style(...)
In [ ]: sns.set_style("dark")
sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)
In [ ]: sns.set_style("ticks")
sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)
We can edit these styles to our liking. Note that the floating point numbers are actually strings!
Lastly, we can also use sns.set(...) to tweak our plots, such as font size scaling.
Saving Plots
Typically, an analysis pipeline won't run in Jupyter, or any other interactive environment, but as a
script that generates a report. We can use seaborn to this end by saving our plots.
seaborn supports saving both in bitmap format, e.g. PNG, as well as in vector format, e.g. PDF.
Exercises
1. Using the full diamond dataset again, use a bar plot to determine how many dimaonds there
are of each clarity. Create facets for the cut (columns) and color (rows)
In [ ]: ###
In [ ]: # MC
sns.catplot(data=diamonds, x="clarity", kind="count",
col="cut", row="color")
2. Create a box plot that relates the carats to the clarity and place the boxes in the correct order
(I1 , SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
In [ ]: ###
In [ ]: # MC
sns.catplot(
data=diamonds, x="clarity", y="carat", kind="box",
order=("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))
3. Plot the relationship between the x and y columns of the diamonds dataframe. Limit the x-
axis to the interval [3, 11] and the y-axis to the interval [0, 15] to remove outliers
In [ ]: ###
In [ ]: # MC
myFig = sns.relplot(data=diamonds, x="x", y="y")
plt.xlim(3, 11)
plt.ylim(0, 15)
In [ ]: ###
In [ ]: # MC
exercise = pd.read_csv("../data/exercise.csv")
efig = sns.catplot(data=exercise, x="diet", y="pulse",
kind="box", hue="kind", col="time")
efig.set_axis_labels(x_var="Diet", y_var="Pulse")
Open Exercise
5. Using the dataset tips , plot any relationship between variables you find worth investigating
and make the figure "presentation-ready". That means:
Save your plot as "output.png" and "output.pdf" and compare the two images. What happens
to them when you zoom in very close?
In [ ]: ###
In [ ]: # MC