Python - Adv - 3 - Jupyter Notebook (Student)

8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook
In [ ]: %matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# At the time of creating this material, there was a versioning issue
# between seaborn and numpy that results in a FutureWarning. This does
# not affect the results and will presumably be fixed in some update cycle
# but creates an annoying warning message we don't want to see every time.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Visualization with Seaborn

Visualization with Seaborn
Introduction
Relationships Between Continuous Variables
Scatter plots
Line plots
Aggregating Data
Plotting Dates
Exercises
Relationships to Categorical Variables
Categorical Scatter Plots
Distribution Plots
Exercises
Element Ordering
Facetting
Under the Hood
Customizing Plots
Plot Text and Axis Labels
Axis Limits
Color
Themes
Saving Plots
Exercises
Introduction
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 1/18

The base library for visualization in Python is matplotlib . Nearly every other library for
visualizing data is built on top of it. However, despite being incredibly flexible and powerful,
matplotlib is difficult to use for data analysis. Instead of being developed with one single API
design, it has grown organically as every new update needed to ensure backwards compatibility
with old code (otherwise all libraries building on it would break until updated). This continuity is part
of what makes it so attractive and simultaneously complicated.
Furthermore, matplotlib is designed to visualize anything, not just data. Because we're most
interested in examining and presenting relationships between data, however, we will use a different
library, seaborn . This library is specifically designed for statistical data visualization and provides
a consistent and easy-to-use API.
Relationships Between Continuous Variables

Visualizing the relationship between continuous variables is as simple as plotting the values of
both variables for each data entry on the x- and y-axes of a plot.
Scatter plots
In [ ]: tips = pd.read_csv("../data/tips.csv")

tips.head()
In [ ]: sns.relplot(x="total_bill", y="tip", data=tips)
We may, of course, be interested in more than just the x- and y- values. We can use additional
arguments to relplot(...) to distinguish data points
In [ ]: sns.relplot(x="total_bill", y="tip", hue="day", data=tips)
Points are now colored differently depending on whether the entry in the dataset corresponds to a
smoker or not. We can do the same for the size and style aesthetics as well.
In [ ]: sns.relplot(x="total_bill", y="tip", size="smoker", data=tips)
In [ ]: sns.relplot(x="total_bill", y="tip", style="day", data=tips)
The aesthetic mappings can be combined as desired to visualize up to 5 dimensions in our

datasets via the x , y , hue , style , and size arguments.
In [ ]: sns.relplot(x="total_bill", y="tip", hue="smoker", size="day", style="time", data
Be warned that this will make plots extremely difficult to visualize parse.

The hue and size aesthetics have been categorical so far, meaning that distinct colors and
sizes were chosen for each possible, discrete value of the dataframe columns they were applied
to. They can also be applied to continuous, numerical variables. In this case, the color palette will
automatically be set to a gradient. We will see further on how to customize colors.
In [ ]: sns.relplot(x="total_bill", y="tip", hue="size", data=tips)
In [ ]: sns.relplot(x="total_bill", y="tip", size="tip", data=tips, kind="scatter")
Line plots
By default, seaborn will create a scatterplot. In the case of time series, we may be interested in
creating a line plot to better visualize trends. We can do this by simply adding a kind="line"
argument (by default, this argument is kind="scatter" ).
In [ ]: df = pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum()})
In [ ]: sns.relplot(x="time", y="value", kind="line", data=df)
By default, the dataframe will be sorted so that the x-values are in ascending order. This ensures
that the line plot looks like a timeseries plot. This can, however, be disabled by setting
sort=False . This could be useful, for example, if we are following the movement of an object or
tracking how two variables change simultaneously through time.
In [ ]: df = pd.DataFrame(np.random.randn(500, 2).cumsum(axis=0), columns=["x", "y"])
In [ ]: sns.relplot(x="x", y="y", sort=False, kind="line", data=df)
Line plots have the same aesthetic mapping possibilities as scatter plots, hue , size , and
style , and they can also be combined in the same way. Notice how multiple lines are created
and only points with the identical mapped aesthetics are connected. That means, if we create a
line plot that maps a variable to hue and to style , we will end up with an individual line for each
existing combination of variables in our data.

"value": np.random.randn(500).cumsum(),
"region": "North", "division": "A"})
df = df.append(pd.DataFrame({
"region": "North", "division": "B"}))
"region": "North", "division": "C"}))
"region": "South", "division": "A"}))
"region": "South", "division": "B"}))

sns.relplot(
x="time", y="value", kind="line", hue="region",
style="division", data=df)
In [ ]: df.head()
In [ ]: # Using size instead of style

sns.relplot(x="time", y="value", kind="line", hue="region", size="division", data
If using the style parameter, we can also decide whether we want dashes, dots, or both.
"region": "North"})
"region": "South"}))
sns.relplot(x="time", y="value", kind="line",
style="region", markers=True, data=df)
In [ ]: sns.relplot(x="time", y="value", kind="line", style="region",

dashes=False, data=df)
In [ ]: sns.relplot(x="time", y="value", kind="line", style="region",

dashes=False, markers=True, data=df)
Aggregating Data

Often, we may have data with multiple measurements for the same data point, i.e. x-value. For
example, we might have several temperature sensors in a device as a failsafe. seaborn can
automatically aggregate y-values for identical x-values. By default, it plots the mean and the 95%
confidence interval around this mean in either direction.
In [ ]: fmri = pd.read_csv("../data/fmri.csv")

fmri.head()
In [ ]: fmri.loc[(fmri["timepoint"] == 18)].head()
In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", data=fmri)
Because seaborn uses bootstrapping to compute the confidence intervals and this is a time-
consuming process, it may be better to either switch to the standard deviation ( ci="sd" ) or turn
this off entirely and only plot the mean ( ci=None )
In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", ci="sd", data=fmri)
In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", ci=None, data=fmri)
We can also change our estimator to any aggregation function, such as np.median(...) ,
np.sum(...) , or even np.max(...) . If we want to turn off aggregation then we just set
estimator=None . Note that this will plot all measurements and cause the data to be plotted in
strange ways.
In [ ]: sns.relplot(x="timepoint", y="signal", kind="line",

estimator=np.median, data=fmri)
In [ ]: sns.relplot(x="timepoint", y="signal", kind="line",

estimator=None, data=fmri)
Plotting Dates
Because they're so ubiquitous, seaborn natively supports the date format and will automatically
format plots accordingly.
In [ ]: pd.date_range("2017-1-1", periods=5)
In [ ]: pd.date_range("1-1-2017", "22-3-2017")
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum()})
df.head()

In [ ]: g = sns.relplot(x="time", y="value", kind="line", data=df)

g.fig.autofmt_xdate()
Exercises
1. Load the iris.csv dataset and create a scatter plot relating the petal length to the petal
width.
In [ ]: ###
In [ ]: # MC
iris = pd.read_csv("../data/iris.csv")
sns.relplot(x="petal_length", y="petal_width", data=iris)
2. Load the diamonds.csv dataset. Plot the carats versus the price again, but this time make
sure that points are colored based on the cut.
In [ ]: ###
In [ ]: # MC
diamonds = pd.read_csv("../data/diamonds.csv")
sns.relplot(data=diamonds, x="carat", y="price", hue="cut")
3. Load the mpg.csv dataset and create a line plot relating the mean mpg to the model_year .
Make sure each country of origin is shown in a separate line style.
In [ ]: ###
In [ ]: # MC
mpg = pd.read_csv("../data/mpg.csv")
sns.relplot(data=mpg, x="model_year", y="mpg",
kind="line", style="origin")
4. This time, use pandas to find the mean mpg value for each model_year and each country
of origin . Create a line plot relating the mean mpg to the model_year with one line for
each country of origin , as above.
Hint: Remember groupby ? Remember how we can use it for multiple columns
simultaneously?
Note: seaborn cannot use the index, even if it is named. You must use *.reset_index()
to ensure that the columns you grouped by are columns in the new data frame

In [ ]: ###
In [ ]: # MC
mpg = pd.read_csv("../data/mpg.csv")
mpg_mean = mpg.groupby(["model_year", "origin"])["mpg"].mean()
mpg_mean = mpg_mean.reset_index()
sns.relplot(data=mpg_mean, x="model_year", y="mpg", kind="line",
style="origin")
5. Consider the following (fake) stock data. Create a line plot from this data with one line for each
stock symbol and format the x-axis as a date.
In [ ]: ###
np.random.seed(101)
stock_data = pd.DataFrame({
"symbol": "TRDS"})
stock_data = stock_data.append(pd.DataFrame({
"symbol": "RISL"}))
stock_data.head()
In [ ]: # MC
np.random.seed(101)
stock_data = pd.DataFrame({
"symbol": "TRDS"})
stock_data = stock_data.append(pd.DataFrame({
"symbol": "RISL"}))

stock_fig = sns.relplot(data=stock_data, x="time", y="value",
hue="symbol", kind="line")
stock_fig.fig.autofmt_xdate()
Relationships to Categorical Variables

We've already seen how we can show dependence on categorical variables with the various
aesthetics in the previous section ( hue , size , and style ). Often, we may not have two
continuous variables to relate to each other, though. For this, we use the seaborn function
catplot(...) which can create multiple kinds of categorical plots.
Categorical Scatter Plots

The simplest way to represent the relationship between continuous and categorical data is with a
categorical scatter plot that represents the distribution of (continuous) values for each category.
For this, we can make use of the default value kind="strip" .
In [ ]: tips = pd.read_csv("../data/tips.csv")

tips.head()
In [ ]: sns.catplot(x="day", y="total_bill", data=tips)
seaborn automatically adds jitter to the points to reduce their overlap. We can adjust this jitter by
passing a value between 0 and 1 (exclusive) or eliminate this jitter entirely by passing a boolean
False . Note that a value of 1 is interpreted as True and the default jitter width is used!
In [ ]: sns.catplot(x="day", y="total_bill", jitter=False, data=tips)
In [ ]: # When a number is passed, this corresponds to a relative width

# jitter=0.5 will typically mean that the "point columns" touch.
sns.catplot(x="day", y="total_bill", jitter=0.3, data=tips)
We can also prevent point overlap entirely by using a swarm plot. This will create a useful visual
approximateion of the distribution of the values.
In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)
Categorical plots only support the hue aesthetic, not the style or size aesthetics.
In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", hue="sex", data=tips)
seaborn will make assumptions on the nature of your data. For example, if you pass two
continuous, numerical variables to catplot(...) , it will try to treat the x-axis as a categorical
variable.
In [ ]: sns.catplot(x="size", y="total_bill", kind="swarm", data=tips)
In [ ]: sns.catplot(x="total_bill", y="size", kind="swarm", data=tips)
Notice that this will break seaborn if you attempt to place to pseudo-categorical variable onto the
y-axis. We can, however, invert our axes if one of the variables is truly categorical, i.e. not
numerical.
In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)
In [ ]: sns.catplot(x="total_bill", y="day", kind="swarm", data=tips)

Distribution Plots
Swarm plots are good for approximating distributions, but we often want to have an exact
description of the data distribution. For this, we can use box plots and variants thereof.
In [ ]: sns.catplot(x="day", y="total_bill", kind="box", data=tips)
Boxplots encode valuable information about our distribution. For each subset of the data, i.e. each
box, the following pieces of information are shown:
The central line of each box represents the median value

The top and bottom of the boxes are the 3𝑟𝑑
and 1𝑠𝑡
quantile, respectively.
This means that 25% of all values are below the bottom line and 25% are above the top
line, i.e. 50% of all values are within the colored region
The whiskers denote the outlier limits. Any value between the whiskers is considered "normal"
The points outside of the whiskers are outliers that may require special attention
The hue argument can be used to show additional, nested relationships
In [ ]: sns.catplot(x="day", y="total_bill", kind="box", hue="sex", data=tips)
Note that hue assumes a categorical variable when used on catplot(...) and seaborn will
therefore automatically convert numerical variables into categorical ones.
In [ ]: sns.catplot(x="day", y="total_bill", kind="box", hue="size", data=tips)
When quantiles aren't enough, seaborn can also display a violin plot. This kind of plot estimates
a density and plots it as a distribution
In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", data=tips)
If a variable has only two possible values and is mapped to the hue aesthetic, then split=True
can be used to combine the two density estimates to compare them more easily.
In [ ]: sns.catplot(x="day", y="total_bill", kind="violin",

hue="sex", split=True, data=tips)
Violin plots estimate the density. This kernel density estimator (KDE) requires a parameter, called
bandwidth, that determines how smooth or how detailed the density plot will be. Understanding
violin plots can therefore be more difficult and potentially misleading.
In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", bw=0.1, data=tips)
In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", bw=5, data=tips)
Violin plots automatically show the corresponding box plot stats inside. We can change this to
either showing sticks , points , or nothing at all.

inner="stick", data=tips)

inner="points", data=tips)

inner=None, data=tips)
Like with line plots, we may be interested in summary statistics over our data. For this, we can use
a bar plot. seaborn will compute a summary statistic, such as the mean, as well as confidence
intervals for each individual category (denoted by the x-axis).
In [ ]: titanic = pd.read_csv("../data/titanic.csv")

titanic.head()
In [ ]: # Compute the mean survival rate for each sex and class as well as confidence int
sns.catplot(x="sex", y="survived", hue="class", kind="bar", data=titanic)
In [ ]: # Compute the total number of survivors for each sex and class as well as confide
sns.catplot(x="sex", y="survived", hue="class", kind="bar",
estimator=np.sum, data=titanic)
If we're just interested in counting the number of occurances of a single variable, we can use
kind="count" .
In [ ]: # Count the number of passengers by sex and class

sns.catplot(x="sex", hue="class", kind="count", data=titanic)
An alternative to a barplot is a "point plot", which connects groups. This can be used to track
pseudo-timeseries data that may only have a few categorical time points, e.g. sales data for 5
years. Notice how it connects data subgroups with the same value of the variable mapped to the
hue aesthetic ( sex ).
In [ ]: sns.catplot(x="class", y="survived", hue="sex", kind="point", data=titanic)
As before, we can also change the estimator and confidence interval method for point plots.
In [ ]: sns.catplot(x="class", y="survived", hue="sex", kind="point",

estimator=np.mean, ci="sd", data=titanic)
Exercises

1. Load the diamonds.csv dataset and create a categorical scatter plot that relates the price to
the cut
In [ ]: ###
In [ ]: # MC
diamonds = pd.read_csv("../data/diamonds.csv")
sns.catplot(data=diamonds, x="cut", y="price")
2. Change the jitter width of the previous plot so that the dot-columns are touching.
In [ ]: ###
In [ ]: # MC
sns.catplot(data=diamonds, x="cut", y="price", jitter=0.5)
3. This time, create a box plot that relates the carats to the clarity
In [ ]: ###
In [ ]: # MC
sns.catplot(data=diamonds, x="clarity", y="carat", kind="box")
4. Create a subset of the diamonds data consisting of only diamonds with colors of "J" (worst)
and "D" (best) and only with clarity "IF".
Hint: We can combine boolean masks for Pandas like so: diamonds.loc[(condition1) &
(condition2)]
Create a violin plot relating the price to the clarity and map the color to the hue aesthetic.
Make sure the density estimates for each color are combined in each violin.
In [ ]: ###
In [ ]: # MC
diamonds_jd = diamonds.loc[
diamonds["color"].isin(("J", "D")) &
(diamonds["clarity"] == "IF")]

sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin")
5. Play with the bandwidth parameter ( bw ) for the previous plot. How can you interpret the plot
for bw=0.01 , bw=0.1 , and bw=1 ?

In [ ]: ###
In [ ]: # MC
hue="color", split=True, kind="violin", bw=0.01)
hue="color", split=True, kind="violin", bw=0.1)
hue="color", split=True, kind="violin", bw=1)
6. Using the full diamond dataset again, use a bar plot to determine how many diamonds there
are of each cut.
In [ ]: ###
In [ ]: # MC
sns.catplot(data=diamonds, x="cut", kind="count")
Element Ordering
All of the above plots allow us to customize the order of elements, both on the axes as well as for
the aesthetics. Naturally, the functions will only enable ordering aesthetics that are supported, e.g.
catplot(...) has no size_order or style_order arguments and relplot(...) has no
order argument as both axes depict continuous values.
In [ ]: # Compute the mean survival rate for each sex and class as well as confidence int
sns.catplot(x="sex", y="survived", hue="class", kind="bar",
order=["female", "male"], data=titanic)
In [ ]: sns.catplot(x="sex", y="survived", hue="class", kind="bar",

order=["female", "male"], data=titanic,
hue_order=["First", "Second", "Third"])
In [ ]: sns.relplot(
x="total_bill", y="tip", hue="smoker", size="day", style="time",
style_order=["Lunch", "Dinner"],
size_order=["Thur", "Fri", "Sat", "Sun"],
hue_order=["Yes", "No"], data=tips)
Faceting
We can also instruct the functions relplot(...) and catplot(...) to create multiple plots if
we simply have too much detail to show in one. The parameters col=... and row=... let us
further split apart the data and show subsets in individual plots.

In [ ]: titanic.head()
In [ ]: # Compute the total number of survivors for each sex and class as well as confide
sns.catplot(x="sex", y="survived", hue="class",
kind="bar", col="embark_town", data=titanic)
In [ ]: tips.head()
In [ ]: sns.relplot(x="total_bill", y="tip", hue="sex",

row="day", col="smoker", data=tips)
In [ ]: sns.relplot(x="total_bill", y="tip", hue="sex",

row="day", col="smoker", data=tips,
row_order=["Thur", "Fri", "Sat", "Sun"])
Under the Hood

seaborn is a high-level interface for matplotlib . The two functions introduced here call other,
intermediate functions, which in turn call matplotlib functions.
relplot(kind=...)
scatter: scatterplot() --> matplotlib.pyplot.scatter()
line: lineplot() --> matplotlib.pyplot.line()
catplot(kind=...)
strip: stripplot() --> Calls multiple matplotlib functions
swarm: swarmplot() --> Calls multiple matplotlib functions
box: boxplot() --> Calls matplotlib.pyplot.boxplot()
violin: violinplot() --> Calls multiple matplotlib functions
bar: barplot() --> Calls matplotlib.pyplot.bar()
count: countplot() --> Calls matplotlib.pyplot.bar()
point: pointplot() --> Calls multiple matplotlib functions
seaborn is essentially a "convenience" to make matplotlib more accessible.
Customizing Plots
Plot Text and Axis Labels

Customizing the text of axis labels is unfortunately not as intuitive as building the plots. This is
because seaborn builds heavily on matplotlib but attempts to reduce the fine granularity of
building a plot with the latter. For example, to create the facetted plots above using matplotlib ,
we would have to subset the data into all possible variants, build each individual plot, arrange them
in a grid, and then add the legend and axis titles. seaborn makes this step somewhat easier, but
cannot get around this granularity when it comes to customizing plots.

Title
In [ ]: # Look at documentation for more info

myFigure = sns.catplot(
x="sex", y="survived", hue="class", kind="bar",
hue_order=["First", "Second", "Third"], data=titanic)
myFigure.fig.suptitle("Titanic Survivors", fontsize=15)
Legend
In [ ]: myFigure._legend.set_title("Passenger Class")

myFigure.fig
Legend labels are stored as Text(...) elements
In [ ]: myFigure._legend.texts
We can change these by calling *.set_text(...) on each of them
In [ ]: myFigure._legend.texts[0].set_text("1st")
myFigure._legend.texts[1].set_text("2nd")
myFigure._legend.texts[2].set_text("3rd")
myFigure._legend.texts
In [ ]: myFigure.fig
Axis Labels
In [ ]: myFigure.set_axis_labels(x_var="Passenger Sex", y_var="Survival Rate")

myFigure.fig
We can set the value of categorical tick labels as follows:
In [ ]: myFigure.set_xticklabels(labels=["Apples", "Oranges"])

myFigure.fig
Rotate Tick Labels
In some cases, tick labels may be too dense and must be rotated
In [ ]: myFigure.set_yticklabels(rotation=30)
myFigure.fig
Axis Limits
We use matplotlib to set our axis limits
In [ ]: sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)

plt.xlim(20, 40)
plt.ylim(2, 8)
Color
There are far more methods of creating and choosing color palettes in seaborn than could
possibly be shown here.
In general, we set the colors of our plot with the parameter palette=... . The simplest way to do
this is to define a dictionary relating the aesthetic names (here the passenger class) to colors. The
colors can be given either as a string (insofar as the color is known to seaborn), in hexadecimal
format indicating the color channel intensities ( #RRGGBB ), or as a tuple/list with 3 values indicating
the color mixing ( [r, g, b] , values should be between 0 and 1)
In [ ]: sns.catplot(
hue_order=["First", "Second", "Third"],
palette=["red", "#00FF00", (0, 1, 1)],
data=titanic)
xkcd produces a set of 954 named colors (https://xkcd.com/color/rgb/) to name random RGB
colors.
This becomes tiresome for many categories so seaborn offers several functions to generate
color palettes automatically. Some of these include:
sns.cubehelix_palette(...)
sns.diverging_palette(...)
sns.dark_palette(...)
Any of the ColorBrewer (http://colorbrewer2.org) presets
... and many more
Themes
Beyond color, seaborn also has support for themes. There are five built-in seaborn themes:
darkgrid , whitegrid , dark , white , and ticks . They can be invoked with
sns.set_style(...)
In [ ]: sns.set_style("dark")
sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)
In [ ]: sns.set_style("ticks")

We can edit these styles to our liking. Note that the floating point numbers are actually strings!
In [ ]: # See current style details

sns.set_style("ticks")
sns.axes_style()
In [ ]: # Overwrite styles

sns.set_style("ticks", {"text.color": '1'})
Lastly, we can also use sns.set(...) to tweak our plots, such as font size scaling.
In [ ]: # Overwrite styles

sns.set(font_scale=1)
Saving Plots
Typically, an analysis pipeline won't run in Jupyter, or any other interactive environment, but as a
script that generates a report. We can use seaborn to this end by saving our plots.
In [ ]: myFigure = sns.catplot(

hue_order=["First", "Second", "Third"],
order=["male", "female"], data=titanic, ci=None)
myFigure.fig.suptitle("Titanic Survivors", fontsize=20)
myFigure._legend.set_title("Legend Title")
myFigure._legend.texts[0].set_text("1st")
myFigure._legend.texts[1].set_text("2nd")
myFigure._legend.texts[2].set_text("3rd")
myFigure.set_axis_labels(x_var="Passenger Sex", y_var="Survival Rate")
myFigure.ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, y: "{:.0f}%".fo
myFigure.set_xticklabels(labels=["Male", "Female"])
myFigure.set_xticklabels(rotation=30)

# Save the plot in all its glory
myFigure.savefig("output.png")
seaborn supports saving both in bitmap format, e.g. PNG, as well as in vector format, e.g. PDF.
Exercises
1. Using the full diamond dataset again, use a bar plot to determine how many dimaonds there
are of each clarity. Create facets for the cut (columns) and color (rows)
In [ ]: ###

In [ ]: # MC
sns.catplot(data=diamonds, x="clarity", kind="count",
col="cut", row="color")
2. Create a box plot that relates the carats to the clarity and place the boxes in the correct order
(I1 , SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
In [ ]: ###
In [ ]: # MC
sns.catplot(
data=diamonds, x="clarity", y="carat", kind="box",
order=("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))
3. Plot the relationship between the x and y columns of the diamonds dataframe. Limit the x-
axis to the interval [3, 11] and the y-axis to the interval [0, 15] to remove outliers
In [ ]: ###
In [ ]: # MC
myFig = sns.relplot(data=diamonds, x="x", y="y")
plt.xlim(3, 11)
plt.ylim(0, 15)
4. Load the exercise.csv dataset.

A. Plot the relationship between the pulse and diet as a boxplot.
B. Map the kind of exercise to the hue
C. Facet the data into columns so that we have one plot for each timepoint
In [ ]: ###
In [ ]: # MC
exercise = pd.read_csv("../data/exercise.csv")
efig = sns.catplot(data=exercise, x="diet", y="pulse",
kind="box", hue="kind", col="time")
efig.set_axis_labels(x_var="Diet", y_var="Pulse")
Open Exercise
5. Using the dataset tips , plot any relationship between variables you find worth investigating
and make the figure "presentation-ready". That means:
Use aesthetics ( hue , style , size ) and facets where appropriate.

Create a figure title and label the axes and legend.
Format tick marks if necessary.
Edit tick labels and legend entries if necessary.
Find a visually appealing color palette.

Choose one of the base themes and play around with the options to them until they are to
your liking.
Save your plot as "output.png" and "output.pdf" and compare the two images. What happens
to them when you zoom in very close?
In [ ]: ###
In [ ]: # MC

Python - Adv - 3 - Jupyter Notebook (Student)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Python - Adv - 3 - Jupyter Notebook (Student)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python - Adv - 3 - Jupyter Notebook (Student)

Uploaded by

Copyright:

Available Formats

8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: %matplotlib inline

Visualization with Seaborn

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 1/18

Relationships Between Continuous Variables

In [ ]: tips = pd.read_csv("../data/tips.csv")

In [ ]: sns.relplot(x="total_bill", y="tip", data=tips)

In [ ]: sns.relplot(x="total_bill", y="tip", hue="day", data=tips)

In [ ]: sns.relplot(x="total_bill", y="tip", size="smoker", data=tips)

In [ ]: sns.relplot(x="total_bill", y="tip", style="day", data=tips)

The aesthetic mappings can be combined as desired to visualize up to 5 dimensions in our

In [ ]: sns.relplot(x="total_bill", y="tip", hue="smoker", size="day", style="time", data

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 2/18

In [ ]: sns.relplot(x="total_bill", y="tip", hue="size", data=tips)

In [ ]: sns.relplot(x="total_bill", y="tip", size="tip", data=tips, kind="scatter")

In [ ]: sns.relplot(x="time", y="value", kind="line", data=df)

In [ ]: df = pd.DataFrame(np.random.randn(500, 2).cumsum(axis=0), columns=["x", "y"])

In [ ]: sns.relplot(x="x", y="y", sort=False, kind="line", data=df)

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 3/18

In [ ]: # Using size instead of style

In [ ]: sns.relplot(x="time", y="value", kind="line", style="region",

In [ ]: sns.relplot(x="time", y="value", kind="line", style="region",

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 4/18

In [ ]: fmri = pd.read_csv("../data/fmri.csv")

In [ ]: fmri.loc[(fmri["timepoint"] == 18)].head()

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", data=fmri)

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", ci="sd", data=fmri)

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", ci=None, data=fmri)

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line",

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line",

In [ ]: pd.date_range("2017-1-1", periods=5)

In [ ]: pd.date_range("1-1-2017", "22-3-2017")

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 5/18

In [ ]: g = sns.relplot(x="time", y="value", kind="line", data=df)

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 6/18

Relationships to Categorical Variables

Categorical Scatter Plots

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 7/18

In [ ]: tips = pd.read_csv("../data/tips.csv")

In [ ]: sns.catplot(x="day", y="total_bill", data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", jitter=False, data=tips)

In [ ]: # When a number is passed, this corresponds to a relative width

In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", hue="sex", data=tips)

In [ ]: sns.catplot(x="size", y="total_bill", kind="swarm", data=tips)

In [ ]: sns.catplot(x="total_bill", y="size", kind="swarm", data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)

In [ ]: sns.catplot(x="total_bill", y="day", kind="swarm", data=tips)

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 8/18

In [ ]: sns.catplot(x="day", y="total_bill", kind="box", data=tips)

The central line of each box represents the median value

The hue argument can be used to show additional, nested relationships

In [ ]: sns.catplot(x="day", y="total_bill", kind="box", hue="sex", data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="box", hue="size", data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin",

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", bw=0.1, data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", bw=5, data=tips)