Unit 3 (Python)
Unit 3 (Python)
Unit 3
Data Visualiza on
What is data visualization?
Data visualization is the representation of data through use of common graphics, such
as charts, plots, infographics, and even animations.
These visual displays of information communicate complex data relationships and
data-driven insights in a way that is easy to understand.
Data analysts and data scientists use it to discover and explain patterns and trends.
Numbers and tables can be overwhelming, but visuals like charts and graphs can reveal
patterns and trends that would be difficult to see otherwise. For example, a bar chart
can show you which products are selling the best, while a line graph can show you how
sales have changed over time.
Identify patterns and trends: Visuals highlight relationships and insights buried within
numbers, making it easier to spot trends, outliers, and correlations.
Simplify complex information: Graphs and charts break down complex data into
digestible chunks, making it easier for humans to grasp the overall picture.
Reveal hidden insights: Certain visual patterns might not be apparent in raw data, but
become clear when represented visually.
Communicate findings effectively: Visuals capture attention and hold it, making data
presentations more engaging and impactful than solely relying on text or numbers.
Share information with diverse audiences: Visualizations transcend language barriers
and cater to learners of all styles, ensuring everyone can understand the data.
Inform better decisions: By readily understanding patterns and relationships in
data, individuals and organizations can make more informed choices based on
evidence.
Track progress and identify areas for improvement: Visualizing data over time reveals
progress towards goals and helps identify areas needing adjustments.
o Matplotlib: The foundation for many other libraries, providing low-level building
blocks for creating static, animated, and interactive visualizations.
o Seaborn: Built on top of Matplotlib, it offers a high-level interface for creating
aesthetically pleasing and informative statistical graphics.
o Plotly: Excels in creating interactive, web-based visualizations, ideal for sharing
and embedding in web applications.
o Bokeh: Another option for interactive visualizations, with a focus on web-based
plots and dashboards.
o Pandas Visualization: Built into the Pandas data analysis library, it offers
convenient plotting functions directly from DataFrames.
Seamless Integration with Data Analysis Tools:
o Python's powerful data analysis libraries like NumPy and Pandas make it easy to
clean, manipulate, and prepare data for visualization.
o Integration with libraries like SciPy and Statsmodels enables statistical modeling
and analysis within the same Python environment.
o This smooth integration streamlines the data analysis and visualization
process, from data preparation to visualization and interpretation.
Interactive Visualizations:
o Plotly and Bokeh enable the creation of dynamic and interactive visualizations
that engage users and facilitate exploration of data.
o These visualizations can be easily shared and embedded in web applications for
widespread dissemination.
Strong Community and Support:
o Python boasts a large and active community of data scientists and
developers, providing extensive resources, tutorials, and support for
visualization tasks.
o Ongoing development and updates ensure that Python's visualization
capabilities remain cutting-edge.
Python libraries for data visualization
Matplotlib
Matplotlib is a Python plotting library for creating static, dynamic, and interactive
visualizations. Its computational mathematics extension is NumPy.
Despite being over a decade old, it remains the most popular plotting library in the
Python world.
Since matplotlib was the first Python data visualization library, many other libraries
have been built on top of it or are intended to work in tandem with it during research.
Although matplotlib is great for visualizing details, it isn't very practical for quickly and
easily creating publication-quality charts.
Plotly
Plotly is a free open-source graphing library for creating data visualizations.
Plotly (plotly.py) is a Python library that is built on top of the Plotly JavaScript library
(plotly.js) and can be used to create web-based data visualizations that can be
displayed in Jupyter notebooks or web applications using Dash, or saved as individual
HTML files.
Plotly supports scatter plots, histograms, line charts, bar charts, pie charts, error bars,
box plots, multiple axes, sparklines, dendrograms, 3-D charts, and other chart types.
Contour plots, which are uncommon in other data visualization libraries, are also
available in Plotly. Plotly is also available for use without an internet connection.
Seaborn
GGplot
ggplot is a versatile library for plotting graphs in Python that was originally
implemented in R.
It is a Domain-Specific language used to create domain-specific visualisations, primarily
for data analysis.
Ggplot allows the graph to be plotted in a straightforward manner with only two lines
of code.
The same code written with matplotlib, on the other hand, is very complex and
involves many lines of code. As a result, ggplot makes graph coding easier.
Bokeh
Matplotlib is a multiplatform data visualization library built on NumPy arrays, and designed
to work with the broader SciPy stack.
One of Matplotlib’s most important features is its ability to play well with many operating
systems and graphics backends. Matplotlib supports dozens of backends and output types,
which means you can count on it to work regardless of which operating system you are using
or which output format you wish. This cross-platform, everything-to-everyone approach has
been one of the great strengths of Matplotlib.
In recent years, however, the interface and style of Matplotlib have begun to show their age.
Newer tools like ggplot and ggvis along with web visuali- zation toolkits based on D3js and
HTML5 canvas, often make Matplotlib feel clunky and old-fashioned.
Importing matplotlib Just as we use the np shorthand for NumPy and the pd shorthand for
Pandas, we will use some standard shorthands for Matplotlib imports:
1. Plot Style:
Matplotlib has various predefined styles that can be applied to change the overall
appearance of plots.
Example:
plt.style.use('ggplot')
Add labels to the x and y axes, and a title to the entire plot.
Example:
plt.xlabel('X-axis Label')
plt.ylabel('Y-axis Label')
plt. tle('Title of the Plot')
4. Legend:
Add a legend to the plot to label different elements.
Example:
plt.plot(x1, y1, label='Line 1')
plt.plot(x2, y2, label='Line 2')
plt.legend()
5. Grid:
Display grid lines to aid in reading values from the plot.
Example:
plt.grid(True)
6. Figure Size
Adjust the size and aspect ra o of the en re figure.
Example:
plt.figure(figsize=(8, 6))
7. Axis Limits:
Set limits for the x and y axes to focus on a specific region of interest.
Example:
plt.xlim(0, 10)
plt.ylim(0, 20)
8. Background Color:
Change the background color of the en re plot.
Example:
plt.figure(facecolor='lightgray')
These are just a few examples of the many customiza on op ons available in Matplotlib.
Line plots
Line plots are a type of data visualization that uses lines to connect individual data points,
providing a clear representation of how a particular variable changes over a continuous
interval. Line plots can be used for the following:
Visualizing Trends: Line plots are highly effective in revealing trends or patterns in
data. The connected lines make it easy to observe how a variable changes over a
continuous range, providing insights into the overall direction of the data.
Showing Relationships: Line plots are particularly useful for illustrating relationships
between two variables, especially when one variable is dependent on the other. The
slope and direction of the line indicate the nature of the relationship.
Highlighting Time Series Data: Line plots are commonly used for visualizing time
series data, where the x-axis represents time. This allows analysts to observe changes
in a variable over time, making trends, cycles, and seasonality more apparent.
Comparing Multiple Series: Line plots can accommodate multiple lines on the same
graph, making it easy to compare the trends of different variables or multiple groups.
This is useful for identifying patterns and differences across categories.
Example:
1. For analyzing the monthly sales performance of a retail store over the course of a year
a line plot would be particularly useful. The x-axis would represent the months
(January through December), and the y-axis would represent the total sales for each
month.
2. Analyzing the historical stock prices of a company over several years. line plot would
provide a clear depiction of how the stock prices have changed over time, allowing you
to identify upward or downward trends.
Line plots can be created in Python with Matplotlib's pyplot library. To build a line plot, first
import Matplotlib. It is a standard convention to import Matplotlib's pyplot library as plt.
• To define a plot, you need some values, the matplotlib.pyplot module, and an idea of what
you want to display.
output:
More than one line can be in the plot. To add another line, just call the plot (x,y) function
again.
Example:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y1 = 2*x+ 1
y2 = 2**x + 1
plt.figure(num = 3, figsize=(8, 5))
plt.plot(x, y2)
plt.plot(x, y1, linewidth=1.0, linestyle='--')
Scatter plots
A scatter plot is a visual representation of how two variables relate to each other. we can use
scatter plots to explore the relationship between two variables, for example by looking for
any correlation between them.
Key Features :
Scatter plots display individual data points on a two-dimensional graph, where each point
represents the values of two variables.
Each data point is usually marked with a dot or another symbol, making it easy to distinguish
individual observations.
Unlike line plots, scatter plots do not connect data points with lines. This lack of connection
allows for a clear view of the distribution of points.
Scatter plots are effective for visualizing the spread or dispersion of data points along both
axes. They provide insight into the range of values for each variable.
Visualizing errors
Error bars are included in Matplotlib line plots and graphs. Error is the difference between
the calculated value and actual value.
Without error bars, bar graphs provide the perception that a measurable or determined
number is defined to a high level of efficiency. The method matplotlib.pyplot.errorbar()
draws y vs. x as planes and/or indicators with error bars associated.
Adding the error bar in Matplotlib, Python. It's very simple, we just have to write the value
of the error. We use the command:
plt.errorbar(x, y, yerr = 2, capsize=3)
Where:
yerr = The error value of the Y axis. Each point has its own error value.
capsize = The size of the lower and upper lines of the error bar
A simple example, where we only plot one point. The error is the 10% on the Y axis.
c) fmt, is the type of marker, in this case is a point ("o") blue ("b").
d) capsize, is the size of the lower and upper lines of the error bar.
e) ecolor, is the color of the error bar. The default color is the marker color.
• A contour line or isoline of a function of two variables is a curve along which the function
has a constant value. It is a cross-section of the three-dimensional graph of the function f(x,
y) parallel to the x, y plane.
• Contour lines are used e.g. in geography and meteorology. In cartography, a contour line
joins points of equal height above a given level, such as mean sea level.
Example:
import numpy as np
import matplotlib.pyplot as plt
xlist = np.linspace(-3.0, 3.0, 3)
ylist = np.linspace(-3.0, 3.0, 4)
X, Y = np.meshgrid(xlist, ylist)
plt.figure()
cp = plt.contour(X, Y, Z, colors='black', linestyles='dashed')
plt.clabel(cp, inline=True, fontsize=10)
plt.title('Contour Plot')
plt.xlabel('x (cm))
plt.ylabel('y (cm)')
plt.show()
Output:
When creating a contour plot, we can also specify the color map.
Histograms
In a histogram, the data are grouped into ranges (e.g. 10 - 19, 20 - 29) and then plotted as
connected bars. Each bar represents a range of data. The width of each bar is proportional
to the width of each category, and the height is proportional to the frequency or percentage
of that category.
It provides a visual interpretation of numerical data by showing the number of data points
that fall within a specified range of values called "bins".
Histograms can display a large amount of data and the frequency of the data values. The
median and distribution of the data can be determined by a histogram. In addition, it can
show any outliers or gaps in the data.
Output:
Parameters: This method accept the following parameters that are described below:
x : This parameter are the sequence of data.
bins : This parameter is an optional parameter and it contains the integer or
sequence or string.
range : This parameter is an optional parameter and it the lower and upper
range of the bins.
bottom : This parameter is the location of the bottom baseline of each bin.
histtype : This parameter is an optional parameter and it is used to draw type of
histogram. {‘bar’, ‘barstacked’, ‘step’, ‘stepfilled’}
color : This parameter is an optional parameter and it is a color spec or sequence
of color specs, one per dataset.
label : This parameter is an optional parameter and it is a string, or sequence of
strings to match multiple datasets.
The simplest legend can be created with the plt.legend() command, which automatically
creates a legend for any labeled plot elements:
There are many ways we might want to customize such a legend. For example, we can specify
the location and turn off the frame
We can use a rounded box (fancybox) or add a shadow, change the transparency (alpha
value) of the frame, or change the padding around the text:
Multiple subplots
Subplots mean groups of axes that can exist in a single matplotlib figure. subplots() function
in the matplotlib library, helps in creating multiple layouts of subplots. It provides control
over all the individual plots that are created.
subplots() without arguments returns a Figure and a single Axes. This is actually the simplest
and recommended way of creating a single Figure and Axes.
There are 3 different ways (at least) to create plots (called axes) in matplotlib. They
are:plt.axes(), figure.add_axis() and plt.subplots()
• plt.axes(): The most basic method of creating an axes is to use the plt.axes function. It takes
optional argument for figure coordinate system. These numbers represent [bottom, left,
width, height] in the figure coordinate system, which ranges from 0 at the bottom left of the
figure to 1 at the top right of the figure.
• By calling subplot(n,m,k), we subdidive the figure into n rows and m columns and specify
that plotting should be done on the subplot number k. Subplots are numbered row by row,
from left to right.
plt.subplots: The Whole Grid in One Go
The approach just described can become quite tedious when creating a large grid of subplots,
especially if you'd like to hide the x- and y-axis labels on the inner plots. For this purpose,
plt.subplots() is the easier tool to use (note the s at the end of subplots). Rather than creating
a single subplot, this function creates a full grid of subplots in a single line, returning them in
a NumPy array. The arguments are the number of rows and number of columns, along with
optional keywords sharex and sharey, which allow you to specify the relationships between
different axes.
There are four important parameters that you must always use with annotate().
b) xy: The place where you want your arrowhead to point to. In other words, the place you
want to annotate. This is a tuple containing two values, x and y.
Example :
This can be done manually with the plt.text/ax.text command, which will place text at a
particular x/y value:
Example :
• The first one is a standard import statement for plotting using matplotlib, which you would
see for 2D plotting as well. The second import of the Axes3D class is required for enabling 3D
projections. It is, otherwise, not used anywhere else.
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111, projection='3d')
The purpose of plotting geographic maps in data visualization is to spatially represent and
analyze data across geographical locations. Geographic maps provide a visual and intuitive
way to explore patterns, trends, and relationships related to location-based data. Here are
several key purposes for plotting geographic maps:
Geographic maps allow for the visualization of how data is distributed across different
regions or locations. Example: Analyzing the spatial distribution of population density,
income levels, or disease outbreaks in different areas.
Maps help in identifying spatial patterns and trends that may not be immediately
apparent in tabular data. Example: Visualizing the migration patterns of wildlife,
identifying hotspots of criminal activity, or observing regional variations in climate.
Maps facilitate the exploration of relationships between different geographic entities.
Example: Understanding trade relationships between countries, exploring
transportation networks, or analyzing connections between different landmarks.
Geographic maps are useful for analyzing demographic data in different regions.
Example: Visualizing age distribution, educational attainment, or cultural diversity across
cities or countries.
Maps help identify spatial correlation and clustering of data points. Example: Mapping
the distribution of customer locations to identify potential markets, or visualizing
clusters of disease cases for epidemiological studies.
• Basemap is a toolkit under the Python visualization library Matplotlib. Its main function is
to draw 2D maps, which are important for visualizing spatial data. Basemap itself does not
do any plotting, but provides the ability to transform coordinates into one of 25 different
map projections.
• Matplotlib can also be used to plot contours, images, vectors, lines or points in transformed
coordinates. Basemap includes the GSSH coastline dataset, as well as datasets from GMT for
rivers, states and national boundaries.
• These datasets can be used to plot coastlines, rivers and political boundaries on a map at
several different resolutions. Basemap uses the Geometry Engine-Open Source (GEOS)
library at the bottom to clip coastline and boundary features to the desired map projection
area. In addition, basemap provides the ability to read shapefiles.
• For example, if we wanted to show all the different types of endangered plants within a
region, we would use a base map showing roads, provincial and state boundaries, waterways
and elevation. Onto this base map, we could add layers that show the location of different
categories of endangered plants. One added layer could be trees, another layer could be
mosses and lichens, another layer could be grasses.
The most useful piece of the Basemap toolkit is the ability to over-plot a variety of data onto
a map background. For simple plotting and text, any plt function works on the map; you can
use the Basemap instance to project latitude and longitude coordinates to (x, y) coordinates
for plotting with plt
Seaborn plots
• Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics. Seaborn is an open-
source Python library.
• Seaborn helps you explore and understand your data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots.
• Keys features:
a) Seaborn is a statistical plotting library
Seaborn works easily with dataframes and the Pandas library. The graphs created can also be
customized easily.
d) Options for visualizing univariate or bivariate distributions and for comparing them
between subsets of data
e) Automatic estimation and plotting of linear regression models for different kinds of
dependent variables
f) High-level abstractions for structuring multi-plot grids that let you easily build complex
visualizations
g) Concise control over matplotlib figure styling with several built-in themes
h) Tools for choosing color palettes that faithfully reveal patterns in your data.
Some plots:
1.) Scatter plot
2.) Heatmap
3.) Pairplot
Matplotlib vs seaborn:
Limitations of Seaborn:
Focused on Statistics: While Seaborn is excellent for statistical data visualization, it may not
offer the same level of flexibility as Matplotlib for general-purpose plotting.
Learning Curve: Users familiar with Matplotlib might need some time to adjust to Seaborn's
conventions and functions, especially if they are used to more granular control over plot
customization.
Plotly plots
Plotly library in Python is an open-source library that can be used for data visualization and
understanding data simply and easily. Plotly supports various types of plots like line charts,
scatter plots, histograms, box plots, etc.
While Matplotlib has long been the foundation for static visualizations, Plotly steps onto
the stage with a focus on interactivity and modern web-based charts, offering a vibrant
alternative to traditional data plotting tools.
Key features:
Drawbacks:
1. Web Dependencies: While Plotly is excellent for web-based projects, it may not
be the most suitable choice for creating static, publication-quality images without
additional steps or libraries.
2. Performance with Large Datasets: Plotly may experience performance issues with
extremely large datasets or complex visualizations, potentially leading to slower
rendering times.
Matplotlib vs plotly
Feature Plotly Matplotlib
Basic interactivity capabilities.
Offers high interactivity with May require additional effort
Interactivity zooming, panning, and hover to achieve high interactivity.
Feature Plotly Matplotlib
features out of the box. Well-
suited for web-based applications.
Plotly Express provides a high- More granular control over
level interface with concise syntax plot elements but often
for creating complex requires more code for
Ease of Use visualizations. complex visualizations.
Well-suited for web-based Primarily designed for static
applications. Can be easily plotting. Integration into web
Web embedded in web pages and used applications may require
Integration to create interactive dashboards. additional steps.
Supports a wide range of plot
types, including 3D plots,
geographic maps, and various Offers standard 2D plots and
Plot Types interactive charts. limited 3D capabilities.
Has a growing community and is
Community widely used in data science and Well-established with a large
Support web development. and mature community.
Exploring ggplot
ggplot2 is built on the philosophy of the Grammar of Graphics, a systematic approach to
describing and building complex visualizations through a consistent and structured grammar.
Advantages of ggplot2:
1. Declarative Syntax: ggplot2 uses a declarative syntax, allowing users to express what
they want in their plot rather than specifying how to achieve it. This results in concise
and expressive code.
2. Layered Structure: Plots in ggplot2 are built layer by layer, making it easy to add
components such as points, lines, and annotations. This layering system contributes to
the flexibility and extensibility of the library.
3. Faceting: ggplot2 supports faceting, allowing users to create multiple plots based on
the levels of a categorical variable. This is useful for exploring how relationships differ
across subgroups.
4. Themes and Customization: The library provides a variety of themes to control the
overall appearance of plots. Additionally, users can customize almost every aspect of
a plot to match specific requirements.
Limitations of ggplot2:
Learning Curve: For users new to the Grammar of Graphics philosophy, there might be
a learning curve in understanding the different components and layers. However, once
mastered, it can lead to more efficient and expressive code.
Exploring PyViz
PyViz is not a single library but a collection of tools and libraries that work together to create
a holistic visualization ecosystem. It includes libraries such as HoloViews, GeoViews, Panel,
and others.
PyViz libraries often use a declarative syntax, allowing users to express what they want to
visualize rather than specifying how to create the visualization. This can lead to more concise
and expressive code.
PyViz emphasizes the creation of interactive dashboards with widgets for user interaction.
This is particularly useful for creating dynamic and responsive visualizations.
Advantages of PyViz
Limitations of PyViz
1. Learning Curve: While PyViz aims to simplify the process of creating visualizations,
there might still be a learning curve for users new to the ecosystem, especially when
working with multiple libraries.
2. Community Size: The PyViz community, while growing, might not be as extensive as
some other visualization ecosystems. This can impact the availability of community-
generated resources and support.
3. Matplotlib Integration: While PyViz can integrate with Matplotlib, it might not provide
the same level of fine-grained control over plot elements as using Matplotlib directly.
Bokeh
Bokeh specializes in creating interactive and dynamic visualizations with features like
zooming, panning, and hovering over data points.
Bokeh is designed to be embedded in web applications, making it suitable for creating
interactive dashboards and web-based data visualizations.
Advantages:
Limitations:
1. Learning Curve: Bokeh may have a learning curve, especially for users new to web-
based plotting libraries, as it involves understanding concepts like Bokeh models and
layouts.
2. Plot Customization: While Bokeh offers good customization options, achieving highly
specific plot configurations may require more effort compared to more granular
libraries.
Panel
Panel is a high-level app and dashboarding framework built on top of Bokeh, making it easy
to create complex dashboards and applications.
Panel provides a component-based layout system that allows users to build dashboards using
a combination of charts, widgets, and custom HTML components.
Advantages:
Limitations:
Advantages of Yellowbrick
Limitations of Yellowbrick
1. Domain-Specific: Yellowbrick is primarily focused on visualizations for model
evaluation. Users seeking more general-purpose data visualization or plotting
capabilities may need to complement it with other libraries like Matplotlib or Seaborn.
2. Learning Curve: While Yellowbrick aims to simplify the process of model evaluation,
users who are not familiar with the underlying concepts of machine learning evaluation
metrics may still require some learning to interpret the visualizations effectively.
3. Availability of Visualizers: Some specific models or tasks may not have dedicated
visualizers within the Yellowbrick library. Users might need to explore additional tools
or create custom visualizations for specific requirements.