Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
54 views

11 Data Visualization

This document discusses data visualization using the ggplot2 package in R. It covers several key topics: 1. An introduction to ggplot2 and its grammar for creating statistical graphics from data mappings. 2. The importance of data visualization, demonstrated through Anscombe's quartet examples. 3. How to create line and path plots to show changes over time using geom_line() and geom_path(). 4. How to create a bar chart to show a categorical variable using geom_bar(). 5. It will next cover plotting a continuous variable.

Uploaded by

YEOW
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

11 Data Visualization

This document discusses data visualization using the ggplot2 package in R. It covers several key topics: 1. An introduction to ggplot2 and its grammar for creating statistical graphics from data mappings. 2. The importance of data visualization, demonstrated through Anscombe's quartet examples. 3. How to create line and path plots to show changes over time using geom_line() and geom_path(). 4. How to create a bar chart to show a categorical variable using geom_bar(). 5. It will next cover plotting a continuous variable.

Uploaded by

YEOW
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

CT127-3-2& Programming for Data Analysis

DATA VISUALIZATION

Module Code & Module Title Slide Title SLIDE 1


TOPIC LEARNING OUTCOMES

At the end of this topic, you should be able to:

•Understand how to use ggplot2 package for data visualization.

Module Code & Module Title Slide Title SLIDE 2


Contents & Structure
- Data Visualization
- ggplot2 package
- Line and path plots
- Bar chart
- Histogram
- Frequency Polygon
- Box plot
- Scatter plot
- Count plot
- Using colors and shapes in plots
- Axis and plot labels
- Facetting

Module Code & Module Title Slide Title SLIDE 3


1
Data Visualization using
ggplot2 package

Module Code & Module Title Slide Title SLIDE 4


Data Visualization

• Data visualization is the graphic representation of data.


• Graphics are used in statistics primarily for two reasons:
exploratory data analysis (EDA) and presenting results.
• The ggplot2 package is widely used to perform data
visualization in R.

• To install ggplot2: install.packages(“ggplot2”)


• To load ggplot2: library(ggplot2)

Module Code & Module Title Slide Title SLIDE 5


ggplot2 package

• Unlike most other graphics packages, ggplot2 has a deep


underlying grammar.

• This grammar is made up of a set of independent


components that can be composed in many different ways.
This makes ggplot2 very powerful because you are not
limited to a set of pre-specified graphics, but you can create
new graphics that are precisely tailored for your problem.

• In brief, the grammar tells us that a statistical graphic is a mapping


from data to aesthetic attributes (colour, shape, size) of geometric
objects (points, lines, bars).

Module Code & Module Title Slide Title SLIDE 6


ggplot2 package

• Every ggplot2 plot has three key components:


1. data,
2. A set of aesthetic mappings between variables in the data and
visual properties, and
3. At least one layer which describes how to render each
observation.

• The basic structure for ggplot2 starts with the ggplot function,
which takes the data as its first argument. After that, layers can
be added using the + symbol.

Module Code & Module Title Slide Title SLIDE 7


ggplot2 package

Here’s a simple example:


ggplot(mpg, aes(x = displ, y = hwy))
+ geom_point()

This produces a scatterplot defined by:


1. Data: mpg.
2. Aesthetic mapping: engine size (displ) mapped to x position, highway
miles per gallon (hwy) to y position.
3. Layer: points.
Module Code & Module Title Slide Title SLIDE 8
2
Importance of Data
Visualization

Module Code & Module Title Slide Title SLIDE 9


Anscombe's quartet

• Frank Anscombe constructed four datasets. Each


dataset consists of eleven (x,y) points. These data
sets have nearly identical simple descriptive
statistics.

• Anscombe's quartet demonstrates both the


importance of graphing data before analyzing it and
the effect of outliers and other influential
observations on statistical properties.
Frank Anscombe

https://en.wikipedia.org/wiki/Frank_Anscombe
Module Code & Module Title Slide Title SLIDE 10
Anscombe's quartet

https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Module Code & Module Title Slide Title SLIDE 11
Anscombe's quartet

1 Two variables correlated and


2
following the assumption of
normality. 1

2 It is not distributed normally +


There is non-linear relationship.

A perfect linear relationship, 3 4


3 except for one outlier which
lower the correlation
coefficient from 1 to 0.816.

4 Clearly shows that one outlier is enough to produce a high correlation coefficient,
even though the relationship between the two variables is not linear.

https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Module Code & Module Title Slide Title SLIDE 12
3
Line and Path plots

Module Code & Module Title Slide Title SLIDE 13


Line and Path plots

• Line and path plots are typically used for time series data.

• Line plots join the points from left to right, while path plots join
them in the order that they appear in the dataset.

• Line plots usually have time on the x-axis, showing how a single
variable has changed over time. Path plots show how two variables
have simultaneously changed over time, with time encoded in the
way that observations are connected.

Module Code & Module Title Slide Title SLIDE 14


Line and Path plots
• geom_line () function in the ggplot2 package is used to plot line
graphs.
• geom_line () function connects the observations in order of the
variable on the x axis.
ggplot(economics, aes(x=date, y=pop)) +
geom_line()

Module Code & Module Title Slide Title SLIDE 15


Line and Path plots
geom_path() connects observations in original order.
ggplot(economics, aes(unemploy / pop, uempmed, colour = date)) +
geom_path()

Module Code & Module Title Slide Title SLIDE 16


4
Plotting a Categorical
Variable using Bar chart

Module Code & Module Title Slide Title SLIDE 17


Bar chart
• A bar chart shows categorical variable’s data in bars with heights
proportional to that variable's values.
• ggplot () with geom_bar () functions in the ggplot2 package are used to
plot bar charts.
• geom_bar() makes the height of the bar proportional to the number of
cases in each group
ggplot(diamonds, aes(x = cut)) +
geom_bar()

Module Code & Module Title Slide Title SLIDE 18


5
Plotting a Continuous Variable
using Histogram/Frequency
Polygon/Boxplot

Module Code & Module Title Slide Title SLIDE 19


Histogram

• Histogram shows the distribution of values for a variable.


• Histograms break the data into buckets and the heights of
the bars represent the number of observations that fall into
each bucket.

Module Code & Module Title Slide Title SLIDE 20


Histogram
ggplot() with geom_histogram() in the ggplot2 package are used to plot
histograms.
ggplot(diamonds, aes(x=carat)) +
geom_histogram()

Module Code & Module Title Slide Title SLIDE 21


Frequency Polygon

• Like Histograms, frequency polygons show the distribution of a single


numeric variable.
• Histograms use bars and frequency polygons use lines.

ggplot(diamonds, aes(x=carat)) +
geom_freqpoly()

Module Code & Module Title Slide Title SLIDE 22


Boxplot

• Boxplot depicts groups of numerical data through their quartiles.


• The middle line in the boxplot represents the median and the
box is bounded by the first and third quartiles.
• The Interquartile Range (IQR) represents the middle 50% of
data.

https://towardsdatascience.com/understanding-boxplots-
Module Code & Module Title 5e2df7bcbd51
Slide Title SLIDE 23
Boxplot

• A series of hourly temperatures were measured throughout the


day in degrees Fahrenheit.
• The recorded values are listed in order as follows: 64, 64, 64,
65, 70, 73, 73, 74, 74, 75, 76, 77, 77, 77, 77, 79, 80, 82, 82,
83, 83, 85, 86, 88.
> summary(temp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
64.00 73.00 77.00 76.17 82.00 88.00

Module Code & Module Title Slide Title SLIDE 24


Boxplot
ggplot() with geom_boxplot() in the ggplot2 package are used to produce
boxplots.
ggplot(diamonds, aes(y=carat, x=1)) +
geom_boxplot()

Even though it is one-dimensional, using only a y aesthetic, there needs to be some


x aesthetic, so we will use 1.
Module Code & Module Title Slide Title SLIDE 25
Boxplot

https://r4ds.had.co.nz/exploratory-data-analysis.html#missing-values-2

Module Code & Module Title Slide Title SLIDE 26


6
Visualize the covariation
between two continuous
variables using Scatterplot

Module Code & Module Title Slide Title SLIDE 27


Scatterplot

• Scatterplot is a diagram that is used to visualize the


covariation between two continuous variables. Covariation is
a correlated variation of two or more variables.

• Every point represents an observation in two variables


where the x-axis represents one variable and the y-axis
another.

Module Code & Module Title Slide Title SLIDE 28


Scatterplot
• ggplot() with geom_point() in the ggplot2 package are used to create
scatterplots.
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point()

Module Code & Module Title Slide Title SLIDE 29


7
Visualize the covariation
between two categorical
variables using Count plot

Module Code & Module Title Slide Title SLIDE 30


Count plot

Two categorical variables: can be explored by counting the


number of observations for each combination.
ggplot(diamonds, aes(x = cut, y = color)) +
geom_count() +
labs(title="The co-variation between diamond's cut quality and
color", x="cut", y="color")

N represents how many


observations occurred
at each combination of
values.

Module Code & Module Title Slide Title SLIDE 31


8
Visualize the covariation
between a categorical and
continuous variables using
Box plot
Module Code & Module Title Slide Title SLIDE 32
Box plot

A categorical and continuous variables: can be explored using


boxplots.
ggplot(diamonds, aes(x = color, y = price)) +
geom_boxplot() +
labs(title="The co-variation between diamond's color and price",
x="color", y="price")

Module Code & Module Title Slide Title SLIDE 33


9
More about plots

Module Code & Module Title Slide Title SLIDE 34


Using colors in plots

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +


geom_point()

Module Code & Module Title Slide Title SLIDE 35


Using colors in plots

ggplot(data=diamonds, aes(x=carat)) +
geom_histogram(col="white", fill="blue")

Module Code & Module Title Slide Title SLIDE 36


Using colors in plots

ggplot(diamonds, aes(carat, colour = cut)) +


geom_freqpoly()

Module Code & Module Title Slide Title SLIDE 37


Using shapes in plots

ggplot(diamonds, aes(x=carat, y=price, color=cut , shape=cut))


+
geom_point()

Module Code & Module Title Slide Title SLIDE 38


Axis and plot labels
ggplot(data=diamonds, aes(carat)) +
geom_histogram(col="white", fill="blue") +
labs(title="Histogram for carat", x=" Carat", y="Count")

Module Code & Module Title Slide Title SLIDE 39


Facetting

• Facetting creates tables of graphics by splitting the data into subsets and
displaying the same graph for each subset.
• To facet a plot you simply add a facetting specification with
facet_wrap(), which takes the name of a variable preceded by ˜.
ggplot(diamonds, aes(x=carat)) +
geom_histogram() +
facet_wrap(~cut)

Module Code & Module Title Slide Title SLIDE 40


Facetting
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
facet_wrap(~cut)

Module Code & Module Title Slide Title SLIDE 41


Quick Review Questions
• Describe the following plots and how use them in R:
– Line plot
– Path plot
– Bar chart
– Histogram plot
– Frequency polygon plot
– Box plot
– Scatter plot
– Count plot
• How to use colors and shapes in plots?
• How to give axis and plots labels in R?
• How to facet a plot in R?

Module Code & Module Title Slide Title SLIDE 42


Summary of Main Teaching Points
• Line and path plots
• Bar chart
• Histogram
• Frequency Polygons
• Box plot
• Scatter plot
• Count plot
• Using colors and shapes in plots
• Axis and plot labels
• Facetting

Module Code & Module Title Slide Title SLIDE 43


What To Expect Next Week

In Class Preparation for Class


• Data Manipulation • Various manipulation functions

Module Code & Module Title Slide Title SLIDE 44

You might also like