R Programming Unit 3
R Programming Unit 3
UNIT – 3
STATISTICS & PROBOBILITY
R Data Visualization
In R, we can create visually appealing data visualizations by writing few lines of code. For this
purpose, we use the diverse functionalities of R. Data visualization is an efficient technique for
gaining insight about data through a visual medium. With the help of visualization techniques, a
human can easily obtain information about hidden patterns in data that might be neglected.
By using the data visualization technique, we can work with large datasets to efficiently obtain
key insights about it.
R Visualization Packages
R provides a series of packages for data visualization. These packages are as follows:
1) plotly
The plotly package provides online interactive and quality graphs. This package extends upon the
JavaScript library ?plotly.js.
2) ggplot2
R allows us to create graphics declaratively. R provides the ggplot package for this purpose. This
package is famous for its elegant and quality graphs, which sets it apart from other visualization
packages.
3) tidyquant
The tidyquant is a financial package that is used for carrying out quantitative financial analysis.
This package adds under tidyverse universe as a financial package that is used for importing,
analyzing, and visualizing the data.
4) taucharts
Data plays an important role in taucharts. The library provides a declarative interface for rapid
mapping of data fields to visual properties.
5) ggiraph
It is a tool that allows us to create dynamic ggplot graphs. This package allows us to add tooltips,
JavaScript actions, and animations to the graphics.
6) geofacets
This package provides geofaceting functionality for 'ggplot2'. Geofaceting arranges a sequence of
plots for different geographical entities into a grid that preserves some of the geographical
orientation.
7) googleVis
googleVis provides an interface between R and Google's charts tools. With the help of this
package, we can create web pages with interactive charts based on R data frames.
8) RColorBrewer
This package provides color schemes for maps and other graphics, which are designed by Cynthia
Brewer.
9) dygraphs
The dygraphs package is an R interface to the dygraphs JavaScript charting library. It provides
rich features for charting time-series data in R.
10) shiny
R Graphics
Graphics play an important role in carrying out the important features of the data. Graphics are
used to examine marginal distributions, relationships between variables, and summary of very
large data. It is a very important complement for many statistical and computational techniques.
Standard Graphics
R standard graphics are available through package graphics, include several functions which
provide statistical plots, like:
o Scatterplots
o Piecharts
o Boxplots
o Barplots etc.
We use the above graphs that are typically a single function call.
There are some key elements of a statistical graphic. These elements are the basics of the grammar
of graphics. Let's discuss each of the elements one by one to gain the basic knowledge of graphics.
1) Data
Data is the most crucial thing which is processed and generates an output.
2) Aesthetic Mappings
Aesthetic mappings are one of the most important elements of a statistical graphic. It controls the
relation between graphics variables and data variables. In a scatter plot, it also helps to map the
temperature variable of a data set into the X variable.
In graphics, it helps to map the species of a plant into the color of dots.
3) Geometric Objects
Geometric objects are used to express each observation by a point using the aesthetic mappings. It
maps two variables in the data set into the x,y variables of the plot.
4) Statistical Transformations
Statistical transformations allow us to calculate the statistical analysis of the data in the plot.The
statistical transformation uses the data and approximates it with the help of a regression line having
x,y coordinates, and counts occurrences of certain values.
5) Scales
It is used to map the data values into values present in the coordinate system of the graphics device.
6) Coordinate system
The coordinate system plays an important role in the plotting of the data.
o Cartesian
o Plot
7) Faceting
Faceting is used to split the data into subgroups and draw sub-graphs for each group.
It can be more attractive to look at the business. And, it is easier to understand through graphics
and charts than a written document with text and numbers. Thus, it can attract a wider range of
audiences. Also, it promotes the widespread use of business insights that come to make better
decisions.
2. Efficiency
Its applications allow us to display a lot of information in a small space. Although, the decision-
making process in business is inherently complex and multifunctional, displaying evaluation
findings in a graph can allow companies to organize a lot of interrelated information in useful
ways.
3. Location
Its app utilizing features such as Geographic Maps and GIS can be particularly relevant to wider
business when the location is a very relevant factor. We will use maps to show business insights
from various locations, also consider the seriousness of the issues, the reasons behind them, and
working groups to address them.
1. Cost
R application development range a good amount of money. It may not be possible, especially for
small companies, that many resources can be spent on purchasing them. To generate reports, many
companies may employ professionals to create charts that can increase costs. Small enterprises are
often operating in resource-limited settings, and are also receiving timely evaluation results that
can often be of high importance.
2. Distraction
However, at times, data visualization apps create highly complex and fancy graphics-rich reports
and charts, which may entice users to focus more on the form than the function. If we first add
visual appeal, then the overall value of the graphic representation will be minimal. In resource-
setting, it is required to understand how resources can be best used. And it is also not caught in the
graphics trend without a clear purpose.
R Pie Charts
R programming language has several libraries for creating charts and graphs. A pie-chart is a
representation of values in the form of slices of a circle with different colors. Slices are labeled
with a description, and the numbers corresponding to each slice are also shown in the chart.
However, pie charts are not recommended in the R documentation, and their characteristics are
limited. The authors recommend a bar or dot plot on a pie chart because people are able to measure
length more accurately than volume.
The Pie charts are created with the help of pie () function, which takes positive numbers as vector
input. Additional parameters are used to control labels, colors, titles, etc.
Here, ip 10s
1. X is a vector that contains the numeric values used in the pie chart.
2. Labels are used to give the description to the slices.
3. Radius describes the radius of the pie chart.
4. Main describes the title of the chart.
5. Col defines the color palette.
6. Clockwise is a logical value that indicates the clockwise or anti-clockwise direction in which slices
are drawn.
Note: The length of the pallet will be the same as the number of values that we have for the chart.
So for that, we will use length() function.
Let's see an example to understand how these methods work in creating an attractive pie chart with
title and color.
Example
Output:
There are two additional properties of the pie chart, i.e., slice percentage and chart legend. We can
show the data in the form of percentage as well as we can add legends to plots in R by using the
legend() function. There is the following syntax of the legend() function.
1. legend(x,y=NULL,legend,fill,col,bg)
Here,
Example
Output:
R Bar Charts
A bar chart is a pictorial representation in which numerical values of variables are represented by
length or height of lines or rectangles of equal width. A bar chart is used for summarizing a set of
categorical data. In bar chart, the data is shown through rectangular bars having the length of the
bar proportional to the value of the variable.
In R, we can create a bar chart to visualize the data in an efficient manner. For this purpose, R
provides the barplot() function, which has the following syntax:
1. barplot(h,x,y,main, names.arg,col)
1. H A vector or matrix which contains numeric values used in the bar chart.
Like pie charts, we can also add more functionalities in the bar chart by-passing more arguments
in the barplot() functions. We can add a title in our bar chart or can add colors to the bar by adding
the main and col parameters, respectively. We can add another parameter i.e., args.name, which is
a vector that has the same number of values, which are fed as the input vector to describe the
meaning of each bar.
Let's see an example to understand how labels, titles, and colors are added in our bar chart.
Example
Output:
We can create bar charts with groups of bars and stacks using matrices as input values in each bar.
One or more variables are represented as a matrix that is used to construct group bar charts and
stacked bar charts.
Example
1. library(RColorBrewer)
2. months <- c("Jan","Feb","Mar","Apr","May")
3. regions <- c("West","North","South")
4. # Creating the matrix of the values.
5. Values <- matrix(c(21,32,33,14,95,46,67,78,39,11,22,23,94,15,16), nrow = 3, ncol = 5, byrow = TRUE)
6. # Giving the chart file a name
7. png(file = "stacked_chart.png")
8. # Creating the bar chart
9. barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Month", ylab = "Revenue", ccol =
c("cadetblue3","deeppink2","goldenrod1"))
Output:
R Boxplot
Boxplots are a measure of how well data is distributed across a data set. This divides the data set
into three quartiles. This graph represents the minimum, maximum, average, first quartile, and the
third quartile in the data set. Boxplot is also useful in comparing the distribution of data in a data
set by drawing a boxplot for each of them.
R provides a boxplot() function to create a boxplot. There is the following syntax of boxplot()
function:
Here,
1. x It is a vector or a formula.
4. varwidth It is also a logical value set as true to draw the width of the box same as the sample size.
5. names It is the group of labels that will be printed under each boxplot.
Let?s see an example to understand how we can create a boxplot in R. In the below example, we
will use the "mtcars" dataset present in the R environment. We will use its two columns only, i.e.,
"mpg" and "cyl". The below example will create a boxplot graph for the relation between mpg and
cyl, i.e., miles per gallon and number of cylinders, respectively.
Example
Output:
Example
Output:
R Histogram
A histogram is a type of bar chart which shows the frequency of the number of values which are
compared with a set of values ranges. The histogram is used for the distribution, whereas a bar
chart is used for comparing different entities. In the histogram, each bar represents the height of
the number of values present in the given range.
For creating a histogram, R provides hist() function, which takes a vector as an input and uses
more parameters to add more functionality. There is the following syntax of hist() function:
1. hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border)
Here,
Let?s see an example in which we create a simple histogram with the help of required parameters
like v, main, col, etc.
Example
Output:
Let?s see some more examples in which we have used different parameters of hist() function to
add more functionality or to create a more attractive chart.
Output:
R Line Graphs
A line graph is a pictorial representation of information which changes continuously over time. A
line graph can also be referred to as a line chart. Within a line graph, there are points connecting
the data to show the continuous change. The lines in a line graph can move up and down based on
the data. We can use a line graph to compare different events, information, and situations.
A line chart is used to connect a series of points by drawing line segments between them. Line
charts are used in identifying the trends in data. For line graph construction, R provides plot()
function, which has the following syntax:
1. plot(v,type,col,xlab,ylab)
Here,
2. type This parameter takes the value ?I? to draw only the lines or ?p? to draw only the
points and "o" to draw both lines and points.
6. col It is used to give the color for both the points and lines
Let’s see a basic example to understand how plot() function is used to create the line graph:
Example
Output:
Like other graphs and charts, in line chart, we can add more features by adding more parameters.
We can add the colors to the lines and points, add labels to the axis, and can give a title to the chart.
Let?s see an example to understand how these parameters are used in plot() function to create an
attractive line graph.
Example
Output:
The lines() function takes an additional input vector for creating a line. Let?s see an example to
understand how this function is used:
Example
Output:
Let?s see an example to understand how ggplot2 is used to create a line graph. In the below
example, we will use the predefined ToothGrowth dataset, which describes the effect of vitamin
C on tooth growth in Guinea pigs.
Example
1. library(ggplot2)
2. #Creating data for the graph
3. data_frame<- data.frame(dose=c("D0.5", "D1", "D2"),
4. len=c(4.2, 10, 29.5))
5. head(data_frame)
6. png(file = "multi_line_graph2.jpg")
7. # Basic line plot with points
8. ggplot(data=data_frame, aes(x=dose, y=len, group=1)) +geom_line()+geom_point()
9. # Change the line type
10. ggplot(data=df, aes(x=dose, y=len, group=1)) +geom_line(linetype = "dashed")+geom_point()
11. # Change the color
12. ggplot(data=df, aes(x=dose, y=len, group=1)) +geom_line(color="red")+geom_point()
13. dev.off()
Output:
R Scatterplots
The scatter plots are used to compare variables. A comparison between variables is required when
we need to define how much one variable is affected by another variable. In a scatterplot, the data
is represented as a collection of points. Each point on the scatterplot defines the values of the two
variables. One variable is selected for the vertical axis and other for the horizontal axis. In R, there
are two ways of creating scatterplot, i.e., using plot() function and using the ggplot2 package's
functions.
Here,
Let's see an example to understand how we can construct a scatterplot using the plot function. In
our example, we will use the dataset "mtcars", which is the predefined dataset available in the R
environment.
Example
Output:
The ggplot2 package provides ggplot() and geom_point() function for creating a scatterplot. The
ggplot() function takes a series of the input item. The first parameter is an input vector, and the
second is the aes() function in which we add the x-axis and y-axis.
Let's start understanding how the ggplot2 package is used with the help of an example where we
have used the familiar dataset "mtcars".
Example
Output:
We can add more features and make a more attractive scatter plots also. Below are some examples
in which different parameters are added.
Introduction:
In random collections of data from independent sources, it is commonly seen that the distribution
of data is normal. It means that if we plot a graph with the value of the variable in the horizontal
axis and counting the values in the vertical axis, then we get a bell shape curve. The curve center
represents the mean of the data set. In the graph, fifty percent of the value is located to the left of
the mean. And the other fifty percent to the right of the graph. This is referred to as the normal
distribution.
1. x It is a vector of numbers.
2. p It is a vector of probabilities.
3. n It is a vector of observations.
4. mean It is the mean value of the sample data whose default value is zero.
Let's start understanding how these functions are used with the help of the examples.
dnorm():Density
The dnorm() function of R calculates the height of the probability distribution at each point for a
given mean and standard deviation. The probability density of the normal distribution is:
Example
Output:
pnorm():Direct Look-Up
The dnorm() function is also known as "Cumulative Distribution Function". This function
calculates the probability of a normally distributed random numbers, which is less than the value
of a given number. The cumulative distribution is as follows:
f(x)=P(X≤x)
Example
Output:
qnorm():Inverse Look-Up
The qnorm() function takes the probability value as an input and calculates a number whose
cumulative value matches with the probability value. The cumulative distribution function and the
inverse cumulative distribution function are related by
p=f(x)
x=f-1 (p)
Example
Output:
ADVERTISEMENT
rnorm():Random variates
The rnorm() function is used for generating normally distributed random numbers. This function
generates random numbers by taking the sample size as an input. Let's see an example in which
we draw a histogram for showing the distribution of the generated numbers.
Example
Output:
Poisson Distribution:
The poisson distribution f(λ) is often used to represent the number of events occurring in a fixed
interval of time or space.
where x=0,1,2,3,….
dpois()
ppois()
qpois()
rpois()
The dpois (density), ppois (distribution function) rpois (random generation) and qpois (quantile
function). The probability density dpois and cumulative distribution ppois are defined on non-
negative integers.
The probability mass function (PMF) of the poisson distribution is given by the formula:
P(X=k)=(e- λ λk)/k!
dpois():
dpois() function is used for illustration of Poisson density in an R plot. The function dpois(0
calculates the probability of a random variable that is available within a certain range.
Syntax:
Ex:
lambda<-2
# Compute the probability mass function (PMF) for a specific value of k
k<-3
pmf_value<-dpois(k,lambda)
cat("probability mass function(PMF)for k=",k,":",pmf_value,"\n")
Out put:
Probability Mass Function PMF) for k=3: 0.180447
ppois():
ppois() function is used for the illustration of the cumulative probability function in an R plot. The
function ppois() calculates the probability of a random variable that will be equal to or less than a
number.
Syntax:
ppois(q, lambda, lower.tail, log)
Ex:
lambda<-4
#Calculate the CDFfor k
k<-3
cdf_value<-ppois(k,lambda)
cat("cumulative distribution function (CDF)fork=",k,"and lambda=",lambda,":",cdf_value,"\n")
Output:
Cumulative Distribution Function (CDF) for k=3 and lambda=4: 0.4334701
rpois():
The function rpois() is used for generating random number s from a given Poisson’s distribution.
Syntax:
rpois(q, lambda)
q: number of random numbers needed
lambda: mean per interval
Ex:
lam<-2.5
rv<-rpois(10,lam)
print(rv)
Output: [1] 0 5 2 0 3 3 0 0 2 5
qpois():
The function qpois() is used for generating quantile of a given Poisson’s distribution.
In probability, quantiles are marked points that divide the graph of probability distribution into
intervals(continuous) which have equal probabilities.
Syntax:
qpois(q, lambda, lower.tail, log)
q: number of successful events happened in an interval
lambda: mean per interval
lower.tail: If TRUE then left tail is considered otherwise right tail is considered
Ex:
lambda<-2
probability<-0.7
quantile_value<-qpois(probability,lambda)
cat("Quantile value for probability",probability,"and lambda",lambda,":",quantile_value,"\n")
Binomial Distribution
The binomial distribution is also known as discrete probability distribution, which is used to
find the probability of success of an event. The event has only two possible outcomes in a series
of experiments. The tossing of the coin is the best example of the binomial distribution. When a
coin is tossed, it gives either a head or a tail. The probability of finding exactly three heads in
repeatedly tossing the coin ten times is approximate during the binomial distribution.
1. x It is a vector of numbers.
2. p It is a vector of probabilities.
3. n It is a vector of observations.
Let's start understanding how these functions are used with the help of the examples
Example
Output:
The dbinom() function of R calculates the cumulative probability(a single value representing the
probability) of an event. In simple words, it calculates the cumulative distribution function of the
particular binomial distribution.
Example
Output:
The qbinom() function of R takes the probability value and generates a number whose cumulative
value matches with the probability value. In simple words, it calculates the inverse cumulative
distribution function of the binomial distribution.
Let's find the number of heads that have a probability of 0.45 when a coin is tossed 51 times.
Example
Output:
rbinom()
The rbinom() function of R is used to generate required number of random values for given
probability from a given sample.
Let's see an example in which we find nine random values from a sample of 160 with a probability
of 0.5.
Example
Output:
dunif():
The dunif() function is used to compare the density of the uniform distribution at specified points.
However, the uniform distribution is not as commonly used as other distribution. For continuous
probability distribution, density is the value of the probability density function as x(i.e. f(x))
Syntax:
x: represents vector
min: lower limit of the distribution (default value is 0 in R)
max: upper limit of the distribution (default value is 1 in R)
Ex:
dunif<-function(x,min,max){
den<-ifelse(x>=min & x <=max, 1/(max-min),0)
return(den)
}
xval<-seq(0, 1, by=0.1)
den_val<-dunif(xval,0,1)
print(den_val)
Output: [1] 1 1 1 1 1 1 1 1 1 1
punif():
The punif() function in R is used to compute the cumulative distribution function(CDF) for the
uniform distribution. it calculates the probability that a random observation from a uniform
distribution will be less than or equal to a specific value.
Syntax:
punif(q, min=0, max=1, lower.tail=TRUE)
q: the quantile (a numeric vector of values)
o min: lower limit of the distribution (default value is 0)
o max: upper limit of the distribution (default value is 1)
Ex:
q<-0.4
minv<-0.2
maxv<-0.8
Output:[1] 0.3333333
qunif():
the qunif() function in R is used to compute quantiles from the uniform distribution.
Syntax:
qunif(p, min, max)
p: is the probability at which to compute the quantile
min: is the minimum value of the distribution (lower limit of the interval)
max: is the maximum value of the distribution (upper limit of the interval)
the qunif function returns the quantiles corresponding to the probability provided in the p
arguments.
Ex:
prob<-c(0.2, 0.5, 0.8) #probabilities
minv<-2 #Lower limit of the interval
maxv<-5 #Upper limit of the interval
rnuif():
The rnif() function in R is used to generate random numbers from a uniform distribution. It
produces a specified number of random variates within a defined interval.
Syntax:
runif(n, min=0, max=1)
n: is the number of random values to generates
min: is the lower limit of the interval (default=0)
max: is the upper limit of the interval (default=1)
runif(5)
Bernoulli Distribution:
The Bernoulli distribution is a discrete probability distribution that represents the outcomes
of a random experiment with two possible outcomes: success and failure. It is names after Jacob
Bernoulli, a Swiss mathematician, and is a special case of the binomial distribution. the Burnoulli
distribution is a takes value 1 with p and value 0 with probability 1-p where 0≤p≤1.
In this distribution,
k takes on the value 1 with probability
p and the value 0 with probability
1-p. the mean(expected value) of the Burnoulli random variable is
E[X]=p, and the variance is Var[X]=p(1-p)
In R programming language, there are 4 built-in functions for Burnoulli distribution
dbern():
dbern() function in R programming measures the density function of the Burnoulli distribution
Ex:
Library(extraDistr)
prob<-0.3
xval<-c(0,1)
den<-dbern(xval, prob)
print(den)
pbern():
pbern() function in R programming gives the distribution function for the Bernoulli distribution
The distribution function or cumulative distribution function(CDF) or cumulative frequency
function, describes the probability that a variate X takes on a value less than or equal to a number
x.
Ex:
n<-1
prob<-0.3
Output:[1]0.7 0.1
qbern():
qbern() gives the quantile function for the Bernoulli distribution
A quantile function in statistical terms specifies the value of the random variable such that
the probability of the variable being less than or equal to that value equals the given probability.
Parameter:
p: vector of probabilities.
prob: probability of success on each trail
lower.tail: logical value
log.p: logical ; if TRUE, probabilities p are given as log(p).
Ex:
if(!require(extraDistr)){
install.packages(“extraDistr”)
}
Library(extraDistr)
prob<-0.3
qunat_val<-qbern(prob)
print(quant_val)
Output: [1] 0
rbern():
rbern() function in R programming is used to generate a vector of random numbers which are
Bernoulli distribution.
Syntax:
rbern(n, prob)
Parameter:
n: number of observations
prob: probability
Ex:
#install.packages(“extraDistr”)
#loading the ‘extraDistr’ package
library(extraDistr)
prob<-0.4
samp<-rbern(prob, 10)
print(samp)
Output: [1] 0 0 1 0 0 1 1 0 0 0
Student t-distribution:
The Student’s t-distribution (or simply the t-distribution) is a probability that arises in the
problem of estimating the mean of a normally distributed population when the sample size is small
and the population standard deviation is unknown. It is also used for constructing confidence
intervals and hypothesis tests on the population mean.
Example
Program to demonstrate the use of dt() along with a chart program to visualize the t-
distribution:
#Define the range of values
x<-seq(-3, 3, length.out=100)
#Set the degrees of freedom
df<-5
#Compute the probability density function for the given values
densities<-dt(x, df)
Output:
pt():
pt(q, df) computes the cumulative distribution function(CDF) of the Student’s t-
distribution for the given values.
q: Vector of quantiles.
df: Degrees of freedom.
Example
cumulative_prob<-pt(1.5,10)
print(cumulative_prob)
Output:[1]0.9177463
Example
Program to demonstrate the use of pt() along with a chart program to visualize the t-
distribution:
#Set the degrees of freedom
df<- 10
qt():
qt(p, df): Computes the quantiles of the Student’s t-distribution for the given probabailities.
p: Vector of probabilities.
df: Degrees of freedom
Example
quantiles <-qt(0.95, 10)
print(quantiles)
Output:[1]1.812461
Example
Program to demonstrate the use of qt() along with a chart program to visualize the
t-distribution:
Output
rt()
rt(n, df): Generates random deviates from a Student’s t-distribution.
n: Number of observations to generate.
df: Degrees of freedom.
Example
Sample <-rt(10, 10)
Print(sample)
Output:
[1]-1.22383726 0.94088638-0.07026774-1.85703872 0.95586137 -1.11909061
[7]-0.02851959 0.72807861 0.46058078-0.16748534
Example
Program to demonstrate the use of rt() along with a chart program to visualize the
t-distribution:
#Generating random numbers from a t=distribution
random_t <-rt(n=100, df=5)
#Creating a histogram
hist(random_t, col = “skyblue”, main = “Histogram of Random Numbers from t-
Distribution”, xlab=”Value”)
Output: