Unit 3
Unit 3
Unit 3
DESCRIPTIVE STATISTICS
Descriptive statistics is the branch of statistics that focuses on describing and
gaining more insight into the data in its present state. It deals with what the
data in its current state means. It makes the data easier to understand and
also gives us knowledge about the data which is necessary to perform further
analysis. Average measures like mean, median, mode, etc. are a good
example of descriptive statistics.
R programming language provides us with lots of simple yet effective
functions to perform descriptive statistics and gain more knowledge about our
data. Summarizing the data, calculating average measures, finding out
cumulative measures, summarizing rows/columns of data structures, etc.
everything is possible with trivial commands.
R provides two very simple functions that can instantly summarize our data
for us.
The str() function takes a single object as an argument and compactly shows
us the structure of the input object. It shows us details like length, data type,
names and other specifics about the components of the object.
The summary() function also takes a single object as an argument. It then
returns the averages measures like mean, median, minimum, maximum, 1st
quantile, 3rd quantile, etc. for each component or variable in the object.
AVERAGE MEASURES
R provides a number of functions that give us different average measures for
given data. These average measures include:
Mean: The mean of a given set of numeric or logical values(it may be a
vector or a row or column of any other data structure) can be easily found
using the mean() function.
Median: Finding the median of a set of numeric or logical values is also very
easy by using the median() function.
Standard deviation: The standard deviation of a set of numerical values can
be found using the sd() function.
Variance: the var() function gives us the variance of a set of numeric or
logical values.
Maximum: In a given set of numeric or logical values, we can use
the max() function to find the maximum or the largest value in the set.
Note: NA is considered to be the largest by the max() function unless its
na.rm argument is set to TRUE.
Minimum: The min() function is a very handy way to find out the smallest
value in a set of numeric values.
Note: Like the max() function, the min() function considers NA to be the
smallest unless na.rm is set to TRUE.
Sum: The sum of a set of numerical values can be found by simply using
the sum() function.
Length: The length or the number of values in a set is given by
the length() function.
Cumulative measures in R
Cumulative measures are statistical measures that are calculated sequentially.
These measures evolve with the data. They provide insight into
the progression and growth of the data. R provides a few functions that
calculate cumulative measures with ease. These functions are
Cumulative sum: The cumsum() function calculates the cumulative sum of a
given vector.
Cumulative max: To find the cumulative maximum value of an input vector,
you can use the cummax() function.
Cumulative min: You can find the cumulative minimum values in a vector by
using the cummin() function.
Cumulative product: Using the comprod() function, you can find the
cumulative product of a vector.
VISUALISATIONS
Graphs in R language is a preferred feature which is used to create various
types of graphs and charts for visualizations. R language supports a rich set of
packages and functionalities to create the graphs using the input data set for
data analytics. The most commonly used graphs in the R language are
scattered plots, box plots, line graphs, pie charts, histograms, and bar charts.
R graphs support both two dimensional and three-dimensional plots for
exploratory data analysis. There are R function like plot(), barplot(), pie() are
used to develop graphs in R language. R package like ggplot2 supports
advance graphs functionalities.
TYPES OF GRAPHS IN R
Histogram: A histogram is a graphical tool that works on a single variable.
Numerous variable values are grouped into bins, and a number of values
termed as the frequency are calculated. This calculation is then used to plot
frequency bars in the respective beans. The height of a bar is represented by
frequency.
In R, we can employ the hist() function as shown below, to generate the
histogram. A simple histogram of tree heights is shown below.
attach(trees)
plot(Girth, Height, main = "Scatterplot of Girth vs Height", xlab = "Tree Girth", ylab = "Tree Height")
abline(lm(Height ~ Girth), col = "blue", lwd = 2)
plot(Girth, Volume, main = "Scatterplot of Girth vs Volume", xlab = "Tree Girth", ylab = "Tree Volume")
abline(lm(Volume ~ Girth), col = "blue", lwd = 2)
SCATTERPLOT MATRICES
R allows us to compare multiple variables at a time because of it
uses scatterplot matrices. Implementing the visualization is quite
simple, and can be achieved using pairs() function as shown
below.
pairs(trees, main = "Scatterplot matrix for trees dataset")
BOXPLOT
Boxplot is a way of visualizing data through boxes and whiskers. Firstly,
variable values are sorted in ascending order and then the data is divided into
quarters.
The box in the plot is the middle 50% of the data, known as IQR. The black
line in the box represents the median.