Exploratory Data Analysis - NOTES
Exploratory Data Analysis - NOTES
Exploratory Data Analysis - NOTES
To detect mistakes.
NB: Before we start EDA in R, we will look at some important concepts that will help us in
################################################### (PDF-NOTES).
Why use R
i. Statistical computing
R Basics Sessions
R and R-Studio
(IDE) that provides features to make using and managing R much easier.
1. Getting help in R
To get help on specific topics, we can use the help() function along with the topic we want to
2. Operations in R.
3. !, $ - Logical Operators
4. ~ - Model Formulae
6. : - Creating Sequence
a. Arithmetic Operators
Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
Operator Description
== exactly equal to
!= not equal to
We will now start looking at exploratory data analysis in R.
1. Measures of Location.
i) Measures of Central Tendency
Measures that indicate the approximate center of a distribution are called
Central tendency tells about how the group of data is clustered around the
Arithmetic Mean
Geometric Mean
Harmonic Mean
Arithmetic Mean
The arithmetic mean is simply called the average of the numbers which represents the central
value of the data distribution. It is calculated by adding all the values and then dividing by the
# defining vector
x <- c (3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23)
# Print mean
y = mean (x)
Print (mean(x))
Press ctrl + R to get the output or click on run at the upper corner of your console.
[1] 21.5
W = sum(x)
MeanX = sum(x)/14
n = length(x)
XMean = sum(x)/n
Xmean = W/n
Let’s begin by looking at a simple example with a dataset that comes pre-loaded in your
version of R, called cars by Ezekiel (1930). These data give the speed of cars and the
If we were to compute the mean for cars$speed (or the variable speed our dataset called cars)
we would simply sum the values in the column for speed and divide by 50.
(4 + 4 + 7 + 7 + 8 + 9 + 10 + 10 + 10 + 11 + 11 + 12 + 12 + 12 + 12 + 13 + 13 + 13 + 13 + 14
+ 14 + 14 + 14 + 15 + 15 + 15 + 16 + 16 + 17 + 17 + 17 + 18 + 18 + 18 + 18 + 19 + 19 + 19
+ 20 + 20 + 20 + 20 + 20 + 22 + 23 + 24 + 24 + 24 + 24 + 25) / 50
But this data is too big to calculate the mean manually. To work with a large data set that is
pre-loaded in R, we:
View (cars)
sumofspeed / 50
## [1] 15.4
sum(cars$speed) / length(cars$speed)
## [1] 15.4
## [1] 15.4
Computing the mean for the cars data worked out nicely because there were no missing
values or NAs. If there were NAs we would be able to omit those from our calculations. For
mean(cars$speed, na.rm=TRUE)
## [1] 15.4
While the mean is not a new concept to you, there’s some notation that is important for you
to understand.
n. Used to refer to the sample size. The number of sample of observations (rows) that we are
1. Geometric Mean
2. Harmonic Mean
The median
The median is another measure of central tendency. The middle value in a set of observations
is the median. For cars$speed, we can sort our variables in ascending order using the sort( )
Example 1
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23)
# Print Median
# output
Example 2
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
In this case, the middle value is at positions 25 and 26. The middle value is 15. If the value of
position 25 was 14 and the value of position 26 was 15 we’d take the average of the two
## [1] 15
The Mode
The mode of a set of observations is the value that occurs most frequently. There’s not a
standard function in R that computes the mode. However, you can create a simple frequency
In R language, there is no function to calculate mode. So, modifying the code to find out the
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29, 56, 37, 45, 1, 25, 8)
y <- table(x)
# Mode of x
# Print mode
1 3 5 7 8 12 13 14 20 23 25 29 37 39 40 45 56
1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 2
[1] "23"
# Defining vector
x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29, 56, 37, 45, 1, 25, 8, 56, 56)
y <- table(x)
# Mode of x
# Print mode
1 3 5 7 8 12 13 14 20 23 25 29 37 39 40 45 56
1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 4
## 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25
## 2 2 1 1 3 2 4 4 4 3 2 3 4 3 5 1 1 4 1
You can also compute the mode using the following algorithm:
names(modeforcars)[modeforcars == max(modeforcars)]
## [1] "20"
Exercise 2
1. Find the Mean, Median and Mode using mtcars dataset pre-loaded in R.
These 3 divides a sorted data set into four, ten and hundred divisions, respectively.
a) Quartiles
There are several quartiles of an observation variable. The first quartile, or lower
quartile, is the value that cuts off the first 25% of the data when it is sorted in
ascending order. The second quartile, or median, is the value that cuts off the first
50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.
Find the quartiles of the eruption durations in the data set faithful.
The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and
b) Deciles
In statistics, deciles are numbers that split a dataset into ten groups of equal
frequency. The first decile is the point where 10% of all data values lie below it. The
second decile is the point where 20% of all data values lie below it, and so on. We can
Calculate Deciles in R
The following code shows how to create a dataset with 20 values and then calculate
#create dataset
data <- c(56, 58, 64, 67, 68, 73, 78, 83, 84, 88,89, 90, 91, 92, 93, 93, 94, 95, 97, 99)
c) Percentiles
The nth percentile of a dataset is the value that cuts off the first n percent of the data values
For example, the 90th percentile of a dataset is the value that cuts of the bottom 90% of the
One of the most commonly used percentiles is the 50th percentile, which represents the
median value of a dataset: this is the value at which 50% of all data values fall below.
What score does a student need to earn on a particular test to be in the top 10% of
scores? To answer this, we would find the 90th percentile of all scores, which is the
value that separates the bottom 90% of values from the top 10%.
What heights encompass the middle 50% of heights for students at a particular
school? To answer this, we would find the 75th percentile of heights and 25th
percentile of heights, which are the two values that determine the upper and lower
To Calculate Percentiles in R
We can easily calculate percentiles in R using the quantile() function, which uses the
following syntax:
Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.
We apply the quantile function to compute the percentiles of eruptions with the desired
percentage ratios.
The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330
minutes respectively.
1. Find the 17th, 43rd, 67th and 85th percentiles of the eruption waiting periods in
2. Measures of Spread/
Spread is the degree of scatter or variation of the variable about the central value.
i) The range
ii) Inter-Quartile range
iii) Quartile Deviation also called semi Inter-Quartile range
iv) Mean Absolute Deviation
v) Variance
vi) Standard deviation
In addition to computing measures of central tendency, another summary statistic we’d like to
compute is variability. How spread out are the data? How far from the mean and median do
The range of a variable is the largest value minus the smallest value. We can compute the
largest value using the max( ) function and the smallest value using the min( ) function. In the
## [1] 4
## [1] 25
R has an even better function, range( ) that outputs the minimum and maximum value in a
## [1] 4 25
Interquartile range
The interquartile range is similar to the range, but instead of calculating the difference
between the biggest and smallest value, you calculate the difference between the 25th
We can calculate the interquartile range (IQR) using IQR( ). This is the range spanned by the
middle half of the data. For example this is the 75th quantile minus the 25th quantile.
## [1] 7
## 4 12 15 19 25
## 25% 75%
## 12 19
The variance is a numerical measure of how the data values are dispersed around the mean.
The variance measures how far a set of numbers are spread out. (A variance of zero indicates
that all the values are identical.) A non-zero variance is always positive: A small variance
indicates that the data points tend to be very close to the mean (expected value). A high
variance indicates that the data points are very spread out from the mean and from each other.
The variance of a dataset X is sometimes written as Var(X) but more commonly denoted as
S2 or for a given sample. The formula for the sample variance is:
## [1] 27.95918
Standard deviation
The square root of the variance is the standard deviation. Below is the formula for the sample
standard deviation.
## [1] 5.287644
## [1] 5.287644
Skew and kurtosis are two more descriptive statistics that you may encounter.
Skewness is a measure of symmetry. If there are more extremely large values than extremely
small ones, the data can be described as positively skewed. If the data tend to have a lot of
extreme small values and not many extremely large values then the data is considered
negatively skewed. As a rule, negative skewness indicates that the mean of the data values is
less than the median, and the data distribution is left-skewed. Positive skewness would
indicate that the mean of the data values is larger than the median, and the data distribution is
Figure: From left to right: Positive skew, no skew, and negative skew
We can compute the skew by using a function called skew( ) from the psych package.
## [1] -0.1105533
Kurtosis is the measure of the pointiness of the data. Intuitively, the kurtosis is a measure of
the peakedness of the data distribution. We can see how fat or thin the tails of a distribution
are relative to a normal distribution. Negative kurtosis would indicates a flat data distribution,
which is said to be platykurtic. Positive kurtosis would indicates a peaked distribution, which
is said to be leptokurtic. Incidentally, the normal distribution has zero kurtosis, and is said to
We can compute the kurtosis by using a function called kurtosi( ) from the psych package.
## [1] -0.6730924
Where do you think cars$speed fall? Let’s plot it. See Figure below
There’s an easier way to compute some measures of central tendency and variability using
the summary( ) function. The summary function provides the min( ), max( ), median( ),
mean( ), the 75% and 25% quantiles. To compute all these measures for a single variable
## speed dist
A similar function to the summary( ) function is the describe( ) function in the psych
package. This function is useful when your data are interval or ratio scale. Unlike the
summary ( ) function, it calculates the descriptive statistics for any type of variable you
give it. It also includes other measures that we discussed earlier such as the trimmed mean
(default is 10%), skew, kurtosis, and range. n is the sample size (or the number of non-
missing values)
There are more advanced functions to compute descriptive statistics by group using the psych
package. One such function is describeBy( ). You can specify a grouping variable. Let’s say
we wanted to obtain descriptive statistics separately for each grouping of data. For example,
we could group our data by the different speeds in cars. We could use speed as our grouping
variable as follows:
describeBy(cars, group=cars$speed)
Bivariate Data
So far we have confined our discussion to the distributions involving only one variable.
Sometimes, in practical applications, we might come across certain set of data, where each
item of the set may comprise of the values of two or more variables.
( , ), ( , ), .....,( , ).
1. Scatter Diagrams.
2. Correlation.
3. Regression.
1. Scatter Diagrams.
A scatter diagram is a tool for analysing relationships between two variables. One variable is
plotted on the horizontal axis and the other is plotted on the vertical axis. The pattern of their
intersecting points can graphically show relationship patterns. Most often a scatter diagram is
i) The basic function is plot(x, y), where x and y are numeric vectors denoting the
# Simple Scatterplot
Plot 2
Scatter diagrams will generally show one of six possible correlations between the
ii) Strong Negative Correlation The value of Y clearly decreases as the value of X
iii) Weak Positive Correlation The value of Y increases slightly as the value of X
iv) Weak Negative Correlation The value of Y decreases slightly as the value of X
v) Complex Correlation The value of Y seems to be related to the value of X, but the
2. Correlation
Correlation is a statistical method to measure the relationship between the two quantitative
The correlation coefficient (r) measures the strength and direction of (linear) relationship
between the two quantitative variables. r can range from +1 (perfect positive correlation) to -
The positive values of r indicate the positive relationship and vice versa. The higher the
absolute value of r, the stronger is the correlation. If the value of r is 0, it indicates that there
The below table suggests the interpretation of r at different absolute values. These cut-offs
are arbitrary and should be used judiciously while interpreting the dataset.
Note: a) Most of the times correlation coefficients are referred to Pearson’s r unless specified.
on underlying data types, sample size, linear or non-linear relationships between the
Pearson correlation (r), measures a linear dependence between two variables (x and y). It’s
also known as a parametric correlation test because it depends to the distribution of the data.
cor.test() test for association/correlation between paired samples. It returns both the
Example 1
# correlation of vectors in R
x <- c(0,1,1,2,3,5,8,13,21,34)
y <- log(x+1)
Example 2
x <- c(0,1,1,2,3,5,8,13,21,34)
y <- log(x+1)
3. Regression
Regression analysis, in general sense, means the estimation or prediction of the unknown
value of one variable from the known value of the other variable.
Regression analysis can be thought of as being sort of like the flip side of correlation.
It has to do with finding the equation for the kind of straight lines you were just looking at
Suppose we have a sample of size n and it has two sets of measures, denoted by x and y. We
can predict the values of y given the values of x by using the equation, .
Or equation
Not every problem can be solved with the same algorithm. In this case, linear regression
assumes that there exists a linear relationship between the response variable and the
explanatory variables. This means that you can fit a line between the two (or more variables).
In this particular example, you can calculate the height of a child if you know her age:
In this case, “a” and “b” are called the intercept and the slope respectively. With the same
example, “a” or the intercept, is the value from where you start measuring. Newborn babies
with zero months are not zero centimeters necessarily; this is the function of the intercept.
The slope measures the change of height with respect to the age in months. In general, for
every month older the child is, his or her height will increase with “b”.
Linear Regression in R
Y = mX + c , where
c = Y-intercept
Linear Models
Since simple L.R. requires just one target, let’s take “Sepal.Length”" attribute as our
target(Y) and “Sepal.Width” attribute as Predictor(X) to find if there exists any kind of
Example 1
Example 2
model1<- lm(Y~X)
Y = 3.41895 – 0.06188X
Example 3
summary (Model2)
lm(Petal.Width ~ Petal.Length).
R-Squared: 0.9271*100 = 92.71% implying that 92.71% variability of Y has been explained