Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Exploratory Data Analysis - NOTES

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

1

Exploratory Data Analysis

 This is the first step in analysing data from an experiment.

 Here, we do a descriptiver statistics analysis of the data.

 Some of the main reasons why we do EDA are:

 To detect mistakes.

 To determine the relationship between variables.

 To determine the main characteristics or features of the data.

NB: Before we start EDA in R, we will look at some important concepts that will help us in

handling this unit.

################################################### (PDF-NOTES).

Things to Know Before Start Learning R

Why use R

• R is an open source programming language and software environment for

statistical computing and graphics.

• R is an object oriented programming environment, much more than most other

statistical software packages.

• R is a comprehensive statistical platform, offering all manner of data-analytic

techniques – any type of data analysis can done in R.

• R has state-of-the-art graphics capabilities- visualize complex data.

• R is a powerful platform for interactive data analysis and exploration.

• Getting data into a usable form from multiple sources.

• R functionality can be integrated into applications written in other languages,

including C++, Java, Python , PHP, SAS and SPSS.

• R runs on a wide array of platforms, including Windows, Unix and Mac OS X.

• R is extensible; can be expanded by installing “packages”


2

Applications of R Programming in Real World

i. Statistical computing

ii. Data Science

iii. Machine Learning

Downloading and Installing R

########################################################

R Basics Sessions

R and R-Studio

R has Graphic User Interfaces (GUI). RStudio is an Integrated Development Environment

(IDE) that provides features to make using and managing R much easier.

 Looking at R window and R studio window with simple examples.

1. Getting help in R

To get help on specific topics, we can use the help() function along with the topic we want to

search. We can also use the ? operator for this. Example:

help(Syntax)

?Syntax

2. Operations in R.

R uses the following operators:

1. +, -, *, /, %%, ^ - Arithmetic Operators

2. >, > =, <, < =, = =, != - Relational Operators

3. !, $ - Logical Operators

4. ~ - Model Formulae

5. < -, = - Assignment Operator

6. : - Creating Sequence
3

a. Arithmetic Operators

Operator Description

+ addition

- subtraction

* multiplication

/ division

^ or ** exponentiation

b. Logical Operators include:

Operator Description

> greater than

>= greater than or equal to

== exactly equal to

!= not equal to
4

EXPLORATORY DATA
ANALYSIS
We will now start looking at exploratory data analysis in R.

1. Measures of Location.
i) Measures of Central Tendency
Measures that indicate the approximate center of a distribution are called

measures of central tendency.

Central tendency tells about how the group of data is clustered around the

centre value of the distribution.

Here we will look at the:

 Arithmetic Mean

 Geometric Mean

 Harmonic Mean

 Median

 Mode

Arithmetic Mean

The arithmetic mean is simply called the average of the numbers which represents the central

value of the data distribution. It is calculated by adding all the values and then dividing by the

total number of observations.

Formulae
5

In R language, arithmetic mean can be calculated by mean() function.

Example:

# defining vector

x <- c (3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23)

# Print mean

Mean(x)

Or

y = mean (x)

or

Print (mean(x))

Press ctrl + R to get the output or click on run at the upper corner of your console.

Output:

[1] 21.5

NB: You can also calculate the mean using:

W = sum(x)

MeanX = sum(x)/14

Or

n = length(x)
6

XMean = sum(x)/n

Or

Xmean = W/n

Given a Large Data set:

Let’s begin by looking at a simple example with a dataset that comes pre-loaded in your

version of R, called cars by Ezekiel (1930). These data give the speed of cars and the

distances taken to stop.

If we were to compute the mean for cars$speed (or the variable speed our dataset called cars)

we would simply sum the values in the column for speed and divide by 50.

(4 + 4 + 7 + 7 + 8 + 9 + 10 + 10 + 10 + 11 + 11 + 12 + 12 + 12 + 12 + 13 + 13 + 13 + 13 + 14

+ 14 + 14 + 14 + 15 + 15 + 15 + 16 + 16 + 17 + 17 + 17 + 18 + 18 + 18 + 18 + 19 + 19 + 19

+ 20 + 20 + 20 + 20 + 20 + 22 + 23 + 24 + 24 + 24 + 24 + 25) / 50

Or quite simply: 770/50 = 15.4.

But this data is too big to calculate the mean manually. To work with a large data set that is

pre-loaded in R, we:

View the data type:

View (cars)

or

cars

In R, we can compute the mean in several ways:

sumofspeed <- sum(cars$speed)

sumofspeed / 50

## [1] 15.4

or
7

sum(cars$speed) / length(cars$speed)

## [1] 15.4

or simply using the mean( ) function

mean(cars$speed)

## [1] 15.4

N/B:

Computing the mean for the cars data worked out nicely because there were no missing

values or NAs. If there were NAs we would be able to omit those from our calculations. For

example,

mean(cars$speed, na.rm=TRUE)

## [1] 15.4

While the mean is not a new concept to you, there’s some notation that is important for you

to understand.

n. Used to refer to the sample size. The number of sample of observations (rows) that we are

averaging. In the above example n=50

x. Used to refer to the sample elements.

EXERCISE

1. Geometric Mean

2. Harmonic Mean
8

The median

The median is another measure of central tendency. The middle value in a set of observations

is the median. For cars$speed, we can sort our variables in ascending order using the sort( )

function. This can help us identify the median.

Example 1

# Defining vector

x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23)

# Print Median

median(x)

# output

21.5

Example 2

sort(cars$speed)

## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15

## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

In this case, the middle value is at positions 25 and 26. The middle value is 15. If the value of

position 25 was 14 and the value of position 26 was 15 we’d take the average of the two

values and the median would be 14.5.

An easier way to compute the median is to use the median( ) function:

median(cars$speed)

## [1] 15
9

The Mode

The mode of a set of observations is the value that occurs most frequently. There’s not a

standard function in R that computes the mode. However, you can create a simple frequency

table to tally the number of times each value occurs.

Example 1: Single-mode value

In R language, there is no function to calculate mode. So, modifying the code to find out the

mode for a given set of values.

# Defining vector

x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29, 56, 37, 45, 1, 25, 8)

# Generate frequency table

y <- table(x)

# Print frequency table

print(y)

# Mode of x

m <- names(y)[which(y == max(y))]

# Print mode

print(m)

Output:

1 3 5 7 8 12 13 14 20 23 25 29 37 39 40 45 56

1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 2

[1] "23"

Example 2: Multiple Mode values


10

# Defining vector

x <- c(3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14, 12, 56, 23, 29, 56, 37, 45, 1, 25, 8, 56, 56)

# Generate frequency table

y <- table(x)

# Print frequency table

print(y)

# Mode of x

m <- names(y)[which(y == max(y))]

# Print mode

print(m)

Output:

1 3 5 7 8 12 13 14 20 23 25 29 37 39 40 45 56

1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 4

[1] "23" "56"

table(cars$speed)

##

## 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25

## 2 2 1 1 3 2 4 4 4 3 2 3 4 3 5 1 1 4 1

Here we see that the value 20 occurs 5 times.


11

You can also compute the mode using the following algorithm:

modeforcars <- table(as.vector(cars$speed))

names(modeforcars)[modeforcars == max(modeforcars)]

## [1] "20"

Exercise 2
1. Find the Mean, Median and Mode using mtcars dataset pre-loaded in R.
12

ii) Measures of Relative Positioning


 The commonly used quantiles are; Quartiles, Deciles and Percentiles.

 These 3 divides a sorted data set into four, ten and hundred divisions, respectively.

a) Quartiles

There are several quartiles of an observation variable. The first quartile, or lower

quartile, is the value that cuts off the first 25% of the data when it is sorted in

ascending order. The second quartile, or median, is the value that cuts off the first

50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.

Example

Find the quartiles of the eruption durations in the data set faithful.

Solution

We apply the quantile function to compute the quartiles of eruptions.

duration = faithful$eruptions # the eruption durations

quantile(duration) # apply the quantile function

0% 25% 50% 75% 100%

1.6000 2.1627 4.0000 4.4543 5.1000

Answer

The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and

4.4543 minutes respectively.

b) Deciles

In statistics, deciles are numbers that split a dataset into ten groups of equal

frequency. The first decile is the point where 10% of all data values lie below it. The

second decile is the point where 20% of all data values lie below it, and so on. We can

use the following syntax to calculate the deciles for a dataset in R:


13

quantile(data, probs = seq(.1, .9, by = .1))

Example

Calculate Deciles in R

The following code shows how to create a dataset with 20 values and then calculate

the values for the deciles of the dataset:

#create dataset

data <- c(56, 58, 64, 67, 68, 73, 78, 83, 84, 88,89, 90, 91, 92, 93, 93, 94, 95, 97, 99)

#calculate deciles of dataset

quantile(data, probs = seq(.1, .9, by = .1))

Output

10% 20% 30% 40% 50% 60% 70% 80% 90%

63.4 67.8 76.5 83.6 88.5 90.4 92.3 93.2 95.2

The way to interpret the deciles is as follows:

10% of all data values lie below 63.4

20% of all data values lie below 67.8.

30% of all data values lie below 76.5.

40% of all data values lie below 83.6.

50% of all data values lie below 88.5.

60% of all data values lie below 90.4.

70% of all data values lie below 92.3.

80% of all data values lie below 93.2.

90% of all data values lie below 95.2.


14

c) Percentiles

The nth percentile of a dataset is the value that cuts off the first n percent of the data values

when all of the values are sorted from least to greatest.

For example, the 90th percentile of a dataset is the value that cuts of the bottom 90% of the

data values from the top 10% of data values.

One of the most commonly used percentiles is the 50th percentile, which represents the

median value of a dataset: this is the value at which 50% of all data values fall below.

Percentiles can be used to answer questions such as:

 What score does a student need to earn on a particular test to be in the top 10% of

scores? To answer this, we would find the 90th percentile of all scores, which is the

value that separates the bottom 90% of values from the top 10%.

 What heights encompass the middle 50% of heights for students at a particular

school? To answer this, we would find the 75th percentile of heights and 25th

percentile of heights, which are the two values that determine the upper and lower

bounds for the middle 50% of heights.

To Calculate Percentiles in R

We can easily calculate percentiles in R using the quantile() function, which uses the

following syntax:

quantile(x, probs = seq(0, 1, 0.25))

Example

Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.

Solution

We apply the quantile function to compute the percentiles of eruptions with the desired

percentage ratios.
15

duration = faithful$eruptions # the eruption durations

quantile(duration, c(.32, .57, .98))

32% 57% 98%

2.3952 4.1330 4.9330

Answer

The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330

minutes respectively.

Exercise

1. Find the 17th, 43rd, 67th and 85th percentiles of the eruption waiting periods in

faithful.

2.
16

2. Measures of Spread/
Dispersion
Spread is the degree of scatter or variation of the variable about the central value.

Examples of these measures include:

i) The range
ii) Inter-Quartile range
iii) Quartile Deviation also called semi Inter-Quartile range
iv) Mean Absolute Deviation
v) Variance
vi) Standard deviation

In addition to computing measures of central tendency, another summary statistic we’d like to

compute is variability. How spread out are the data? How far from the mean and median do

the observed values tend to be?

Range

The range of a variable is the largest value minus the smallest value. We can compute the

largest value using the max( ) function and the smallest value using the min( ) function. In the

case with cars$speed, the range is 25 – 4 or 21.

min(cars$speed)

## [1] 4

max(cars$speed)

## [1] 25

R has an even better function, range( ) that outputs the minimum and maximum value in a

vector

range(cars$speed)
17

## [1] 4 25

Interquartile range

The interquartile range is similar to the range, but instead of calculating the difference

between the biggest and smallest value, you calculate the difference between the 25th

quantile and the 75th quantile.

We can calculate the interquartile range (IQR) using IQR( ). This is the range spanned by the

middle half of the data. For example this is the 75th quantile minus the 25th quantile.

IQR(cars$speed)

## [1] 7

We can see all quantiles by typing the following:

quantile(cars$speed)

## 0% 25% 50% 75% 100%

## 4 12 15 19 25

Or just to see the 25% and 75% we can type:

quantile(cars$speed, probs=c(.25, .75))

## 25% 75%

## 12 19

Therefore, you can see the IQR is simply 19 – 12.

Variance

The variance is a numerical measure of how the data values are dispersed around the mean.

The variance measures how far a set of numbers are spread out. (A variance of zero indicates

that all the values are identical.) A non-zero variance is always positive: A small variance
18

indicates that the data points tend to be very close to the mean (expected value). A high

variance indicates that the data points are very spread out from the mean and from each other.

The variance of a dataset X is sometimes written as Var(X) but more commonly denoted as

S2 or for a given sample. The formula for the sample variance is:

To compute the sample variance in R we would type the following:

var(cars$speed)

## [1] 27.95918

Standard deviation

The square root of the variance is the standard deviation. Below is the formula for the sample

standard deviation.

To compute the sample standard deviation in R, type the following:

sqrt(var(cars$speed))

## [1] 5.287644

or you can use the sd( ) function

sd(cars$speed)

## [1] 5.287644

Measures of Skew and kurtosis

Skew and kurtosis are two more descriptive statistics that you may encounter.

Skew
19

Skewness is a measure of symmetry. If there are more extremely large values than extremely

small ones, the data can be described as positively skewed. If the data tend to have a lot of

extreme small values and not many extremely large values then the data is considered

negatively skewed. As a rule, negative skewness indicates that the mean of the data values is

less than the median, and the data distribution is left-skewed. Positive skewness would

indicate that the mean of the data values is larger than the median, and the data distribution is

right-skewed. See Figure below for an illustration.

Figure: From left to right: Positive skew, no skew, and negative skew

We can compute the skew by using a function called skew( ) from the psych package.

library(psych)

skew(cars$speed)

## [1] -0.1105533

Kurtosis

Kurtosis is the measure of the pointiness of the data. Intuitively, the kurtosis is a measure of

the peakedness of the data distribution. We can see how fat or thin the tails of a distribution

are relative to a normal distribution. Negative kurtosis would indicates a flat data distribution,

which is said to be platykurtic. Positive kurtosis would indicates a peaked distribution, which

is said to be leptokurtic. Incidentally, the normal distribution has zero kurtosis, and is said to

be mesokurtic. See Figure ?? for an illustration.


20

We can compute the kurtosis by using a function called kurtosi( ) from the psych package.

kurtosi(cars$speed)

## [1] -0.6730924

Where do you think cars$speed fall? Let’s plot it. See Figure below

DESCRIBE AND SUMMARY FUNCTIONS.

There’s an easier way to compute some measures of central tendency and variability using

the summary( ) function. The summary function provides the min( ), max( ), median( ),

mean( ), the 75% and 25% quantiles. To compute all these measures for a single variable

type:

summary(cars$speed)
21

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 4.0 12.0 15.0 15.4 19.0 25.0

To summarize a data frame, type:

summary(cars)

## speed dist

## Min. : 4.0 Min. : 2.00

## 1st Qu.:12.0 1st Qu.: 26.00

## Median :15.0 Median : 36.00

## Mean :15.4 Mean : 42.98

## 3rd Qu.:19.0 3rd Qu.: 56.00

## Max. :25.0 Max. :120.00

Describing a data frame

A similar function to the summary( ) function is the describe( ) function in the psych

package. This function is useful when your data are interval or ratio scale. Unlike the

summary ( ) function, it calculates the descriptive statistics for any type of variable you

give it. It also includes other measures that we discussed earlier such as the trimmed mean

(default is 10%), skew, kurtosis, and range. n is the sample size (or the number of non-

missing values)

describe(cars)

There are more advanced functions to compute descriptive statistics by group using the psych

package. One such function is describeBy( ). You can specify a grouping variable. Let’s say

we wanted to obtain descriptive statistics separately for each grouping of data. For example,

we could group our data by the different speeds in cars. We could use speed as our grouping

variable as follows:

describeBy(cars, group=cars$speed)
22

Bivariate Data
So far we have confined our discussion to the distributions involving only one variable.

Sometimes, in practical applications, we might come across certain set of data, where each

item of the set may comprise of the values of two or more variables.

A Bivariate Data is a set of paired measurements which are of the form:

( , ), ( , ), .....,( , ).

1. Scatter Diagrams.

2. Correlation.

3. Regression.

1. Scatter Diagrams.

A scatter diagram is a tool for analysing relationships between two variables. One variable is

plotted on the horizontal axis and the other is plotted on the vertical axis. The pattern of their

intersecting points can graphically show relationship patterns. Most often a scatter diagram is

used to prove or disprove cause-and-effect relationships.

There are many ways to create a scatterplot in R.


23

i) The basic function is plot(x, y), where x and y are numeric vectors denoting the

(x,y) points to plot.

# Simple Scatterplot

x<-c(1,2,3,4,5,6,7)

y<-c(2,4,6,8,10,12,14)

plot(x,y)

Plot 2

attach(mtcars)

plot(wt, mpg, main="Scatterplot Example",

xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)

Scatter diagrams will generally show one of six possible correlations between the

variables:

i) Strong Positive Correlation The value of Y clearly increases as the value of X

increases.

ii) Strong Negative Correlation The value of Y clearly decreases as the value of X

increases.

iii) Weak Positive Correlation The value of Y increases slightly as the value of X

increases.

iv) Weak Negative Correlation The value of Y decreases slightly as the value of X

increases.

v) Complex Correlation The value of Y seems to be related to the value of X, but the

relationship is not easily determined.

vi) No Correlation There is no demonstrated connection between the two variables


24

2. Correlation

Correlation is a statistical method to measure the relationship between the two quantitative

variables.

The correlation coefficient (r) measures the strength and direction of (linear) relationship

between the two quantitative variables. r can range from +1 (perfect positive correlation) to -

1 (perfect negative correlation).

The positive values of r indicate the positive relationship and vice versa. The higher the

absolute value of r, the stronger is the correlation. If the value of r is 0, it indicates that there

is no relationship between the two variables.

Interpretation of correlation coefficient (r)

The below table suggests the interpretation of r at different absolute values. These cut-offs

are arbitrary and should be used judiciously while interpreting the dataset.
25

Note: In interpretation, correlation can be positive or negative based on the sign of r

Types of correlation coefficients (r)

There are three main types of correlation coefficients:

i) Pearson’s product-moment correlation coefficient.

ii) Spearman’s rank-order (Spearman’s rho) correlation coefficient.

iii) Kendall’s Tau correlation coefficient.

Note: a) Most of the times correlation coefficients are referred to Pearson’s r unless specified.

b) The appropriate usage of different types of correlation coefficients largely depends

on underlying data types, sample size, linear or non-linear relationships between the

two variables, and their distributions.

i) Pearson’s product-moment correlation coefficient.

Pearson correlation (r), measures a linear dependence between two variables (x and y). It’s

also known as a parametric correlation test because it depends to the distribution of the data.

It can be used only when x and y are from normal distribution.

mx and my are the means of x and y variables.


26

Correlation coefficient can be computed in R using the functions cor() or cor.test():

cor() computes the correlation coefficient

cor.test() test for association/correlation between paired samples. It returns both the

correlation coefficient and the significance level(or p-value) of the correlation .

The simplified formats are:

cor(x, y, method = c("pearson", "kendall", "spearman"))

cor.test(x, y, method=c("pearson", "kendall", "spearman"))

Where;

x, y: numeric vectors with the same length

Method: correlation method.

Example 1

# correlation of vectors in R

x <- c(0,1,1,2,3,5,8,13,21,34)

y <- log(x+1)

cor(x,y)

Example 2

x <- c(0,1,1,2,3,5,8,13,21,34)

y <- log(x+1)

cor(x,y,method="pearson")

3. Regression
Regression analysis, in general sense, means the estimation or prediction of the unknown

value of one variable from the known value of the other variable.

Regression analysis can be thought of as being sort of like the flip side of correlation.

It has to do with finding the equation for the kind of straight lines you were just looking at
27

Suppose we have a sample of size n and it has two sets of measures, denoted by x and y. We

can predict the values of y given the values of x by using the equation, .

Or equation

Not every problem can be solved with the same algorithm. In this case, linear regression

assumes that there exists a linear relationship between the response variable and the

explanatory variables. This means that you can fit a line between the two (or more variables).

In this particular example, you can calculate the height of a child if you know her age:

In this case, “a” and “b” are called the intercept and the slope respectively. With the same

example, “a” or the intercept, is the value from where you start measuring. Newborn babies

with zero months are not zero centimeters necessarily; this is the function of the intercept.

The slope measures the change of height with respect to the age in months. In general, for

every month older the child is, his or her height will increase with “b”.

Linear Regression in R

A linear regression can be calculated in R with the command lm().

Dependent Variable (Target) : Continuous

Independent Variable (Predictor(s)): Continuous/Discrete


28

Y = mX + c , where

m = slope of straight line

c = Y-intercept

R-Codes to load Data

require("datasets")

data("iris")

str(iris)

head(iris)

Linear Models

Since simple L.R. requires just one target, let’s take “Sepal.Length”" attribute as our

target(Y) and “Sepal.Width” attribute as Predictor(X) to find if there exists any kind of

relationship between them.

Example 1

y<-c(1,2,3,4,5,6,7)

x<-c(2,5,6,8,9,10,18)

M<-lm(x~y)

summary(M)
29

Example 2

Y<- iris[,"Sepal.Width"] # select Target attribute

X<- iris[,"Sepal.Length"] # select Predictor attribute

head(X)

lm

model1<- lm(Y~X)

model1 # provides regression line coefficients i.e. slope and y-intercept

Y = 3.41895 – 0.06188X

Interpretation

Holding X constant Y increases by 3.41895.

For unit increase in X Holding the intercept constant, β = 0, Y decreases by 0.06188.

Example 3

Model2<-lm(Petal.Width ~ Petal.Length, data=iris)$coefficients

summary (Model2)
30

The results can be interpreted as follows:

lm(Petal.Width ~ Petal.Length).

Petal.Width = -0.363076 + 0.415755 Petal.Length

R-Squared: 0.9271*100 = 92.71% implying that 92.71% variability of Y has been explained

by X leaving 7.29% unexplained.


31

EXPLORATORY DATA ANALYSIS

PLOTS/GRAPHICS

You might also like