Exploratory Data Analysis Using R
Exploratory Data Analysis Using R
Introduction to R Programming
In this R programming tutorial, we are going to learn what is R statistics, introduction to
R Programming, R programming examples, r programming for data science various R software
editors like RGui and R Studio and their components. We will also learn R
Features, Applications of R programming, how to develop R scripts with the help of examples.
You will get a good idea to learn r programming for data science. Let’s begin with the
introduction to R Programming Tutorial.
Many routines have been written for R analytics by people all over the world and made freely
available from the R project Website as packages. However, the basic installation (for Linux,
Windows, or Mac) contains a powerful set of tools for most purposes.
History of R language
John Chambers and colleagues developed R at Bell Laboratories. R is an implementation of the S
programming Language and combines with lexical scoping semantics inspired by Scheme. R
was named
partly after the first names of two R authors. The project conceives in 1992, with an initial
version
released in 1995 and a stable beta version in 2000. Let us also Understand in this Introduction to
R Programming Tutorial, that Why should learn R Programming.
R-Studio:
R-Studio is an integrated development environment (IDE) for R language. R-Studio is a code
editor and development environment, with some nice features that make code development in R
easy and fun.
a) Features of R-Studio
Code highlighting that gives different colors to keywords and variables, making it easier to read
Automatic bracket matching
Code completion, so as to reduce the effort of typing the commands in full
Easy access to R Help, with additional features for exploring functions and parameters of
functions
R-Studio is available free of charge for Linux, Windows, and Mac devices. It can be directly
accessed by clicking the R-Studio icon in the menu system on the desktop.
Because R-Studio is available free of charge for Linux, Windows, and Mac devices, it is a good
option to use with R. To open R-Studio, click the R-Studio icon in the menu system or on the
desktop.
Components of R-Studio
Source – Top left corner of the screen contains a text editor that lets the user work with source
script files. Multiple lines of code can also be entered here. Users can save R script file to disk
and perform other tasks on the script.
Console – Bottom left corner is the R console window. The console in R-Studio is identical to
the console in RGui. All the interactive work of R programming is performed in this window.
Workspace and History – The top right corner is the R workspace and history window. This
provides an overview of the workspace, where the variables created in the session along with
their values can be inspected. This is also the area where the user can see a history of the
commands issued in R.
Files, Plots, Package, and Help The bottom right corner gives access to the following tools:
Files – This is where the user can browse folders and files on a computer.
Plots – This is where R displays the user’s plots.
Packages – This is where the user can view a list of all the installed packages.
Help – This is where you can browse the built-in Help system of R.
Next in this Introduction to R Programming Introduction is R Scripting and Sourcing a script in
R.
1. Load the ‘iris. CSV’ file and display the names and type of each column.
Find statistics such as min, max, range, mean, median, variance, standard
deviation for each column of data.
SOLUTION:
print(iris)
[1] 4.3
> print (max(iris$Sepal.Length))
[1] 7.9
> print (range(iris$Sepal.Length))
[1] 4.3 7.9
> print (mean(iris$Sepal.Length))
[1] 5.843333
> print (median(iris$Sepal.Length))
[1] 5.8
> print (var(iris$Sepal.Length))
[1] 0.6856935
> print (sd(iris$Sepal.Length))
[1] 0.8280661
print (min(iris$Sepal.Width))
[1] 2
> print (max(iris$Sepal.Width))
[1] 4.4
> print (range(iris$Sepal.Width))
[1] 2.0 4.4
> print (mean(iris$Sepal.Width))
[1] 3.057333
> print (median(iris$Sepal.Width))
[1] 3
> print (var(iris$Sepal.Width))
[1] 0.1899794
> print (sd(iris$Sepal.Width))
[1] 0.4358663
print (min(iris$Petal.Length))
[1] 1
> print (max(iris$Petal.Length))
[1] 6.9
> print (range(iris$Petal.Length))
[1] 1.0 6.9
> print (mean(iris$Petal.Length))
[1] 3.758
> print (median(iris$Petal.Length))
[1] 4.35
> print (var(iris$Petal.Length))
[1] 3.116278
> print (sd(iris$Petal.Length))
[1] 1.765298
print (min(iris$Petal.Width))
[1] 0.1
> print (max(iris$Petal.Width))
[1] 2.5
> print (range(iris$Petal.Width))
[1] 0.1 2.5
> print (mean(iris$Petal.Width))
[1] 1.199333
> print (median(iris$Petal.Width))
[1] 1.3
> print (var(iris$Petal.Width))
[1] 0.5810063
> print (sd(iris$Petal.Width))
[1] 0.7622377
If you want to normalize you data you can do as you suggest and simply calculate:
data frame could be normalized using Min-Max normalization technique which specifies the
following formula to be applied to each value of features to be normalized. This technique is
traditionally used with K-Nearest Neighbors (KNN) Classification problems.
(x-min(x))/(max(x)-min(x))
>x<-iris$sepal_length
>normalized = (x-min(x))/(max(x)-min(x))
>print(normalized)
>x1<-iris$sepal_width
>normalized = (x1-min(x1))/(max(x1)-min(x1))
>print(normalized)
>x2<-iris$petal_length
>normalized = (x2-min(x2))/(max(x2)-min(x2))
>print(normalized)
>x3<-iris$petal_width
>normalized = (
x3-min(x3))/(max(x3)-min(x3))
>print(normalized)
3. Generate histograms for any one variable (sepal length/ sepal width/ petal
length/ petal width) and generate scatter plots for every pair of variables
showing each species in different color.
SOLUTION:
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
‘v’ is a vector containing numeric values used in histogram.
‘main’ indicates title of the chart.
‘col’ is used to set color of the bars.
‘border’ is used to set border color of each bar.
‘xlab’ is used to give description of x-axis.
‘xlim’ is used to specify the range of values on the x-axis.
‘ylim’ is used to specify the range of values on the y-axis.
‘breaks’ is used to mention the width of each bar.
# print iris
print(iris)
> x1<-iris$Sepal.Length
> x2<-iris$Sepal.Width
> x3<-iris$Petal.Length
> x4<-iris$Petal.Width
>hist(x1)
>hist(x2)
>hist(x3)
>hist(x4)
Scatter Plots
Scatter plots show many points plotted in the Cartesian plane. Each point represents the
values of two variables. One variable is chosen in the horizontal axis and another in the
vertical axis.
The simple scatter plot is created using the plot() function.
Syntax: The basic syntax for creating scatter plot in R is −
4. Generate box plots for each of the numerical attributes. Identify the
attribute with the highest variance.
SOLUTION:
Boxplots are a measure of how well distributed is the data in a data set. It divides the data
set into three quartiles. This graph represents the minimum, maximum, median, first quartile
and third quartile in the data set. It is also useful in comparing the distribution of data across
data sets by drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax: The basic syntax to create a boxplot in R is −
# print iris
>print(iris)
>x1<-iris$sepal_length
>x2<-iris$sepal_width
>x3<-iris$petal_length
>x4<-iris$petal_width
# Highest Variance
>var1=var(x1)
>var2=var(x2)
>var3=var(x3)
>var4=var(x4)
>max1=max(c(var1,var2,var3,var4))
>print(max1)
[1] 3.116278
SOLUTION:
Vectors:
When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.
#Create a Vectors
apple <- c('red', 'green', "yellow")
print(apple)
print(num)
c1<-c(2, 3, 5)
c3<-c("aa","bb","cc","dd","ee")
#combining vectors
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
c(n, s)
#arithematic
a = c(1, 3, 5, 7)
b = c(1, 2, 4, 8,9)
a+b
Matrices:
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function.
#matrix
A = matrix(
+ c(2, 4, 3, 1, 5, 7), # the data elements
+ nrow=2, # number of rows
+ ncol=3, # number of columns
+ byrow = TRUE) # fill matrix by rows
#An element at the mth row, nth column of A can be accessed by the expression A[m, n].
#We can also extract more than one rows or columns at a time.
t(A) # transpose of B
Arrays:
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.
, , 1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
, , 3
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 4
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
Data Frames:
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length.
BMI <- data.frame( gender = c("Male", "Male","Female"), height = c(152, 171.5, 165),
weight = c(81, 93, 78), Age = c(42, 38, 26))
print(BMI)
View(BMI)
Lists:
A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.
# Create a list.
eglist <- list(c(2,5), c('red','green',"yellow"), 21.3, BMI, mat2, sin)
print(eglist)
[[1]]
[1] 2 5
[[2]]
[1] "red" "green" "yellow"
[[3]]
[1] 21.3
[[4]]
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
[[5]]
[,1] [,2] [,3]
[1,] "a" "c" "e"
[2,] "b" "d" "f"
[[6]]
function (x) .Primitive("sin")
z= (X - μ) / σ
where z is the z-score, X is the value of the element, μ is the population mean, and σ is the
standard deviation.
>X1<-iris$Sepel.Length
>Zscr1<-(X1 -mean(X1))/sd(X1)
>Zscr1
>X2<-iris$Sepel.Width
>Zscr2<-(X2 -mean(X2))/sd(X2)
>Zscr2
>X3<-iris$Petal.Length
>Zscr3<-(X3 -mean(X3))/sd(X3)
>Zscr3
>X4<-iris$Petal.Width
>Zscr4<-(X4 -mean(X4))/sd(X4)
>Zscr4
7.
a) Use R to apply linear regression to predict evaporation coefficient in
terms of air velocity using the iris given below:
Solution:
lm() Function
This function creates the relationship model between the predictor and the response
variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
>AV <- data.frame(
+ airvelocity = c(20,60,100,140,180,220,260,300,340,380),
+ evaporationcoefficient = +c(0.18,0.37,0.35,0.78,0.56,0.75,1.18,1.36,1.17,1.65)
+)
> print(AV)
airvelocity evaporationcoefficient
1 20 0.18
2 60 0.37
3 100 0.35
4 140 0.78
5 180 0.56
6 220 0.75
7 260 1.18
8 300 1.36
9 340 1.17
10 380 1.65
> model<-lm(AV$airvelocity~AV$evaporationcoefficient)
> print(model)
Call:
lm(formula = AV$airvelocity ~ AV$evaporationcoefficient)
Coefficients:
(Intercept) AV$evaporationcoefficient
2.564 236.450
Solution:
Residual Standard Error is measure of the quality of a linear regression fit.......The
Residual Standard Error is the average amount that the response (dist) will deviate
from the true regression line.
The R-squared statistic provides a measure of how well the model is fitting the
actual data.
> summary(model)
Call:
lm(formula = AV$airvelocity ~ AV$evaporationcoefficient)
Residuals:
Min 1Q Median 3Q Max
-46.99 -24.88 -17.14 33.74 60.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
> cor(AV$airvelocity,AV$evaporationcoefficient)
[1] 0.9514814
> y<-cor(AV$airvelocity,AV$evaporationcoefficient)
>y
[1] 0.9514814
> x<-log(AV$airvelocity)
>x
> m<-lm(x~AV$evaporationcoefficient)
>m
Call:
lm(formula = x ~ AV$evaporationcoefficient)
Coefficients:
(Intercept) AV$evaporationcoefficient
3.678 1.614
Introduction
WEKA is a data mining system developed by the University of Waikato in New Zealand
that implements data mining algorithms. WEKA is a state-of-the-art facility for developing
machine learning (ML) techniques and their application to real-world data mining problems. It is
a collection of machine learning algorithms for data mining tasks. The algorithms are applied
directly to a dataset. WEKA implements algorithms for data preprocessing, classification,
regression, clustering, association rules; it also includes a visualization tools. The new machine
learning schemes can also be developed with this package. WEKA is open source software
issued under the GNU General Public License. The goal of this Tutorial is to help you to learn
WEKA Explorer. The tutorial will guide you step by step through the analysis of a simple
problem using WEKA Explorer preprocessing, classification, clustering, association, attribute
selection, and visualization tools. At the end of each problem there is a representation of the
results with explanations side by side. Each part is concluded with the exercise for individual
practice. By the time you reach the end of this tutorial, you will be able to analyze your data with
WEKA Explorer using various learning schemes and interpret received results. Before starting
this, you should be familiar with data mining algorithms such as C4.5 (C5), ID3, K-means, and
Apriori.
You can launch Weka from C:\Program Files directory, from your desktop selecting icon,
or from the Windows task bar ‘Start’ -> ‘Programs’ -> ‘Weka 3.8’. When ‘WEKA GUI
Chooser’ window appears on the screen, you can select one of the four options at the bottom of
the window:
1. Simple CLI provides a simple command-line interface and allows direct execution of Weka
commands.
2. Explorer is an environment for exploring data.
3. Experimenter is an environment for performing experiments and conducting statistical tests
between learning schemes.
4. KnowledgeFlow is a Java-Beans-based interface for setting up and running machine learning
experiments.
For the exercises in this tutorial you will use ‘Explorer’. Click on ‘Explorer’ button in the
‘WEKA GUI Chooser’ window.
Discretization
1) Sometimes association rule mining can only be performed on categorical data. This
requires performing discretization on numeric or continuous attributes. In the following
example let us discretize age attribute.
To change the defaults for the filters, click on the box immediately to the right of the
choose button.
We enter the index for the attribute to be discretized. In this case the attribute is age. So
we must enter ‘1’ corresponding to the age attribute.
Enter ‘3’ as the number of bins. Leave the remaining field values as they are.
Click OK button.
Click on apply in the filter panel. This will result in a new working relation with the
selected attribute partition into 3 bins.
Dataset test.arff
@relation test
@attribute admissionyear {2005,2006,2007,2008,2009,2010}
@attribute course {cse,mech,it,ece}
@data
%
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
2007, cse
2008, it
2008, cse
2009, it
2009, ece
%
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
The following screenshot shows the classification rules that were generated whenj48
algorithm is applied on the given dataset.
The following screenshot shows the classification rules that were generated when naive
bayes algorithm is applied on the given dataset.
The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.