0% found this document useful (0 votes)

73 views

Exploratory Data Analysis Using R

Uploaded by

Naveen kumar

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views

Exploratory Data Analysis Using R

Uploaded by

Naveen kumar

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 48

KNOWLEDGE ENGINEERING LAB

Exploratory Data Analysis using R

Introduction to R Programming
In this R programming tutorial, we are going to learn what is R statistics, introduction to
R Programming, R programming examples, r programming for data science various R software
editors like RGui and R Studio and their components. We will also learn R
Features, Applications of R programming, how to develop R scripts with the help of examples.
You will get a good idea to learn r programming for data science. Let’s begin with the
introduction to R Programming Tutorial.

What is R Programming Language?

R language is an open source program maintained by the R core-development team – team of
volunteer developers from across the globe. R language used for performing statistical operations
and is available from the R-Project website www.r-project.org. R is a command line driven
program. The user enters commands at the prompt (> by default) and each command is executed
one at a time.

Many routines have been written for R analytics by people all over the world and made freely
available from the R project Website as packages. However, the basic installation (for Linux,
Windows, or Mac) contains a powerful set of tools for most purposes.

R is a consolidated environment for performing statistical operations and generating R data

analysis reports in graphical or text formats. R commands entered in the console are evaluated
and executed. R cannot handle certain auto-formatting characters such as en-dashes or smart
quotes; therefore, you need to be careful while copying and pasting commands into R from other
applications. Let us now learn something about the History of R in this Introduction to R
Programming.

History of R language
John Chambers and colleagues developed R at Bell Laboratories. R is an implementation of the S
programming Language and combines with lexical scoping semantics inspired by Scheme. R
was named
partly after the first names of two R authors. The project conceives in 1992, with an initial
version
released in 1995 and a stable beta version in 2000. Let us also Understand in this Introduction to
R Programming Tutorial, that Why should learn R Programming.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 1

KNOWLEDGE ENGINEERING LAB

R-Studio:
R-Studio is an integrated development environment (IDE) for R language. R-Studio is a code
editor and development environment, with some nice features that make code development in R
easy and fun.

a) Features of R-Studio

Code highlighting that gives different colors to keywords and variables, making it easier to read
Automatic bracket matching
Code completion, so as to reduce the effort of typing the commands in full
Easy access to R Help, with additional features for exploring functions and parameters of
functions
R-Studio is available free of charge for Linux, Windows, and Mac devices. It can be directly
accessed by clicking the R-Studio icon in the menu system on the desktop.
Because R-Studio is available free of charge for Linux, Windows, and Mac devices, it is a good
option to use with R. To open R-Studio, click the R-Studio icon in the menu system or on the
desktop.

Components of R-Studio

Source – Top left corner of the screen contains a text editor that lets the user work with source
script files. Multiple lines of code can also be entered here. Users can save R script file to disk
and perform other tasks on the script.

Console – Bottom left corner is the R console window. The console in R-Studio is identical to
the console in RGui. All the interactive work of R programming is performed in this window.

Workspace and History – The top right corner is the R workspace and history window. This
provides an overview of the workspace, where the variables created in the session along with
their values can be inspected. This is also the area where the user can see a history of the
commands issued in R.

Files, Plots, Package, and Help The bottom right corner gives access to the following tools:

Files – This is where the user can browse folders and files on a computer.
Plots – This is where R displays the user’s plots.
Packages – This is where the user can view a list of all the installed packages.
Help – This is where you can browse the built-in Help system of R.
Next in this Introduction to R Programming Introduction is R Scripting and Sourcing a script in
R.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 2

KNOWLEDGE ENGINEERING LAB

1. Load the ‘iris. CSV’ file and display the names and type of each column.
Find statistics such as min, max, range, mean, median, variance, standard
deviation for each column of data.
SOLUTION:

iris <- read.csv("C:/Desktop/R LAB PROGRAMS/iris.csv", header=TRUE, sep=",")

# print iris

print(iris)

#display col names

colnames( iris )

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

#display each col types

str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1

1 1 1 1 ...

#min, max, range, mean, median, variance, standard deviation

print (min(iris$Sepal.Length))

[1] 4.3
> print (max(iris$Sepal.Length))
[1] 7.9
> print (range(iris$Sepal.Length))
[1] 4.3 7.9
> print (mean(iris$Sepal.Length))
[1] 5.843333
> print (median(iris$Sepal.Length))
[1] 5.8
> print (var(iris$Sepal.Length))

[1] 0.6856935
> print (sd(iris$Sepal.Length))
[1] 0.8280661

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 3

KNOWLEDGE ENGINEERING LAB

print (min(iris$Sepal.Width))
[1] 2
> print (max(iris$Sepal.Width))
[1] 4.4
> print (range(iris$Sepal.Width))
[1] 2.0 4.4
> print (mean(iris$Sepal.Width))
[1] 3.057333
> print (median(iris$Sepal.Width))
[1] 3
> print (var(iris$Sepal.Width))
[1] 0.1899794
> print (sd(iris$Sepal.Width))
[1] 0.4358663

print (min(iris$Petal.Length))
[1] 1
> print (max(iris$Petal.Length))
[1] 6.9
> print (range(iris$Petal.Length))
[1] 1.0 6.9
> print (mean(iris$Petal.Length))
[1] 3.758
> print (median(iris$Petal.Length))
[1] 4.35
> print (var(iris$Petal.Length))
[1] 3.116278
> print (sd(iris$Petal.Length))
[1] 1.765298

print (min(iris$Petal.Width))
[1] 0.1
> print (max(iris$Petal.Width))
[1] 2.5
> print (range(iris$Petal.Width))
[1] 0.1 2.5
> print (mean(iris$Petal.Width))
[1] 1.199333
> print (median(iris$Petal.Width))

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 4

KNOWLEDGE ENGINEERING LAB

[1] 1.3
> print (var(iris$Petal.Width))
[1] 0.5810063
> print (sd(iris$Petal.Width))

[1] 0.7622377

2. Write R program to normalize the variables into 0 to 1 scale using min-

max normalisation
SOLUTION:

What is min max normalization?

Min-max normalization is a normalization strategy which linearly transforms x to y= (x-

min)/(max-min), where min and max are the minimum and maximum values in X, where X
is the set of observed values of x. It can be easily seen that when x=min, then y=0, and.
When x=max, then y=1

If you want to normalize you data you can do as you suggest and simply calculate:

data frame could be normalized using Min-Max normalization technique which specifies the
following formula to be applied to each value of features to be normalized. This technique is
traditionally used with K-Nearest Neighbors (KNN) Classification problems.

(x-min(x))/(max(x)-min(x))

>x<-iris$sepal_length
>normalized = (x-min(x))/(max(x)-min(x))
>print(normalized)

[1] 0.22222222 0.16666667 0.11111111 0.08333333 0.19444444 0.30555556 0.08333333

[8] 0.19444444 0.02777778 0.16666667 0.30555556 0.13888889 0.13888889 0.00000000
[15] 0.41666667 0.38888889 0.30555556 0.22222222 0.38888889 0.22222222 0.30555556
[22] 0.22222222 0.08333333 0.22222222 0.13888889 0.19444444 0.19444444 0.25000000
[29] 0.25000000 0.11111111 0.13888889 0.30555556 0.25000000 0.33333333 0.16666667
[36] 0.19444444 0.33333333 0.16666667 0.02777778 0.22222222 0.19444444 0.05555556
[43] 0.02777778 0.19444444 0.22222222 0.13888889 0.22222222 0.08333333 0.27777778
[50] 0.19444444 0.75000000 0.58333333 0.72222222 0.33333333 0.61111111 0.38888889

[57] 0.55555556 0.16666667 0.63888889 0.25000000 0.19444444 0.44444444 0.47222222

[64] 0.50000000 0.36111111 0.66666667 0.36111111 0.41666667 0.52777778 0.36111111
[71] 0.44444444 0.50000000 0.55555556 0.50000000 0.58333333 0.63888889 0.69444444
[78] 0.66666667 0.47222222 0.38888889 0.33333333 0.33333333 0.41666667 0.47222222
[85] 0.30555556 0.47222222 0.66666667 0.55555556 0.36111111 0.33333333 0.33333333
[92] 0.50000000 0.41666667 0.19444444 0.36111111 0.38888889 0.38888889 0.52777778

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 5

KNOWLEDGE ENGINEERING LAB

[99] 0.22222222 0.38888889 0.55555556 0.41666667 0.77777778 0.55555556 0.61111111

[106] 0.91666667 0.16666667 0.83333333 0.66666667 0.80555556 0.61111111 0.58333333
[113] 0.69444444 0.38888889 0.41666667 0.58333333 0.61111111 0.94444444 0.94444444
[120] 0.47222222 0.72222222 0.36111111 0.94444444 0.55555556 0.66666667 0.80555556
[127] 0.52777778 0.50000000 0.58333333 0.80555556 0.86111111 1.00000000 0.58333333
[134] 0.55555556 0.50000000 0.94444444 0.55555556 0.58333333 0.47222222 0.72222222
[141] 0.66666667 0.72222222 0.41666667 0.69444444 0.66666667 0.66666667 0.55555556
[148] 0.61111111 0.52777778 0.44444444

>x1<-iris$sepal_width
>normalized = (x1-min(x1))/(max(x1)-min(x1))
>print(normalized)

[1] 0.62500000 0.41666667 0.50000000 0.45833333 0.66666667 0.79166667 0.58333333 0.58333333

[9] 0.37500000 0.45833333 0.70833333 0.58333333 0.41666667 0.41666667 0.83333333 1.00000000
[17] 0.79166667 0.62500000 0.75000000 0.75000000 0.58333333 0.70833333 0.66666667 0.54166667
[25] 0.58333333 0.41666667 0.58333333 0.62500000 0.58333333 0.50000000 0.45833333 0.58333333
[33] 0.87500000 0.91666667 0.45833333 0.50000000 0.62500000 0.66666667 0.41666667 0.58333333
[41] 0.62500000 0.12500000 0.50000000 0.62500000 0.75000000 0.41666667 0.75000000 0.50000000
[49] 0.70833333 0.54166667 0.50000000 0.50000000 0.45833333 0.12500000 0.33333333 0.33333333
[57] 0.54166667 0.16666667 0.37500000 0.29166667 0.00000000 0.41666667 0.08333333 0.37500000
[65] 0.37500000 0.45833333 0.41666667 0.29166667 0.08333333 0.20833333 0.50000000 0.33333333
[73] 0.20833333 0.33333333 0.37500000 0.41666667 0.33333333 0.41666667 0.37500000 0.25000000
[81] 0.16666667 0.16666667 0.29166667 0.29166667 0.41666667 0.58333333 0.45833333 0.12500000
[89] 0.41666667 0.20833333 0.25000000 0.41666667 0.25000000 0.12500000 0.29166667 0.41666667
[97] 0.37500000 0.37500000 0.20833333 0.33333333 0.54166667 0.29166667 0.41666667 0.37500000
[105] 0.41666667 0.41666667 0.20833333 0.37500000 0.20833333 0.66666667 0.50000000 0.29166667
[113] 0.41666667 0.20833333 0.33333333 0.50000000 0.41666667 0.75000000 0.25000000 0.08333333
[121] 0.50000000 0.33333333 0.33333333 0.29166667 0.54166667 0.50000000 0.33333333 0.41666667
[129] 0.33333333 0.41666667 0.33333333 0.75000000 0.33333333 0.33333333 0.25000000 0.41666667
[137] 0.58333333 0.45833333 0.41666667 0.45833333 0.45833333 0.45833333 0.29166667 0.50000000
[145] 0.54166667 0.41666667 0.20833333 0.41666667 0.58333333 0.41666667

>x2<-iris$petal_length
>normalized = (x2-min(x2))/(max(x2)-min(x2))
>print(normalized)

[1] 0.06779661 0.06779661 0.05084746 0.08474576 0.06779661 0.11864407 0.06779661 0.08474576

[9] 0.06779661 0.08474576 0.08474576 0.10169492 0.06779661 0.01694915 0.03389831 0.08474576
[17] 0.05084746 0.06779661 0.11864407 0.08474576 0.11864407 0.08474576 0.00000000 0.11864407
[25] 0.15254237 0.10169492 0.10169492 0.08474576 0.06779661 0.10169492 0.10169492 0.08474576
[33] 0.08474576 0.06779661 0.08474576 0.03389831 0.05084746 0.06779661 0.05084746 0.08474576

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 6

KNOWLEDGE ENGINEERING LAB

[41] 0.05084746 0.05084746 0.05084746 0.10169492 0.15254237 0.06779661 0.10169492 0.06779661

[49] 0.08474576 0.06779661 0.62711864 0.59322034 0.66101695 0.50847458 0.61016949 0.59322034
[57] 0.62711864 0.38983051 0.61016949 0.49152542 0.42372881 0.54237288 0.50847458 0.62711864
[65] 0.44067797 0.57627119 0.59322034 0.52542373 0.59322034 0.49152542 0.64406780 0.50847458
[73] 0.66101695 0.62711864 0.55932203 0.57627119 0.64406780 0.67796610 0.59322034 0.42372881
[81] 0.47457627 0.45762712 0.49152542 0.69491525 0.59322034 0.59322034 0.62711864 0.57627119
[89] 0.52542373 0.50847458 0.57627119 0.61016949 0.50847458 0.38983051 0.54237288 0.54237288
[97] 0.54237288 0.55932203 0.33898305 0.52542373 0.84745763 0.69491525 0.83050847 0.77966102
ajud[105] 0.81355932 0.94915254 0.59322034 0.89830508 0.81355932 0.86440678 0.69491525 0.72881356
[113] 0.76271186 0.67796610 0.69491525 0.72881356 0.76271186 0.96610169 1.00000000 0.67796610
[121] 0.79661017 0.66101695 0.96610169 0.66101695 0.79661017 0.84745763 0.64406780 0.66101695
[129] 0.77966102 0.81355932 0.86440678 0.91525424 0.77966102 0.69491525 0.77966102 0.86440678
[137] 0.77966102 0.76271186 0.64406780 0.74576271 0.77966102 0.69491525 0.69491525 0.83050847
[145] 0.79661017 0.71186441 0.67796610 0.71186441 0.74576271 0.69491525

>x3<-iris$petal_width
>normalized = (
x3-min(x3))/(max(x3)-min(x3))
>print(normalized)

[1] 0.04166667 0.04166667 0.04166667 0.04166667 0.04166667 0.12500000 0.08333333 0.04166667

[9] 0.04166667 0.00000000 0.04166667 0.04166667 0.00000000 0.00000000 0.04166667 0.12500000
[17] 0.12500000 0.08333333 0.08333333 0.08333333 0.04166667 0.12500000 0.04166667 0.16666667
[25] 0.04166667 0.04166667 0.12500000 0.04166667 0.04166667 0.04166667 0.04166667 0.12500000
[33] 0.00000000 0.04166667 0.04166667 0.04166667 0.04166667 0.00000000 0.04166667 0.04166667
[41] 0.08333333 0.08333333 0.04166667 0.20833333 0.12500000 0.08333333 0.04166667 0.04166667
[49] 0.04166667 0.04166667 0.54166667 0.58333333 0.58333333 0.50000000 0.58333333 0.50000000
[57] 0.62500000 0.37500000 0.50000000 0.54166667 0.37500000 0.58333333 0.37500000 0.54166667
[65] 0.50000000 0.54166667 0.58333333 0.37500000 0.58333333 0.41666667 0.70833333 0.50000000
[73] 0.58333333 0.45833333 0.50000000 0.54166667 0.54166667 0.66666667 0.58333333 0.37500000
[81] 0.41666667 0.37500000 0.45833333 0.62500000 0.58333333 0.62500000 0.58333333 0.50000000
[89] 0.50000000 0.50000000 0.45833333 0.54166667 0.45833333 0.37500000 0.50000000 0.45833333
[97] 0.50000000 0.50000000 0.41666667 0.50000000 1.00000000 0.75000000 0.83333333 0.70833333
[105] 0.87500000 0.83333333 0.66666667 0.70833333 0.70833333 1.00000000 0.79166667 0.75000000
[113] 0.83333333 0.79166667 0.95833333 0.91666667 0.70833333 0.87500000 0.91666667 0.58333333
[121] 0.91666667 0.79166667 0.79166667 0.70833333 0.83333333 0.70833333 0.70833333 0.70833333
[129] 0.83333333 0.62500000 0.75000000 0.79166667 0.87500000 0.58333333 0.54166667 0.91666667
[137] 0.95833333 0.70833333 0.70833333 0.83333333 0.95833333 0.91666667 0.75000000 0.91666667
[145] 1.00000000 0.91666667 0.75000000 0.79166667 0.91666667 0.70833333

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 7

KNOWLEDGE ENGINEERING LAB

3. Generate histograms for any one variable (sepal length/ sepal width/ petal
length/ petal width) and generate scatter plots for every pair of variables
showing each species in different color.
SOLUTION:

A histogram represents the frequencies of values of a variable bucketed into ranges.

Histogram is similar to bar chat but the difference is it groups the values into continuous
ranges. Each bar in histogram represents the height of the number of values present in that
range.
R creates histogram using hist() function. This function takes a vector as an input and uses
some more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is −

hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
‘v’ is a vector containing numeric values used in histogram.
‘main’ indicates title of the chart.
‘col’ is used to set color of the bars.
‘border’ is used to set border color of each bar.
‘xlab’ is used to give description of x-axis.
‘xlim’ is used to specify the range of values on the x-axis.
‘ylim’ is used to specify the range of values on the y-axis.
‘breaks’ is used to mention the width of each bar.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 8

KNOWLEDGE ENGINEERING LAB

# print iris
print(iris)

> x1<-iris$Sepal.Length
> x2<-iris$Sepal.Width
> x3<-iris$Petal.Length
> x4<-iris$Petal.Width

>hist(x1)

>hist(x2)

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 9

KNOWLEDGE ENGINEERING LAB

>hist(x3)

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 10

KNOWLEDGE ENGINEERING LAB

>hist(x4)

Scatter Plots
Scatter plots show many points plotted in the Cartesian plane. Each point represents the
values of two variables. One variable is chosen in the horizontal axis and another in the
vertical axis.
The simple scatter plot is created using the plot() function.
Syntax: The basic syntax for creating scatter plot in R is −

plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Following is the description of the parameters used −
‘x’ is the data set whose values are the horizontal coordinates.
‘y’ is the data set whose values are the vertical coordinates.
‘main’ is the tile of the graph.
‘xlab’ is the label in the horizontal axis.
‘ylab’ is the label in the vertical axis.
‘xlim’ is the limits of the values of x used for plotting.
‘ylim’ is the limits of the values of y used for plotting.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 11

KNOWLEDGE ENGINEERING LAB

‘axes’ indicates whether both axes should be drawn on the plot.

>plot(x = x1,y = x2,col =

c("red","blue","green")[iris$Species], xlab = "sepal_length",
ylab = "sepal_width", main = "sepal_length vs sepal_width")

>plot(x = x1,y = x3,col =

c("red","blue","green")[iris$Species], xlab = "sepal_length",
ylab = "petal_length", main = "sepal_length vs petal_length")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 12

KNOWLEDGE ENGINEERING LAB

>plot(x = x1,y = x4,col =

c("red","blue","green")[iris$Species], xlab = "sepal_length",
ylab = "petal_width", main = "sepal_length vs petal_width")

> plot(x = x2,y = x3,col =

c("red","blue","green")[iris$Species], xlab = "sepal_width",
ylab = "petal_length", main = "sepal_width vs petal_length")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 13

KNOWLEDGE ENGINEERING LAB

> plot(x = x2,y = x4,col =

c("red","blue","green")[iris$Species], xlab = "sepal_width",
ylab = "petal_width", main = "sepal_width vs petal_width")

> plot(x = x3,y = x4,col =

c("red","blue","green")[iris$Species], xlab = "petal_length",
ylab = "petal_width", main = "petal_length vs petal_width")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 14

KNOWLEDGE ENGINEERING LAB

4. Generate box plots for each of the numerical attributes. Identify the
attribute with the highest variance.
SOLUTION:

Boxplots are a measure of how well distributed is the data in a data set. It divides the data
set into three quartiles. This graph represents the minimum, maximum, median, first quartile
and third quartile in the data set. It is also useful in comparing the distribution of data across
data sets by drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax: The basic syntax to create a boxplot in R is −

boxplot(x, data, notch, varwidth, names, main)

Following is the description of the parameters used −
‘x' is a vector or a formula.
‘data’ is the data frame.
‘notch’ is a logical value. Set as TRUE to draw a notch.
‘varwidth’ is a logical value. Set as true to draw width of the box proportionate to the sample
size.
‘names’ are the group labels which will be printed under each boxplot.
‘main’ is used to give a title to the graph.

# print iris

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 15

KNOWLEDGE ENGINEERING LAB

>print(iris)

>x1<-iris$sepal_length
>x2<-iris$sepal_width
>x3<-iris$petal_length
>x4<-iris$petal_width

>boxplot(x1, xlab = "sepal_length", main = "sepal_length ")

>boxplot(x2, xlab = "sepal_ width ", main = "sepal_width ")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 16

KNOWLEDGE ENGINEERING LAB

>boxplot(x3, xlab = " petal_length ", main = " petal_length ")

>boxplot(x4, xlab = " petal_width ", main = " petal_width ")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 17

KNOWLEDGE ENGINEERING LAB

# Highest Variance
>var1=var(x1)
>var2=var(x2)
>var3=var(x3)
>var4=var(x4)

>max1=max(c(var1,var2,var3,var4))

>print(max1)

[1] 3.116278

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 18

KNOWLEDGE ENGINEERING LAB

5. Study of homogeneous and heterogeneous iris structures such as vector,

matrix, array, list, data frame in R.

SOLUTION:

1) Vectors 2)Lists 3)Matrices 4)Arrays 5)Iris Frames

Vectors:

When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.

#Create a Vectors
apple <- c('red', 'green', "yellow")

print(apple)

num <- c(1,5,6,1,8,11,"a")

print(num)

c1<-c(2, 3, 5)

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 19

KNOWLEDGE ENGINEERING LAB

c2<-c(TRUE, FALSE, TRUE, FALSE, FALSE)

c3<-c("aa","bb","cc","dd","ee")

length(c("aa", "bb", "cc", "dd", "ee"))

#combining vectors
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
c(n, s)

#arithematic
a = c(1, 3, 5, 7)
b = c(1, 2, 4, 8,9)
a+b

Matrices:
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function.
#matrix
A = matrix(
+ c(2, 4, 3, 1, 5, 7), # the data elements
+ nrow=2, # number of rows
+ ncol=3, # number of columns
+ byrow = TRUE) # fill matrix by rows

A = matrix( c(2, 4, 3, 1, 5, 7), nrow=2, ncol=3, byrow = TRUE)

A = matrix( c(2, 4, 3, 1, 5, 7), nrow=2, ncol=3)

#An element at the mth row, nth column of A can be accessed by the expression A[m, n].

A[2, 3] # element at 2nd row, 3rd column

#The entire mth row A can be extracted as A[m, ].

A[2, ] # the 2nd row

#Similarly, the entire nth column A can be extracted as A[ ,n].

A[ ,3] # the 3rd column

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 20

KNOWLEDGE ENGINEERING LAB

#We can also extract more than one rows or columns at a time.

A[ ,c(1,3)] # the 1st and 3rd columns

t(A) # transpose of B

Arrays:

While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.

a <- array(c('green','yellow'),dim = c(3,3,4))

print(a)

, , 1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
, , 3
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 4
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

Data Frames:

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

BMI <- data.frame( gender = c("Male", "Male","Female"), height = c(152, 171.5, 165),
weight = c(81, 93, 78), Age = c(42, 38, 26))

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 21

KNOWLEDGE ENGINEERING LAB

print(BMI)

View(BMI)

gender height weight Age

1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

Lists:
A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.

# Create a list.
eglist <- list(c(2,5), c('red','green',"yellow"), 21.3, BMI, mat2, sin)
print(eglist)

[[1]]
[1] 2 5
[[2]]
[1] "red" "green" "yellow"
[[3]]
[1] 21.3
[[4]]
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
[[5]]
[,1] [,2] [,3]
[1,] "a" "c" "e"
[2,] "b" "d" "f"
[[6]]
function (x) .Primitive("sin")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 22

KNOWLEDGE ENGINEERING LAB

6. Write R Program using ‘apply’ group of functions to create and apply

normalization function on each of the numeric variables/columns of iris
dataset to transform them into a value around 0 with z-score
normalization.
SOLUTION:
z-score (aka, a standard score) indicates how many standard deviations an element is
from the mean. A z-score can be calculated from the following formula.

z= (X - μ) / σ

where z is the z-score, X is the value of the element, μ is the population mean, and σ is the
standard deviation.

>X1<-iris$Sepel.Length
>Zscr1<-(X1 -mean(X1))/sd(X1)
>Zscr1

[1] -0.89767388 -1.13920048 -1.38072709 -1.50149039 -1.01843718 -0.53538397

[7] -1.50149039 -1.01843718 -1.74301699 -1.13920048 -0.53538397 -1.25996379
[13] -1.25996379 -1.86378030 -0.05233076 -0.17309407 -0.53538397 -0.89767388
[19] -0.17309407 -0.89767388 -0.53538397 -0.89767388 -1.50149039 -0.89767388
[25] -1.25996379 -1.01843718 -1.01843718 -0.77691058 -0.77691058 -1.38072709
[31] -1.25996379 -0.53538397 -0.77691058 -0.41462067 -1.13920048 -1.01843718
[37] -0.41462067 -1.13920048 -1.74301699 -0.89767388 -1.01843718 -1.62225369
[43] -1.74301699 -1.01843718 -0.89767388 -1.25996379 -0.89767388 -1.50149039
[49] -0.65614727 -1.01843718 1.39682886 0.67224905 1.27606556 -0.41462067
[55] 0.79301235 -0.17309407 0.55148575 -1.13920048 0.91377565 -0.77691058

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 23

KNOWLEDGE ENGINEERING LAB

[61] -1.01843718 0.06843254 0.18919584 0.30995914 -0.29385737 1.03453895

[67] -0.29385737 -0.05233076 0.43072244 -0.29385737 0.06843254 0.30995914
[73] 0.55148575 0.30995914 0.67224905 0.91377565 1.15530226 1.03453895
[79] 0.18919584 -0.17309407 -0.41462067 -0.41462067 -0.05233076 0.18919584
[85] -0.53538397 0.18919584 1.03453895 0.55148575 -0.29385737 -0.41462067
[91] -0.41462067 0.30995914 -0.05233076 -1.01843718 -0.29385737 -0.17309407
[97] -0.17309407 0.43072244 -0.89767388 -0.17309407 0.55148575 -0.05233076
[103] 1.51759216 0.55148575 0.79301235 2.12140867 -1.13920048 1.75911877
[109] 1.03453895 1.63835547 0.79301235 0.67224905 1.15530226 -0.17309407
[115] -0.05233076 0.67224905 0.79301235 2.24217198 2.24217198 0.18919584
[121] 1.27606556 -0.29385737 2.24217198 0.55148575 1.03453895 1.63835547
[127] 0.43072244 0.30995914 0.67224905 1.63835547 1.87988207 2.48369858
[133] 0.67224905 0.55148575 0.30995914 2.24217198 0.55148575 0.67224905
[139] 0.18919584 1.27606556 1.03453895 1.27606556 -0.05233076 1.15530226
[145] 1.03453895 1.03453895 0.55148575 0.79301235 0.43072244 0.06843254

>X2<-iris$Sepel.Width
>Zscr2<-(X2 -mean(X2))/sd(X2)
>Zscr2

[1] 1.01560199 -0.13153881 0.32731751 0.09788935 1.24503015 1.93331463 0.78617383 0.78617383

[9] -0.36096697 0.09788935 1.47445831 0.78617383 -0.13153881 -0.13153881 2.16274279 3.08045544
[17] 1.93331463 1.01560199 1.70388647 1.70388647 0.78617383 1.47445831 1.24503015 0.55674567
[25] 0.78617383 -0.13153881 0.78617383 1.01560199 0.78617383 0.32731751 0.09788935 0.78617383
[33] 2.39217095 2.62159911 0.09788935 0.32731751 1.01560199 1.24503015 -0.13153881 0.78617383
[41] 1.01560199 -1.73753594 0.32731751 1.01560199 1.70388647 -0.13153881 1.70388647 0.32731751
[49] 1.47445831 0.55674567 0.32731751 0.32731751 0.09788935 -1.73753594 -0.59039513 -0.59039513
[57] 0.55674567 -1.50810778 -0.36096697 -0.81982329 -2.42582042 -0.13153881 -1.96696410 -
0.36096697
[65] -0.36096697 0.09788935 -0.13153881 -0.81982329 -1.96696410 -1.27867961 0.32731751 -0.59039513
[73] -1.27867961 -0.59039513 -0.36096697 -0.13153881 -0.59039513 -0.13153881 -0.36096697 -
1.04925145
[81] -1.50810778 -1.50810778 -0.81982329 -0.81982329 -0.13153881 0.78617383 0.09788935 -1.73753594
[89] -0.13153881 -1.27867961 -1.04925145 -0.13153881 -1.04925145 -1.73753594 -0.81982329 -
0.13153881
[97] -0.36096697 -0.36096697 -1.27867961 -0.59039513 0.55674567 -0.81982329 -0.13153881 -
0.36096697
[105] -0.13153881 -0.13153881 -1.27867961 -0.36096697 -1.27867961 1.24503015 0.32731751 -
0.81982329
[113] -0.13153881 -1.27867961 -0.59039513 0.32731751 -0.13153881 1.70388647 -1.04925145 -
1.96696410
[121] 0.32731751 -0.59039513 -0.59039513 -0.81982329 0.55674567 0.32731751 -0.59039513 -
0.13153881
[129] -0.59039513 -0.13153881 -0.59039513 1.70388647 -0.59039513 -0.59039513 -1.04925145 -
0.13153881
[137] 0.78617383 0.09788935 -0.13153881 0.09788935 0.09788935 0.09788935 -0.81982329 0.32731751
[145] 0.55674567 -0.13153881 -1.27867961 -0.13153881 0.78617383 -0.13153881

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 24

KNOWLEDGE ENGINEERING LAB

>X3<-iris$Petal.Length
>Zscr3<-(X3 -mean(X3))/sd(X3)
>Zscr3

[1] -1.33575163 -1.33575163 -1.39239929 -1.27910398 -1.33575163 -1.16580868 -1.33575163 -1.27910398

[9] -1.33575163 -1.27910398 -1.27910398 -1.22245633 -1.33575163 -1.50569459 -1.44904694 -1.27910398
[17] -1.39239929 -1.33575163 -1.16580868 -1.27910398 -1.16580868 -1.27910398 -1.56234224 -
1.16580868
[25] -1.05251337 -1.22245633 -1.22245633 -1.27910398 -1.33575163 -1.22245633 -1.22245633 -
1.27910398
[33] -1.27910398 -1.33575163 -1.27910398 -1.44904694 -1.39239929 -1.33575163 -1.39239929 -
1.27910398
[41] -1.39239929 -1.39239929 -1.39239929 -1.22245633 -1.05251337 -1.33575163 -1.22245633 -
1.33575163
[49] -1.27910398 -1.33575163 0.53362088 0.42032558 0.64691619 0.13708732 0.47697323 0.42032558
[57] 0.53362088 -0.25944625 0.47697323 0.08043967 -0.14615094 0.25038262 0.13708732 0.53362088
[65] -0.08950329 0.36367793 0.42032558 0.19373497 0.42032558 0.08043967 0.59026853 0.13708732
[73] 0.64691619 0.53362088 0.30703027 0.36367793 0.59026853 0.70356384 0.42032558 -0.14615094
[81] 0.02379201 -0.03285564 0.08043967 0.76021149 0.42032558 0.42032558 0.53362088 0.36367793
[89] 0.19373497 0.13708732 0.36367793 0.47697323 0.13708732 -0.25944625 0.25038262 0.25038262
[97] 0.25038262 0.30703027 -0.42938920 0.19373497 1.27004036 0.76021149 1.21339271 1.04344975
[105] 1.15674505 1.60992627 0.42032558 1.43998331 1.15674505 1.32668801 0.76021149 0.87350679
[113] 0.98680210 0.70356384 0.76021149 0.87350679 0.98680210 1.66657392 1.77986923 0.70356384
[121] 1.10009740 0.64691619 1.66657392 0.64691619 1.10009740 1.27004036 0.59026853 0.64691619
[129] 1.04344975 1.15674505 1.32668801 1.49663097 1.04344975 0.76021149 1.04344975 1.32668801
[137] 1.04344975 0.98680210 0.59026853 0.93015445 1.04344975 0.76021149 0.76021149 1.21339271
[145] 1.10009740 0.81685914 0.70356384 0.81685914 0.93015445 0.76021149

>X4<-iris$Petal.Width
>Zscr4<-(X4 -mean(X4))/sd(X4)
>Zscr4

[1] -1.3110521482 -1.3110521482 -1.3110521482 -1.3110521482 -1.3110521482 -1.0486667950

[7] -1.1798594716 -1.3110521482 -1.3110521482 -1.4422448248 -1.3110521482 -1.3110521482
[13] -1.4422448248 -1.4422448248 -1.3110521482 -1.0486667950 -1.0486667950 -1.1798594716
[19] -1.1798594716 -1.1798594716 -1.3110521482 -1.0486667950 -1.3110521482 -0.9174741184
[25] -1.3110521482 -1.3110521482 -1.0486667950 -1.3110521482 -1.3110521482 -1.3110521482
[31] -1.3110521482 -1.0486667950 -1.4422448248 -1.3110521482 -1.3110521482 -1.3110521482
[37] -1.3110521482 -1.4422448248 -1.3110521482 -1.3110521482 -1.1798594716 -1.1798594716
[43] -1.3110521482 -0.7862814418 -1.0486667950 -1.1798594716 -1.3110521482 -1.3110521482
[49] -1.3110521482 -1.3110521482 0.2632599711 0.3944526477 0.3944526477 0.1320672944
[55] 0.3944526477 0.1320672944 0.5256453243 -0.2615107354 0.1320672944 0.2632599711

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 25

KNOWLEDGE ENGINEERING LAB

[61] -0.2615107354 0.3944526477 -0.2615107354 0.2632599711 0.1320672944 0.2632599711

[67] 0.3944526477 -0.2615107354 0.3944526477 -0.1303180588 0.7880306775 0.1320672944
[73] 0.3944526477 0.0008746178 0.1320672944 0.2632599711 0.2632599711 0.6568380009
[79] 0.3944526477 -0.2615107354 -0.1303180588 -0.2615107354 0.0008746178 0.5256453243
[85] 0.3944526477 0.5256453243 0.3944526477 0.1320672944 0.1320672944 0.1320672944
[91] 0.0008746178 0.2632599711 0.0008746178 -0.2615107354 0.1320672944 0.0008746178
[97] 0.1320672944 0.1320672944 -0.1303180588 0.1320672944 1.7063794137 0.9192233541
[103] 1.1816087073 0.7880306775 1.3128013839 1.1816087073 0.6568380009 0.7880306775
[109] 0.7880306775 1.7063794137 1.0504160307 0.9192233541 1.1816087073 1.0504160307
[115] 1.5751867371 1.4439940605 0.7880306775 1.3128013839 1.4439940605 0.3944526477
[121] 1.4439940605 1.0504160307 1.0504160307 0.7880306775 1.1816087073 0.7880306775
[127] 0.7880306775 0.7880306775 1.1816087073 0.5256453243 0.9192233541 1.0504160307
[133] 1.3128013839 0.3944526477 0.2632599711 1.4439940605 1.5751867371 0.7880306775
[139] 0.7880306775 1.1816087073 1.5751867371 1.4439940605 0.9192233541 1.4439940605
[145] 1.7063794137 1.4439940605 0.9192233541 1.0504160307 1.4439940605 0.7880306775

7.
a) Use R to apply linear regression to predict evaporation coefficient in
terms of air velocity using the iris given below:

Air Velocity (cm/sec) 20,60,100,140,180,220,260,30

0,340,380
Evaporation Coefficient 0.18, 0.37, 0.35, 0.78, 0.56,
(sqmm/sec) 0.75, 1.18, 1.36, 1.17, 1.65

Solution:

Regression analysis is a very widely used statistical tool to establish a relationship

model between two variables. One of these variable is called predictor variable
whose value is gathered through experiments. The other variable is called response
variable whose value is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear relationship
represents a straight line when plotted as a graph. A non-linear relationship where
the exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 26

KNOWLEDGE ENGINEERING LAB

y is the response variable.

x is the predictor variable.
a and b are constants which are called the coefficients.

lm() Function
This function creates the relationship model between the predictor and the response
variable.

Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
>AV <- data.frame(
+ airvelocity = c(20,60,100,140,180,220,260,300,340,380),
+ evaporationcoefficient = +c(0.18,0.37,0.35,0.78,0.56,0.75,1.18,1.36,1.17,1.65)
+)

> print(AV)
airvelocity evaporationcoefficient
1 20 0.18
2 60 0.37
3 100 0.35
4 140 0.78
5 180 0.56
6 220 0.75
7 260 1.18
8 300 1.36
9 340 1.17
10 380 1.65

> model<-lm(AV$airvelocity~AV$evaporationcoefficient)
> print(model)

Call:
lm(formula = AV$airvelocity ~ AV$evaporationcoefficient)

Coefficients:
(Intercept) AV$evaporationcoefficient

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 27

KNOWLEDGE ENGINEERING LAB

2.564 236.450

b) Analyze the significance of residual standard-error value, R-squared

value, F-statistic. Find the correlation coefficient for this data and
analyze the significance of the correlation value.

Solution:
Residual Standard Error is measure of the quality of a linear regression fit.......The
Residual Standard Error is the average amount that the response (dist) will deviate
from the true regression line.

The R-squared statistic provides a measure of how well the model is fitting the
actual data.

F-statistic is a good indicator of whether there is a relationship between our predictor

and the response variables.
Is it true that we can have a F value indicating a strong relationship that is NON
LINEAR so that our RSE is high and our R squared is low

> summary(model)

Call:
lm(formula = AV$airvelocity ~ AV$evaporationcoefficient)

Residuals:
Min 1Q Median 3Q Max
-46.99 -24.88 -17.14 33.74 60.79

Coefficients:
Estimate Std. Error t value Pr(>|t|)

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 28

KNOWLEDGE ENGINEERING LAB

(Intercept) 2.564 25.804 0.099 0.923

AV$evaporationcoefficient 236.450 27.035 8.746 2.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.53 on 8 degrees of freedom

Multiple R-squared: 0.9053, Adjusted R-squared: 0.8935
F-statistic: 76.49 on 1 and 8 DF, p-value: 2.286e-05

> cor(AV$airvelocity,AV$evaporationcoefficient)

[1] 0.9514814

> y<-cor(AV$airvelocity,AV$evaporationcoefficient)
>y

[1] 0.9514814

c) Perform a log transformation on the ‘Air Velocity 'column, perform

linear regression again, and analyze all the relevant values.
Solution:

> x<-log(AV$airvelocity)
>x

[1] 2.995732 4.094345 4.605170 4.941642 5.192957 5.393628 5.560682 5.703782

5.828946 5.940171

> m<-lm(x~AV$evaporationcoefficient)
>m
Call:
lm(formula = x ~ AV$evaporationcoefficient)

Coefficients:
(Intercept) AV$evaporationcoefficient
3.678 1.614

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 29

KNOWLEDGE ENGINEERING LAB

WEKA Knowledge Extraction toolkit:

Introduction
WEKA is a data mining system developed by the University of Waikato in New Zealand
that implements data mining algorithms. WEKA is a state-of-the-art facility for developing
machine learning (ML) techniques and their application to real-world data mining problems. It is
a collection of machine learning algorithms for data mining tasks. The algorithms are applied
directly to a dataset. WEKA implements algorithms for data preprocessing, classification,
regression, clustering, association rules; it also includes a visualization tools. The new machine
learning schemes can also be developed with this package. WEKA is open source software
issued under the GNU General Public License. The goal of this Tutorial is to help you to learn
WEKA Explorer. The tutorial will guide you step by step through the analysis of a simple
problem using WEKA Explorer preprocessing, classification, clustering, association, attribute
selection, and visualization tools. At the end of each problem there is a representation of the
results with explanations side by side. Each part is concluded with the exercise for individual
practice. By the time you reach the end of this tutorial, you will be able to analyze your data with
WEKA Explorer using various learning schemes and interpret received results. Before starting
this, you should be familiar with data mining algorithms such as C4.5 (C5), ID3, K-means, and
Apriori.

2. Launching WEKA Explorer

You can launch Weka from C:\Program Files directory, from your desktop selecting icon,
or from the Windows task bar ‘Start’ -> ‘Programs’ -> ‘Weka 3.8’. When ‘WEKA GUI

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 30

KNOWLEDGE ENGINEERING LAB

Chooser’ window appears on the screen, you can select one of the four options at the bottom of
the window:

1. Simple CLI provides a simple command-line interface and allows direct execution of Weka
commands.
2. Explorer is an environment for exploring data.
3. Experimenter is an environment for performing experiments and conducting statistical tests
between learning schemes.
4. KnowledgeFlow is a Java-Beans-based interface for setting up and running machine learning
experiments.

For the exercises in this tutorial you will use ‘Explorer’. Click on ‘Explorer’ button in the
‘WEKA GUI Chooser’ window.

‘WEKA Explorer’ window appears on a screen.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 31

KNOWLEDGE ENGINEERING LAB

1. Demonstration of preprocessing on dataset student.arff

Aim: This experiment illustrates some of the basic data preprocessing operations that
can be performed using WEKA-Explorer. The sample dataset used for this example is the
student data available in arff format.
Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of
the data weka will compute some basic strategies on each attribute. The left panel in the
above figure shows the list of recognized attributes while the top panel indicates the
names of the base relation or table and the current working relation (which are same
initially).
Step3: Clicking on an attribute in the left panel will show the basic statistics on the
attributes for the categorical attributes the frequency of each attribute value is shown,
while for continuous attributes we can obtain min, max, mean, standard deviation and
deviation etc.,
Step4: The visualization in the right button panel in the form of cross-tabulation across
two attributes.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 32

KNOWLEDGE ENGINEERING LAB

Note: we can select another attribute using the dropdown list.

Step5: Selecting or filtering attributes
Removing an attribute- When we need to remove an attribute, we can do this by using
the attribute filters in weka. In the filter model panel, click on choose button, This will
show a popup window with a list of available filters.
Scroll down the list and select the “weka.filters.unsupervised.attribute.remove” filters.
Step 6:a) Next click the textbox immediately to the right of the choose button. In the
resulting dialog box enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to false.The click OK now in the filter
box.you will see “Remove-R-7”.
c) Click the apply button to apply filter to this data.This will remove the attribute and
create new working relation.
d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff)

Discretization
1) Sometimes association rule mining can only be performed on categorical data. This
requires performing discretization on numeric or continuous attributes. In the following
example let us discretize age attribute.

 Let us divide the values of age attribute into three bins(intervals).

 First load the dataset into weka(student.arff)

 Select the age attribute.

 Activate filter-dialog box and select “WEKA.filters.unsupervised.attribute.discretize”

from the list.

 To change the defaults for the filters, click on the box immediately to the right of the
choose button.

 We enter the index for the attribute to be discretized. In this case the attribute is age. So
we must enter ‘1’ corresponding to the age attribute.

 Enter ‘3’ as the number of bins. Leave the remaining field values as they are.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 33

KNOWLEDGE ENGINEERING LAB

 Click OK button.

 Click on apply in the filter panel. This will result in a new working relation with the
selected attribute partition into 3 bins.

 Save the new working relation in a file called student-data-discretized .arff

Dataset student .arff

@relation student
@attribute age {<30,30-40,>40}
@attribute income {low, medium, high}
@attribute student {yes, no}
@attribute credit-rating {fair, excellent}
@attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 34

KNOWLEDGE ENGINEERING LAB

The following screenshot shows the effect of discretization.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 35

KNOWLEDGE ENGINEERING LAB

2. Demonstration of Association rule process on dataset test.arff using

apriori algorithm
Aim: This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is test.arff
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields
have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.
Step3: We will use apriori algorithm. This is the default algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.

Dataset test.arff
@relation test
@attribute admissionyear {2005,2006,2007,2008,2009,2010}
@attribute course {cse,mech,it,ece}
@data
%
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
2007, cse
2008, it
2008, cse
2009, it
2009, ece
%

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 36

KNOWLEDGE ENGINEERING LAB

The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 37

KNOWLEDGE ENGINEERING LAB

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 38

KNOWLEDGE ENGINEERING LAB

3. Demonstration of classification rule process on dataset employee.arff

using j48 algorithm
Aim: This experiment illustrates the use of j-48 classifier in weka.the sample data set
used in this experiment is “employee”data available at arff format. This document
assumes that appropriate data pre processing has been performed.
Steps involved in this experiment:
Step 1: We begin the experiment by loading the data (employee.arff) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default values the
default version does perform some pruning but does not perform error pruning.
Step4: Under the “text “options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don’t have separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well
as evaluation statistic will appear in the right panel when the model construction is
complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.
Step-8: We will use our model to classify the new instances.
Step-9: In the main panel under “text “options click the “supplied test set” radio button
and then click the “set” button. This wills pop-up a window which will allow you to open
the file containing test instances.

Data set employee.arff:

@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 39

KNOWLEDGE ENGINEERING LAB

27, 15k, poor

27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 34k, good
48, 32k,good
%

The following screenshot shows the classification rules that were generated whenj48
algorithm is applied on the given dataset.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 40

KNOWLEDGE ENGINEERING LAB

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 41

KNOWLEDGE ENGINEERING LAB

4. Demonstration of classification rule process on dataset employee.arff

using naïve bayes algorithm
Aim: This experiment illustrates the use of naïve bayes classifier in weka. The sample
data set used in this experiment is “employee”data available at arff format. This
document assumes that appropriate data pre processing has been performed.
Steps involved in this experiment:
1. We begin the experiment by loading the data (employee.arff) into weka.
Step2: next we select the “classify” tab and click “choose” button to select the
“id3”classifier.
Step3: now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default values his
default version does perform some pruning but does not perform error pruning.
Step4: under the “text “options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don’t have separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of generated model.
Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is
complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.
Step-8: we will use our model to classify the new instances.
Step-9: In the main panel under “text “options click the “supplied test set” radio button
and then click the “set” button. This will show pop-up window which will allow you to
open the file containing test instances.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 42

KNOWLEDGE ENGINEERING LAB

Data set employee.arff:

@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 34k, good
48, 32k, good
%

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 43

KNOWLEDGE ENGINEERING LAB

The following screenshot shows the classification rules that were generated when naive
bayes algorithm is applied on the given dataset.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 44

KNOWLEDGE ENGINEERING LAB

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 45

KNOWLEDGE ENGINEERING LAB

5. Demonstration of clustering rule process on dataset iris.arff using

simple k-means
Aim: This experiment illustrates the use of simple k-mean clustering with Weka
explorer. The sample data set used for this example is based on the iris data available in
ARFF format. This document assumes that appropriate preprocessing has been
performed. This iris dataset includes 150 instances.

Steps involved in this Experiment

Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on
the choose button. This step results in a dropdown list of available clustering algorithms.
Step 3 : In this case we select ‘simple k-means’.
Step 4: Next click in text button to the right of the choose button to get popup window
shown in the screenshots. In this window we enter six on the number of clusters and we
leave the value of the seed on as it is. The seed value is used in generating a random
number which is used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in
the following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid
are means vectors for each clusters. This clusters can be used to characterized the
cluster.For eg, the centroid of cluster1 shows the class iris.versicolor mean value of the
sepal length is 5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.
Step 7: Another way of understanding characterstics of each cluster through visualization
,we can do this, try right clicking the result set on the result. List panel and selecting the
visualize cluster assignments.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 46

KNOWLEDGE ENGINEERING LAB

The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 47

KNOWLEDGE ENGINEERING LAB

Interpretation of the above visualization

From the above visualization, we can understand the distribution of sepal length and petal
length in each cluster. For instance, for each cluster is dominated by petal length. In this
case by changing the color dimension to other attributes we can see their distribution with
in each of the cluster.
Step 8: We can assure that resulting dataset which included each instance along with its
assign cluster. To do so we click the save button in the visualization window and save the
result iris k-mean .The top portion of this file is shown in the following figure

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 48

Outlier Analysis 2nd Edition Charu C. Aggarwal (Auth.) All Chapters Instant Download
100% (4)
Outlier Analysis 2nd Edition Charu C. Aggarwal (Auth.) All Chapters Instant Download
33 pages
Week 2 Solutions
No ratings yet
Week 2 Solutions
7 pages
True of False: H12-111-Enu Hcia-Iot V2.5 Exam
100% (4)
True of False: H12-111-Enu Hcia-Iot V2.5 Exam
20 pages
STAT 1520 Notes
No ratings yet
STAT 1520 Notes
61 pages
TailwindCSS WebApps Using B4X
No ratings yet
TailwindCSS WebApps Using B4X
331 pages
Solution 2
0% (1)
Solution 2
4 pages
R Programming Notes
100% (1)
R Programming Notes
32 pages
IR - Spectros
No ratings yet
IR - Spectros
40 pages
APA Citation: - In-Text Citation - Reference Page - Paraphrasing and Summarizing
No ratings yet
APA Citation: - In-Text Citation - Reference Page - Paraphrasing and Summarizing
19 pages
Ggplot2 Cheatsheet 2.1
No ratings yet
Ggplot2 Cheatsheet 2.1
2 pages
Chap 1-4, Statistical Inference, by Casella and Berger PDF
No ratings yet
Chap 1-4, Statistical Inference, by Casella and Berger PDF
686 pages
Survival Plots SURVMINER Package Tutorial
No ratings yet
Survival Plots SURVMINER Package Tutorial
5 pages
RYAN, THOMAS P. - [Wiley Series in Probability and Statistics] Modern Regression Methods __ (2
No ratings yet
RYAN, THOMAS P. - [Wiley Series in Probability and Statistics] Modern Regression Methods __ (2
658 pages
Variable Selection
No ratings yet
Variable Selection
15 pages
1 What Is Bioinformatics
No ratings yet
1 What Is Bioinformatics
34 pages
SPSS Syntax
No ratings yet
SPSS Syntax
17 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
Penalized Regression
No ratings yet
Penalized Regression
19 pages
Cancer Biomarkers
No ratings yet
Cancer Biomarkers
7 pages
Bayesian Analysis
No ratings yet
Bayesian Analysis
9 pages
STAT 0202 Outlier Syllabus - Spring 2020 - Regular Term (14-Wk) - 12-2
No ratings yet
STAT 0202 Outlier Syllabus - Spring 2020 - Regular Term (14-Wk) - 12-2
7 pages
Arch Model and Time-Varying Volatility
No ratings yet
Arch Model and Time-Varying Volatility
17 pages
Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
Dplyr Tutorial
100% (1)
Dplyr Tutorial
22 pages
Lecture 16 Discourse Analysis
No ratings yet
Lecture 16 Discourse Analysis
15 pages
Introduction To Visualising Spatial Data in R / Robin Lovelace
No ratings yet
Introduction To Visualising Spatial Data in R / Robin Lovelace
23 pages
App.A - Detection and Estimation in Additive Gaussian Noise PDF
No ratings yet
App.A - Detection and Estimation in Additive Gaussian Noise PDF
55 pages
Statistics With R
No ratings yet
Statistics With R
41 pages
Purrr
No ratings yet
Purrr
2 pages
Amazon
No ratings yet
Amazon
24 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
11-SAS Enterprise Guide
No ratings yet
11-SAS Enterprise Guide
78 pages
Bahan Univariate Linear Regression
No ratings yet
Bahan Univariate Linear Regression
64 pages
4-PCA and SVD
No ratings yet
4-PCA and SVD
32 pages
K Means
No ratings yet
K Means
22 pages
(Treading On Python 2) Matt Harrison - Treading On Python Volume 2 - Intermediate Python 2 (2013, Hairysun)
No ratings yet
(Treading On Python 2) Matt Harrison - Treading On Python Volume 2 - Intermediate Python 2 (2013, Hairysun)
144 pages
Ma5160 Applied Probability and Statistics 1 PDF
50% (2)
Ma5160 Applied Probability and Statistics 1 PDF
4 pages
Five College R
No ratings yet
Five College R
104 pages
Book Matlab Document Stats
No ratings yet
Book Matlab Document Stats
2,338 pages
Lab 3 - Linear Regression
No ratings yet
Lab 3 - Linear Regression
15 pages
Causal Inference for Statistics Social and Biomedical Sciences An Introduction 1st Edition Guido W. Imbens all chapter instant download
100% (1)
Causal Inference for Statistics Social and Biomedical Sciences An Introduction 1st Edition Guido W. Imbens all chapter instant download
27 pages
TensorFlow Tutorial For Beginners (Article) - DataCamp
No ratings yet
TensorFlow Tutorial For Beginners (Article) - DataCamp
66 pages
Matlab Quick Reference
100% (3)
Matlab Quick Reference
12 pages
In Class Exercise Linear Regression in R
No ratings yet
In Class Exercise Linear Regression in R
6 pages
Department of Economics: ECONOMICS 481: Economics Research Paper and Seminar
No ratings yet
Department of Economics: ECONOMICS 481: Economics Research Paper and Seminar
15 pages
Intermediate R - Nonlinear Regression in R
No ratings yet
Intermediate R - Nonlinear Regression in R
4 pages
Quantile Regression (Final) PDF
100% (1)
Quantile Regression (Final) PDF
22 pages
R Studio How To
No ratings yet
R Studio How To
12 pages
Quantile Regression
No ratings yet
Quantile Regression
11 pages
Land Laws The Most Ancient Land Laws in Bangladesh Can Be Traced To The Practices of Aboriginal Communities Involving Payment of A Share of The Produce of The Land To The Head of The Clan
No ratings yet
Land Laws The Most Ancient Land Laws in Bangladesh Can Be Traced To The Practices of Aboriginal Communities Involving Payment of A Share of The Produce of The Land To The Head of The Clan
3 pages
R Intro Script
No ratings yet
R Intro Script
86 pages
Quantile Regression: EC 823: Applied Econometrics
No ratings yet
Quantile Regression: EC 823: Applied Econometrics
20 pages
R Visualizations: Derive Meaning from Data 1st Edition David Gerbing - The latest ebook edition with all chapters is now available
100% (3)
R Visualizations: Derive Meaning from Data 1st Edition David Gerbing - The latest ebook edition with all chapters is now available
65 pages
Minimax Algorithm & Alpha-Beta Pruning
No ratings yet
Minimax Algorithm & Alpha-Beta Pruning
35 pages
Statistical Computing Using Statistical Computing Using
No ratings yet
Statistical Computing Using Statistical Computing Using
128 pages
Computerstatistik Skriptum
No ratings yet
Computerstatistik Skriptum
162 pages
R Introduction by Deepayan Sarkar
No ratings yet
R Introduction by Deepayan Sarkar
23 pages
Csc121 Full Notes
No ratings yet
Csc121 Full Notes
227 pages
E5 - Statistical Analysis Using R
100% (1)
E5 - Statistical Analysis Using R
45 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
DATA ANALYTICS LAB MANUAL
No ratings yet
DATA ANALYTICS LAB MANUAL
57 pages
Fshortcut Keys Description
No ratings yet
Fshortcut Keys Description
2 pages
ACH580-31-039A-4: ACH580-31-039A-4 ACH580-31-039A-4 PN: 18.5 KW, IN: 38 A
100% (1)
ACH580-31-039A-4: ACH580-31-039A-4 ACH580-31-039A-4 PN: 18.5 KW, IN: 38 A
3 pages
3795-210-S10-E99-0002-Rev.3 PLC Io Mapping List Including Time Synchronize S
No ratings yet
3795-210-S10-E99-0002-Rev.3 PLC Io Mapping List Including Time Synchronize S
7 pages
Ilnas-En Iso 17634:2006
No ratings yet
Ilnas-En Iso 17634:2006
7 pages
Last M
No ratings yet
Last M
8 pages
Logistic Officer HADAAF
No ratings yet
Logistic Officer HADAAF
6 pages
Accomplishment Report Grade 4 Week 3
No ratings yet
Accomplishment Report Grade 4 Week 3
8 pages
ECOPS Electronics Police Record Management System
0% (1)
ECOPS Electronics Police Record Management System
4 pages
Auto Body Repair Technology: Ch. 15,16 Key Terms
No ratings yet
Auto Body Repair Technology: Ch. 15,16 Key Terms
1 page
Careers in STEM - Updated
No ratings yet
Careers in STEM - Updated
19 pages
Expense Report Template
No ratings yet
Expense Report Template
6 pages
2015 ATD - Research - Skills - Challenges - and - Trends - in - Instructional - Design
No ratings yet
2015 ATD - Research - Skills - Challenges - and - Trends - in - Instructional - Design
29 pages
Prague
No ratings yet
Prague
2 pages
Candidate Evaluation Details: Hidayath Ali Mokula
No ratings yet
Candidate Evaluation Details: Hidayath Ali Mokula
2 pages
Itil
No ratings yet
Itil
125 pages
Speed Monitor EMA
No ratings yet
Speed Monitor EMA
12 pages
Tarm Wood Boiler Planning Guide
No ratings yet
Tarm Wood Boiler Planning Guide
22 pages
L&T Question Paper
No ratings yet
L&T Question Paper
5 pages
Tribhuvan University Institute of Engineering Pulchowk Campus
No ratings yet
Tribhuvan University Institute of Engineering Pulchowk Campus
21 pages
Quiz Contest: Bharati Vidyapeeth Institute of Technology, Kharghar
No ratings yet
Quiz Contest: Bharati Vidyapeeth Institute of Technology, Kharghar
4 pages
Mitfxcpu
No ratings yet
Mitfxcpu
50 pages
Summer Internship at Outlook Magazine India
No ratings yet
Summer Internship at Outlook Magazine India
34 pages
Assignment Format Case Study - Strategic Management
No ratings yet
Assignment Format Case Study - Strategic Management
4 pages
Modular Multiplication
No ratings yet
Modular Multiplication
5 pages
Unit 2 Electromagnetic Relays (Switchgear and Protection)
No ratings yet
Unit 2 Electromagnetic Relays (Switchgear and Protection)
91 pages
Emerging Trends in OB
No ratings yet
Emerging Trends in OB
2 pages
Digital Banking
100% (1)
Digital Banking
13 pages
Propsim User Reference
No ratings yet
Propsim User Reference
365 pages