Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
73 views

Exploratory Data Analysis Using R

Uploaded by

Naveen kumar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Exploratory Data Analysis Using R

Uploaded by

Naveen kumar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

KNOWLEDGE ENGINEERING LAB

Exploratory Data Analysis using R

Introduction to R Programming
In this R programming tutorial, we are going to learn what is R statistics, introduction to
R Programming, R programming examples, r programming for data science various R software
editors like RGui and R Studio and their components. We will also learn R
Features, Applications of R programming, how to develop R scripts with the help of examples.
You will get a good idea to learn r programming for data science. Let’s begin with the
introduction to R Programming Tutorial.

What is R Programming Language?


R language is an open source program maintained by the R core-development team – team of
volunteer developers from across the globe. R language used for performing statistical operations
and is available from the R-Project website www.r-project.org. R is a command line driven
program. The user enters commands at the prompt (> by default) and each command is executed
one at a time.

Many routines have been written for R analytics by people all over the world and made freely
available from the R project Website as packages. However, the basic installation (for Linux,
Windows, or Mac) contains a powerful set of tools for most purposes.

R is a consolidated environment for performing statistical operations and generating R data


analysis reports in graphical or text formats. R commands entered in the console are evaluated
and executed. R cannot handle certain auto-formatting characters such as en-dashes or smart
quotes; therefore, you need to be careful while copying and pasting commands into R from other
applications. Let us now learn something about the History of R in this Introduction to R
Programming.

History of R language
John Chambers and colleagues developed R at Bell Laboratories. R is an implementation of the S
programming Language and combines with lexical scoping semantics inspired by Scheme. R
was named
partly after the first names of two R authors. The project conceives in 1992, with an initial
version
released in 1995 and a stable beta version in 2000. Let us also Understand in this Introduction to
R Programming Tutorial, that Why should learn R Programming.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 1


KNOWLEDGE ENGINEERING LAB

R-Studio:
R-Studio is an integrated development environment (IDE) for R language. R-Studio is a code
editor and development environment, with some nice features that make code development in R
easy and fun.

a) Features of R-Studio

Code highlighting that gives different colors to keywords and variables, making it easier to read
Automatic bracket matching
Code completion, so as to reduce the effort of typing the commands in full
Easy access to R Help, with additional features for exploring functions and parameters of
functions
R-Studio is available free of charge for Linux, Windows, and Mac devices. It can be directly
accessed by clicking the R-Studio icon in the menu system on the desktop.
Because R-Studio is available free of charge for Linux, Windows, and Mac devices, it is a good
option to use with R. To open R-Studio, click the R-Studio icon in the menu system or on the
desktop.

Components of R-Studio

Source – Top left corner of the screen contains a text editor that lets the user work with source
script files. Multiple lines of code can also be entered here. Users can save R script file to disk
and perform other tasks on the script.

Console – Bottom left corner is the R console window. The console in R-Studio is identical to
the console in RGui. All the interactive work of R programming is performed in this window.

Workspace and History – The top right corner is the R workspace and history window. This
provides an overview of the workspace, where the variables created in the session along with
their values can be inspected. This is also the area where the user can see a history of the
commands issued in R.

Files, Plots, Package, and Help The bottom right corner gives access to the following tools:

Files – This is where the user can browse folders and files on a computer.
Plots – This is where R displays the user’s plots.
Packages – This is where the user can view a list of all the installed packages.
Help – This is where you can browse the built-in Help system of R.
Next in this Introduction to R Programming Introduction is R Scripting and Sourcing a script in
R.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 2


KNOWLEDGE ENGINEERING LAB

1. Load the ‘iris. CSV’ file and display the names and type of each column.
Find statistics such as min, max, range, mean, median, variance, standard
deviation for each column of data.
SOLUTION:

iris <- read.csv("C:/Desktop/R LAB PROGRAMS/iris.csv", header=TRUE, sep=",")


# print iris

print(iris)

#display col names


colnames( iris )

[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

#display each col types


str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1


1 1 1 1 ...

#min, max, range, mean, median, variance, standard deviation


print (min(iris$Sepal.Length))

[1] 4.3
> print (max(iris$Sepal.Length))
[1] 7.9
> print (range(iris$Sepal.Length))
[1] 4.3 7.9
> print (mean(iris$Sepal.Length))
[1] 5.843333
> print (median(iris$Sepal.Length))
[1] 5.8
> print (var(iris$Sepal.Length))

[1] 0.6856935
> print (sd(iris$Sepal.Length))
[1] 0.8280661

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 3


KNOWLEDGE ENGINEERING LAB

print (min(iris$Sepal.Width))
[1] 2
> print (max(iris$Sepal.Width))
[1] 4.4
> print (range(iris$Sepal.Width))
[1] 2.0 4.4
> print (mean(iris$Sepal.Width))
[1] 3.057333
> print (median(iris$Sepal.Width))
[1] 3
> print (var(iris$Sepal.Width))
[1] 0.1899794
> print (sd(iris$Sepal.Width))
[1] 0.4358663

print (min(iris$Petal.Length))
[1] 1
> print (max(iris$Petal.Length))
[1] 6.9
> print (range(iris$Petal.Length))
[1] 1.0 6.9
> print (mean(iris$Petal.Length))
[1] 3.758
> print (median(iris$Petal.Length))
[1] 4.35
> print (var(iris$Petal.Length))
[1] 3.116278
> print (sd(iris$Petal.Length))
[1] 1.765298

print (min(iris$Petal.Width))
[1] 0.1
> print (max(iris$Petal.Width))
[1] 2.5
> print (range(iris$Petal.Width))
[1] 0.1 2.5
> print (mean(iris$Petal.Width))
[1] 1.199333
> print (median(iris$Petal.Width))

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 4


KNOWLEDGE ENGINEERING LAB

[1] 1.3
> print (var(iris$Petal.Width))
[1] 0.5810063
> print (sd(iris$Petal.Width))

[1] 0.7622377

2. Write R program to normalize the variables into 0 to 1 scale using min-


max normalisation
SOLUTION:

What is min max normalization?

Min-max normalization is a normalization strategy which linearly transforms x to y= (x-


min)/(max-min), where min and max are the minimum and maximum values in X, where X
is the set of observed values of x. It can be easily seen that when x=min, then y=0, and.
When x=max, then y=1

If you want to normalize you data you can do as you suggest and simply calculate:

data frame could be normalized using Min-Max normalization technique which specifies the
following formula to be applied to each value of features to be normalized. This technique is
traditionally used with K-Nearest Neighbors (KNN) Classification problems.

(x-min(x))/(max(x)-min(x))

>x<-iris$sepal_length
>normalized = (x-min(x))/(max(x)-min(x))
>print(normalized)

[1] 0.22222222 0.16666667 0.11111111 0.08333333 0.19444444 0.30555556 0.08333333


[8] 0.19444444 0.02777778 0.16666667 0.30555556 0.13888889 0.13888889 0.00000000
[15] 0.41666667 0.38888889 0.30555556 0.22222222 0.38888889 0.22222222 0.30555556
[22] 0.22222222 0.08333333 0.22222222 0.13888889 0.19444444 0.19444444 0.25000000
[29] 0.25000000 0.11111111 0.13888889 0.30555556 0.25000000 0.33333333 0.16666667
[36] 0.19444444 0.33333333 0.16666667 0.02777778 0.22222222 0.19444444 0.05555556
[43] 0.02777778 0.19444444 0.22222222 0.13888889 0.22222222 0.08333333 0.27777778
[50] 0.19444444 0.75000000 0.58333333 0.72222222 0.33333333 0.61111111 0.38888889

[57] 0.55555556 0.16666667 0.63888889 0.25000000 0.19444444 0.44444444 0.47222222


[64] 0.50000000 0.36111111 0.66666667 0.36111111 0.41666667 0.52777778 0.36111111
[71] 0.44444444 0.50000000 0.55555556 0.50000000 0.58333333 0.63888889 0.69444444
[78] 0.66666667 0.47222222 0.38888889 0.33333333 0.33333333 0.41666667 0.47222222
[85] 0.30555556 0.47222222 0.66666667 0.55555556 0.36111111 0.33333333 0.33333333
[92] 0.50000000 0.41666667 0.19444444 0.36111111 0.38888889 0.38888889 0.52777778

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 5


KNOWLEDGE ENGINEERING LAB

[99] 0.22222222 0.38888889 0.55555556 0.41666667 0.77777778 0.55555556 0.61111111


[106] 0.91666667 0.16666667 0.83333333 0.66666667 0.80555556 0.61111111 0.58333333
[113] 0.69444444 0.38888889 0.41666667 0.58333333 0.61111111 0.94444444 0.94444444
[120] 0.47222222 0.72222222 0.36111111 0.94444444 0.55555556 0.66666667 0.80555556
[127] 0.52777778 0.50000000 0.58333333 0.80555556 0.86111111 1.00000000 0.58333333
[134] 0.55555556 0.50000000 0.94444444 0.55555556 0.58333333 0.47222222 0.72222222
[141] 0.66666667 0.72222222 0.41666667 0.69444444 0.66666667 0.66666667 0.55555556
[148] 0.61111111 0.52777778 0.44444444

>x1<-iris$sepal_width
>normalized = (x1-min(x1))/(max(x1)-min(x1))
>print(normalized)

[1] 0.62500000 0.41666667 0.50000000 0.45833333 0.66666667 0.79166667 0.58333333 0.58333333


[9] 0.37500000 0.45833333 0.70833333 0.58333333 0.41666667 0.41666667 0.83333333 1.00000000
[17] 0.79166667 0.62500000 0.75000000 0.75000000 0.58333333 0.70833333 0.66666667 0.54166667
[25] 0.58333333 0.41666667 0.58333333 0.62500000 0.58333333 0.50000000 0.45833333 0.58333333
[33] 0.87500000 0.91666667 0.45833333 0.50000000 0.62500000 0.66666667 0.41666667 0.58333333
[41] 0.62500000 0.12500000 0.50000000 0.62500000 0.75000000 0.41666667 0.75000000 0.50000000
[49] 0.70833333 0.54166667 0.50000000 0.50000000 0.45833333 0.12500000 0.33333333 0.33333333
[57] 0.54166667 0.16666667 0.37500000 0.29166667 0.00000000 0.41666667 0.08333333 0.37500000
[65] 0.37500000 0.45833333 0.41666667 0.29166667 0.08333333 0.20833333 0.50000000 0.33333333
[73] 0.20833333 0.33333333 0.37500000 0.41666667 0.33333333 0.41666667 0.37500000 0.25000000
[81] 0.16666667 0.16666667 0.29166667 0.29166667 0.41666667 0.58333333 0.45833333 0.12500000
[89] 0.41666667 0.20833333 0.25000000 0.41666667 0.25000000 0.12500000 0.29166667 0.41666667
[97] 0.37500000 0.37500000 0.20833333 0.33333333 0.54166667 0.29166667 0.41666667 0.37500000
[105] 0.41666667 0.41666667 0.20833333 0.37500000 0.20833333 0.66666667 0.50000000 0.29166667
[113] 0.41666667 0.20833333 0.33333333 0.50000000 0.41666667 0.75000000 0.25000000 0.08333333
[121] 0.50000000 0.33333333 0.33333333 0.29166667 0.54166667 0.50000000 0.33333333 0.41666667
[129] 0.33333333 0.41666667 0.33333333 0.75000000 0.33333333 0.33333333 0.25000000 0.41666667
[137] 0.58333333 0.45833333 0.41666667 0.45833333 0.45833333 0.45833333 0.29166667 0.50000000
[145] 0.54166667 0.41666667 0.20833333 0.41666667 0.58333333 0.41666667

>x2<-iris$petal_length
>normalized = (x2-min(x2))/(max(x2)-min(x2))
>print(normalized)

[1] 0.06779661 0.06779661 0.05084746 0.08474576 0.06779661 0.11864407 0.06779661 0.08474576


[9] 0.06779661 0.08474576 0.08474576 0.10169492 0.06779661 0.01694915 0.03389831 0.08474576
[17] 0.05084746 0.06779661 0.11864407 0.08474576 0.11864407 0.08474576 0.00000000 0.11864407
[25] 0.15254237 0.10169492 0.10169492 0.08474576 0.06779661 0.10169492 0.10169492 0.08474576
[33] 0.08474576 0.06779661 0.08474576 0.03389831 0.05084746 0.06779661 0.05084746 0.08474576

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 6


KNOWLEDGE ENGINEERING LAB

[41] 0.05084746 0.05084746 0.05084746 0.10169492 0.15254237 0.06779661 0.10169492 0.06779661


[49] 0.08474576 0.06779661 0.62711864 0.59322034 0.66101695 0.50847458 0.61016949 0.59322034
[57] 0.62711864 0.38983051 0.61016949 0.49152542 0.42372881 0.54237288 0.50847458 0.62711864
[65] 0.44067797 0.57627119 0.59322034 0.52542373 0.59322034 0.49152542 0.64406780 0.50847458
[73] 0.66101695 0.62711864 0.55932203 0.57627119 0.64406780 0.67796610 0.59322034 0.42372881
[81] 0.47457627 0.45762712 0.49152542 0.69491525 0.59322034 0.59322034 0.62711864 0.57627119
[89] 0.52542373 0.50847458 0.57627119 0.61016949 0.50847458 0.38983051 0.54237288 0.54237288
[97] 0.54237288 0.55932203 0.33898305 0.52542373 0.84745763 0.69491525 0.83050847 0.77966102
ajud[105] 0.81355932 0.94915254 0.59322034 0.89830508 0.81355932 0.86440678 0.69491525 0.72881356
[113] 0.76271186 0.67796610 0.69491525 0.72881356 0.76271186 0.96610169 1.00000000 0.67796610
[121] 0.79661017 0.66101695 0.96610169 0.66101695 0.79661017 0.84745763 0.64406780 0.66101695
[129] 0.77966102 0.81355932 0.86440678 0.91525424 0.77966102 0.69491525 0.77966102 0.86440678
[137] 0.77966102 0.76271186 0.64406780 0.74576271 0.77966102 0.69491525 0.69491525 0.83050847
[145] 0.79661017 0.71186441 0.67796610 0.71186441 0.74576271 0.69491525

>x3<-iris$petal_width
>normalized = (
x3-min(x3))/(max(x3)-min(x3))
>print(normalized)

[1] 0.04166667 0.04166667 0.04166667 0.04166667 0.04166667 0.12500000 0.08333333 0.04166667


[9] 0.04166667 0.00000000 0.04166667 0.04166667 0.00000000 0.00000000 0.04166667 0.12500000
[17] 0.12500000 0.08333333 0.08333333 0.08333333 0.04166667 0.12500000 0.04166667 0.16666667
[25] 0.04166667 0.04166667 0.12500000 0.04166667 0.04166667 0.04166667 0.04166667 0.12500000
[33] 0.00000000 0.04166667 0.04166667 0.04166667 0.04166667 0.00000000 0.04166667 0.04166667
[41] 0.08333333 0.08333333 0.04166667 0.20833333 0.12500000 0.08333333 0.04166667 0.04166667
[49] 0.04166667 0.04166667 0.54166667 0.58333333 0.58333333 0.50000000 0.58333333 0.50000000
[57] 0.62500000 0.37500000 0.50000000 0.54166667 0.37500000 0.58333333 0.37500000 0.54166667
[65] 0.50000000 0.54166667 0.58333333 0.37500000 0.58333333 0.41666667 0.70833333 0.50000000
[73] 0.58333333 0.45833333 0.50000000 0.54166667 0.54166667 0.66666667 0.58333333 0.37500000
[81] 0.41666667 0.37500000 0.45833333 0.62500000 0.58333333 0.62500000 0.58333333 0.50000000
[89] 0.50000000 0.50000000 0.45833333 0.54166667 0.45833333 0.37500000 0.50000000 0.45833333
[97] 0.50000000 0.50000000 0.41666667 0.50000000 1.00000000 0.75000000 0.83333333 0.70833333
[105] 0.87500000 0.83333333 0.66666667 0.70833333 0.70833333 1.00000000 0.79166667 0.75000000
[113] 0.83333333 0.79166667 0.95833333 0.91666667 0.70833333 0.87500000 0.91666667 0.58333333
[121] 0.91666667 0.79166667 0.79166667 0.70833333 0.83333333 0.70833333 0.70833333 0.70833333
[129] 0.83333333 0.62500000 0.75000000 0.79166667 0.87500000 0.58333333 0.54166667 0.91666667
[137] 0.95833333 0.70833333 0.70833333 0.83333333 0.95833333 0.91666667 0.75000000 0.91666667
[145] 1.00000000 0.91666667 0.75000000 0.79166667 0.91666667 0.70833333

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 7


KNOWLEDGE ENGINEERING LAB

3. Generate histograms for any one variable (sepal length/ sepal width/ petal
length/ petal width) and generate scatter plots for every pair of variables
showing each species in different color.
SOLUTION:

A histogram represents the frequencies of values of a variable bucketed into ranges.


Histogram is similar to bar chat but the difference is it groups the values into continuous
ranges. Each bar in histogram represents the height of the number of values present in that
range.
R creates histogram using hist() function. This function takes a vector as an input and uses
some more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is −

hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
‘v’ is a vector containing numeric values used in histogram.
‘main’ indicates title of the chart.
‘col’ is used to set color of the bars.
‘border’ is used to set border color of each bar.
‘xlab’ is used to give description of x-axis.
‘xlim’ is used to specify the range of values on the x-axis.
‘ylim’ is used to specify the range of values on the y-axis.
‘breaks’ is used to mention the width of each bar.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 8


KNOWLEDGE ENGINEERING LAB

# print iris
print(iris)

> x1<-iris$Sepal.Length
> x2<-iris$Sepal.Width
> x3<-iris$Petal.Length
> x4<-iris$Petal.Width

>hist(x1)

>hist(x2)

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 9


KNOWLEDGE ENGINEERING LAB

>hist(x3)

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 10


KNOWLEDGE ENGINEERING LAB

>hist(x4)

Scatter Plots
Scatter plots show many points plotted in the Cartesian plane. Each point represents the
values of two variables. One variable is chosen in the horizontal axis and another in the
vertical axis.
The simple scatter plot is created using the plot() function.
Syntax: The basic syntax for creating scatter plot in R is −

plot(x, y, main, xlab, ylab, xlim, ylim, axes)


Following is the description of the parameters used −
‘x’ is the data set whose values are the horizontal coordinates.
‘y’ is the data set whose values are the vertical coordinates.
‘main’ is the tile of the graph.
‘xlab’ is the label in the horizontal axis.
‘ylab’ is the label in the vertical axis.
‘xlim’ is the limits of the values of x used for plotting.
‘ylim’ is the limits of the values of y used for plotting.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 11


KNOWLEDGE ENGINEERING LAB

‘axes’ indicates whether both axes should be drawn on the plot.

>plot(x = x1,y = x2,col =


c("red","blue","green")[iris$Species], xlab = "sepal_length",
ylab = "sepal_width", main = "sepal_length vs sepal_width")

>plot(x = x1,y = x3,col =


c("red","blue","green")[iris$Species], xlab = "sepal_length",
ylab = "petal_length", main = "sepal_length vs petal_length")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 12


KNOWLEDGE ENGINEERING LAB

>plot(x = x1,y = x4,col =


c("red","blue","green")[iris$Species], xlab = "sepal_length",
ylab = "petal_width", main = "sepal_length vs petal_width")

> plot(x = x2,y = x3,col =


c("red","blue","green")[iris$Species], xlab = "sepal_width",
ylab = "petal_length", main = "sepal_width vs petal_length")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 13


KNOWLEDGE ENGINEERING LAB

> plot(x = x2,y = x4,col =


c("red","blue","green")[iris$Species], xlab = "sepal_width",
ylab = "petal_width", main = "sepal_width vs petal_width")

> plot(x = x3,y = x4,col =


c("red","blue","green")[iris$Species], xlab = "petal_length",
ylab = "petal_width", main = "petal_length vs petal_width")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 14


KNOWLEDGE ENGINEERING LAB

4. Generate box plots for each of the numerical attributes. Identify the
attribute with the highest variance.
SOLUTION:

Boxplots are a measure of how well distributed is the data in a data set. It divides the data
set into three quartiles. This graph represents the minimum, maximum, median, first quartile
and third quartile in the data set. It is also useful in comparing the distribution of data across
data sets by drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax: The basic syntax to create a boxplot in R is −

boxplot(x, data, notch, varwidth, names, main)


Following is the description of the parameters used −
‘x' is a vector or a formula.
‘data’ is the data frame.
‘notch’ is a logical value. Set as TRUE to draw a notch.
‘varwidth’ is a logical value. Set as true to draw width of the box proportionate to the sample
size.
‘names’ are the group labels which will be printed under each boxplot.
‘main’ is used to give a title to the graph.

# print iris

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 15


KNOWLEDGE ENGINEERING LAB

>print(iris)

>x1<-iris$sepal_length
>x2<-iris$sepal_width
>x3<-iris$petal_length
>x4<-iris$petal_width

>boxplot(x1, xlab = "sepal_length", main = "sepal_length ")

>boxplot(x2, xlab = "sepal_ width ", main = "sepal_width ")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 16


KNOWLEDGE ENGINEERING LAB

>boxplot(x3, xlab = " petal_length ", main = " petal_length ")

>boxplot(x4, xlab = " petal_width ", main = " petal_width ")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 17


KNOWLEDGE ENGINEERING LAB

# Highest Variance
>var1=var(x1)
>var2=var(x2)
>var3=var(x3)
>var4=var(x4)

>max1=max(c(var1,var2,var3,var4))

>print(max1)

[1] 3.116278

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 18


KNOWLEDGE ENGINEERING LAB

5. Study of homogeneous and heterogeneous iris structures such as vector,


matrix, array, list, data frame in R.

SOLUTION:

1) Vectors 2)Lists 3)Matrices 4)Arrays 5)Iris Frames

Vectors:

When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.

#Create a Vectors
apple <- c('red', 'green', "yellow")

print(apple)

num <- c(1,5,6,1,8,11,"a")

print(num)

c1<-c(2, 3, 5)

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 19


KNOWLEDGE ENGINEERING LAB

c2<-c(TRUE, FALSE, TRUE, FALSE, FALSE)

c3<-c("aa","bb","cc","dd","ee")

length(c("aa", "bb", "cc", "dd", "ee"))

#combining vectors
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
c(n, s)

#arithematic
a = c(1, 3, 5, 7)
b = c(1, 2, 4, 8,9)
a+b

Matrices:
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function.
#matrix
A = matrix(
+ c(2, 4, 3, 1, 5, 7), # the data elements
+ nrow=2, # number of rows
+ ncol=3, # number of columns
+ byrow = TRUE) # fill matrix by rows

A = matrix( c(2, 4, 3, 1, 5, 7), nrow=2, ncol=3, byrow = TRUE)

A = matrix( c(2, 4, 3, 1, 5, 7), nrow=2, ncol=3)

#An element at the mth row, nth column of A can be accessed by the expression A[m, n].

A[2, 3] # element at 2nd row, 3rd column

#The entire mth row A can be extracted as A[m, ].

A[2, ] # the 2nd row

#Similarly, the entire nth column A can be extracted as A[ ,n].

A[ ,3] # the 3rd column

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 20


KNOWLEDGE ENGINEERING LAB

#We can also extract more than one rows or columns at a time.

A[ ,c(1,3)] # the 1st and 3rd columns

t(A) # transpose of B

Arrays:

While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.

a <- array(c('green','yellow'),dim = c(3,3,4))


print(a)

, , 1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
, , 3
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 4
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

Data Frames:

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

BMI <- data.frame( gender = c("Male", "Male","Female"), height = c(152, 171.5, 165),
weight = c(81, 93, 78), Age = c(42, 38, 26))

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 21


KNOWLEDGE ENGINEERING LAB

print(BMI)

View(BMI)

gender height weight Age


1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

Lists:
A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.

# Create a list.
eglist <- list(c(2,5), c('red','green',"yellow"), 21.3, BMI, mat2, sin)
print(eglist)

[[1]]
[1] 2 5
[[2]]
[1] "red" "green" "yellow"
[[3]]
[1] 21.3
[[4]]
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
[[5]]
[,1] [,2] [,3]
[1,] "a" "c" "e"
[2,] "b" "d" "f"
[[6]]
function (x) .Primitive("sin")

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 22


KNOWLEDGE ENGINEERING LAB

6. Write R Program using ‘apply’ group of functions to create and apply


normalization function on each of the numeric variables/columns of iris
dataset to transform them into a value around 0 with z-score
normalization.
SOLUTION:
z-score (aka, a standard score) indicates how many standard deviations an element is
from the mean. A z-score can be calculated from the following formula.

z= (X - μ) / σ

where z is the z-score, X is the value of the element, μ is the population mean, and σ is the
standard deviation.

>X1<-iris$Sepel.Length
>Zscr1<-(X1 -mean(X1))/sd(X1)
>Zscr1

[1] -0.89767388 -1.13920048 -1.38072709 -1.50149039 -1.01843718 -0.53538397


[7] -1.50149039 -1.01843718 -1.74301699 -1.13920048 -0.53538397 -1.25996379
[13] -1.25996379 -1.86378030 -0.05233076 -0.17309407 -0.53538397 -0.89767388
[19] -0.17309407 -0.89767388 -0.53538397 -0.89767388 -1.50149039 -0.89767388
[25] -1.25996379 -1.01843718 -1.01843718 -0.77691058 -0.77691058 -1.38072709
[31] -1.25996379 -0.53538397 -0.77691058 -0.41462067 -1.13920048 -1.01843718
[37] -0.41462067 -1.13920048 -1.74301699 -0.89767388 -1.01843718 -1.62225369
[43] -1.74301699 -1.01843718 -0.89767388 -1.25996379 -0.89767388 -1.50149039
[49] -0.65614727 -1.01843718 1.39682886 0.67224905 1.27606556 -0.41462067
[55] 0.79301235 -0.17309407 0.55148575 -1.13920048 0.91377565 -0.77691058

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 23


KNOWLEDGE ENGINEERING LAB

[61] -1.01843718 0.06843254 0.18919584 0.30995914 -0.29385737 1.03453895


[67] -0.29385737 -0.05233076 0.43072244 -0.29385737 0.06843254 0.30995914
[73] 0.55148575 0.30995914 0.67224905 0.91377565 1.15530226 1.03453895
[79] 0.18919584 -0.17309407 -0.41462067 -0.41462067 -0.05233076 0.18919584
[85] -0.53538397 0.18919584 1.03453895 0.55148575 -0.29385737 -0.41462067
[91] -0.41462067 0.30995914 -0.05233076 -1.01843718 -0.29385737 -0.17309407
[97] -0.17309407 0.43072244 -0.89767388 -0.17309407 0.55148575 -0.05233076
[103] 1.51759216 0.55148575 0.79301235 2.12140867 -1.13920048 1.75911877
[109] 1.03453895 1.63835547 0.79301235 0.67224905 1.15530226 -0.17309407
[115] -0.05233076 0.67224905 0.79301235 2.24217198 2.24217198 0.18919584
[121] 1.27606556 -0.29385737 2.24217198 0.55148575 1.03453895 1.63835547
[127] 0.43072244 0.30995914 0.67224905 1.63835547 1.87988207 2.48369858
[133] 0.67224905 0.55148575 0.30995914 2.24217198 0.55148575 0.67224905
[139] 0.18919584 1.27606556 1.03453895 1.27606556 -0.05233076 1.15530226
[145] 1.03453895 1.03453895 0.55148575 0.79301235 0.43072244 0.06843254

>X2<-iris$Sepel.Width
>Zscr2<-(X2 -mean(X2))/sd(X2)
>Zscr2

[1] 1.01560199 -0.13153881 0.32731751 0.09788935 1.24503015 1.93331463 0.78617383 0.78617383


[9] -0.36096697 0.09788935 1.47445831 0.78617383 -0.13153881 -0.13153881 2.16274279 3.08045544
[17] 1.93331463 1.01560199 1.70388647 1.70388647 0.78617383 1.47445831 1.24503015 0.55674567
[25] 0.78617383 -0.13153881 0.78617383 1.01560199 0.78617383 0.32731751 0.09788935 0.78617383
[33] 2.39217095 2.62159911 0.09788935 0.32731751 1.01560199 1.24503015 -0.13153881 0.78617383
[41] 1.01560199 -1.73753594 0.32731751 1.01560199 1.70388647 -0.13153881 1.70388647 0.32731751
[49] 1.47445831 0.55674567 0.32731751 0.32731751 0.09788935 -1.73753594 -0.59039513 -0.59039513
[57] 0.55674567 -1.50810778 -0.36096697 -0.81982329 -2.42582042 -0.13153881 -1.96696410 -
0.36096697
[65] -0.36096697 0.09788935 -0.13153881 -0.81982329 -1.96696410 -1.27867961 0.32731751 -0.59039513
[73] -1.27867961 -0.59039513 -0.36096697 -0.13153881 -0.59039513 -0.13153881 -0.36096697 -
1.04925145
[81] -1.50810778 -1.50810778 -0.81982329 -0.81982329 -0.13153881 0.78617383 0.09788935 -1.73753594
[89] -0.13153881 -1.27867961 -1.04925145 -0.13153881 -1.04925145 -1.73753594 -0.81982329 -
0.13153881
[97] -0.36096697 -0.36096697 -1.27867961 -0.59039513 0.55674567 -0.81982329 -0.13153881 -
0.36096697
[105] -0.13153881 -0.13153881 -1.27867961 -0.36096697 -1.27867961 1.24503015 0.32731751 -
0.81982329
[113] -0.13153881 -1.27867961 -0.59039513 0.32731751 -0.13153881 1.70388647 -1.04925145 -
1.96696410
[121] 0.32731751 -0.59039513 -0.59039513 -0.81982329 0.55674567 0.32731751 -0.59039513 -
0.13153881
[129] -0.59039513 -0.13153881 -0.59039513 1.70388647 -0.59039513 -0.59039513 -1.04925145 -
0.13153881
[137] 0.78617383 0.09788935 -0.13153881 0.09788935 0.09788935 0.09788935 -0.81982329 0.32731751
[145] 0.55674567 -0.13153881 -1.27867961 -0.13153881 0.78617383 -0.13153881

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 24


KNOWLEDGE ENGINEERING LAB

>X3<-iris$Petal.Length
>Zscr3<-(X3 -mean(X3))/sd(X3)
>Zscr3

[1] -1.33575163 -1.33575163 -1.39239929 -1.27910398 -1.33575163 -1.16580868 -1.33575163 -1.27910398


[9] -1.33575163 -1.27910398 -1.27910398 -1.22245633 -1.33575163 -1.50569459 -1.44904694 -1.27910398
[17] -1.39239929 -1.33575163 -1.16580868 -1.27910398 -1.16580868 -1.27910398 -1.56234224 -
1.16580868
[25] -1.05251337 -1.22245633 -1.22245633 -1.27910398 -1.33575163 -1.22245633 -1.22245633 -
1.27910398
[33] -1.27910398 -1.33575163 -1.27910398 -1.44904694 -1.39239929 -1.33575163 -1.39239929 -
1.27910398
[41] -1.39239929 -1.39239929 -1.39239929 -1.22245633 -1.05251337 -1.33575163 -1.22245633 -
1.33575163
[49] -1.27910398 -1.33575163 0.53362088 0.42032558 0.64691619 0.13708732 0.47697323 0.42032558
[57] 0.53362088 -0.25944625 0.47697323 0.08043967 -0.14615094 0.25038262 0.13708732 0.53362088
[65] -0.08950329 0.36367793 0.42032558 0.19373497 0.42032558 0.08043967 0.59026853 0.13708732
[73] 0.64691619 0.53362088 0.30703027 0.36367793 0.59026853 0.70356384 0.42032558 -0.14615094
[81] 0.02379201 -0.03285564 0.08043967 0.76021149 0.42032558 0.42032558 0.53362088 0.36367793
[89] 0.19373497 0.13708732 0.36367793 0.47697323 0.13708732 -0.25944625 0.25038262 0.25038262
[97] 0.25038262 0.30703027 -0.42938920 0.19373497 1.27004036 0.76021149 1.21339271 1.04344975
[105] 1.15674505 1.60992627 0.42032558 1.43998331 1.15674505 1.32668801 0.76021149 0.87350679
[113] 0.98680210 0.70356384 0.76021149 0.87350679 0.98680210 1.66657392 1.77986923 0.70356384
[121] 1.10009740 0.64691619 1.66657392 0.64691619 1.10009740 1.27004036 0.59026853 0.64691619
[129] 1.04344975 1.15674505 1.32668801 1.49663097 1.04344975 0.76021149 1.04344975 1.32668801
[137] 1.04344975 0.98680210 0.59026853 0.93015445 1.04344975 0.76021149 0.76021149 1.21339271
[145] 1.10009740 0.81685914 0.70356384 0.81685914 0.93015445 0.76021149

>X4<-iris$Petal.Width
>Zscr4<-(X4 -mean(X4))/sd(X4)
>Zscr4

[1] -1.3110521482 -1.3110521482 -1.3110521482 -1.3110521482 -1.3110521482 -1.0486667950


[7] -1.1798594716 -1.3110521482 -1.3110521482 -1.4422448248 -1.3110521482 -1.3110521482
[13] -1.4422448248 -1.4422448248 -1.3110521482 -1.0486667950 -1.0486667950 -1.1798594716
[19] -1.1798594716 -1.1798594716 -1.3110521482 -1.0486667950 -1.3110521482 -0.9174741184
[25] -1.3110521482 -1.3110521482 -1.0486667950 -1.3110521482 -1.3110521482 -1.3110521482
[31] -1.3110521482 -1.0486667950 -1.4422448248 -1.3110521482 -1.3110521482 -1.3110521482
[37] -1.3110521482 -1.4422448248 -1.3110521482 -1.3110521482 -1.1798594716 -1.1798594716
[43] -1.3110521482 -0.7862814418 -1.0486667950 -1.1798594716 -1.3110521482 -1.3110521482
[49] -1.3110521482 -1.3110521482 0.2632599711 0.3944526477 0.3944526477 0.1320672944
[55] 0.3944526477 0.1320672944 0.5256453243 -0.2615107354 0.1320672944 0.2632599711

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 25


KNOWLEDGE ENGINEERING LAB

[61] -0.2615107354 0.3944526477 -0.2615107354 0.2632599711 0.1320672944 0.2632599711


[67] 0.3944526477 -0.2615107354 0.3944526477 -0.1303180588 0.7880306775 0.1320672944
[73] 0.3944526477 0.0008746178 0.1320672944 0.2632599711 0.2632599711 0.6568380009
[79] 0.3944526477 -0.2615107354 -0.1303180588 -0.2615107354 0.0008746178 0.5256453243
[85] 0.3944526477 0.5256453243 0.3944526477 0.1320672944 0.1320672944 0.1320672944
[91] 0.0008746178 0.2632599711 0.0008746178 -0.2615107354 0.1320672944 0.0008746178
[97] 0.1320672944 0.1320672944 -0.1303180588 0.1320672944 1.7063794137 0.9192233541
[103] 1.1816087073 0.7880306775 1.3128013839 1.1816087073 0.6568380009 0.7880306775
[109] 0.7880306775 1.7063794137 1.0504160307 0.9192233541 1.1816087073 1.0504160307
[115] 1.5751867371 1.4439940605 0.7880306775 1.3128013839 1.4439940605 0.3944526477
[121] 1.4439940605 1.0504160307 1.0504160307 0.7880306775 1.1816087073 0.7880306775
[127] 0.7880306775 0.7880306775 1.1816087073 0.5256453243 0.9192233541 1.0504160307
[133] 1.3128013839 0.3944526477 0.2632599711 1.4439940605 1.5751867371 0.7880306775
[139] 0.7880306775 1.1816087073 1.5751867371 1.4439940605 0.9192233541 1.4439940605
[145] 1.7063794137 1.4439940605 0.9192233541 1.0504160307 1.4439940605 0.7880306775

7.
a) Use R to apply linear regression to predict evaporation coefficient in
terms of air velocity using the iris given below:

Air Velocity (cm/sec) 20,60,100,140,180,220,260,30


0,340,380
Evaporation Coefficient 0.18, 0.37, 0.35, 0.78, 0.56,
(sqmm/sec) 0.75, 1.18, 1.36, 1.17, 1.65

Solution:

Regression analysis is a very widely used statistical tool to establish a relationship


model between two variables. One of these variable is called predictor variable
whose value is gathered through experiments. The other variable is called response
variable whose value is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear relationship
represents a straight line when plotted as a graph. A non-linear relationship where
the exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 26


KNOWLEDGE ENGINEERING LAB

y is the response variable.


x is the predictor variable.
a and b are constants which are called the coefficients.

lm() Function
This function creates the relationship model between the predictor and the response
variable.

Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
>AV <- data.frame(
+ airvelocity = c(20,60,100,140,180,220,260,300,340,380),
+ evaporationcoefficient = +c(0.18,0.37,0.35,0.78,0.56,0.75,1.18,1.36,1.17,1.65)
+)

> print(AV)
airvelocity evaporationcoefficient
1 20 0.18
2 60 0.37
3 100 0.35
4 140 0.78
5 180 0.56
6 220 0.75
7 260 1.18
8 300 1.36
9 340 1.17
10 380 1.65

> model<-lm(AV$airvelocity~AV$evaporationcoefficient)
> print(model)

Call:
lm(formula = AV$airvelocity ~ AV$evaporationcoefficient)

Coefficients:
(Intercept) AV$evaporationcoefficient

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 27


KNOWLEDGE ENGINEERING LAB

2.564 236.450

b) Analyze the significance of residual standard-error value, R-squared


value, F-statistic. Find the correlation coefficient for this data and
analyze the significance of the correlation value.

Solution:
Residual Standard Error is measure of the quality of a linear regression fit.......The
Residual Standard Error is the average amount that the response (dist) will deviate
from the true regression line.

The R-squared statistic provides a measure of how well the model is fitting the
actual data.

F-statistic is a good indicator of whether there is a relationship between our predictor


and the response variables.
Is it true that we can have a F value indicating a strong relationship that is NON
LINEAR so that our RSE is high and our R squared is low

> summary(model)

Call:
lm(formula = AV$airvelocity ~ AV$evaporationcoefficient)

Residuals:
Min 1Q Median 3Q Max
-46.99 -24.88 -17.14 33.74 60.79

Coefficients:
Estimate Std. Error t value Pr(>|t|)

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 28


KNOWLEDGE ENGINEERING LAB

(Intercept) 2.564 25.804 0.099 0.923


AV$evaporationcoefficient 236.450 27.035 8.746 2.29e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.53 on 8 degrees of freedom


Multiple R-squared: 0.9053, Adjusted R-squared: 0.8935
F-statistic: 76.49 on 1 and 8 DF, p-value: 2.286e-05

> cor(AV$airvelocity,AV$evaporationcoefficient)

[1] 0.9514814

> y<-cor(AV$airvelocity,AV$evaporationcoefficient)
>y

[1] 0.9514814

c) Perform a log transformation on the ‘Air Velocity 'column, perform


linear regression again, and analyze all the relevant values.
Solution:

> x<-log(AV$airvelocity)
>x

[1] 2.995732 4.094345 4.605170 4.941642 5.192957 5.393628 5.560682 5.703782


5.828946 5.940171

> m<-lm(x~AV$evaporationcoefficient)
>m
Call:
lm(formula = x ~ AV$evaporationcoefficient)

Coefficients:
(Intercept) AV$evaporationcoefficient
3.678 1.614

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 29


KNOWLEDGE ENGINEERING LAB

WEKA Knowledge Extraction toolkit:

Introduction
WEKA is a data mining system developed by the University of Waikato in New Zealand
that implements data mining algorithms. WEKA is a state-of-the-art facility for developing
machine learning (ML) techniques and their application to real-world data mining problems. It is
a collection of machine learning algorithms for data mining tasks. The algorithms are applied
directly to a dataset. WEKA implements algorithms for data preprocessing, classification,
regression, clustering, association rules; it also includes a visualization tools. The new machine
learning schemes can also be developed with this package. WEKA is open source software
issued under the GNU General Public License. The goal of this Tutorial is to help you to learn
WEKA Explorer. The tutorial will guide you step by step through the analysis of a simple
problem using WEKA Explorer preprocessing, classification, clustering, association, attribute
selection, and visualization tools. At the end of each problem there is a representation of the
results with explanations side by side. Each part is concluded with the exercise for individual
practice. By the time you reach the end of this tutorial, you will be able to analyze your data with
WEKA Explorer using various learning schemes and interpret received results. Before starting
this, you should be familiar with data mining algorithms such as C4.5 (C5), ID3, K-means, and
Apriori.

2. Launching WEKA Explorer

You can launch Weka from C:\Program Files directory, from your desktop selecting icon,
or from the Windows task bar ‘Start’ -> ‘Programs’ -> ‘Weka 3.8’. When ‘WEKA GUI

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 30


KNOWLEDGE ENGINEERING LAB

Chooser’ window appears on the screen, you can select one of the four options at the bottom of
the window:

1. Simple CLI provides a simple command-line interface and allows direct execution of Weka
commands.
2. Explorer is an environment for exploring data.
3. Experimenter is an environment for performing experiments and conducting statistical tests
between learning schemes.
4. KnowledgeFlow is a Java-Beans-based interface for setting up and running machine learning
experiments.

For the exercises in this tutorial you will use ‘Explorer’. Click on ‘Explorer’ button in the
‘WEKA GUI Chooser’ window.

‘WEKA Explorer’ window appears on a screen.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 31


KNOWLEDGE ENGINEERING LAB

1. Demonstration of preprocessing on dataset student.arff


Aim: This experiment illustrates some of the basic data preprocessing operations that
can be performed using WEKA-Explorer. The sample dataset used for this example is the
student data available in arff format.
Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of
the data weka will compute some basic strategies on each attribute. The left panel in the
above figure shows the list of recognized attributes while the top panel indicates the
names of the base relation or table and the current working relation (which are same
initially).
Step3: Clicking on an attribute in the left panel will show the basic statistics on the
attributes for the categorical attributes the frequency of each attribute value is shown,
while for continuous attributes we can obtain min, max, mean, standard deviation and
deviation etc.,
Step4: The visualization in the right button panel in the form of cross-tabulation across
two attributes.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 32


KNOWLEDGE ENGINEERING LAB

Note: we can select another attribute using the dropdown list.


Step5: Selecting or filtering attributes
Removing an attribute- When we need to remove an attribute, we can do this by using
the attribute filters in weka. In the filter model panel, click on choose button, This will
show a popup window with a list of available filters.
Scroll down the list and select the “weka.filters.unsupervised.attribute.remove” filters.
Step 6:a) Next click the textbox immediately to the right of the choose button. In the
resulting dialog box enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to false.The click OK now in the filter
box.you will see “Remove-R-7”.
c) Click the apply button to apply filter to this data.This will remove the attribute and
create new working relation.
d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.(student.arff)

Discretization
1) Sometimes association rule mining can only be performed on categorical data. This
requires performing discretization on numeric or continuous attributes. In the following
example let us discretize age attribute.

 Let us divide the values of age attribute into three bins(intervals).

 First load the dataset into weka(student.arff)

 Select the age attribute.

 Activate filter-dialog box and select “WEKA.filters.unsupervised.attribute.discretize”


from the list.

 To change the defaults for the filters, click on the box immediately to the right of the
choose button.

 We enter the index for the attribute to be discretized. In this case the attribute is age. So
we must enter ‘1’ corresponding to the age attribute.

 Enter ‘3’ as the number of bins. Leave the remaining field values as they are.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 33


KNOWLEDGE ENGINEERING LAB

 Click OK button.

 Click on apply in the filter panel. This will result in a new working relation with the
selected attribute partition into 3 bins.

 Save the new working relation in a file called student-data-discretized .arff

Dataset student .arff


@relation student
@attribute age {<30,30-40,>40}
@attribute income {low, medium, high}
@attribute student {yes, no}
@attribute credit-rating {fair, excellent}
@attribute buyspc {yes, no}
@data
%
<30, high, no, fair, no
<30, high, no, excellent, no
30-40, high, no, fair, yes
>40, medium, no, fair, yes
>40, low, yes, fair, yes
>40, low, yes, excellent, no
30-40, low, yes, excellent, yes
<30, medium, no, fair, no
<30, low, yes, fair, no
>40, medium, yes, fair, yes
<30, medium, yes, excellent, yes
30-40, medium, no, excellent, yes
30-40, high, yes, fair, yes
>40, medium, no, excellent, no
%

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 34


KNOWLEDGE ENGINEERING LAB

The following screenshot shows the effect of discretization.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 35


KNOWLEDGE ENGINEERING LAB

2. Demonstration of Association rule process on dataset test.arff using


apriori algorithm
Aim: This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is test.arff
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields
have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.
Step3: We will use apriori algorithm. This is the default algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.

Dataset test.arff
@relation test
@attribute admissionyear {2005,2006,2007,2008,2009,2010}
@attribute course {cse,mech,it,ece}
@data
%
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
2007, cse
2008, it
2008, cse
2009, it
2009, ece
%

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 36


KNOWLEDGE ENGINEERING LAB

The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 37


KNOWLEDGE ENGINEERING LAB

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 38


KNOWLEDGE ENGINEERING LAB

3. Demonstration of classification rule process on dataset employee.arff


using j48 algorithm
Aim: This experiment illustrates the use of j-48 classifier in weka.the sample data set
used in this experiment is “employee”data available at arff format. This document
assumes that appropriate data pre processing has been performed.
Steps involved in this experiment:
Step 1: We begin the experiment by loading the data (employee.arff) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default values the
default version does perform some pruning but does not perform error pruning.
Step4: Under the “text “options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don’t have separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well
as evaluation statistic will appear in the right panel when the model construction is
complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.
Step-8: We will use our model to classify the new instances.
Step-9: In the main panel under “text “options click the “supplied test set” radio button
and then click the “set” button. This wills pop-up a window which will allow you to open
the file containing test instances.

Data set employee.arff:


@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 39


KNOWLEDGE ENGINEERING LAB

27, 15k, poor


27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 34k, good
48, 32k,good
%

The following screenshot shows the classification rules that were generated whenj48
algorithm is applied on the given dataset.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 40


KNOWLEDGE ENGINEERING LAB

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 41


KNOWLEDGE ENGINEERING LAB

4. Demonstration of classification rule process on dataset employee.arff


using naïve bayes algorithm
Aim: This experiment illustrates the use of naïve bayes classifier in weka. The sample
data set used in this experiment is “employee”data available at arff format. This
document assumes that appropriate data pre processing has been performed.
Steps involved in this experiment:
1. We begin the experiment by loading the data (employee.arff) into weka.
Step2: next we select the “classify” tab and click “choose” button to select the
“id3”classifier.
Step3: now we specify the various parameters. These can be specified by clicking in the
text box to the right of the chose button. In this example, we accept the default values his
default version does perform some pruning but does not perform error pruning.
Step4: under the “text “options in the main panel. We select the 10-fold cross validation
as our evaluation approach. Since we don’t have separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of generated model.
Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is
complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that we
may find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.
Step-8: we will use our model to classify the new instances.
Step-9: In the main panel under “text “options click the “supplied test set” radio button
and then click the “set” button. This will show pop-up window which will allow you to
open the file containing test instances.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 42


KNOWLEDGE ENGINEERING LAB

Data set employee.arff:


@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48}
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@attribute performance {good, avg, poor}
@data
%
25, 10k, poor
27, 15k, poor
27, 17k, poor
28, 17k, poor
29, 20k, avg
30, 25k, avg
29, 25k, avg
30, 20k, avg
35, 32k, good
48, 34k, good
48, 32k, good
%

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 43


KNOWLEDGE ENGINEERING LAB

The following screenshot shows the classification rules that were generated when naive
bayes algorithm is applied on the given dataset.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 44


KNOWLEDGE ENGINEERING LAB

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 45


KNOWLEDGE ENGINEERING LAB

5. Demonstration of clustering rule process on dataset iris.arff using


simple k-means
Aim: This experiment illustrates the use of simple k-mean clustering with Weka
explorer. The sample data set used for this example is based on the iris data available in
ARFF format. This document assumes that appropriate preprocessing has been
performed. This iris dataset includes 150 instances.

Steps involved in this Experiment


Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on
the choose button. This step results in a dropdown list of available clustering algorithms.
Step 3 : In this case we select ‘simple k-means’.
Step 4: Next click in text button to the right of the choose button to get popup window
shown in the screenshots. In this window we enter six on the number of clusters and we
leave the value of the seed on as it is. The seed value is used in generating a random
number which is used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in
the following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid
are means vectors for each clusters. This clusters can be used to characterized the
cluster.For eg, the centroid of cluster1 shows the class iris.versicolor mean value of the
sepal length is 5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.
Step 7: Another way of understanding characterstics of each cluster through visualization
,we can do this, try right clicking the result set on the result. List panel and selecting the
visualize cluster assignments.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 46


KNOWLEDGE ENGINEERING LAB

The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 47


KNOWLEDGE ENGINEERING LAB

Interpretation of the above visualization


From the above visualization, we can understand the distribution of sepal length and petal
length in each cluster. For instance, for each cluster is dominated by petal length. In this
case by changing the color dimension to other attributes we can see their distribution with
in each of the cluster.
Step 8: We can assure that resulting dataset which included each instance along with its
assign cluster. To do so we click the save button in the visualization window and save the
result iris k-mean .The top portion of this file is shown in the following figure

DEPT OF IT, SRKR ENGINEERING COLLEGE(A) Page 48

You might also like