Practical: 01: Microsoft Visual Studio 2010, Microsoft SQL Server Managment Studio
Practical: 01: Microsoft Visual Studio 2010, Microsoft SQL Server Managment Studio
Practical: 01: Microsoft Visual Studio 2010, Microsoft SQL Server Managment Studio
Practical: 01
Aim: Overview of SQL Server 2008/2012 Databases and Analysis Services. Create a Sample
Database Star Schema using SQL Server Management Studio. Design, Create and Process
cube by identifying measures and dimensions for Star Schema, for an assigned system by
replacing a dimension in the grid, filtering and drilldown using cube browser.
Software/ Hardware: Microsoft Visual Studio 2010, Microsoft Sql Server Managment Studio
Theory:
OLAP Operations
OLAP provides a easy surroundings for interactive information analysis. variety of
OLAP information cube operations exist to pass off completely different views of
knowledge, permitting interactive querying and analysis of the information.
The most standard user operations on dimensional information are:
Roll up: The roll-up operation (also known as drill-up or aggregation operation) performs aggregation
on an information cube, either by ascension up an inspiration hierarchy for a dimension or
by ascension down an inspiration hierarchy, i.e. dimension reduction.
Roll Down: The descend operation (also known as drill down) is that the reverse of roll up. It navigates
from less elaborate information to a lot of elaborate information. It are often complete by either stepping
down an inspiration hierarchy for a dimension or introducing further dimensions.
Slicing: Slice performs a variety on one dimension of the given cube, so leading to a subcube.
Dicing: The dice operation defines a subcube by acting a variety on 2 or a lot of dimensions.
Pivot: Pivot otheriwise referred to as Rotate changes the dimensional orientation of the cube, i.e.
rotates the information axes to look at the information from completely different views.
Pivot teams information with completely different dimensions.
Scoping: proscribing the read of info objects to a fixed set is named scoping. Scoping can permit users to
recieve and update some information values they need to Recieve and update.
Screening: Screening is performed against the information or members of a dimension so as to limit the
set of knowledge retrieved.
Drill across: Accesses over one reality table that's joined by common dimensions. Combiens cubes that
share one or a lot of dimensions.
Drill Through: Drill all the way down to the lowest level of an information cube down to
its face relative tables.
1
CE350- Data Warehousing and Data Mining 16CE068
Steps:
2. Create Database
2
CE350- Data Warehousing and Data Mining 16CE068
3
CE350- Data Warehousing and Data Mining 16CE068
4
CE350- Data Warehousing and Data Mining 16CE068
5
CE350- Data Warehousing and Data Mining 16CE068
6
CE350- Data Warehousing and Data Mining 16CE068
CUSTOMER table
REGION Table
7
CE350- Data Warehousing and Data Mining 16CE068
PRODUCT Table
TIME Table
FACT Table
8
CE350- Data Warehousing and Data Mining 16CE068
9
CE350- Data Warehousing and Data Mining 16CE068
10
CE350- Data Warehousing and Data Mining 16CE068
11
CE350- Data Warehousing and Data Mining 16CE068
Select the option as shown in next figure and click on the next
Click on finish
12
CE350- Data Warehousing and Data Mining 16CE068
13
CE350- Data Warehousing and Data Mining 16CE068
14
CE350- Data Warehousing and Data Mining 16CE068
It will show the data source view name and table, click on finish
15
CE350- Data Warehousing and Data Mining 16CE068
16
CE350- Data Warehousing and Data Mining 16CE068
17
CE350- Data Warehousing and Data Mining 16CE068
18
CE350- Data Warehousing and Data Mining 16CE068
Click on finish
19
CE350- Data Warehousing and Data Mining 16CE068
20
CE350- Data Warehousing and Data Mining 16CE068
21
CE350- Data Warehousing and Data Mining 16CE068
22
CE350- Data Warehousing and Data Mining 16CE068
23
CE350- Data Warehousing and Data Mining 16CE068
Conclusion: Processing of a criminal case database using the various OLAP operations.
24
CE350- Data Warehousing and Data Mining 16CE068
Practical-2
Aim: Introduction to R programming.R-GUI, Rstudio –Basic working and Commands.
Software Requirement: R GUI
Theory:
R is freely available under the GNU General Public License, and pre-compiled binary versions
are provided for various operating systems like Linux, Windows and Mac. This programming language
was named R, based on the first letter of first name of the two R authors (Robert Gentleman and Ross
Ihaka), and partly a play on the name of the Bell Labs Language S.
Evolution of R:
R was initially written by Ross Ihaka and Robert Gentleman at the Department of Statistics of the
University of Auckland in Auckland, New Zealand. R made its first appearance in 1993.
A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify the R source
code archive.
Features of R:
As stated earlier, R is a programming language and software environment for statistical analysis,
graphics representation and reporting. The following are the important features of R −
R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the computer or
printing at the papers.
25
CE350- Data Warehousing and Data Mining 16CE068
There are many types of R-objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
Commands:
print() – to print any value/string.
classs() – to get the data type of the variable.
name() – assign names .
charToRaw() – convert string into hexa-decimal.
Vectors: To create vector with one or more element use c() function.
26
CE350- Data Warehousing and Data Mining 16CE068
Lists: A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
Accessing Elements:
- List_name[index]
27
CE350- Data Warehousing and Data Mining 16CE068
Array: While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension.
Accessing Elements:
- Array_name[row,column,matrix_no]
Factor:
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct
values of the elements in the vector as labels. The labels are always character irrespective of whether
it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions gives the count of levels.
28
CE350- Data Warehousing and Data Mining 16CE068
Matrix:
Accessing Elements:
Matrix_name[row, column]
Matrix Operations:
Addition:
29
CE350- Data Warehousing and Data Mining 16CE068
Multiplication: Subtraction:
Saf
Data Frames:
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different
modes of data. The first column can be numeric while the second column can be character and third
column can be logical. It is a list of vectors of equal length. Data Frames are created using the
data.frame() function.
Conclusion: In this practical we have learned about basic of R language and its implementation using
R GUI.
30
CE350- Data Warehousing and Data Mining 16CE068
Practical-3
Aim: Write a program to Perform various steps of preprocessing on the given
relational database/ warehouse/ files.(Data Preparator)
Data Cleaning
Data cleaning facilities include character removal, text replacement, and date conversion.
Data Import/Export
Data Preparator can be used to import data from a database and export them to a file and
vice versa.
Data Integration
Two operators, Append and Merge, can be used to combine data from different data
sources.
Data Reduction
Data reduction can be achieved using sampling and record selection.
Data Transformation
Data Preparator can be used to pre-process data for data mining. It transforms training
data using a series of transformations and in the process creates a model which can be
used to transform corresponding test/execution data.
Data Preparator provides many operators for transforming data.
Data Visualization
Data visualization can be performed using a variety of statistical plots.
31
CE350- Data Warehousing and Data Mining 16CE068
Main Window
The initial window. Open a data source from the Start dialog, then drag operators from the list on
the left to the panel on the right.
Operator Tree
An operator tree is a tree of operators (preprocessing transformations) that are to be applied to the
data. The nodes of the tree represent the operators and the links between the nodes show
dependencies between the operators. The root of the tree --- the Data node --- is created
automatically after opening a data source. With each node is associated an operator dialog which
is displayed when the user clicks on the gray area of the node. Operators are initialized by entering
required details into operator dialogs.
32
CE350- Data Warehousing and Data Mining 16CE068
Creating Nodes
To create a new node, drag an operator name from the list of names on the left hand side of the
main window and drop it on the display pane on the right hand side
Connecting Nodes
To link two nodes by an arrow, press the mouse button on the gray area of the first node and drag
the mouse to the second node. Release the mouse button on the gray area of the second node.
Moving Nodes
To move a node to a different location on the display pane, press the mouse button on the coloured
bar at the top of the node and drag the node to the desired location.
Types of Nodes
There are five types of nodes, distinguished by the colour of the bar at the top of the node icon.
Green node is the Data node. It is the root of the operator tree. There can be only one green
node.
Blue nodes are preprocessing nodes that will be included in the corresponding Model Tree.
They represent the transformations that will also be performed on the test or execution data
sets.
Red nodes are output nodes that display or save results. They cannot have descendants.
Yellow nodes are file utils nodes which can only have the Data node or another File Utils
node as the parent node. However, they can have other nodes as descendants.
Gray nodes are preprocessing nodes that will not be included in the corresponding model
tree. There is only one gray node: Sample. Sampling would not be meaningful for test or
execution data sets.
33
CE350- Data Warehousing and Data Mining 16CE068
1.Attribute Operator
1) Outlier
34
CE350- Data Warehousing and Data Mining 16CE068
Z-Score Method
This operator uses the Z-Score method to handle outliers in numeric attributes, and a frequency
based approach to handle outliers in nominal attributes.
Numeric Attributes
The Z-Score method uses the zscore statistic defined as:
It gives the number of standard deviations a value is above or below the mean. An outlier is a value
that has zscore above a specified upper limit or below a specified lower limit.
35
CE350- Data Warehousing and Data Mining 16CE068
1. Winsorize (replace outliers with the values corresponding to the specified zscore limits).
2. Remove the records containing outliers from the data set.
Nominal Attributes
In a nominal attribute, a value (label) that has a very low frequency of occurrence is considered to
be an outlier. There are two options for dealing with outliers:
1. Select the attributes for which to handle outliers by checking check boxes in the Select
column.
2. Select options for numeric attributes.
o Enter the minimum and maximum z-scores.
o Specify how to handle outliers in numeric attributes (winsorize or remove from
the data set).
3. Select options for nominal attributes.
o Set the minimum frequency for the values of nominal attributes in the Min count
spinner.
o Specify how to handle outliers in nominal attributes (replace with missing value
symbol or remove from the data set).
4. Compute statistics. Set the number of cases in the Num Cases to Filter spinner and press
the Calc Statistics button. This updates the mean, standard deviation and other statistics
required by the operator.
5. Click OK.
This option allows the user to remove attributes from the data or to move selected attributes
either to the leftmost or to the rightmost position in the data set.
36
CE350- Data Warehousing and Data Mining 16CE068
Delete
Move
37
CE350- Data Warehousing and Data Mining 16CE068
The range of a numeric attribute is divided into intervals and each interval is given a label. Attribute
values are replaced by the labels of the intervals into which they fall.
Interval width is computed by dividing the attribute range by the number of interval
Equal frequency discretization divides the range of a numeric attribute into a given number of
intervals containing equal (or nearly equal) number of values.
The main advantage of this method is that it does not require sorting of attributes. The disadvantage
is that the resulting intervals are only approximate.
This operator allows you to discretize numeric attributes manually by entering the
desired cut points.
38
CE350- Data Warehousing and Data Mining 16CE068
This command provides operators for handling missing values. Currently the following methods
are available:
1. Delete Cases
2. Remove Attributes
3. Impute Values
4. Predict from Model
5. Create Missing Value Patterns
39
CE350- Data Warehousing and Data Mining 16CE068
1. Delete Cases
This method removes cases containing missing values from the data set. This is a commonly used
approach referred to as listwise deletion or casewise deletion.
2. Remove Attributes
This operator removes attributes containing missing values from the data set.
3. Impute Values
This operator replaces missing values with imputed values. It uses single-value imputations where
all missing values in an attribute are filled with the same imputed value. The problem with this
approach is that it can lead to bias. Commonly used imputations are the attribute mean or median
for numeric attributes, and the mode for nominal attributes.
This operator replaces missing values with values predicted by a prediction model.
This operator adds new attributes (missing value patterns) to the data set. It creates a new two-
valued (dichotomous) variable for each selected attribute containing missing values. The values of
the new variable represent two possible states: "value is present" and "value is missing".
40
CE350- Data Warehousing and Data Mining 16CE068
41
CE350- Data Warehousing and Data Mining 16CE068
This operator reduces the number of labels of nominal attributes by keeping up to a given number
of most frequent labels and creating a new label from all the remaining labels.
If a nominal attribute has more labels than the specified maximum number m, then the first m-1
labels with the largest frequencies will be retained and one new label will be created out of all the
remaining labels.
42
CE350- Data Warehousing and Data Mining 16CE068
7.Scale Numeric Attributes This command provides operators for scaling (or normalizing)
numeric attributes. Scaling is required for data mining algorithms that accept only attribute values
within certain ranges. For example, neural networks, clustering, nearest neighbour among others.
Scaling is also needed to prevent bias when attributes have very different ranges (e.g., age and
salary). Currently the following scaling methods are provided:
1. Linear
2. Decimal
3. Hyperbolic Tangent
4. Soft-Max
5. Z-Score
43
CE350- Data Warehousing and Data Mining 16CE068
Select Attributes
This command selects a subset of attributes. The following methods are currently provided:
1. Manual Selection
2. Mutual Information Selection
3. Robust Mutual Information Selection
44
CE350- Data Warehousing and Data Mining 16CE068
45
CE350- Data Warehousing and Data Mining 16CE068
2. Record operator
1. Sample
This operator creates a sample from the data set. The following sampling methods are provided:
1. Random
2. One in K
3. First K
Random sampling selects cases at random according to a given percentage. One-in-K sampling
selects every K-th case. First-K sampling selects the first K cases from the data set.
2.Select Records
This operator selects records (cases) from the data set, based on a specified key and key values.
o If the Include radio button is selected then all the cases satisfying the selection
criteria will be included in the resulting data set.
o If the Remove radio button is selected then all the cases satisfying the selection
criteria will be removed from the resulting data set.
3. Select a key (key attribute) from the list on the left.
4. Select key values.
o For a nominal key attribute, select one or more key values from the list on the right.
(Press ctrl key and click appropriate rows, then click Add to Selected).
o For a numeric key attribute, specify the range of values to be selected by entering
the minimum value in Target Min and the maximum value in Target Max spinners.
5. Press Add to Selected.
Repeat the steps 3, 4 and 5 above for other key attributes as needed.
6. Click OK.
47
CE350- Data Warehousing and Data Mining 16CE068
3.Utils
1. File utils
Training File
Test File
Validation File
This may be useful when experimenting with algorithms for handling missing values.
1.3 Append
The Append operator can be used to append cases from a file, a database table or an Excel
worksheet to the end of the current data set.
The appended rows and the current data set must have the same number of attributes and the
corresponding attributes must be of the same type.
1.4 Balance
This operator creates a new file in which the labels of a selected nominal attribute (balancing
attribute) have approximately equal frequencies.
1.5 Merge
The Merge operator merges a sorted data set with another sorted data set into a single file.
1.6 Sort
This operator sorts the data set in ascending or descending order of a specified key attribute and
creates a sorted file.
48
CE350- Data Warehousing and Data Mining 16CE068
The Add Columns operator adds (appends) columns from a file, a database table or an Excel
worksheet to the right of the rightmost column of the current file. The number of rows in the
resulting file is equal to the number of rows in the smaller file.
This operator changes the names (identifiers) of selected attributes and/or attribute values (labels).
It creates a new file containing the changed identifiers.
1.10 Encode
A phonetic algorithm encodes words on the basis of their pronunciation. Similarly sounding words
will have similar code.
Soundex Algorithm
Metaphone Algorithm
1.11 Join
The Smooth Columns operator reduces the number of distinct values of selected numeric attributes
by replacing the original values with estimated values.
49
CE350- Data Warehousing and Data Mining 16CE068
1. Bin Average
2. Bin Boundaries
3. Bin Midpoint
4. Rounding
The Split Columns operator splits a file into two files by a specified attribute (column).
The columns with indices less than the specified column index are written to one file and the
remaining columns to another file.
This operator transforms attribute values using several common mathematical functions:
ln(x), log2(x), log10(x), exp(x), sqrt(x), sin(x), cos(x), tan(x), 1/x, x2, x3
50
CE350- Data Warehousing and Data Mining 16CE068
51
CE350- Data Warehousing and Data Mining 16CE068
4. Output
1. Statistics
This operator displays statistical information for the attributes in the data set.
Minimum
Maximum
Range
Mean
Standard deviation
Number of missing values
Percentage of missing values
52
CE350- Data Warehousing and Data Mining 16CE068
2. Table
It reads one page of data at a time. The page size is set to 100 lines. Initially the first page of 100
lines is loaded.
53
CE350- Data Warehousing and Data Mining 16CE068
3. File Output
54
CE350- Data Warehousing and Data Mining 16CE068
4. Database Output
1. JDBC/ODBC
2. Other
55
CE350- Data Warehousing and Data Mining 16CE068
5. Excel Output
This operator saves output to an Excel spreadsheet (.xls file). Only up to 256 columns and 65,535
rows are allowed.
6. Visualize Data
This option provides several commonly used visualization techniques. The charts listed below,
with the exception of Dependency Tree and Parallel Coordinates, are plotted using the open source
library JFreeChart, from http://www.object-refinery.com/.
Numeric Attributes
1. Univariate Plots
2. Bivariate Plots
3. Conditional Plots
4. Matrix Plots
Nominal Attributes
1. Univariate Charts
2. Stacked Bar Charts
56
CE350- Data Warehousing and Data Mining 16CE068
Mixed Attributes
1. Dependency Tree
2. Parallel Coordinates
57
CE350- Data Warehousing and Data Mining 16CE068
Performed Steps:
1.) Starting screen will be look like this.
58
CE350- Data Warehousing and Data Mining 16CE068
3.) ‘Handle outliers’ option can be located at left side in the tool.
4.) After adding statistics, this screen will be populated by clicking on the statistics.
59
CE350- Data Warehousing and Data Mining 16CE068
6.) Discretize:
60
CE350- Data Warehousing and Data Mining 16CE068
Practical-4
Aim: Describing data and its Statistical Analysis Graphically using R Programming.
Perform association rule mining using r programming
Software Requirement: R GUI
Theory:
R Programming language has numerous libraries to create charts and graphs.
Pie-Chart
A pie-chart is a representation of values as slices of a circle with different colors. The slices are
labeled and the numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input. The
additional parameters are used to control labels, color, title etc.
62
CE350- Data Warehousing and Data Mining 16CE068
Barcharts
A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable.
R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal bars in the bar chart.
In bar chart each of the bars can be given different colors.
library("xlsx") library(plotrix)
x <- table(data$Gender)
barplot(x)
63
CE350- Data Warehousing and Data Mining 16CE068
Line Graphs
A line chart is a graph that connects a series of points by drawing line segments between them. These
points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually
used in identifying the trends in data.
The plot() function in R is used to create the line graph.
x <- table(data$Gender)
plot(x, col = “red”, xlab = “Number of students”, ylab = “Gender”, main = “Line Chart”)
64
CE350- Data Warehousing and Data Mining 16CE068
Boxplots
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into three
quartiles. This graph represents the minimum, maximum, median, first quartile and third quartile in
the data set. It is also useful in comparing the distribution of data across data sets by drawing boxplots
for each of them.
Boxplots are created in R by using the boxplot() function.
library("xlsx") library(plotrix)
Histograms
A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar
to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram
represents the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some more
parameters to plot histograms.
65
CE350- Data Warehousing and Data Mining 16CE068
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border) Examples:
library("xlsx") library(plotrix)
x <- table(data$Gender)
hist(x, main = “Histogram of v”, xlab = “Frequency”, ylab = “Weight”, col = “green”, border = “red”)
Association rule mining is the data mining process of finding the rules that may govern associations and
causal objects between sets of items.
So in a given transaction with multiple items, it tries to find the rules that govern how or why such items
are often bought together. For example, peanut butter and jelly are often bought together because a
lot of people like to make PB&J sandwiches.
66
CE350- Data Warehousing and Data Mining 16CE068
Also surprisingly, diapers and beer are bought together because, as it turns out, that dads are often tasked
to do the shopping while the moms are left with the baby.
The main applications of association rule mining:
Basket data analysis - is to analyze the association of purchased items in a single basket or
single purchase as per the examples given above.
Cross marketing - is to work with other businesses that complement your own, not
competitors. For example, vehicle dealerships and manufacturers have cross marketing
campaigns with oil and gas companies for obvious reasons.
Catalog design - the selection of items in a business’ catalog are often designed to
complement each other so that buying one item will lead to buying of another. So these items
are often complements or very related.
2) library(“arules”)
67
CE350- Data Warehousing and Data Mining 16CE068
4) Summary(groceries)
5) Inspect(groceries[1:5])
• inspect is to summarize all relevant options, plots and statistics that should be usually
considered.
68
CE350- Data Warehousing and Data Mining 16CE068
6) itemFrequency(groceries(,1:3))
69
CE350- Data Warehousing and Data Mining 16CE068
8) image(groceries[1:5])
10) apriori(groceries)
70
CE350- Data Warehousing and Data Mining 16CE068
71
CE350- Data Warehousing and Data Mining 16CE068
14) Summary(groceryrules)
15) Inspect(gloceryrules[1:3])
Conclusion:
From this practical we learnt that how different types are charts are formed using the data and one gather
the knowledge by analyzing this charts.
72
CE350- Data Warehousing and Data Mining 16CE068
Practical-5
Aim: Perform Different Data Mining Activities using Weka Explorer Tool
(Open Source Data Mining Tool).& Experimental tool (Open source data mining)
Theory:
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It
is also well-suited for developing new machine learning schemes.
Weka is open source software issued under the GNU General Public License.
73
CE350- Data Warehousing and Data Mining 16CE068
Step-2 : Preprocessing the data via WEKA explorer where different options gives us idea of
performing data preprocessing tasks.
74
CE350- Data Warehousing and Data Mining 16CE068
Step-4: Here the weather-nominal data set is selected on which various data analysis tasks are to
be performed. In the dialog box it shows the plot of nominal-weather by default selection of
attributed in the datasets.
Step-5: The relation between different attributed of data sets can be mapped in this relation tool
box where all the attributes are showed and their relation amongst themselves is to be found.
75
CE350- Data Warehousing and Data Mining 16CE068
Step-6: Deciding a classifier is very important in WEKA explorer as it decides on which basis
the data us to be analysed.
76
CE350- Data Warehousing and Data Mining 16CE068
Step-7: Result in WEKA can be studied in different window based on the classifier chosen and it
gives different type of plots according to the plots mentioned in various tree, barplots ,histogram
etc.
77
CE350- Data Warehousing and Data Mining 16CE068
Step-8 : WEKA gives us the freedom of selecting the algorithm for association of data by default
its Apriori Algorithm.
78
CE350- Data Warehousing and Data Mining 16CE068
Step-9 : WEKA also allows us to specifically perform data visualization by taking certain
attributes.
Step-10 : Here the figure shows the GNU plot matrix of selected attributes on which it produces
certain different visualizations models.
79
CE350- Data Warehousing and Data Mining 16CE068
- WEKA is a state-of-the-art facility for developing machine learning (ML) techniques and their
application to real-world data mining problems.
- It is a collection of machine learning algorithms for data mining tasks. The algorithms are applied
directly to a dataset.
- WEKA implements algorithms for data preprocessing, classification, regression, clustering,
association rules; it also includes a visualization tools.
- WEKA expects the data file to be in Attribute-Relation File Format (ARFF) file.
Weka Options
1. Weka Explorer:Weka Explorer is an environment for exploring data.
2. Experimenter: Experimenter is an environment for performing experiments and
conducting statistical tests between learning schemes.
3. KnowledgeFlow: Knowledge Flow is a Java-Beans-based interface for setting up and
running machine learning experiments.
80
CE350- Data Warehousing and Data Mining 16CE068
WeatherNominal.arff
- Once the data is loaded, WEKA recognizes attributes that are shown in the ‘Attribute’ window.
Left panel of ‘Preprocess’ window shows the list of recognized attributes:
1. No. is a number that identifies the order of the attribute as they are in data file
2. Selection tick boxes allow you to select the attributes for working relation
3. Name is a name of an attribute as it was declared in the data file.
- The ‘Current relation’ box above ‘Attribute’ box displays
1. The base relation (table) name and the current working relation
2. The number of instances
3. The number of attributes
- During the scan of the data, WEKA computes some basic statistics on each attribute.
- The following statistics are shown in ‘Selected attribute’ box on the right panel of ‘Preprocess’ window:
1. Name is the name of an attribute
2. Type is most commonly Nominal or Numeric
3. Missing is the number (percentage) of instances in the data for which this attribute is unspecified
4. Distinct is the number of different values that the data contains for this attribute
5. Unique is the number (percentage) of instances in the data having a value for this
attribute that no other instances have.
81
CE350- Data Warehousing and Data Mining 16CE068
82
CE350- Data Warehousing and Data Mining 16CE068
Steps:
1. Click on preprocessing step. Open file .Open weathernominal.artf file
83
CE350- Data Warehousing and Data Mining 16CE068
Classify
- Classifiers in WEKA are the models for predicting nominal or numeric quantities.
- Click on ‘Choose’ button in the ‘Classifier’ box just below the tabs and select C4.5 classifier WEKA →
Classifiers →Trees -> J48
loaded from a file. Clicking on the ‘Set…’ button brings up a dialog allowing you to choose the file
to test on.
3. Cross-validation: Evaluates the classifier by cross-validation, using the number of folds that are
entered in the ‘Folds’ text field.
4. Percentage split: Evaluates the classifier on how well it predicts a certain percentage of the
data, which is held out for testing. The amount of data held out depends on the value entered in
the ‘%’ field.
85
CE350- Data Warehousing and Data Mining 16CE068
Associate
- WEKA contains an implementation of the Apriori algorithm for learning association rules.
- It works only with discrete data and will identify statistical dependencies between groups of attributes
- Apriori can compute all rules that have a given minimum support and exceed a given confidence.
- The association rule scheme cannot handle numeric values;
86
CE350- Data Warehousing and Data Mining 16CE068
WEKA EXPERIMENTAL
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization.
Step-2 : After clicking New default parameters for an Experiment are defined.
87
CE350- Data Warehousing and Data Mining 16CE068
Step-3: One can add dataset files either with an absolute path or with a relative one. The latter
makes it often easier to run experiments on different machines, hence one should check Use
relative paths, before clicking on Add new.... In this example, open the data directory and choose
the weather.arff dataset
88
CE350- Data Warehousing and Data Mining 16CE068
Step-4: With the Choose button one can open the GenericObjectEditor and choose another
classifier. Additional algorithms can be added again with the Add new... button, e.g., the J48
decision tree.
Step-5: To run the current experiment, click the Run tab at the top of the Experiment
Environment window. Click Start to run the experiment. If the experiment was defined correctly,
the 3 messages shown above will be displayed in the Log panel.
89
CE350- Data Warehousing and Data Mining 16CE068
Step-6: For Analyzing the current experiment, click on the Analyse tab on the Environment
window.
90
CE350- Data Warehousing and Data Mining 16CE068
Step-7: In the Analyse tab, click on the perform test button and it will analyse the current
dataset and will generate the confusion matrix.
91
CE350- Data Warehousing and Data Mining 16CE068
Conclusion:
Thus, above we have studied the explorer window to weka tool. And we have studied weka
experimental.
92
CE350- Data Warehousing and Data Mining 16CE068
Practical 6
AIM: Performing Linear Regression Analysis using RProgramming.
HARDWARE REQUIRED: --
THEORY:
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value
is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1. Mathematically a linear relationship represents a
straight line when plotted as a graph. A non-linear relationship where the exponent of any
variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
lm() Function:
This function creates the relationship model between the predictor and the response
variable.
Syntax: The basic syntax for lm() function in linear regression is −
lm(formula, data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
predict() Function:
93
CE350- Data Warehousing and Data Mining 16CE068
CODE:
s1 <- c(6.42,7.05,6.68,8.42,8.41,9.05,9.00,9.10)
s2 <- c(7.4,8.05,7.78,8.0,8.96,9.0,9.0,9.0)
relation <- lm(s2~s1)
print(relation)
print(summary(relation))
OUTPUT:
94
CE350- Data Warehousing and Data Mining 16CE068
OUTPUT:
95
CE350- Data Warehousing and Data Mining 16CE068
CONCLUSION:
By this practical, we have learnt regression between the data and also to generate scatter
plot on the data.
96
CE350- Data Warehousing and Data Mining 16CE068
Practical 7
H/W: -- Computer
Theory:
o process multiple batches or streams in parallel (each separate flow executes in its own thread)
o chain filters together
o visualize performance of incremental classifiers during processing (scrolling plots of classification
accuracy, RMS error, predictions etc.)
o plugin facility for allowing easy addition of new components to the Knowledge Flow
Steps:
1. Select arff loader
98
CE350- Data Warehousing and Data Mining 16CE068
99
CE350- Data Warehousing and Data Mining 16CE068
100
CE350- Data Warehousing and Data Mining 16CE068
101
CE350- Data Warehousing and Data Mining 16CE068
5. Goto classifier: trees and j48 (to evalution by right click and choose batch classifier)
102
CE350- Data Warehousing and Data Mining 16CE068
103
CE350- Data Warehousing and Data Mining 16CE068
104
CE350- Data Warehousing and Data Mining 16CE068
Conclusion:
From this practical we learnt about Different Data Mining Activities using Weka Knowledge Flow Tool.
105
CE350- Data Warehousing and Data Mining 16CE068
Practical 8
AIM: Performing Time Series Analysis using RStudio.
HARDWARE REQUIRED: --
THEORY:
Time series is a series of data points in which each data point is associated with a
timestamp. A simple example is the price of a stock in the stock market at different points
of time on a given day. Another example is the amount of rainfall in a region at different
months of the year. R language uses many functions to create, manipulate and plot the time
series data. The data for the time series is stored in an R object called time-series object. It
is also a R data object like a vector or data frame.
CODE:
library(xlsx)
grade<-c(6.42,7.05,6.68,8.42,8.41,9.0,8.52,8.44)
grade.timeseries<-ts(grade,start = 1,frequency = 1)
print(grade.timeseries)
plot(grade.timeseries)
106
CE350- Data Warehousing and Data Mining 16CE068
OUTPUT:
CODE:
gradeA <- c(6.42,7.05,6.68,8.42,8.41,9.0,8.52,8.44)
gradeB <- c(7.4,8.05,7.7,8.0,8.96,9.0,9.0,9.0)
grade.timeseries <- ts(gradeA,start= 1 ,frequency = 1 )
combined.grade <- matrix(c(gradeA,gradeB),nrow = 8)
grade.timeseries <- ts(combined.grade,start = 1,frequency = 1)
print(grade.timeseries)
plot(grade.timeseries)
summary(grade)
OUTPUT:
CONCLUSION:
By this practical, we have learnt creating different time series using R Programming.
107
CE350- Data Warehousing and Data Mining 16CE068
PRACTICAL-9
AIM: Perform Different Data Mining Activities using XL Miner Tool./ Tanagra/
Sipina/Rapid Miner/Orange/Knime/Cluto3
Snapshots:
Step 1: Choose “Orange” from Programs. The first interface that appears looks like the
one given below.
108
CE350- Data Warehousing and Data Mining 16CE068
Step 2: Select the file from the data and place it on the layout. Similarly drag and drop all
Components. Now Connect the File to the data table it will connect two
Components.
Similarly do this for all components.
Step 3: Now, click on the file to load a dataset. We will select the adult sample dataset.
Basically it is both Categorical and numerical data.
109
CE350- Data Warehousing and Data Mining 16CE068
Step 4: Now, click on the data table to visualize the data. It will show 977 Instance with
14 features and Discrete Class with 2 values. We will also see the numeric value
by clicking on the visualize numeric values. Here We perform Data Cleaning and
Handling missing values. We get (1.0%missing values) at the right side of the
table. From the table we can see there are some values which are not filled so we
have to filled it. We have to reduce the missing values to zero.
Step 5: Next Click on the Outliers. We have Outliers to detect number of instances. It
shows the 380 inliers and 597 outliers.
110
CE350- Data Warehousing and Data Mining 16CE068
Step 6: By detecting the outlier next click on the data table it only shows the outlier
instance only.
Step 7: Now we perform the preprocessing tasks of data to handle the missing values. So
Drag and drop the tasks and place on the right side of the screen. We take impute
missing values.
Step 8: So, Here we finally get the no missing value and all the missing data will be
filled. We also replace random value also by clicking on the second button. It will
place any value for their given selected attribute only. We also remove the
missing value if we remove then we have less number of instances.
111
CE350- Data Warehousing and Data Mining 16CE068
Step 9: So, Here we finally get the no missing value and all the missing data will be filled.
We also replace random value also by clicking on the second button. It will place
any value for their given selected attribute only. We also remove the missing value
if we remove then we have less number of instances.
Step 10: Let us take another preprocess tasks. We have Discretize Continuous Variables
for binning methods. Now select Equal Width Discretization with Interval of
Five.
112
CE350- Data Warehousing and Data Mining 16CE068
113
CE350- Data Warehousing and Data Mining 16CE068
Step 12: So, Here we get the 10 Principle Component for the Given Y Attributes. Based
on the data table. Next Click on the Distributions It shows the Graph of Density vs
PC1 for Given Y.
114
CE350- Data Warehousing and Data Mining 16CE068
115
CE350- Data Warehousing and Data Mining 16CE068
CONCLUSION:
Hence. From this practical we learnt Open Source Data Mining Tool. Perform different
Data Pre-processing techniques like cleaning, handling missing values using Orange Tool.
116