DMBI Lab Manual Final
DMBI Lab Manual Final
DMBI Lab Manual Final
New Delhi
LAB MANUAL
MISSION OF INSTITUTE
2
DEPARTMENT of COMPUTER SCIENCE and ENGINEERING
VISION OF DEPARTMENT
MISSION OF DEPARTMENT
3
PROGRAM OUTCOMES (POs)
1. Engineering knowledge: Apply the knowledge acquired in mathematics, science, engineering for the
solution of complex engineering problems.
2.Problem analysis: Identify research gaps, formulate and analyze complex engineering problems
drawing substantiated conclusions using basic knowledge of mathematics, natural sciences and
engineering sciences.
3. Design/development of solutions: Design solutions for the identified complex engineering
problems as well as develop solutions that meet the specified needs for the public health and safety,
and the cultural, societal and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research
methods, including design of experiments, analysis and interpretation of data and synthesis of
the information to provide valid conclusions.
5. Modern tool usage: Work on the latest technologies, resources and software tools including prediction
and modelling to complex engineering activities with an understanding of their limitations.
6. The engineer and society: Apply the basic acquired knowledge to measure societal, health, safety,
legal and cultural issues and identifying the consequential responsibilities relevant to the professional
engineering practice.
7. Environment and Sustainability: Comprehend the impact of the professional engineering solutions in
context of society and environment and demonstrate the need and knowledge for sustainable
development
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of
the engineering practice
9. Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports
and design documentation, make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning.
4
TABLE OF CONTENTS
S. Page no.
No
.
1. Course details
1.1Course objective
1.2. CourseOutcomes
1.3CO-PO/PSO mapping
1.4Evaluation Scheme
1.5Guidelines/Rubrics for continuous assessment
1.6 Lab safety instruction
1.7Instructions for students while writing Experiment in
Lab file.
4 Experiment details
5
1. COURSE DETAILS
The objective of the course is to make students aware with the new discipline of data mining, its
techniques. Also to help the students in solving the real world problems by using data mining
techniques and algorithms.
1.2COURSE OUTCOMES
At the end of the course student will be able to: PO/ PSO
ETCS457.1 Define and understand the business requirements, ETL process, PO1,PO2,
dimensional Analysis and Information flow. PO3,
PO5
PSO1,PSO2
ETCS457.6 Demonstrate knowledge of data mining process including data PO1, PO3,
preparation, modelling and evaluation. PO4, PO9
PSO1, PSO2
6
1.3
MAPPING COURSE OUTCOMES (CO) AND PROGRAM OUTCOMES (PO)/
PROGRAM SPECIFIC OUTCOME (PSO)
PO PO 2 PO 3 PO 4 PO 5 PO 6 PO 7 PO 8 PO 9 PO PO PO PSO PSO
CO 1 10 11 12 1 2
3 2 2 2 2 2
CO1
3 3 2 2 3 2
CO2
3 3 2 1 1
CO3
2 3 1 3 2 2 2
CO4
3 2 1 3 2 2 1
CO5
3 2 3 2 2 2 2 3 2
CO6
Laboratory
Components Internal External
Marks 40 60
Total Marks 100
7
⮚ File [5 Marks]
⮚ Viva – Voce [5 Marks]
● 2 innovative experiments (Content Beyond syllabus) 10 marks for 1st& 2nd
Semester
● 2 innovative experiments (Content Beyond syllabus) 5 marks for 3rd , 4th ,5th ,
6th ,7th 8th Semester
● Viva 5 marks for 3rd , 4th ,5th , 6th ,7th 8th Semester
The Rubrics for Experiment execution and Lab file+ viva voce is given below:
File File
File Contents & Contents &
Status Contents&Checke Checked not Checked
d Timely Timely(after After two
one week) weeks
Marks 4-5 2-3 0-1
Viva
Viva Viva
Status (Unsatisfactory
(Good) (Average)
)
Marks 4-5 1-3 0
Note: Viva Voce Questions for each experiment should be related to Course
Outcomes.
8
1.6 Safety Guidelines/Rules for laboratory(AS PERSUBJECT/LABORATORY)
9
1.7Format for students while writing Experiment in Lab file.
Experiment No: 1
Aim:
Course Outcome:
Software used:
Theory:
Flowchart/Algorithm/Code:
Results:
10
11
2.LIST OF EXPERIMENTS AS PER GGSIPU
8. (CO6)
Study of DBMINER tool and Study of ARMINER tool
12
3. EXPERIMENTAL SETUP DETAILS FOR THE COURSE
Software Requirements:
Java , Windows, WEKA
13
1. Introduction
WEKA, formally called Waikato Environment for Knowledge Learning, is a computer program
that was developed at the University of Waikato in New Zealand for the purpose of identifying
information from raw data gathered from agricultural domains. WEKA supports many different
standard data mining tasks such as data preprocessing, classification, clustering, regression,
visualization and feature selection. The basic premise of the application is to utilize a computer
application that can be trained to perform machine learning capabilities and derive useful
information in the form of trends and patterns. WEKA is an open source application that is freely
available under the GNU general public license agreement. Originally written in C the WEKA
application has been completely rewritten in Java and is compatible with almost every
computing platform. It is user friendly with a graphical interface that allows for quick set up and
operation. WEKA operates on the predication that the user data is available as a flat file or
relation, this means that each data object is described by a fixed number of attributes that usually
are of a specific type, normal alpha-numeric or numeric values. The WEKA application allows
novice users a tool to identify hidden information from database and file systems with simple to
use options and visual interfaces.
14
Installation of WEKA Tool
The program information can be found by conducting a search on the Web for WEKA Data
Mining or going directly to the site at www.cs.waikato.ac.nz/~ml/WEKA . The site has a very
large amount of useful information on the program’s benefits and background. New users might
find some benefit from investigating the user manual for the program. The main WEKA site has
links to this information as well as past experiments for new users to refine the potential uses that
might be of particular interest to them. When prepared to download the software it is best to
select the latest application from the selection offered on the site. The format for downloading
the application is offered in a self installation package and is a simple procedure that provides the
complete program on the end users machine that is ready to use when extracted.
Fig 1: Installation
Once the program has been loaded on the user’s machine it is opened by navigating to the
programs start option and that will depend on the user’s operating system. Figure 1 is an example
of the initial opening screen on a computer with Windows XP.
15
Fig 2: GUI
There are four options available on this initial screen.
♦ Simple CLI- provides users without a graphic interface option the ability to
execute commands from a terminal window.
♦ Explorer- the graphical interface used to conduct experimentation on raw data
♦ Experimenter- this option allows users to conduct different experimental
variations on data sets and perform statistical manipulation
♦ Knowledge Flow-basically the same functionality as Explorer with drag and
drop functionality. The advantage of this option is that it supports incremental
learning from previous results
While the options available can be useful for different applications the remaining focus of the
user manual will be on the Experimenter option through the rest of the user guide.
After selecting the Experimenter option the program starts and provides the user with a separate
graphical interface.
16
Figure 3 shows the opening screen with the available options. At first there is only the option to
select the Preprocess tab in the top left corner. This is due to the necessity to present the data set
to the application so it can be manipulated. After the data has been preprocessed the other tabs
become active for use.
There are six tabs:
1. Preprocess- used to choose the data file to be used by the application
2. Classify- used to test and train different learning schemes on the preprocessed data file under
experimentation
3. Cluster- used to apply different tools that identify clusters within the data file
4. Association- used to apply different rules to the data file that identify association within the
data
6. Visualize- used to see what the various manipulation produced on the data set in a 2D
format, in scatter plot and bar graph output
Once the initial preprocessing of the data set has been completed the user can move between the
tab options to perform changes to the experiment and view the results in real time. This provides
the benefit of having the ability to move from one option to the next so that when a condition
becomes exposed it can be placed in a different environment to be visually changed
instantaneously.
17
a) Preprocessing
In order to experiment with the application the data set needs to be presented to
WEKA in a format that the program understands. There are rules for the type of data that WEKA
will accept. There are three options for presenting data into the program.
♦ Open File- allows for the user to select files residing on the local machine or
recorded medium
♦ Open URL- provides a mechanism to locate a file or data source from a
different location specified by the user
♦ Open Database- allows the user to retrieve files or data from a database source
provided by the user
There are restrictions on the type of data that can be accepted into the program. Originally the
software was designed to import only ARFF files, newer versions allow different file types such
as CSV, C4.5 and serialized instance formats. The extensions for these files
include .csv, .arff, .names, .bsi and .data. Figure 3 shows an example of selection of the file
weather.arff.
OUTPUT
18
Once the initial data has been selected and loaded the user can select options for refining the
experimental data. The options in the preprocess window include selection of optional filters to
apply and the user can select or remove different attributes of the data set as necessary to identify
specific information. The ability to pick from the available attributes allows users to separate
different parts of the data set for clarity in the experimentation. The user can modify the attribute
selection and change the relationship among the different attributes by deselecting different
choices from the original data set. There are many different filtering options available within the
preprocessing window and the user can select the different options based on need and type of
data present.
19
B) Classify
The user has the option of applying many different algorithms to the data set that would in theory
produce a representation of the information used to make observation easier. It is difficult to
identify which of the options would provide the best output for the experiment. The best
approach is to independently apply a mixture of the available choices and see what yields
something close to the desired results. The Classify tab is where the user selects the classifier
choices. Figure 5 shows some of the categories.
Again there are several options to be selected inside of the classify tab. Test option gives the user
the choice of using four different test mode scenarios on the data set:
1. Use training set
2. Supplied training set
3. Cross validation
4. Split percentage
There is the option of applying any or all of the modes to produce results that can be compared
by the user. Additionally inside the test options toolbox there is a dropdown menu so the user can
select various items to apply that depending on the choice can provide output options such as
saving the results to file or specifying the random seed value to be applied for the classification.
20
The classifiers in WEKA have been developed to train the data set to produce output that has
been classified based on the characteristics of the last attribute in the data set. For a specific
attribute to be used the option must be selected by the user in the options menu before testing is
performed. Finally the results have been calculated and they are shown in the text box on the
lower right. They can be saved in a file and later retrieved for comparison at a later time or
viewed within the window after changes and different results have been derived.
C) Cluster
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze. There are a few options
within the cluster window that are similar to those described in the classifier tab. They are use
training set, supplied test set, percentage split. The fourth option is classes to cluster evaluation,
which compares how well the data compares with a pre-assigned class within the data. While in
cluster mode users have the option of ignoring some of the attributes from the data set. This can
be useful if there are specific attributes causing the results to be out of range or for large data
sets. Figure 6 shows the Cluster window and some of its options.
D) Associate
The associate tab opens a window to select the options for associations within the data set. The
user selects one of the choices and presses start to yield the results. There are few options for this
window and they are shown in Figure 7 below.
21
Fig 7: Associate Tab
22
Experiment 1
CO1: Define and understand the business requirements, ETL process, dimensional
Analysis and Information flow.
You need to load your data warehouse regularly so that it can serve its purpose of facilitating
business analysis. To do this, data from one or more operational systems needs to be extracted
and copied into the data warehouse. The challenge in data warehouse environments is to
integrate, rearrange and consolidate large volumes of data over many systems, thereby providing
a new unified information base for business intelligence.
The process of extracting data from source systems and bringing it into the data warehouse is
commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL
refers to a broad process, and not three well-defined steps. The acronym ETL is perhaps too
simplistic, because it omits the transportation phase and implies that each of the other phases of
the process is distinct. Nevertheless, the entire process is known as ETL.
The methodology and tasks of ETL have been well known for many years, and are not
necessarily unique to data warehouse environments: a wide variety of proprietary applications
and database systems are the IT backbone of any enterprise. Data has to be shared between
applications or systems, trying to integrate them, giving at least two applications the same picture
of the world. This data sharing was mostly addressed by mechanisms similar to what we now
call ETL.
Extraction of Data
During extraction, the desired data is identified and extracted from many different sources,
including database systems and applications. Very often, it is not possible to identify the specific
subset of interest, therefore more data than necessary has to be extracted, so the identification of
the relevant data will be done at a later point in time. Depending on the source system's
capabilities (for example, operating system resources), some transformations may take place
during this extraction process. The size of the extracted data varies from hundreds of kilobytes
up to gigabytes, depending on the source system and the business situation. The same is true for
the time delta between two (logically) identical extractions: the time span may vary between
days/hours and minutes to near real-time. Web server log files, for example, can easily grow to
hundreds of megabytes in a very short period of time.
23
Transportation of Data
The emphasis in many of the examples in this section is scalability. Many long-time users of
Oracle Database are experts in programming complex data transformation logic using PL/SQL.
These chapters suggest alternatives for many such data manipulation operations, with a particular
emphasis on implementations that take advantage of Oracle's new SQL functionality, especially
for ETL and the parallel query infrastructure.
Designing and maintaining the ETL process is often considered one of the most difficult and
resource-intensive portions of a data warehouse project. Many data warehousing projects use
ETL tools to manage this process. Oracle Warehouse Builder (OWB), for example, provides
ETL capabilities and takes advantage of inherent database abilities. Other data warehouse
builders create their own ETL tools and processes, either inside or outside the database.
Besides the support of extraction, transformation, and loading, there are some other tasks that are
important for a successful ETL implementation as part of the daily operations of the data
warehouse and its support for further enhancements. Besides the support for designing a data
warehouse and the data flow, these tasks are typically addressed by ETL tools such as OWB.
Oracle is not an ETL tool and does not provide a complete solution for ETL. However, Oracle
does provide a rich set of capabilities that can be used by both ETL tools and customized ETL
solutions. Oracle offers techniques for transporting data between Oracle databases, for
transforming large volumes of data, and for quickly loading new data into a data warehouse.
The successive loads and transformations must be scheduled and processed in a specific order.
Depending on the success or failure of the operation or parts of it, the result must be tracked and
subsequent, alternative processes might be started. The control of the progress as well as the
definition of a business workflow of the operations are typically addressed by ETL tools such as
Oracle Warehouse Builder.
24
Expected Outcome attained : ETL Processes, dimension Analysis, business requirements
understood properly.
CO1
Q What is ETL process?
Q What is difference between OLAP and data mining ?
Q What are the types of tasks that are carried out during data mining ?
CO2:
Q1
Q2
CO3:
Q1
Q2
25
Experiment 1
Aim: Study of ETL process and its tools.
You need to load your data warehouse regularly so that it can serve its purpose of facilitating
business analysis. To do this, data from one or more operational systems needs to be extracted
and copied into the data warehouse. The challenge in data warehouse environments is to
integrate, rearrange and consolidate large volumes of data over many systems, thereby providing
a new unified information base for business intelligence.
The process of extracting data from source systems and bringing it into the data warehouse is
commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL
refers to a broad process, and not three well-defined steps. The acronym ETL is perhaps too
simplistic, because it omits the transportation phase and implies that each of the other phases of
the process is distinct. Nevertheless, the entire process is known as ETL.
The methodology and tasks of ETL have been well known for many years, and are not
necessarily unique to data warehouse environments: a wide variety of proprietary applications
and database systems are the IT backbone of any enterprise. Data has to be shared between
applications or systems, trying to integrate them, giving at least two applications the same picture
of the world. This data sharing was mostly addressed by mechanisms similar to what we now
call ETL.
Extraction of Data
During extraction, the desired data is identified and extracted from many different sources,
including database systems and applications. Very often, it is not possible to identify the specific
subset of interest, therefore more data than necessary has to be extracted, so the identification of
the relevant data will be done at a later point in time. Depending on the source system's
capabilities (for example, operating system resources), some transformations may take place
during this extraction process. The size of the extracted data varies from hundreds of kilobytes
up to gigabytes, depending on the source system and the business situation. The same is true for
the time delta between two (logically) identical extractions: the time span may vary between
days/hours and minutes to near real-time. Web server log files, for example, can easily grow to
hundreds of megabytes in a very short period of time.
26
Transportation of Data
The emphasis in many of the examples in this section is scalability. Many long-time users of
Oracle Database are experts in programming complex data transformation logic using PL/SQL.
These chapters suggest alternatives for many such data manipulation operations, with a particular
emphasis on implementations that take advantage of Oracle's new SQL functionality, especially
for ETL and the parallel query infrastructure.
Designing and maintaining the ETL process is often considered one of the most difficult and
resource-intensive portions of a data warehouse project. Many data warehousing projects use
ETL tools to manage this process. Oracle Warehouse Builder (OWB), for example, provides
ETL capabilities and takes advantage of inherent database abilities. Other data warehouse
builders create their own ETL tools and processes, either inside or outside the database.
Besides the support of extraction, transformation, and loading, there are some other tasks that are
important for a successful ETL implementation as part of the daily operations of the data
warehouse and its support for further enhancements. Besides the support for designing a data
warehouse and the data flow, these tasks are typically addressed by ETL tools such as OWB.
Oracle is not an ETL tool and does not provide a complete solution for ETL. However, Oracle
does provide a rich set of capabilities that can be used by both ETL tools and customized ETL
solutions. Oracle offers techniques for transporting data between Oracle databases, for
transforming large volumes of data, and for quickly loading new data into a data warehouse.
The successive loads and transformations must be scheduled and processed in a specific order.
Depending on the success or failure of the operation or parts of it, the result must be tracked and
subsequent, alternative processes might be started. The control of the progress as well as the
definition of a business workflow of the operations are typically addressed by ETL tools such as
Oracle Warehouse Builder
27
Experiment 2
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt
or inaccurate records from a record set, table, or database and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting
the dirty or coarse data.[1] Data cleansing may be performed interactively with data
wrangling tools, or as batch processing through scripting.
After cleansing, a data set should be consistent with other similar data sets in the system. The
inconsistencies detected or removed may have been originally caused by user entry errors, by
corruption in transmission or storage, or by different data dictionary definitions of similar entities
in different stores. Data cleansing differs from data validation in that validation almost invariably
means data is rejected from the system at entry and is performed at the time of entry, rather than
on batches of data.
The actual process of data cleansing may involve removing typographical errors or validating
and correcting values against a known list of entities. The validation may be strict (such as
rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records
that partially match existing, known records). Some data cleansing solutions will clean data by
cross checking with a validated data set. A common data cleansing practice is data enhancement,
where data is made more complete by adding related information. For example, appending
addresses with any phone numbers related to that address. Data cleansing may also involve
activities like, harmonization of data, and standardization of data. For example, harmonization of
short codes (st, rd, etc.) to actual words (street, road, etcetera). Standardization of data is a means
of changing a reference data set to a new standard, ex, use of standard codes.
28
Experiment 3
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors
and inconsistencies from data in order to improve the quality of data. Data quality problems are
present in single data collections, such as files and databases, e.g., due to misspellings during
data entry, missing information or other invalid data. When multiple data sources need to be
integrated, e.g., in data warehouses, federated database systems or global web-based information
systems, the need for data cleaning increases significantly. This is because the sources often
contain redundant data in different representations. In order to provide access to accurate and
consistent data, consolidation of different data representations and elimination of duplicate
information become necessary.
29
Experiment 4
Aim: Introduction to WEKA tool
WEKA, formally called Waikato Environment for Knowledge Learning, is a computer program
that was developed at the University of Waikato in New Zealand for the purpose of identifying
information from raw data gathered from agricultural domains. WEKA supports many different
standard data mining tasks such as data preprocessing, classification, clustering, regression,
visualization and feature selection. The basic premise of the application is to utilize a computer
application that can be trained to perform machine learning capabilities and derive useful
information in the form of trends and patterns. WEKA is an open source application that is freely
available under the GNU general public license agreement. Originally written in C the WEKA
application has been completely rewritten in Java and is compatible with almost every
computing platform. It is user friendly with a graphical interface that allows for quick set up and
operation. WEKA operates on the predication that the user data is available as a flat file or
relation, this means that each data object is described by a fixed number of attributes that usually
are of a specific type, normal alpha-numeric or numeric values. The WEKA application allows
novice users a tool to identify hidden information from database and file systems with simple to
use options and visual interfaces.
Installation of WEKA Tool
The program information can be found by conducting a search on the Web for WEKA Data
Mining or going directly to the site at www.cs.waikato.ac.nz/~ml/WEKA . The site has a very
large amount of useful information on the program’s benefits and background. New users might
find some benefit from investigating the user manual for the program. The main WEKA site has
links to this information as well as past experiments for new users to refine the potential uses that
might be of particular interest to them. When prepared to download the software it is best to
select the latest application from the selection offered on the site. The format for downloading
the application is offered in a self-installation package and is a simple procedure that provides
the complete program on the end users machine that is ready to use when extracted.
30
Figure 1: Installation
Once the program has been loaded on the user’s machine it is opened by navigating to the
programs start option and that will depend on the user’s operating system. Figure 1 is an example
of the initial opening screen on a computer with Windows XP.
There are four options available on this initial screen.
♦ Simple CLI- provides users without a graphic interface option the ability to
execute commands from a terminal window.
♦ Explorer- the graphical interface used to conduct experimentation on raw data
♦ Experimenter- this option allows users to conduct different experimental
variations on data sets and perform statistical manipulation
♦ Knowledge Flow-basically the same functionality as Explorer with drag and
drop functionality. The advantage of this option is that it supports incremental
learning from previous results
31
Figure 2: GUI
While the options available can be useful for different applications the remaining focus of the
user manual will be on the Experimenter option through the rest of the user guide.
After selecting the Experimenter option the program starts and provides the user with a separate
graphical interface.
3. Cluster- used to apply different tools that identify clusters within the
data file
4. Association- used to apply different rules to the data file that identify
association within the data
32
6. Visualize- used to see what the various manipulation produced on the
data set in a 2D format, in scatter plot and bar graph output
Once the initial preprocessing of the data set has been completed the user can move between the
tab options to perform changes to the experiment and view the results in real time. This provides
the benefit of having the ability to move from one option to the next so that when a condition
becomes exposed it can be placed in a different environment to be visually changed
instantaneously.
● Preprocessing
33
a) In order to experiment with the application the data set needs to be presented to WEKA
in a format that the program understands. There are rules for the type of data that WEKA
will accept. There are three options for presenting data into the program.
b) Open File- allows for the user to select files residing on the local machine or recorded
medium
c) Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
d) Open Database- allows the user to retrieve files or data from a database source provided
by the user
e) There are restrictions on the type of data that can be accepted into the program.
Originally the software was designed to import only ARFF files, newer versions allow
different file types such as CSV, C4.5 and serialized instance formats. The extensions for
these files include .csv, .arff, .names, .bsi and .data.
Once the initial data has been selected and loaded the user can select options for refining the
experimental data. The options in the preprocess window include selection of optional filters to
apply and the user can select or remove different attributes of the data set as necessary to identify
specific information. The ability to pick from the available attributes allows users to separate
different parts of the data set for clarity in the experimentation. The user can modify the attribute
selection and change the relationship among the different attributes by deselecting different
choices from the original data set. There are many different filtering options available within the
preprocessing window and the user can select the different options based on need and type of
data present.
34
● Classify
The user has the option of applying many different algorithms to the data set that would in theory
produce a representation of the information used to make observation easier. It is difficult to
identify which of the options would provide the best output for the experiment. The best
approach is to independently apply a mixture of the available choices and see what yields
something close to the desired results. The Classify tab is where the user selects the classifier
choices. Figure 5 shows some of the categories.
Again there are several options to be selected inside of the classify tab. Test option gives the user
the choice of using four different test mode scenarios on the data set:
1. Use training set
2. Supplied training set
3. Cross validation
4. Split percentage
There is the option of applying any or all of the modes to produce results that can be compared
by the user. Additionally inside the test options toolbox there is a dropdown menu so the user can
35
select various items to apply that depending on the choice can provide output options such as
saving the results to file or specifying the random seed value to be applied for the classification.
The classifiers in WEKA have been developed to train the data set to produce output that has
been classified based on the characteristics of the last attribute in the data set. For a specific
attribute to be used the option must be selected by the user in the options menu before testing is
performed. Finally the results have been calculated and they are shown in the text box on the
lower right. They can be saved in a file and later retrieved for comparison at a later time or
viewed within the window after changes and different results have been derived.
● Cluster
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze. There are a few options
within the cluster window that are similar to those described in the classifier tab. They are use
training set, supplied test set, percentage split. The fourth option is classes to cluster evaluation,
which compares how well the data compares with a pre-assigned class within the data. While in
cluster mode users have the option of ignoring some of the attributes from the data set. This can
be useful if there are specific attributes causing the results to be out of range or for large data
sets. Figure 6 shows the Cluster window and some of its options.
36
Figure 7: Associate Tab
Pre-processing of ARFF file
Steps
a. Dataset
A set of data items, the dataset, is a very basic concept of machine learning. Adataset is roughly
equivalent to a two-dimensional spreadsheet or database table.In WEKA, it is implemented by
the weka.core.Instances class. A dataset isa collection of examples, each one of class
weka.core.Instance. Each Instanceconsists of a number of attributes, any of which can be
nominal (= one of a predefine ed list of values), numeric (= a real or integer number) or a string
(= anarbitrary long list of characters, enclosed in ”double quotes”). Additional typesare date and
relational, which are not covered here but in the ARFF chapter.The external representation of an
Instances class is an ARFF file, which consistsof a header describing the attribute types and the
data as comma-separated list.
Here is a short, commented example.
A complete description of the ARFF fileformat can be found here.
37
@data
sunny,FALSE,85,85,no
sunny,TRUE,80,90,no
overcast,FALSE,83,86,yes
rainy,FALSE,70,96,yes
rainy,FALSE,68,80,yes
b. Loading Data
The first four buttons at the top of the preprocess section enable you to load
data into WEKA:
1. Open file.... Brings up a dialog box allowing you to browse for the data
file on the local file system.
2. Open URL.... Asks for a Uniform Resource Locator address for where
the data is stored.
3. Open DB.... Reads data from a database. (Note that to make this work
you might have to edit the file in weka/experiment/DatabaseUtils.props.)
4. Generate.... Enables you to generate artificial data from a variety ofDataGenerators.Using the
Open file... button you can read files in a variety of formats:WEKA’s ARFF format, CSV
format, C4.5 format, or serialized Instances format. ARFF files typically have a .arff extension,
CSV files a .csv extension,C4.5 files a .data and .names extension, and serialized Instances
objects a .bsiextension.NB: This list of formats can be extended by adding custom file
convertersto the weka.core.converters package.
38
Experiment 5
Figure 9: Output
39
Experiment 6
Aim: Implementation of Clustering and Association techniques on ARFF files
using WEKA.
Steps
a. Selecting a Cluster
By now we will be familiar with the process of selecting and configuring objects. Clicking on the
clustering scheme listed in the Clusterer box at the top of the window brings up a Generic Object
Editor dialog with which to choose a new clustering scheme.
b. Cluster Modes
The Cluster mode box is used to choose what to cluster and how to evaluate the results. The first
three options are the same as for classification: Use training set, Supplied test set and Percentage
split except that now the data is assigned to clusters instead of trying to predict a specific class.
The fourth mode, Classes to clusters evaluation, compares how well the chosen clusters match up
with a pre-assigned class in the data. The drop-down box below this option selects the class, just
as in the Classify panel. An additional option in the Cluster mode box, the Store clusters for
visualization tick box, determines whether or not it will be possible to visualize the clusters once
training is complete. When dealing with datasets that are so large that memory becomes a
problem it may be helpful to disable this option.
Figure10: Output
40
Learning Associations
Once appropriate parameters for the association rule learner have been set, click the Start button.
When complete, right-clicking on an entry in the result list allows the results to be viewed or
saved
b. Selecting Attributes
Attribute selection involves searching through all possible combinations of attributes in the data
to find which subset of attributes works best for prediction. To do this, two objects must be set
up: an attribute evaluator and a search method. The evaluator determines what method is used to
assign a worth to each subset of attributes. The search method determines what style of search
is performed.
Figure11: Output
41
Experiment 7
Aim: Implementation of Visualization technique on ARFF files using WEKA.
WEKA’s visualization section allows you to visualize 2D plots of the current relation.
a. The scatter plot matrix
When you select the Visualize panel, it shows a scatter plot matrix for all the attributes, colour
coded according to the currently selected class. It is possible to change the size of each
individual 2D plot and the point size, and to randomly jitter the data (to uncover obscured
points). It also possible to change the attribute used to colour the plots, to select only a subset of
attributes for inclusion in the scatter plot matrix, and to sub sample the data. Note that
changes will only come into effect once the Update button has been pressed.
b. Selecting an individual 2D scatter plot
When you click on a cell in the scatter plot matrix, this will bring up a separate window with a
visualization of the scatter plot you selected. (We described above how to visualize particular
results in a separate window—for example, classifier errors—the same visualization controls are
used here.)
Data points are plotted in the main area of the window. At the top are two drop-down list buttons
for selecting the axes to plot. The one on the left shows which attribute is used for the x-axis; the
one on the right shows which is used for the y-axis.
Beneath the x-axis selector is a drop-down list for choosing the colour scheme. This allows you
to colour the points based on the attribute selected. Below the plot area, a legend describes what
values the colours correspond to. If the values are discrete, you can modify the colour used for
each one by clicking on them and making an appropriate selection in the window that pops up.
42
Figure12: Output
43
Experiment 8
Aim: Study of DBMINER tool and ARMINER tool
DBMiner, a data mining system for interactive mining of multiple-level knowledge in large
relational databases, has been developed based on our years-of-research. The system implements
a wide spectrum of data mining functions, including generalization, characterization,
discrimination, association, classification, and prediction. By incorporation of several interesting
data mining techniques, including attribute-oriented induction, progressive deepening for mining
multiple-level rules, and meta-rule guided knowledge mining, the system provides a user-
friendly, interactive data mining environment with good performance.
A data mining system, DBMiner, has been developed for interactive mining of multiple-level
knowledge in large relational databases. It is based on studies of data mining techniques and
experience in the development of an early system prototype, DBLearn. The system implements a
wide spectrum of data mining functions, including generalization, characterization, association,
classification, and prediction. By incorporation of several interesting data mining techniques,
including attribute-oriented induction, statistical analysis, progressive deepening for mining
multiple-level knowledge, and meta-rule guided mining, the system provides a user-friendly,
interactive data mining environment with good performance.
Features
● It incorporates several interesting data mining techniques, including attribute-oriented induction,
progressive deepening for mining multiple-level rules and meta-rule guided knowledge mining,
etc., and implements a wide spectrum of data mining functions including generalization,
characterization, association, classification, and prediction.
● It performs interactive data mining at multiple concept levels on any user-specified set of data in
a database using an SQL-like Data Mining Query Language, DMQL, or a graphical user
interface. Users may interactively set and adjust various thresholds, control a data mining
process, perform roll-up or drill-down at multiple concept levels, and generate different forms of
outputs, including generalized relations, generalized feature tables, multiple forms of generalized
rules, visual presentation of rules, charts, curves, etc.
● Efficient implementation techniques have been explored using different data structures, including
generalized relations and multiple-dimensional data cubes, and being integrated with relational
database techniques. The data mining process may utilize user- or expert-defined set-grouping or
schema-level concept hierarchies which can be specified flexibly, adjusted dynamically based on
data distribution, and generated automatically for numerical attributes.
● Both UNIX and PC (Windows/NT) versions of the system adopt a client/server architecture. The
latter communicates with various commercial database systems for data mining using the ODBC
technology.
ARMiner is a client-server data mining application specialized in finding association rules. The
name ARMiner comes from Association Rules Miner. ARMiner has been written in Java.
44
ARMiner has been developed at UMass/Boston as a Software Engineering project in Spring
2000.
Last ARMiner Server version is 1.0a (12/05/2001). Last ARMiner Client version is 1.0b
(04/05/2001). Both the client and server were last compiled using the Sun JDK 1.3.1
Experiment 10
45
Aim: Comparative analysis of various classification Algorithms using
knowledge flow and generate ROC curve.
The Knowledge Flow provides an alternative to the Explorer as a graphical front end to WEKA’s
core algorithms. The KnowledgeFlow is a work in progress so some of the functionality from the
Explorer is not yet available. On the other hand, there are things that can be done in the
KnowledgeFlow but not in the Explorer.
● Filters
46
Figure 16 : Filters
● Classifiers
Figure
17 : Classifiers
● Cluster
Figure 18 : Clusters
● Visualization
Figure 19 : Visualization
● DataVisualizer - component that can pop up a panel for visualizing datain a single large
2D scatter plot.
● ScatterPlotMatrix - component that can pop up a panel containing a ma-trix of small
scatter plots (clicking on a small plot pops up a large scatter plot).
● AttributeSummarizer - component that can pop up a panel containing amatrix of
histogram plots - one for each of the attributes in the input data.
● ModelPerformanceChart - component that can pop up a panel for visualizing threshold
(i.e. ROC style) curves.
● TextViewer - component for showing textual data. Can show data sets,classification
performance statistics etc.
● GraphViewer - component that can pop up a panel for visualizing treebased models.
● StripChart - component that can pop up a panel that displays a scrollingplot of data
(used for viewing the online performance of incremental classifiers.
47
Evaluation
Figure 20 : Evaluation
● TrainingSetMaker - make a data set into a training set.
● TestSetMaker - make a data set into a test set.
● CrossValidationFoldMaker - split any data set, training set or test set
intofolds.
● TrainTestSplitMaker - split any data set, training set or test set into
atraining set and a test set.
● ClassAssigner - assign a column to be the class for any data set,
trainingset or test set.
● ClassValuePicker - choose a class value to be considered as the “posi-
tive” class. This is useful when generating data for ROC style curves (see
ModelPerformanceChart below and example 4.2).
● ClassifierPerformanceEvaluator - evaluate the performance of batch
trained/testedclassifiers.
● IncrementalClassifierEvaluator - evaluate the performance of incremen-
tally trained classifiers.
● ClustererPerformanceEvaluator - evaluate the performance of batch
trained/testedclusterers.
● PredictionAppender - append classifier predictions to a test set. For dis-
crete class problems, can either append predicted class labels or
probability distributions.
Examples
1. Cross-validated J48
Setting up a flow to load an ARFF file (batch mode) and perform a cross-validation using J48
(WEKA’s C4.5 implementation).
Figure 21 :
Cross validation using J48
48
● Click on the DataSources tab and choose ArffLoader from the toolbar (the mouse
pointer will change to a cross hairs ).
● Next place the ArffLoader component on the layout area by clicking some-where on the
layout (a copy of the ArffLoader icon will appear on the layout area).
● Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader
icon on the layout. A pop-up menu will appear. Select Configure under Edit in the list
from this menu and browse to the locationof your ARFF file.
● Next click the Evaluation tab at the top of the window and choose the ClassAssigner
(allows you to choose which column to be the class) com-ponent from the toolbar. Place
this on the layout.
● Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader
and select the dataSet under Connections in the menu. A rubber band line will appear.
Move the mouse over the ClassAssigner component and left click - a red line labeled
dataSet will connect the two components.
● Next right click over the ClassAssigner and choose Configure from the menu. This will
pop up a window from which you can specify which column is the class in your data (last
is the default).
● Next grab a CrossValidationFoldMaker component from the Evaluation toolbar and
place it on the layout. Connect the ClassAssigner to the CrossValidationFoldMaker by
right clicking over ClassAssigner and se-lecting dataSet from under Connections in the
menu.
● Next click on the Classifiers tab at the top of the window and scroll along the toolbar
until you reach the J48 component in the trees section. Place a J48 component on the
layout.
● Connect
49
● Connect the CrossValidationFoldMaker to J48 TWICE by first choosing trainingSet and
then testSet from the pop-up menu for the CrossValidationFoldMaker.
● Next go back to the Evaluation tab and place a Classifier Performance Evaluator
component on the layout. Connect J48 to this component byselecting the batchClassifier
entry from the pop-up menu for J48.
● Next go to the Visualization toolbar and place a TextViewer component on the layout.
Connect the ClassifierPerformanceEvaluator to the TextViewer by selecting the text
entry from the pop-up menu for ClassifierPerformanceEvaluator.
● Now start the flow executing by selecting Start loading from the pop-up menu for
ArffLoader. Depending on how big the data set is and how long cross-validation takes
you will see some animation from some of the icons in the layout (J48’s tree will grow in
the icon and the ticks will animate on the ClassifierPerformanceEvaluator). You will also
see some progress information in the Status bar and Log at the bottom of the window.
● When finished you can view the results by choosing Show results from the pop-up menu
for the TextViewer component.
● Other cool things to add to this flow: connect a TextViewer and/or a GraphViewer to
J48 in order to view the textual or graphical representations ofthe trees produced for each
fold of the cross validation (this is something that is not possible in the Explorer).
50
2. Plotting multiple ROC curves
The KnowledgeFlow can draw multiple ROC curves in the same plot window, something that
the Explorer cannot do. In this example we use J48 and RandomForest as classifiers.
51
● Next click on the Classifiers tab at the top of the window and scroll along the toolbar
until you reach the J48 component in the trees section. Place a J48 component on the
layout.
● Connect the CrossValidationFoldMaker to J48 TWICE by first choosing trainingSet and
then testSet from the pop-up menu for the Cross ValidationFoldMaker.
● Repeat these two steps with the RandomForest classifier.
● Next go back to the Evaluation tab and place a ClassifierPerformance Evaluator
component on the layout. Connect J48 to this component byselecting the batchClassifier
entry from the pop-up menu for J48. Add another ClassifierPerformanceEvaluator for
RandomForest and connect them via batchClassifier as well.
● Next go to the Visualization toolbar and place a ModelPerformanceChart component
on the layout. Connect both ClassifierPerformanceEvaluators to the
ModelPerformanceChart by selecting the thresholdData entry from the pop-up menu for
ClassifierPerformanceEvaluator.
● Now start the flow executing by selecting Start loading from the pop-up menu for
ArffLoader. Depending on how big the data set is and how long cross validation takes
you will see some animation from some of the icons in the layout. You will also see some
progress information in the Status bar and Log at the bottom of the window.
● Select Show plot from the popup-menu of the ModelPerformanceChart under the
Actions section.
Here are the two ROC curves generated from the UCI dataset credit-g, evaluated on the class
label good :
Figure 23 : Multiple ROC Curve
52
3.Processing data incrementally
Some classifiers, clusterers and filters in Weka can handle data incrementally in a streaming
fashion. Here is an example of training and testing naive Bayes incrementally. The results are
sent to a TextViewer and predictions are plotted by a StripChart component.
● Click on the DataSources tab and choose ArffLoader from the toolbar (the mouse
pointer will change to a cross hairs ).
● Next place the ArffLoader component on the layout area by clicking some-where on the
layout (a copy of the ArffLoader icon will appear on the layout area).
● Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader
icon on the layout. A pop-up menu will appear. Select Configure under Edit in the list
from this menu and browse to the locationof your ARFF file.
● Next click the Evaluation tab at the top of the window and choose the ClassAssigner
(allows you to choose which column to be the class) com-ponent from the toolbar. Place
this on the layout.
● Now connect the ArffLoader to the ClassAssigner: first right click over the ArffLoader
and select the dataSet under Connections in the menu. A rubber band line will appear.
Move the mouse over the ClassAssigner component and left click - a red line labeled
dataSet will connect the two components.
● Next right click over the ClassAssigner and choose Configure from the menu. This will
pop up a window from which you can specify which column is the class in your data (last
is the default).
● Now grab a NaiveBayesUpdateable component from the bayes section of the
Classifiers panel and place it on the layout.
● Next connect the ClassAssigner to NaiveBayesUpdateable using ainstance connection.
53
● Next place an IncrementalClassiferEvaluator from the Evaluation panel onto the
layout and connect NaiveBayesUpdateable to it using a incrementalClassifier
connection.
● Next place a TextViewer component from the Visualization panel on the Layout.
Connect the IncrementalClassifierEvaluator to it using a text connection.
● Next place a StripChart component from the Visualization panel on the layout and
connect IncrementalClassifierEvaluator to it using a chart con-nection.
● Display the StripChart’s chart by right-clicking over it and choosing Showchart from
the pop-up menu. Note: the StripChart can be configuredwith options that control how
often data points and labels are displayed.
● Finally, start the flow by right-clicking over the ArffLoader and selecting Start loading
from the pop-up menu.
54
COURSE EXIT SURVEY
55
56