0% found this document useful (0 votes)

115 views

Lab2 DataPreprocessing A1.2

This document provides information about preprocessing data in WEKA, including loading data, filtering attributes, and discretizing numeric attributes. It uses a bank marketing dataset as an example. The key steps are: 1) The bank dataset is loaded into WEKA from a CSV file. 2) The "id" attribute is removed using the Remove filter to eliminate unique identifiers. 3) The "children" attribute is changed from numeric to categorical. 4) The "age" and "income" attributes are discretized into 3 bins each using the Discretize filter.

Uploaded by

Deepak Dahiya

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views

Lab2 DataPreprocessing A1.2

Uploaded by

Deepak Dahiya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

( Ref: http://maya.cs.depaul.edu/classes/ect584/WEKA/preprocess.html ‐‐ WEKA 3.4.

1 )
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Data Preprocessing in WEKA
This exercise illustrates some of the basic data preprocessing operations that can be
performed using WEKA. The sample data set used for this example is the "bank data"
available in comma‐separated format (bank‐data.csv).

The data contains the following fields

id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
did the customer buy a PEP (Personal Equity Plan) after the last mailing
pep
(YES/NO)

Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in ".csv"
format files. This is fortunate since many databases or spreadsheet applications can save or
export data into flat files in this format. As can be seen in the sample data file, the first row
contains the attribute names (separated by commas) followed by each data row with
attribute values listed in the same order (also separated by commas). In fact, once loaded
into WEKA, the data set can be saved into ARFF format.

In this example, we load the data set into WEKA, perform a series of operations using
WEKA's preprocessing filters. While all of these operations can be performed from the
command line, we use the GUI interface for WEKA Explorer.

Initially (in the Preprocess tab) click "open" and navigate to the directory containing the
data file (.csv or .arff). In this case we will open the above data file. This is shown in Figure
p1.

Figure p1

Once the data is loaded, WEKA will recognize the attributes and during the scan of the data
will compute some basic statistics on each attribute. The left panel in Figure p2 shows the
list of recognized attributes, while the top panels indicate the names of the base relation (or
table) and the current working relation (which are the same initially).

Figure p2

Clicking on any attribute in the left panel will show the basic statistics on that attribute. For
categorical attributes, the frequency for each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation, etc. As an example, see
Figures p3 and p4 below which show the results of selecting the "age" and "married"
attributes, respectively.

Figure p3

Figure p4

Note that the visualization in the right bottom panel is a form of cross‐tabulation across two
attributes. For example, in Figure p4 above, the default visualization panel cross‐tabulates
"married" with the "pep" attribute (by default the second attribute is the last column of the
data file). You can select another attribute using the drop down list.

Selecting or Filtering Attributes
In our sample data file, each record is uniquely identified by a customer id (the "id"
attribute). We need to remove this attribute before the data mining step. We can do this by
(1) simply select the attribute and click on “Remove button” as shown in Figure p5 (WEKA
3.6.2) or

Figure p5

(2) using the Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button.
This will show a popup window with a list available filters. Scroll down the list and select the
"weka.filters.unsupervised.attribute.Remove" filter as shown in Figure p6.

Figure p6

Next, click on text box immediately to the right of the "Choose" button. In the resulting
dialog box enter the index of the attribute to be filtered out (this can be a range or a list
separated by commas). In this case, we enter 1 which is the index of the "id" attribute (see
the left panel). Make sure that the "invertSelection" option is set to false (otherwise
everything except attribute 1 will be filtered). Then click "OK" (See Figure p7). Now, in the
filter box you will see "Remove ‐R 1" (see Figure p8).

Figure p7

Figure p8

Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute
and create a new working relation (whose name now includes the details of the filter that
was applied). The result is depicted in Figure p9:

Figure p9

It is possible now to apply additional filters to the new working relation. In this example,
however, we will save our intermediate results as separate data files and treat each step as
a separate WEKA session. To save the new working relation as an ARFF file, click on save
button in the top panel. Here, as shown in the "save" dialog box (see Figure p10), we will
save the new relation in the file "bank‐data‐R1.arff".

Figure p10

Figure p11 shows the top portion of the new generated ARFF file (in text editor).

Figure p11
Note that in the new data set, the "id" attribute and all the corresponding values in the
records have been removed. Also, note that WEKA has automatically determined the
correct types and values associated with the attributes, as listed in the Attributes section of
the ARFF file.

Discretization
Some techniques, such as association rule mining, can only be performed on categorical
data. This requires performing discretization on numeric or continuous attributes. There are
3 such attributes in this data set: "age", "income", and "children". In the case of the
"children" attribute the range of possible values are only 0, 1, 2, and 3. In this case, we have
opted for keeping all of these values in the data. This means we can simply discretize by
removing the keyword "numeric" as the type for the "children" attribute in the ARFF file,
and replacing it with the set of discrete values. We do this directly in our text editor as seen
in Figure p12. In this case, we have saved the resulting relation in a separate file "bank‐
data2.arff".

Figure p12

We will rely on WEKA to perform discretization on the "age" and "income" attributes. In this
example, we divide each of these into 3 bins (intervals). The WEKA discretization filter, can
divide the ranges blindly, or used various statistical techniques to automatically determine
the best way of partitioning the data. In this case, we will perform simple binning.

First we will load our filtered data set into WEKA by opening the file "bank‐data2.arff". The
"open" dialog box in depicted in Figure p13.

Figure p13

If we select the "children" attribute in this new data set, we see that it is now a categorical
attribute with four possible discrete values. This is depicted in Figure p14.

Figure p14

Now, once again we activate the Filter dialog box, but this time, we will select
"weka.filters.unsupervised.attribute.Discretize" from the list (see Figure p15).

Figure p15

Next, to change the defaults for this filter, click on the box immediately to the right of the
"Choose" button. This will open the Discretize Filter dialog box. We enter the index for the
attributes to be discretized. In this case we enter 1 corresponding to attribute "age". We
also enter 3 as the number of bins (note that it is possible to discretize more than one
attribute at the same time (by using a list of attribute indices). Since we are doing simple
binning, all of the other available options are set to "false". The dialog box is depicted in
Figure p16. Clicking on “More” will give you detail of each parameter.

Figure p16
Click "Apply" in the Filter panel. This will result in a new working relation with the selected
attribute partitioned into 3 bins (see Figure p17). To examine the results, we save the new
working relation in the file "bank‐data3.arff" as depicted in Figure p18.

Figure p17

Figure p18
Let us now examine the new data set using our text editor. The top portion of the data is
shown in Figure p18. You can observe that WEKA has assigned its own labels to each of the
value ranges for the discretized attribute. For example, the lower range in the "age"
attribute is labeled "(‐inf‐34.333333]" (enclosed in single quotes and escape characters),
while the middle range is labeled "(34.333333‐50.666667]", and so on. These labels now
also appear in the data records where the original age value was in the corresponding
range.

Next, we apply the same process to discretize the "income" attribute into 3 bins. Again,
Weka automatically performs the binning and replaces the values in the "income" column
with the appropriate automatically generated labels. We save the new file into "bank‐
data3.arff", replacing the older version.

Clearly, the WEKA labels, while readable, leave much to be desired as far as naming
conventions go. We will thus use the global search/replace functions in text editor to
replace these labels with more succinct and readable ones.

Replace all of the WEKA‐assigned labels of “age” and “income” attributes. Note that the
attribute section (the top part) of the arff file must be adjusted accordingly.

Figure p19 shows the final result of the transformation and the newly assigned labels for
these attribute values.

Figure p19

We now also change the relation name in the ARFF file to "bank‐data‐final" and save the file
as "bank‐data‐final.arff".

You may try with different number of bins. There is also a parameter for equal‐
frequency binning. Check it out.
Missing Values
1. Open file “bank‐data.arff”
2. Check if there is any missing values in any attribute.

3. Edit data to make some missing values.
4. Delete some data in “region”(Nominal) and “children”(Numeric) attributes. Click on “OK”
button when finish.

5. Make note of Label that has Max Count in “region” and Mean of “children” attributes.

6. Choose “ReplaceMissingValues” filter
(weka.filters.unsupervised.attribute.ReplaceMissingValues). Then, click on Apply button.

6 6

7. Look into the data. How did those missing values get replaced ?

8. Edit “bank‐data.arff” with text editor. Make some data missing by replacing them with ‘?’.
(Try with nominal and numeric attributes). Save to “bank‐data‐missing.arff”.

9. Load “bank‐data‐missing.arff” into WEKA, observe the data and attribute information.

10. Replace missing values by the same procedure you had done before.

LAB Manual: Course: CSC271: Database Systems
No ratings yet
LAB Manual: Course: CSC271: Database Systems
55 pages
Python Quiz
No ratings yet
Python Quiz
99 pages
Introduction To System Programming
100% (1)
Introduction To System Programming
50 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Analysing Descriptive, Prescriptive, Predictive & Diagnostic Framework at Workplace
No ratings yet
Analysing Descriptive, Prescriptive, Predictive & Diagnostic Framework at Workplace
11 pages
Weka Tutorial
No ratings yet
Weka Tutorial
45 pages
Step1. Open The Data/bank Data - CSV Dataset
No ratings yet
Step1. Open The Data/bank Data - CSV Dataset
3 pages
Weka Weka: A - Antony Alex MCA DR G R D College of Science - CBE Tamil Nadu - India
No ratings yet
Weka Weka: A - Antony Alex MCA DR G R D College of Science - CBE Tamil Nadu - India
23 pages
Weka Lab
No ratings yet
Weka Lab
11 pages
Data Mining Lab Questions
100% (1)
Data Mining Lab Questions
47 pages
Bab 3 Data Preprocessing: Arif Djunaidy
No ratings yet
Bab 3 Data Preprocessing: Arif Djunaidy
54 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Amdahl
No ratings yet
Amdahl
2 pages
Parallel Computer Models - Deepti Malhotra
No ratings yet
Parallel Computer Models - Deepti Malhotra
195 pages
CS-114 Fundamentals of Programming: Looping Constructs
No ratings yet
CS-114 Fundamentals of Programming: Looping Constructs
22 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Preprocessing The Informal Text For Efficient Sentiment Analysis
No ratings yet
Preprocessing The Informal Text For Efficient Sentiment Analysis
4 pages
CSE3013 Module6
No ratings yet
CSE3013 Module6
127 pages
Basics of Python
No ratings yet
Basics of Python
8 pages
10.object Oriented Design and UML Diagrams
No ratings yet
10.object Oriented Design and UML Diagrams
89 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
Algorithms and Data Structures: Priority Queue
No ratings yet
Algorithms and Data Structures: Priority Queue
24 pages
An Introduction To WEKA
No ratings yet
An Introduction To WEKA
85 pages
Multicasting and Multicast Routing Protocol
No ratings yet
Multicasting and Multicast Routing Protocol
20 pages
Multi-Core Programming Digital Edition (06!29!06)
No ratings yet
Multi-Core Programming Digital Edition (06!29!06)
362 pages
Principles of Database Management Overview
100% (1)
Principles of Database Management Overview
24 pages
Parallel Processing Assignment 1
No ratings yet
Parallel Processing Assignment 1
14 pages
C++ Functions: Presented by Kathryne Tarrayo and Jester Tiu
100% (1)
C++ Functions: Presented by Kathryne Tarrayo and Jester Tiu
21 pages
Advance Computer Architecture (Autosaved)
No ratings yet
Advance Computer Architecture (Autosaved)
128 pages
NumPy For MATLAB Users
No ratings yet
NumPy For MATLAB Users
16 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
CSC203 - Operating System Concepts
No ratings yet
CSC203 - Operating System Concepts
55 pages
Cache Mapping Functions
No ratings yet
Cache Mapping Functions
39 pages
Week 14 Risk Management
No ratings yet
Week 14 Risk Management
9 pages
Ping and How It Works PDF
No ratings yet
Ping and How It Works PDF
12 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
Ch5 5 Data Preprocessing
No ratings yet
Ch5 5 Data Preprocessing
39 pages
MID EXAM ODD SEMESTER ACADEMIC YEAR 2021:2022-Database System Practicum
No ratings yet
MID EXAM ODD SEMESTER ACADEMIC YEAR 2021:2022-Database System Practicum
15 pages
Chapter 3 ASSEMBLY LANGUAGE PROGRAMMING
100% (1)
Chapter 3 ASSEMBLY LANGUAGE PROGRAMMING
122 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Chapter 01 - Introduction Distributed Syetem
No ratings yet
Chapter 01 - Introduction Distributed Syetem
45 pages
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
No ratings yet
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
192 pages
Today's Topic:: To Run A Single Program Is Easy What To Do When Several Programs Run in Parallel?
100% (2)
Today's Topic:: To Run A Single Program Is Easy What To Do When Several Programs Run in Parallel?
33 pages
Pre Processor Directives and Operators in C
No ratings yet
Pre Processor Directives and Operators in C
45 pages
Exponential and Logarithmic Equations
No ratings yet
Exponential and Logarithmic Equations
10 pages
What Is Object Oriented Programming?: Ch-1 OOP in Python Updated & Revised by Dr. Ra'ed M. Al-Khatib (2019)
No ratings yet
What Is Object Oriented Programming?: Ch-1 OOP in Python Updated & Revised by Dr. Ra'ed M. Al-Khatib (2019)
45 pages
OS Concepts Chapter 2 Solution To Practice Exercises Part 2
100% (2)
OS Concepts Chapter 2 Solution To Practice Exercises Part 2
2 pages
Unit III: Concept Description: Characterization and Comparison
No ratings yet
Unit III: Concept Description: Characterization and Comparison
53 pages
Files - Python Questions and Answers - Sanfoundry PDF
No ratings yet
Files - Python Questions and Answers - Sanfoundry PDF
15 pages
Introduction To Numpy - Ipynb - Colaboratory
No ratings yet
Introduction To Numpy - Ipynb - Colaboratory
11 pages
Unit 1 Fundamentals of Python Programming
No ratings yet
Unit 1 Fundamentals of Python Programming
66 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Switch Statements
No ratings yet
Switch Statements
11 pages
OS 1 Process Management
No ratings yet
OS 1 Process Management
7 pages
Iloc, Loc, and Ix For Data Selection in Python Pandas - Shane Lynn
No ratings yet
Iloc, Loc, and Ix For Data Selection in Python Pandas - Shane Lynn
2 pages
Unit 1
No ratings yet
Unit 1
70 pages
Modeling and Simulation Lab 02
No ratings yet
Modeling and Simulation Lab 02
7 pages
Matlab Python Cheatsheet Formulae PDF
100% (1)
Matlab Python Cheatsheet Formulae PDF
17 pages
hw2 Datapreproc PDF
No ratings yet
hw2 Datapreproc PDF
15 pages
WEKA Explorer Tutorial
No ratings yet
WEKA Explorer Tutorial
45 pages
Proj Guidelines E
No ratings yet
Proj Guidelines E
1 page
CISO Proposal 23dec2015
No ratings yet
CISO Proposal 23dec2015
7 pages
Proj Guidelines E PDF
No ratings yet
Proj Guidelines E PDF
1 page
BITS Dubai Prospectus
No ratings yet
BITS Dubai Prospectus
48 pages
EPF Withdrawl PDF
No ratings yet
EPF Withdrawl PDF
2 pages
Id Renewal
0% (1)
Id Renewal
3 pages
Eg With Auto Cad
No ratings yet
Eg With Auto Cad
23 pages
Employee Provident Fund Withdrawals Simplified: New EPFO Guidelines
No ratings yet
Employee Provident Fund Withdrawals Simplified: New EPFO Guidelines
2 pages
Car
No ratings yet
Car
2 pages
MHof QuestionnaireEvaluation 2012 Cronbach FactAnalysis
No ratings yet
MHof QuestionnaireEvaluation 2012 Cronbach FactAnalysis
11 pages
Manual Teensy 3.2
100% (1)
Manual Teensy 3.2
1,377 pages
Automatic Creation of BP in FSCM
100% (1)
Automatic Creation of BP in FSCM
16 pages
Shubham Chandak CV
No ratings yet
Shubham Chandak CV
3 pages
How To Convert RGB To Grayscale Pic If: The Input Image I Can Be of Class Uint8
No ratings yet
How To Convert RGB To Grayscale Pic If: The Input Image I Can Be of Class Uint8
3 pages
How To Recover A Router That Will Not Boot
No ratings yet
How To Recover A Router That Will Not Boot
27 pages
DB2 Application Programming
100% (1)
DB2 Application Programming
150 pages
CorelDRAW X7 in Simple Steps
No ratings yet
CorelDRAW X7 in Simple Steps
2 pages
ISO27001 Tool Kit
0% (1)
ISO27001 Tool Kit
4 pages
User Manual
100% (1)
User Manual
4 pages
Integration Services - Extending Packages With Scripting
No ratings yet
Integration Services - Extending Packages With Scripting
188 pages
Twyncat: Twincat ST (Structured Text) Language
No ratings yet
Twyncat: Twincat ST (Structured Text) Language
5 pages
RRRR
No ratings yet
RRRR
20 pages
PDMS
No ratings yet
PDMS
1 page
Hotel Management System: Object Oriented Approach To UML Design
No ratings yet
Hotel Management System: Object Oriented Approach To UML Design
18 pages
GATE Computer Networks Book
100% (1)
GATE Computer Networks Book
12 pages
CB Insights Future Data Security 1901
No ratings yet
CB Insights Future Data Security 1901
27 pages
Cloud Computing
No ratings yet
Cloud Computing
3 pages
Stranchat: Department of Computer Science, Christ University
No ratings yet
Stranchat: Department of Computer Science, Christ University
54 pages
QTP 11 New Features
No ratings yet
QTP 11 New Features
9 pages
CS-115 Midterm
No ratings yet
CS-115 Midterm
13 pages
SQL Printout
No ratings yet
SQL Printout
7 pages
MI01 Create Physical Inventory Document
No ratings yet
MI01 Create Physical Inventory Document
5 pages
Lesson 12: Kitting and De-Kitting
No ratings yet
Lesson 12: Kitting and De-Kitting
39 pages
Recovering Passwords in Solaris 11 (Paulie's World in A Blog) PDF
No ratings yet
Recovering Passwords in Solaris 11 (Paulie's World in A Blog) PDF
5 pages
Shell Scripting, Processes and Scheduling
No ratings yet
Shell Scripting, Processes and Scheduling
2 pages
AMD Adaptive S4 ACPI Device Driver Distribution List Version 1.2.0.0046, 10/03/2019 Supported Chipsets N/A
No ratings yet
AMD Adaptive S4 ACPI Device Driver Distribution List Version 1.2.0.0046, 10/03/2019 Supported Chipsets N/A
2 pages
Java Assignment 2
No ratings yet
Java Assignment 2
4 pages
Emulation of FMA and Correctly-Rounded Sums: Proved Algorithms Using Rounding To Odd
No ratings yet
Emulation of FMA and Correctly-Rounded Sums: Proved Algorithms Using Rounding To Odd
9 pages
Loops in C
No ratings yet
Loops in C
66 pages
5.1.8 K-Nearest-Neighbor Algorithm
No ratings yet
5.1.8 K-Nearest-Neighbor Algorithm
8 pages