Data Mining Lab Notes
Data Mining Lab Notes
Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of
data from multiple heterogeneous data sources in support of managements decision-making process.
Pentaho Data Integration (PDI, also called Kettle) is the component of Pentaho responsible for the Extract,
Transform and Load (ETL) processes. PDI can also be used for other purposes:
Data preprocessing
Integrating applications
Files
Database repository
If you choose the database repository method, the repository has to be created the first time you execute
Spoon. If you choose the files method, the Jobs are saved in files with a kjb extension, and the
Transformations are in files with a ktr extension. In this tutorial you'll work with the Files method.
Starting Spoon
Start Spoon by executing Spoon.bat on command line prompt.
2.
3.
1.
2.
3.
Hops
A hop connects one transformation step or job entry with another. The direction of the data flow is indicated
with an arrow on the graphical view pane. A hop can be enabled or disabled.
2. Using PDI find the list of the sales record which has ordered with the quantity more than 75. Use the
3. Using PDI find the list of the sales record. Check for zip code if NULL lookup in city_zip_code file
and add for appropriate city. Use the system CSV data file sales and city_zip_code
4. Create data file for sailor information, reservation of boats and boat information. Using PDI find
the list of senior citizen sailors who have reserved at least a boat.
5. Create data file for sailor information, reservation of boats and boat information. Using PDI find
the list of sailors who are eligible for voting and reserved at least a red boat.
1. Design multi-dimensional data models namely star, snowflake and fact-constellation schemas.
Multidimensional Data Model:
Multidimensional data model is to view data in the form of a cube. Data cube contains set of dimensions like
product, customer, time etc. and facts like sales number (unit sold) and profit.
This will open the New Schema 1 window with the schema file name as Schema1.xml. Please refer to the
below screenshot.
Step 7: Click on Schema as shown above and set the required properties for it, for ex, name of
the schema, etc. For now, enter name as SchemaTest.
Step 8: Right click on element Schema and select Add Cube option. This will add a new cube
into the schema.
Step 9:Set the name of the cube as CubeTest. Once it is done, the schema design will look like
below.
Step 10:Set the fact-table for the cube CubeTest. To do so, click on the icon before the cube
image as mentioned in #2 in above screenshot. This will expand the cube node like below image.
Step 11:Now click on the Table element, this will list out the attributes specific to the
Table element. Clicking on the name attribute will display all tables available under current
datasource (the database we set in Step 5. Select the table Customers.
Once you choose the table PUBLIC -> CUSTOMERS, the schema attribute value will be filled in
automatically as PUBLIC.
Step 12:Now add a new dimension called CustomerGeography to the cube by right clicking the
cube element CubeTest.
Step 13:
For the new dimension added, set the required attribute values like name, foreign key, etc.
Set name of the dimension as CustomerGeography, and foreign key as CUSTOMERNUMBER. Double click
on the dimension name CustomerGeography. This will expand the node and display the Hierarchy.
Click on the hierarchy in the left side pane, you can find the attribute properties for the hierarchy.
Set name -> CustomerGeo; allMemberName = All Countries
Step 14:Double click on the Hierarchy element in the left side pane, will expand the node
further and show the Table element. Click on the Table element to set the dimension-table for the
dimension CustomerGeography. This will list the related attributes on the right side pane. Clicking on
the name attributes value field will list the tables available in the current schema.
Select it as CUSTOMERS. This will automatically fill the schema field as PUBLIC.
Step 15:Right click on the element Hierarchy on the left side pane and select Add Level.
This will add a new level with name New Level 0. Refer to the below screenshot.
To rename and set other attributes, set the attribute values (as listed below) for the newly created level in the
right side pane.
Name -> CustomerCountry
Column -> COUNTRY
Type -> string
uniqueMembers -> true
Now we have added a level called CustomerCountry.
Step 16:To add another one level, right-click on Hierarchy in the left side pane (as we did in
Step 15), and select Add Level. This will add a new level with name New Level 1. To rename and set
other attribute values, set the attribute values in the right side pane as below,
Name -> CustomerCity
This will add a new dimension to the cube with a default name. To rename it and set other attribute values, click
on the newly created dimension in the left side pane. This will list out the attributes for the dimension.
Set name -> CustomerContact; foreign key CUSTOMERNUMBER.
Step 18:To add hierarchy and levels for this dimension, double click on the dimension name
which will expand the dimension node CustomerContact. Click on the hierarchy element in the left
side pane, then on the right side pane set the below attribute values.
Set name -> ; allmembername = All Contacts.
Step 19:Double click on the element hierarchy, will expand the node hierarchy where you
can set the dimension-table for the dimension CustomerContact.
Click on the Table element and select the table as CUSTOMERS.
Step 20:To add a new level for this dimension or hierarchy, right-click on the element
hierarchy and select Add Level. This will add a new level to the hierarchy with name New Level 0.
We can rename it by changing the attributes values like below,
name -> CustomerNames
Column -> CONTACTFIRSTNAME
Type -> String.
Step 21:To add new measure to the cube, right click on the cube CubeTest and select Add
Measure. This will add a new measure with name New Measure 0. You can rename it by changing the
attribute values.
Step 22:Now, click File -> Save menu to save the cube schema. You can save it in your desired
path. For ex, save it as TestCube.mondrian.xml
Result:
A new cube has been designed, configured and published into the Pentaho Server. Also, we viewed the cube via
Pentaho User Console.
2.
The Step configuration window will appear. This is different from the previous Step config window in
that it allows you to write JavaScript code. You will use it to build the message "Hello, " concatenated with
each of the names.
3.
4.
The main area of the configuration window is for coding. To the left, there is a tree with a set of
available functions that you can use in the code. In particular, the last two branches have
the input and output fields, ready to use in the code. In this example there are two fields: last_name and name.
At the bottom you can type any variable created in the code. In this case, you have created a variable
named msg. Since you need to send this message to the output file, you have to write the variable name in the
grid. This should be the result:
1.
2.
Select the Step you just configured. In order to check that the new field will leave this Step, you will
now see the Input and Output Fields. Input Fields are the data columns that reach a Step. Output Fields are
the data columns that leave a Step. There are Steps that simply transform the input data. In this case, the input
and output fields are usually the same. There are Steps, however, that add fields to the Output - Calculator, for
example. There are other Steps that filter or combine data causing that the Output has less fields that the Input
- Group by, for example.
3.
4.
Select Show Input Fields. You'll see that the Input Fields are last_name and name, which come from
the CSV file input Step.
5.
Select Show Output Fields. You'll see that not only do you have the existing fields, but also the
new msg field.
Configuring the XML Output Step
1.
Double-click the XML Output Step. The configuration window for this kind of Step will appear. Here
you're going to set the name and location of the output file, and establish which of the fields you want to
include. You may include all or some of the fields that reach the Step.
2.
3.
4.
5.
Click Get Fields to fill the grid with the three input fields. In the output file you only want to include the
message, so delete name and last_name.
6.
Roll-up
Drill-down
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
By dimension reduction
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the level of month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider
the following diagram that shows how slice works.
Figure shows Slice is performed for the dimension "time" using the criterion time = "Q1".
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following
diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative
presentation of data. Consider the following diagram that shows the pivot operation.
Understand the features of WEKA Toolkit such as Explorer, Knowledge Flow Interface,
Experimenter and command Line Interface.
Once the program has been loaded on the users machine it is opened by navigating to the programs start
option and that will depend on the users operating system. Figure is an example of the initial opening
screen on a computer with Windows.
3. Navigate the options available in WEKA (ex. Select attribute panel, Preprocess panel, Classify
panel, Cluster panel, Associate panel and Visualize panel)
Figure shows the opening screen with the available options. At first there is only the option to select the
Preprocess tab in the top left corner. This is due to the necessity to present the data set to the application so it
can be manipulated. After the data has been preprocessed the other tabs become active for use. There are six
tabs:
1. Preprocess- used to choose the data file to be used by the application
2. Classify- used to test and train different learning schemes on the preprocessed data file
under experimentation
3. Cluster- used to apply different tools that identify clusters within the data file
4. Association- used to apply different rules to the data file that identify association
within the data
5. Select attributes-used to apply different rules to reveal changes based on selected
attributes inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set in a 2D
format, in scatter plot and bar graph output
4. Study the arff file format.
Attribute-Relation File Format
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a
set of attributes.
ARFF files have two distinct sections. The first section is the Header information, which is followed
the Data information.
The ARFF Header Section
The ARFF Header section of the file contains the relation declaration and attribute declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes spaces.
The @attribute Declarations
Attribute declarations take the form of an orderd sequence of @attribute statements. Each attribute in the data
set has its own @attribute statement which uniquely defines the name of that attribute and it's data type.
The format for the @attribute statement is:
@attribute <attribute-name> <datatype>
where the <attribute-name> must start with an alphabetic character. If spaces are to be included in the name
then the entire name must be quoted.
The <datatype> can be any of the four types currently (version 3.2.1) supported by Weka:
numeric
<nominal-specification>
string
date [<date-format>]
where <nominal-specification> and <date-format> are defined below. The
keywords numeric, string and date are case insensitive.
ARFF Data Section
The ARFF Data section of the file contains the data declaration line and the actual instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data segment in the file. The format is:
@data
Example:
The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in
the data), and their types. An example header on the standard IRIS dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
%
(a) Creator: R.A. Fisher
%
(b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
%
(c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class
{Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and @DATA declarations are
case insensitive.
i.
5. Explore the available Data sets in WEKA. Load the data sets (ex Weather data set, iris data set etc.).
Load each data set and observe following:1. ISIS Dataset
List the attributes names and their types.
Sno
Name of attribute
Type of attribute
ii.
iii.
iv.
v.
i.
No. of Records
2. Wheather Dataset
List the attributes names and their types.
Sno
ii.
iii.
iv.
Name of class
Name of attribute
Type of attribute
v.
Name of class
No. of Records
Name of attribute
Type of attribute
11
12
13
14
15
16
17
18
19
20
21
ii.
iii.
iv.
v.
Name of class
No. of Records
UNIT II
Perform data preprocessing task and demonstrate the performing Association rule mining.
A. Explore weather dataset for preprocessing the data and apply discretization and resample
filter on dataset.
1. Start Weka . It will open Weka GUI window.
2. Click on the Explorer button and you get the Weka Knowledge Explorer window.
3. Click on the Open File.. button and open an weather.ARFF file.
4. Click on Choose and select filters/unsupervised/attribute/Discetize. Then click on the area right of the
Choose button.
5. Click on the Apply button to do the discretization. Then select one of the original numeric attributes (e.g.
temperature) and see how it is discretized in the Selected attribute window.
Click on Choose and select filters/unsupervised/instance/Resample. Then click on the area right of the Choose
button.
Click on the Apply button to do the Resample. Then select one of the original numeric attributes (e.g.
temperature) and see how it is discretized in the Selected attribute window.
B. Load IRIS dataset into WEKA. Apply discretization filter on Numeric field and run
Apriori algorithm. Study the rules generated.
1. Start Weka . It will open Weka GUI window.
2. Click on the Explorer button and you get the Weka Knowledge Explorer window.
3. Click on the Open File.. button and open an IRIS.ARFF file.
6. Select the Associate tab from main menu. Then click on choose button and select the weka/associations/
Apriori.
7. Click on Start button to generate frequent itemsets and different association rules.
UNIT III
Demonstrate performing classification on datasets
CLASSIFICATION: - It is necessary to provide a clear classification of data mining systems which
may help users to distinguish between such systems and to identify them. Data mining classification can
be done in different ways:
1) Data mining can be classified according to the kinds of databases mined
2) Data mining can be Classified according to the kinds of knowledge mined which is done based on the
mining functionalities like characterization, discrimination etc.
3) We can also classify the data mining systems according to the kinds of techniques utilized,
applications adapted.
A. Load IRIS , German Credit Card dataset into weka and run ID3 and J48 algorithm. Study
the classifier output, entropy values and kappa statistics.
B. Extract if-then rules generated by the classifiers. Observe the confusion matrix and derive
accuracy, F-measure, TPrate, FPrate, Precision and Recall values. Apply cross-validation
strategy with various fold levels and compute the accuracy results.
C. Load IRIS dataset into weka and perform Nave-bayes classification and k-nearest neighbor
classification. Interpret the results obtained.
NaiveBayes: -Classification of data can be done based on Bayesian theorems. NaiveBayes classifier is a
simple probabilistic classifier based on applying the Bayes theorem with strong independence
assumptions. In simple terms NaiveBayes classifier assumes that the presence or absence of a particular
feature of a class is unrelated to the presence or absence of any other feature. An advantage of the
NaiveBayes classifier is that, it only requires small amount of training data to estimate the parameters
like mean, and variances of the variables necessary for classification.
E. Compare classification results of J48 and Nave-bayes classification for IRIS dataset and
deduce which classifier is performing best and poor for IRIS dataset and justify.
J48
Nave-bayes
classification
Instances
Test Mode
Number of Leaves
Size of tree
Correctly classified instances
Incorrectly classified instances
Kappa statistics
Confusion Matrix
UNIT IV
Demonstrate performing clustering on datasets
CLUSTERING
Clustering is a task of assigning a set of objects into groups called as clusters. Clustering is also referred
as cluster analysis where the objects in the same cluster are more similar to each other than to those objects in
other clusters. Clustering is the main task of Explorative Data mining and is a common technique for statistical
data analysis used in many fields like machine learning, pattern recognition, image analysis, bio informatics etc.
Cluster analysis is not an algorithm but is a general task to be solved. Clustering is of different types like
hierarchical clustering which creates a hierarchy of clusters, partial clustering, and spectral clustering.
SimpleK-MeansIt is a method of cluster analysis called as partial cluster analysis or partial clustering. K-Means
clustering partition or divides n observations into K clusters. Each observation belongs to the cluster with the
nearest mean. K-means clustering is an algorithm to group the objects based on attributes/features into K
number of groups where K is positive integer.
A. Load IRIS dataset into weka and run simple k-mean clustering algorithm with different values
of k. Study the cluster formed. Observe the sum of square errors and centroids, and derive
insights.
C. weka.clusterers.Cobweb
1. weka.clusterers.FarthestFirst -
2. weka.clusterers.EM
4. weka.clusterers.FilteredClusterer -
5. weka.clusterers.HierarchicalClusterer
6. weka.clusterers.MakeDensityBasedClusterer -
6. Explore the visualization features of weka to visualize the clusters. Derive interesting insights and
explain.
%
A32 : existing credits paid back duly till now
%
A33 : delay in paying off in the past
%
A34 : critical account/
%
other credits existing (not at this bank)
%
% Attribute 4: (qualitative)
%
Purpose
%
A40 : car (new)
%
A41 : car (used)
%
A42 : furniture/equipment
%
A43 : radio/television
%
A44 : domestic appliances
%
A45 : repairs
%
A46 : education
%
A47 : (vacation - does not exist?)
%
A48 : retraining
%
A49 : business
%
A410 : others
%
% Attribute 5: (numerical)
%
Credit amount
%
% Attibute 6: (qualitative)
%
Savings account/bonds
%
A61 :
... < 100 DM
%
A62 : 100 <= ... < 500 DM
%
A63 : 500 <= ... < 1000 DM
%
A64 :
.. >= 1000 DM
%
A65 : unknown/ no savings account
%
% Attribute 7: (qualitative)
%
Present employment since
%
A71 : unemployed
%
A72 :
... < 1 year
%
A73 : 1 <= ... < 4 years
%
A74 : 4 <= ... < 7 years
%
A75 :
.. >= 7 years
%
% Attribute 8: (numerical)
%
Installment rate in percentage of disposable income
%
% Attribute 9: (qualitative)
%
Personal status and sex
%
A91 : male : divorced/separated
%
A92 : female : divorced/separated/married
%
A93 : male : single
%
A94 : male : married/widowed
%
A95 : female : single
%
% Attribute 10: (qualitative)
%
Other debtors / guarantors
%
A101 : none
%
A102 : co-applicant
%
A103 : guarantor
%
% Attribute 11: (numerical)
%
Present residence since
%
% Attribute 12: (qualitative)
%
Property
%
A121 : real estate
%
A122 : if not A121 : building society savings agreement/
%
life insurance
%
A123 : if not A121/A122 : car or other, not in attribute 6
%
A124 : unknown / no property
%
% Attribute 13: (numerical)
%
Age in years
%
% Attribute 14: (qualitative)
%
Other installment plans
%
A141 : bank
%
A142 : stores
%
A143 : none
%
% Attribute 15: (qualitative)
%
Housing
%
A151 : rent
%
A152 : own
%
A153 : for free
%
% Attribute 16: (numerical)
%
Number of existing credits at this bank
%
% Attribute 17: (qualitative)
%
Job
%
A171 : unemployed/ unskilled - non-resident
%
A172 : unskilled - resident
%
A173 : skilled employee / official
%
A174 : management/ self-employed/
%
highly qualified employee/ officer
%
% Attribute 18: (numerical)
%
Number of people being liable to provide maintenance for
%
% Attribute 19: (qualitative)
%
Telephone
%
A191 : none
%
A192 : yes, registered under the customers name
%
% From: A45
To: repairs
% From: A46
To: education
% From: A47
To: vacation
% From: A48
To: retraining
% From: A49
To: business
% From: A410
To: other
%
%
% Relabeled values in attribute savings_status
% From: A61
To: '<100'
% From: A62
To: '100<=X<500'
% From: A63
To: '500<=X<1000'
% From: A64
To: '>=1000'
% From: A65
To: 'no known savings'
%
%
% Relabeled values in attribute employment
% From: A71
To: unemployed
% From: A72
To: '<1'
% From: A73
To: '1<=X<4'
% From: A74
To: '4<=X<7'
% From: A75
To: '>=7'
%
%
% Relabeled values in attribute personal_status
% From: A91
To: 'male div/sep'
% From: A92
To: 'female div/dep/mar'
% From: A93
To: 'male single'
% From: A94
To: 'male mar/wid'
% From: A95
To: 'female single'
%
%
% Relabeled values in attribute other_parties
% From: A101
To: none
% From: A102
To: 'co applicant'
% From: A103
To: guarantor
%
%
% Relabeled values in attribute property_magnitude
% From: A121
To: 'real estate'
% From: A122
To: 'life insurance'
% From: A123
To: car
% From: A124
To: 'no known property'
%
%
% Relabeled values in attribute other_payment_plans
% From: A141
To: bank
% From: A142
To: stores
% From: A143
To: none
%
%
@data
'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real
estate',67,none,own,2,skilled,1,yes,yes,good
'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real
estate',22,none,own,1,skilled,1,none,yes,bad
'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,3,'real
estate',49,none,own,1,'unskilled resident',2,none,yes,good
'<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male single',guarantor,4,'life
insurance',45,none,'for free',1,skilled,2,none,yes,good
'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no known
property',53,none,'for free',2,skilled,2,none,yes,bad
'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male single',none,4,'no known
property',35,none,'for free',1,'unskilled resident',2,yes,yes,good
3.1. List all the categorical (or nominal) attributes and the real-valued attributes separately.
From the German Credit Assessment Case Study given to us the following attributes are found to be
applicable for Credit-Risk Assessment:
Total Valid Attributes:1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
10. debitors
11. residence_since
12. property
14. installment plans
15. housing
16. existing credits
17. job
18. num_dependents
19. telephone
20. foreign worker
Categorical or Nomianal attributes(which takes True/false,Yes/no etc values):1. checking_status
2. credit history
3. purpose
4. savings_status
5. employment
6. personal status
7. debtors
8. property
9. installment plans
10. housing
11. job
12. telephone
13. foreign worker
Real valued attributes:1. duration
2. credit amount
3. credit amount
4. residence
5. age
6. existing credits
7. num_dependents
3.2. What attributes do you think might be crucial in making the credit assessement ? Come up with
some simple rules in plain English using your selected attributes.
According to me the following attributes may be crucial in making the credit risk assessment.
1. Credit_history
2. Employment
3. Property_magnitude
4. job
5. duration
6. crdit_amount
7. installment
8. existing credit
Based on the above attributes, we can make a decision whether to give credit or not.
3.3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete
dataset as the training data. Report the model obtained after training.
A decision tree is a flow chart like tree structure where each internal node(non-leaf)denotes a test on the
attribute, each branch represents an outcome of the test ,and each leaf node(terminal node)holds a class
label.
Decision trees can be easily converted into classification rules.
e.g. ID3,C4.5 and CART.
J48 pruned tree
1. Using WEKA Tool, we can generate a decision tree by selecting the classify tab.
2. In classify tab select choose option where a list of different decision trees are available. From that
list select J48.
3. Now under test option, select training data test option.
4. The resulting window in WEKA is as follows:
5. To generate the decision tree, right click on the result list and select visualize tree option by which the
decision tree will be generated.
6. The obtained decision tree for credit risk assessment is very large to fit on the screen.
3.4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for
each of the examples in the dataset. What % of examples can you classify correctly? (This is also called
testing on the training set) Why do you think you cannot get 100 % training accuracy?
In the above model we trained complete dataset and we classified credit good/bad for each of the
examples in the dataset.
For example:
if
purpose=vacation then
credit=bad
else
Purpose =business then Credit=good
In this way we classified each of the examples in the dataset. We classified 85.5% of examples correctly and the
remaining 14.5% of examples are incorrectly classified. We cant get 100% training accuracy because out of the
20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the
accuracy is affected and hence we cant get 100% training accuracy.
3.5. Is testing on the training set as you did above a good idea? Why or Why not?
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the
remaining 1/3 as test set. But here in the above model we have taken complete dataset as training set which
results only 85.5% accuracy.
This is done for the analyzing and training of the unnecessary attributes which does not make a crucial role in
credit risk assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If
some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results
and the time for computation will be less. This is why, we prefer not to take complete dataset as training set.
Use Training Set Result for the table GermanCreditData:
Correctly Classified Instances
855
85.5 %
Incorrectly Classified Instances
145
Kappa statistic
0.6251
0.2312
0.34
55.0377 %
74.2015 %
1000
14.5
3.6. One approach for solving the problem encountered in the previous question is using cross-validation?
Describe what cross-validation is briefly. Train a Decision Tree again using cross-validation and report
your results. Does your accuracy increase/decrease? Why?
Cross validation:-
In k-fold cross-validation, the initial data are randomly portioned into k mutually exclusive subsets or folds
D1, D2, D3, . . .Dk. Each of approximately equal size. Training and testing is performed k times. In iteration I,
partition Di is reserved as the test set and the remaining partitions are collectively used to train the model. That
is in the first iteration subsets D2, D3, . . . .Dk collectively serve as the training set in order to obtain as first
model. Which is tested on Di? The second trained on the subsets D1, D3, . . . Dk and test on the D2 and so
on.
1. Select classify tab and J48 decision tree and in the test option select cross validation radio button and
the number of folds as 10.
2. Number of folds indicates number of partition with the set of attributes.
3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors will be zeroed
out, but in reality there is no such training set that gives 100% accuracy.
Cross Validation Result at folds: 10 for the table German Credit Data:
Correctly Classified Instances
705
70.5
295
29.5
Kappa statistic
0.2467
0.3467
0.4796
82.5233 %
104.6565 %
1000
Here there are 1000 instances with 100 instances per partition.
698
69.8
302
30.2
Kappa statistic
0.2264
0.3571
0.4883
85.0006 %
106.5538 %
1000
709
70.9
291
29.1
Kappa statistic
0.2538
0.3484
0.4825
82.9304 %
105.2826 %
1000
710
71
290
29
Kappa statistic
0.2587
0.3444
0.4771
81.959 %
104.1164 %
1000
Percentage split does not allow 100%, it allows only till 99.9%
362
72.4
138
27.6
Kappa statistic
0.2725
0.3225
0.4764
76.3523 %
106.4373 %
500
Kappa statistic
0.6667
0.6667
221.7054 %
221.7054 %
100
3.7. Check to see if the data shows a bias against "foreign workers" (attribute 20), or "personalstatus"(attribute 9). One way to do this (Perhaps rather simple minded) is to remove these attributes
from the dataset and see if the decision tree created in those cases is significantly different from the full
dataset case which you have already done. To remove an attribute you can use the reprocess tab in
Weka's GUI Explorer. Did removing these attributes have any significant effect? Discuss.
This increases in accuracy because the two attributes foreign workers and personal status are not
much important in training and analyzing.
By removing this, the time has been reduced to some extent and then it results in increase in the
accuracy.
The decision tree which is created is very large compared to the decision tree which we have trained
now. This is the main difference between these two decision trees.
If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these two attributes are
not significant to perform training.
3.8. Another question might be, do you really need to input so many attributes to get good results? Maybe
only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class
attribute (naturally)). Try out some combinations. (You had removed two attributes in problem
7.Remember to reload the arff data file to get all the attributes initially before you start selecting the ones
you want.)
Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.
After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over attributes and
visualize them.
After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further try random
combination of attributes to increase the accuracy.
Cross validation
Percentage split
3.9. Sometimes, the cost of rejecting an applicant who actually has a good credit(case 1) might be higher
than accepting an applicant who has bad credit (case 2).Instead of counting the misclassifications equally
in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do
this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and
cross-validation results. Are they significantly different from results obtained in problem 6 (using equal
cost)?
In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two cases with
different cost. Let us take cost 5 in case 1 and cost 2 in case 2.
When we give such costs in both cases and after training the decision tree, we can observe that almost equal to
that of the decision tree obtained in problem 6.
Case1 (cost 5) Case2 (cost 5)
Total Cost
3820 1705
Average cost 3.82 1.705
We dont find this cost factor in problem 6. As there we use equal cost. This is the
major difference between the results of problem 6 and problem 9.
The cost matrices we used here:
Case 1: 5 1
1 5
Case 2: 2 1
12
4.Set classes as 2.
5.Click on Resize and then well get cost matrix.
6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0
7.Then confusion matrix will be generated and you can find out the difference
between good and bad attribute.
8. Check accuracy whether its changing or not.
3.10. Do you think it is a good idea to prefer simple decision trees instead of having long complex
decision trees? How does the complexity of a Decision Tree relate to the bias of the model?
When we consider long complex decision trees, we will have many unnecessary attributes in the tree which
results in increase of the bias of the model. Because of this, the accuracy of the model can also effected.
This problem can be reduced by considering simple decision tree. The attributes will be less and it decreases the
bias of the model. Due to this the result will be more accurate.
So it is a good idea to prefer simple decision trees instead of long complex trees.
1. Open any existing ARFF file e.g labour.arff.
2. In preprocess tab, select ALL to select all the attributes.
3. Go to classify tab and then use traning set with J48 algorithm.
4. To generate the decision tree, right click on the result list and select visualize tree option, by which
the decision tree will be generated.
Visualize tree
3.11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced
Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees
using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report
your accuracy using the pruned model. Does your accuracy increase ?
Reduced-error pruning:The idea of using a separate pruning set for pruning which is applicable to decision trees as well as rule sets is
called reduced-error pruning. The variant described previously prunes a rule immediately after it has been
grown and is called incremental reduced-error pruning.
Another possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding individual tests.
However, this method is much slower. Of course, there are many different ways to assess the worth of a rule
based on the pruning set. A simple measure is to consider how well the rule would do at discriminating the
predicted class from other classes if it were the only rule in the theory, operating under the closed world
assumption.
If it gets p instances right out of the t instances that it covers, and there are P instances of this class out of a total
T of instances altogether, then it gets positive instances right. The instances that it does not cover include N - n
negative ones, where n = t p is the number of negative instances that the rule covers and N = T - P is the total
number of negative instances.
Thus the rule has an overall success ratio of [p +(N - n)] T , and this quantity, evaluated on the test set, has been
used to evaluate the success of a rule when using reduced-error pruning.
1. Right click on J48 algorithm to get Generic Object Editor window
2. In this,make reduced error pruning option as true and also the unpruned option as true
.
3. Then press OK and then start.
4. We find that the accuracy has been increased by selecting the reduced error pruning option.
3.12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own
small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different
classifiers that output the model in the form of rules - one such classifier in Weka is rules. PART, train
this model and report the set of rules obtained. Sometimes just one attribute can be good enough in
making the decision, yes, just one ! Can you predict what attribute that might be in this dataset? OneR
classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error).
Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oneR.
In weka, rules.PART is one of the classifier which converts the decision trees into IF-THEN-ELSE rules.
Converting Decision trees into IF-THEN-ELSE rules using rules.PART classifier:PART decision list