GUNADWDM
GUNADWDM
GUNADWDM
DWDM LAB
EXERCISE:1
Creation of a Data Warehouse:
➢ Build Data Warehouse/Data Mart (using open source tools like Pentaho Data Integration Tool,
Pentaho Business Analytics; or other data warehouse tools like Microsoft-SSIS, Informatica,
Business Objects,etc.,)
➢ Design multi-dimensional data models namely Star, Snowflake and Fact Constellation
schemas for any one enterprise (ex. Banking, Insurance, Finance, Healthcare, manufacturing,
Automobiles, sales etc).
➢ Write ETL scripts and implement using data warehouse tools.
➢ Perform Various OLAP operations such slice, dice, roll up, drill up and pivot .
In this task, we are going to use MySQL administrator, SQLyog Enterprise tools for
building & identifying tables in database & also for populating (filling) the sample data in those
tables of a database.A data warehouse is constructed by integrating data from multiple
heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and
decision making. We are building a data warehouse by integrating all the tables in database &
analyzing those data. In the below figure we represented MySQL Administrator connection
establishment.
There are different options available in MySQL administrator. Another tool SQLyog Enterprise, we are
using for building & identifying tables in a database after successful connection establishment through
MySQL Administrator. Below we can see the window of SQLyog Enterprise.
On left-side navigation, we can see different databases & it‘s related tables. Now we are going to build
tables & populate table‘s data in database through SQL queries. These tables in database can be used
further for building data warehou
In the above two windows, we created a database named “sample”& in that database we created two
tables named as “user_details”& “hockey”through SQL queries.
Now, we are going to populate (filling) sample data through SQL queries in those two
created tables as represented in below windows.
Through MySQL administrator & SQLyog, we can import databases from other sources
(.XLS,
.CSV, .sql) & also we can export our databases as backup for further processing. We can connect
MySQL to other applications for data analysis & reporting.
(ii). Design multi-dimensional data models namely Star, snowflake and Fact constellation
schemas for any one enterprise (ex. Banking, Insurance, Finance, Healthcare,
Manufacturing, Automobile, etc.).
Multi-Dimensional model was developed for implementing data warehouses & it provides both a
mechanism to store data and a way for business analysis. The primary components of dimensional
model are dimensions & facts. There are different of types of multi-dimensional data models. They
are:
1. Star Schema Model
2. Snow Flake Schema Model
3. Fact Constellation Model.
Now, we are going to design these multi-dimensional models for the Marketing enterprise.
First, we need to built the tables in a database through SQLyog as shown below.
In the above window, left side navigation bar consists of a database named as ―sales_dw‖ in
which there are six different tables (dimcustdetails, dimcustomer, dimproduct, dimsalesperson,
dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as “Microsoft
Visual Studio 2012 for Business Intelligence” for building multi-dimensional models.
In the above window, we are seeing Microsoft Visual Studio before creating a project In
which right side navigation bar contains different options like Data Sources, Data Source Views,
Cubes, Dimensions etc.
Through Data Sources, we can connect to our MySQL database named as “sales_dw”.
Then, automatically all the tables in that database will be retrieved to this tool for creating multi-
dimensional models.
By data source views & cubes, we can see our retrieved tables in multi-dimensional
models. We need to add dimensions also through dimensions option. In general, Multi-
dimensional models consists of dimension tables & fact tables.
A Star schema model is a join between a fact table and a no. of dimension tables. Each dimensional
table are joined to the fact table using primary key to foreign key join but dimensional tables are
not joined to each other. It is the simplest style of dataware house schema.
Star schema is a entity relationship diagram of this schema resembles a star with point
radiating from central table as we seen in the below implemented window in visual studio.
It is slightly different from star schema in which dimensional tables from a star schema are
organized into a hierarchy by normalizing them.
Snow flake schema is represented by centralized fact table which are connected to multiple
dimension tables. Snow flake effects only dimension tables not fact tables. we developed a
snowflake schema for sales_dw database by visual studio tool as shown below.
ETL (Extract-Transform-Load):
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process
of how the data are loaded from the source system to the data warehouse. Currently, the ETL
encompasses a cleaning step as a separate step. The sequence is then Extract-Clean- Transform-
Load. Let us briefly describe each step of the ETL process.
Process
Extract:
The Extract step covers the data extraction from the source system and makes it accessible for
further processing. The main objective of the extract step is to retrieve all the required data from
the source system with as little resources as possible. The extract step should be designed in a way
that it does not negatively affect the source system in terms or performance, response time or any
kind of locking.
There are several ways to perform the extract:
• Update notification - if the source system is able to provide a notification that a record has
been changed and describe the change, this is the easiest way to get the data.
• Incremental extract - some systems may not be able to provide notification that an update
has occurred, but they are able to identify which records have been modified and provide
an extract of such records. During further ETL steps, the system needs to identifychanges
and propagate it down. Note, that by using daily extract, we may not be able to handle
deleted records properly.
• Full extract - some systems are not able to identify which data has been changed at all, soa
full extract is the only way one can get the data out of the system. The full extract requires
keeping a copy of the last extract in the same format in order to be able to identify changes.
Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely important.
Particularly for full extracts; the data volumes can be in tens of gigabytes.
Clean:
The cleaning step is one of the most important as it ensures the quality of the data in the data
warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
• Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
Transform:
The transform step applies a set of rules to transform the data from the source to the target. This
includes converting any measured data to the same dimension (i.e. conformed dimension) using
the same units so that they can later be joined. The transformation step also requires joining data
from several sources, generating aggregates, generating surrogate keys, sorting, deriving new
Load:
During the load step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible. The target of the Load process is often a database. In order to make the load
process efficient, it is helpful to disable any constraints and indexes before the load and enable
them back only after the load completes. The referential integrity needs to be maintained by ETL
tool to ensure consistency.
Managing ETL Process:
The ETL process seems quite straight forward. As with every application, there is a possibility that
the ETL process fails. This can be caused by missing extracts from one of the systems, missing
values in one of the reference tables, or simply a connection or power outage. Therefore, it is
necessary to design the ETL process keeping fail-recovery in mind.
Staging:
It should be possible to restart, at least, some of the phases independently from the others. For
example, if the transformation step fails, it should not be necessary to restart the Extract step. We
can ensure this by implementing proper staging. Staging means that the data is simply dumped to
the location (called the Staging Area) so that it can then be read by the next processing phase. The
staging area is also used during ETL process to store intermediate results of processing. This is ok
for the ETL process which uses for this purpose. However, tThe staging area should is be accessed
by the load ETL process only. It should never be available to anyone else; particularly not to end
users as it is not intended for data presentation to the end-user.may contain incomplete or in-the-
middle-of-the-processing data.
When you are about to use an ETL tool, there is a fundamental decision to be made: will the
company build its own data transformation tool or will it use an existing tool?
Building your own data transformation tool (usually a set of shell scripts) is the preferred approach
for a small number of data sources which reside in storage of the same type. The reason for that is
the effort to implement the necessary transformation is little due to similar data structure and
common system architecture. Also, this approach saves licensing cost and there is no need to train
the staff in a new tool. This approach, however, is dangerous from the TOC point of view. If the
transformations become more sophisticated during the time or there is a need to integrate other
systems, the complexity of such an ETL system grows but the manageability drops significantly.
Similarly, the implementation of your own tool often resembles re-inventing the wheel.
There are many ready-to-use ETL tools on the market. The main benefit of using off-the-shelf ETL
tools is the fact that they are optimized for the ETL process by providing connectors to common
data sources like databases, flat files, mainframe systems, xml, etc. They provide a means to
implement data transformations easily and consistently across various data sources. This includes
filtering, reformatting, sorting, joining, merging, aggregation and other operations ready to use.
The tools also support transformation scheduling, version control, monitoring and unified metadata
management. Some of the ETL tools are even integrated with BI tools.
Some of the Well Known ETL Tools:
The most well-known commercial tools are Ab Initio, IBM InfoSphere DataStage,
Informatica, Oracle Data Integrator, and SAP Data Integrator.
10 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527
AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB
In these above tools, we are going to use OpenRefine 2.8 ETL toolto different sample datasets
forextracting, data cleaning, transforming & loading.
Perform various OLAP operations such slice, dice, roll up, drill down and pivot. OLAP
Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations
• Roll-up (Drill-up)
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up (Drill-up):
Drill-down:
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways
• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.
• Drill-down is performed by stepping down a concept hierarchy for the dimension time.
• Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of quarter to the
levelof month.
• When drill-down is performed, one or more dimensions from the data cube are added.
• It navigates the data from less detailed data to highly detailed data.
Slice
:
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube.
Dice
:
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Pivot (rotate):
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data.
Now, we are practically implementing all these OLAP Operations using Microsoft
Excel.
4. We got all the music.cub data for analyzing different OLAP Operations.Firstly, we performed
drill-down operation as shown below.
roll-up (drill-up) operation, in the above window I selected January month then automatically
Drill-up option is enabled on top. We will click on Drill-up option, then the below window will
be displayed.
Now we are going to perform roll-up (drill-up) operation, in the above window I selected January month
14 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527
AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB
then automatically Drill-up option is enabled on top. We will click on Drill-up option, then the below
window will be displayed.
Next OLAP operation Slicing is performed by inserting slicer as shown in top navigation
options. While inserting slicers for slicing operation, we select 2 Dimensions (for e.g.
CategoryName & Year) only with one Measure (for e.g. Sum of sales).After inserting a slice&
adding a filter (CategoryName: AVANT ROCK & BIG BAND; Year: 2009 & 2010), we will get
table as shown below.
2. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order Date-Year)&
16 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527
AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB
columns (Values-Sum of Quantity & Sum of Sales) through right side bottom navigation baras
shown below.
After Swapping
(rotating), we will get resultant as represented below with a pie-chart for Category-Classical&
Year Wise data.
Below window, represents the data visualization through Pentaho Business Analytics
tool online (http://www.pentaho.com/hosted-demo) for some sample dataset.
19
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-2
Aim:Explore machine learning tool WEKA
Description:
Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to these functions. Portability, since
it is fully implemented in the Java programming language and thus runs on almost any modern
computing platform.
With WEKA, machine learning algorithms are readily available to users. ML specialists can use these
methods to extract useful information from high volumes of data. Here, the specialists can create an
environment to develop new machine learning methods and implement them on real data.
20
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Step 2: After successful download, open the file location and double-click on the downloaded file.
The Step Up wizard will appear. Click on Next.
Step-3: The License Agreement terms will open. Read it thoroughly and click on “I Agree”.
Step-4: According to your requirements, select the components to be installed. Full component
installation is recommended. Click on Next.
21
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
22
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
23
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Explorer: The WEKA Explorer windows show different tabs starting with preprocessing. Initially, the preprocess tab
is active, as first the data set is preprocessed before applying algorithms to it and exploring the dataset
24
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Experimenter: The WEKA experimenter button allows the users to create, run, and modify different
schemes in one experiment on a dataset
The different components available are Datasources, Datasavers, Filters, Classifiers, Clusters,
Evaluation, and Visualization.
25
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Simple CLI:
Command-line interface is a text based user interface used to run programs to manage computer files
and interact with computer
This also called as command -line user interface
When you click on the Explorer button in the Applications selector, it opens
Now we can see the 6 tabs in explorer:
• Preprocess
• Classify
• Cluster
• Associate
• Select Attributes
• Visualize
Under these tabs, there are several pre-implemented machine learning algorithms. Let us look into each
of them in detail now.
• Preprocess Tab
Initially as you open the explorer, only the Preprocess tab is enabled. The first step in machine learning
is to preprocess the data. Thus, in the Preprocess option, you will select the data file, process it and
make it fit for applying the various machine learning algorithms.
• Classify Tab
The Classify tab provides you several machine learning algorithms for the classification of your data.
To list a few, you may apply algorithms such as Linear Regression, Logistic Regression, Support Vector
26
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very
exhaustive and provides both supervised and unsupervised machine learning algorithms
• Cluster Tab
There are several clustering algorithms provided - such as SimpleKMeans, FilteredClusterer,
HierarchicalClusterer, and so on.
• Associate Tab
In the Associate tab, we can find Apriori, Filtered Associator and FPGrowth.
27
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
28
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
rainy,mild,high,TRUE,
30
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
31
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERICISE-3
Aim:Perform data preprocessing tasks and demonstrate performing association rules mining on data
sets
Description: The data that is collected from the field contains many unwanted things that leads to
wrong analysis
To demonstrate the available features in preprocessing, we will use the weather database that is
provided in the installation.
Step-1:Using the Openfile option under the Preprocess tag select the weather-nominal.arff file.
➢ Applying Filters:
• There are many filters like: Unsupervised filters
32
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
• Supervised filters
• Discretization
• Resample Filter ,etc
▪ Supervised filters:
Supervised learning is a machine learning method in which models are trained using labeled data. In
supervised learning, models need to find the mapping function to map the input variable (X) with the
output variable (Y).
Supervised learning can be used for two types of problems: Classification and Regression
▪ Unsupervised filters:
Unsupervised learning is another machine learning method in which patterns inferred from the
unlabeled input data. The goal of unsupervised learning is to find the structure and patterns from the
input data. Unsupervised learning does not need any supervision. Instead, it finds patterns from the data
by its own.
33
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
▪ Discretization: Once in a while one has numeric data but wants to use classifier that handles
only nominal values. In that case one needs to discretize the data, which can be done with the
following filters:
But, since discretization depends on the data which presented to the discretization algorithm, one
easily end up with incompatible train and test files
➢ Load weather. Nominal into Weka and run Apriori Algorithm with different support and
confidence values. data set into weka and run
➢ Apriori Algoithm:
• AIM: To select interesting rules from the set of all possible rules, constraints on various
measures of significance and interest can be used. The best known constraints are minimum
thresholds on support and confidence.
• Description:
The Apriori algorithm is one such algorithm in ML that finds out the probable associations and
creates association rules.
34
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
WEKA provides the implementation of the Apriori algorithm. You can define the minimum support
and an acceptable confidence level while computing these rules.
• ALGORITHM:
Association rule mining is to find out association rules that satisfy the predefined minimum support
and confidence from a given database
The Apriori algorithm finds the frequent sets L In Database D.
· Find frequent set Lk − 1.
·Join Step. .
.Ck is generated by joining Lk − 1with itself ·
.Prune Step.
.o Any (k − 1) itemset that is not frequent cannot be a subset of a
.frequent k itemset, hence should be removed.
.Where · (Ck: Candidate itemset of size k) ·
.(Lk: frequent itemset of size k)
.Apriori Pseudocode
.Apriori (T,£) L
L<{ Large 1itemsets that appear in more than transactions }
while L(k1)≠ Φ C(k)<Generate( Lk − 1) for transactions t € T
C(t)Subset(Ck,t) for candidates c € C(t) count[c]
→Steps for run Apriori algorithm in WEKA :
35
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Output:
• Support count: The support count of an itemset X, denoted by X.count, in a data set T is the number
of transactions in T that contain X. Assume T has n transactions.
• Then,
support = ( X Y ).count n
37
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-4
4(a)Aim: Load each dataset into Weka and run id3, j48 classification algorithm, study the classifier
output. Compute entropy values, Kappa statistic.
➢ Description:
→Steps for run ID3 and J48 Classification algorithms in WEKA
▪ Open WEKA Tool.
▪ Click on WEKA Explorer.
▪ Click on Preprocessing tab button.
▪ Click on open file button.
▪ Choose WEKA folder in C drive.
▪ Select and Click on data option button.
▪ Choose iris data set and open file.
▪ Click on classify tab and Choose J48 algorithm and select use training set test option.
▪ Click on start button.
▪ Click on classify tab and Choose ID3 algorithm and select use training set test option.
▪ Click on start button.
▪ Output:
38
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
4(b)Extract if-then rues from decision tree generated by classifier, Observe the confusion matrix
Description: A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node
IF-THEN Rules:
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in
the following from − IF condition THEN conclusion Let us consider a rule R1, R1: IF age=youth
AND student=yes THEN buy_computer=yes.
Output:
39
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
→Load each dataset into Weka and perform Naïve-bayes classification and k-Nearest Neighbour
classification. Interpret the results obtained.
Aim: Determining and classifying the credit good or bad in the dataset with an Accuracy.
Description:
Naive Bayes classifier assumes that the presence of a particular feature of a class is unrelated to the
presence of any other feature.. Even though these features depend on the existence of the other
features, a naive Bayes classifier considers all of these properties to independently contribute to the
probability
→4©Steps for run Naïve-bayes and k-nearest neighbor Classification algorithms in WEKA
▪ Open WEKA Tool
▪ Click on WEKA Explorer.
▪ Click on Preprocessing tab button.
▪ Click on open file button.
▪ Choose WEKA folder in C drive.
▪ Select and Click on data option button
▪ Choose iris data set and open file.
▪ Click on classify tab and Choose Naïve-bayes algorithm
▪ select use training set test option..
▪ Click on start button.
▪ Click on classify tab
▪ Choose k-nearest neighbor
▪ select use training set test option.
▪ Click on start button.
40
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Naïve-bayes algorithm:
41
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
→4(e)Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for each dataset,
and deduce which classifier is performing best and poor for each dataset and justify
Aim:To Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for each dataset
→Description:
Steps for run ID3 and J48 Classification algorithms in WEKA
* Open WEKA Tool.
* Click on WEKA Explorer.
* Click on Preprocessing tab button.
* Click on open file button.
* Choose WEKA folder in C drive
→Select and Click on data option button
* Choose iris data set and open file.
* Click on classify tab and Choose J48 algorithm and select use training set test option.
* Click on start button.
42
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
* Click on classify tab and Choose ID3 algorithm and select use training set test option.
* Click on start button.
* Click on classify tab and Choose Naïve-bayes algorithm and select use training set test option.
* Click on start button.
43
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
* Click on classify tab and Choose k-nearest neighbor and select use training set test option.
* Click on start button.
44
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-5
→5(a)Demonstrate performing clustering on data sets Clustering Tab
➢ AIM: To understanding the selected attributes and removing attributes also to reload & the
arff data file to get all the attributes in the data set.
➢ Description:Selecting a Clusterer :-
By now you will be familiar with the process of selecting and configuring objects. Clicking on the
clustering scheme listed in the Clusterer box at the top of the window brings up a
GenericObjectEditor dialog with which to choose a new clustering scheme.
Steps for run K-mean Clustering algorithms in WEKA:
• Click on cluster tab and Choose k-mean and select use training set test option.
Output:
45
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
5(b)Study the clusters formed. Observe the sum of squared errors and centroids, and derive
insights
46
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
→5(d)Explore visualization features of weka to visualize the clusters. Derive interesting insights and
explain.
Aim: To explore visualization features of weka to visualize the clusters.
Description:
Visualize Features WEKA’s visualization allows you to visualize a
2-D plot of the current working relation.
Visualization is very useful in practice, it helps to determine difficulty of the learning problem.
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations.
WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data points.
47
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-6
→Demonstrate knowledge flow application on data sets
6(a)Aim: Develop a knowledge flow layout for finding strong association rules by using Apriori, FP
Growth algorithms
Description: The Knowledge Flow presents a data-flow inspired interface to WEKA. The user can
select WEKA components from a palette, place them on a layout canvas and connect them together in
order to form a knowledge flow for processing and analyzing data. At present, all of WEKA’s
classifiers, filters, clusterers, associators, loaders and savers are available in the Knowledge Flow
along with some extra tools.
48
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
6(b)AIM:
Set up the knowledge flow to load an ARFF (batch mode) and perform a cross validation using J48
algorithm
DESCRIPTION:
The knowledge flow provides an alternative way to the explorer as a graphical front end to WEKA’s
algorithm. Knowledge flow is a working progress. So, some of the functionality from explorer is not
yet available. So, on the other hand there are the things that can be done in knowledge flow, but not in
explorer. Knowledge flow presents a dataflow interface to WEKA. The user can select WEKA
components from a toolbar placed them on a layout campus and connect them together in order to
form a knowledge flow for processing and analyzing the data
PROCEDURE:
Step -1:open weka tool in that open knowledge application
49
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Step-5: Now select the class assigner and then select to cross validation fold maker
50
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Step-6:Now from cross validation fold maker attach to J48 by training set and test set
Step-7:Now by J48 select the batch classifier and attach to classfier performance evaluator
51
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Step-8:Now from classifier performance evaluator text the data to text viewer
52
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Step -10:Now in the text viewer right click and check the show results
53
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Output:
final the comeplte knowledge flow algorithm
6© AIM:Demonstrate plotting multiple ROC curves in the same plot window by using j48 and
Random forest tree
➢ Rules for plotting multiple ROC curves :
• Open WEKA Tool.
• Click on classify tab button and select trees option and select J48 and run it
54
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
55
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXCERCISE-7
Aim:Demonstrate zero R technique on iris dataset(by using necessary preprocessing technique and
share your observations.
Zero R: ZeroR is the simplest classification method which relies on the target and ignores all
predictors. ZeroR classifier simply predicts the majority category (class). Although there is no
predictability power in ZeroR, it is useful for determining a baseline performance as a benchmark for
other classification methods.
Zero R on iris dataset:
56
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
57
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-8
Aim:write a java program to prepare a simulated dataset with unique instance.
The number will be unique if it is positive integer and there are no repeated digits in the
number. In other words, a number is said to be unique if and only if the digits are not
duplicate. For example, 20, 56, 9863, 145, etc. are the unique numbers while 33, 121, 900,
1010, etc. are not unique numbers. In this section, we will create
here are the following ways to check the number is unique or not: By Comparing Each Digit
Manually Using String
Using Array
By Comparing Each Digit Manually
There are the following steps to check number is unique or not:
1. Read a number from the user.
2. Find the last digit o the number.
3. Compare all digits of the number with the last digit.
4. If the digit found more than one time, the number is not unique.
5. Else, eliminate the last digit of the number.
6. Repeat steps 2 to 5 until the number becomes zero.
UniqueNumberExample1.java
1. Import java.util.Scanner;
2. public class UniqueNumberExample1
3. {
4. public static void main(String args[])
5. {
6. int r1, r2, number, num1, num2, count = 0;
7. Scanner sc = new Scanner(System.in);
8. System.out.print("Enter the number you want to check: ");
9. //reading a number from the user
10. number = sc.nextInt();
58
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
59
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
40. {
41. System.out.println("The number is not unique.");
42. }
43. }
44. }
Output 1:
Enter the number you want to check: 13895
The number is unique.
Output 2:
Enter the number you want to check: 11100
The number is not unique.
Output 3:
Enter the number you want to check: 10000
The number is not unique.
60
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-9
Aim : Write a Python program to generate frequent item sets / association rules using Apriori
algorithm
Procedure :
Apriori Algorithm is a Machine Learning algorithm utilized to understand the patterns of
relationships among the various products involved. The most popular use of the algorithm is
to suggest products based on the items already in the user's shopping cart. Walmart
specifically has utilized the algorithm in recommending items to its users.
Dataset: Groceries data
Implementation of algorithm in Python:
Step 1: Import the required libraries
1. import numpy as np
2. import pandas as pd
3. from mlxtend.frequent_patterns import apriori, association_rules
Step 2: Load and explore the data
1. # Now, we will load the Data
2. data1 = pnd.read_excel('Online_Retail.xlsx')
3. data1.head()
61
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Output:
62
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Input:
# here, we will explore the columns of the data
1. data1.columns
Output:
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
'UnitPrice', 'CustomerID', 'Country'],
Dtype = 'object')
Input:
# Now, we will explore the different regions of transactions
1. data1.Country.unique()
Output:
array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
'European Community', 'Malta', 'RSA'], dtype = object)
Step 3: Clean the Data
1. # here, we will strip the extra spaces in the description
2. data1['Description'] = data1['Description'].str.strip()
3. # Now, drop the rows which does not have any invoice number
4. data1.dropna(axis = 0, subset = ['InvoiceNo'], inplace = True)
5. data1['InvoiceNo'] = data1['InvoiceNo'].astype('str')
6. # Now, we will drop all transactions which were done on credit
7. data1 = data1[~data1['InvoiceNo'].str.contains('C')]
Step 4: Split the data according to the region of transaction
1. # Transactions done in France
2. basket1_France = (data1[data1['Country'] == "France"]
63
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
3. .groupby(['InvoiceNo', 'Description'])['Quantity']
4. .sum().unstack().reset_index().fillna(0)
5. .set_index('InvoiceNo'))
6. # Transactions done in the United Kingdom
7. basket1_UK = (data1[data1['Country'] == "United Kingdom"]
8. .groupby(['InvoiceNo', 'Description'])['Quantity']
9. .sum().unstack().reset_index().fillna(0)
10. .set_index('InvoiceNo'))
11. # Transactions done in Portugal
12. basket1_Por = (data1[data1['Country'] == "Portugal"]
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 68
13. .groupby(['InvoiceNo', 'Description'])['Quantity']
14. .sum().unstack().reset_index().fillna(0)
15. .set_index('InvoiceNo'))
16.
17. basket1_Sweden = (data1[data1['Country'] == "Sweden"]
18. .groupby(['InvoiceNo', 'Description'])['Quantity']
19. .sum().unstack().reset_index().fillna(0)
20. .set_index('InvoiceNo'))
Step 5: Hot encoding the Data
# Here, we will define the hot encoding function
1. # for making the data suitable
2. # for the concerned libraries
3. def hot_encode1(P):
4. if(P<= 0):
5. return 0
6. if(P>= 1):
7. return 1
8. # Here, we will encode the datasets
64
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
9. basket1_encoded = basket1_France.applymap(hot_encode1)
10. basket1_France = basket1_encoded
11.
12. basket1_encoded = basket1_UK.applymap(hot_encode1)
13. basket1_UK = basket1_encoded
14.
15. basket1_encoded = basket1_Por.applymap(hot_encode1)
16. basket1_Por = basket1_encoded
17. basket1_encoded = basket1_Sweden.applymap(hot_encode1)
18. basket1_Sweden = basket1_encoded
Step 6: Build the models and analyse the results
a) France:
1. # Build the model
2. frq_items1 = AP(basket1_France, min_support = 0.05, use_colnames = True)
3.
4. # Collect the inferred rules in a dataframe
5. rules1 = AR(frq_items1, metric = "lift", min_threshold = 1)
6. rules1 = rules1.sort_values(['confidence', 'lift'], ascending = [False, False])
7. print(rules1.head())
Output:
antecedents \
45 (JUMBO BAG WOODLAND ANIMALS)
260 (PLASTERS IN TIN CIRCUS PARADE, RED TOADSTOOL ...
272 (RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TI...
302 (SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO...
301 (SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...
consequents antecedent support consequent support \
45 (POSTAGE) 0.076531 0.765306
260 (POSTAGE) 0.051020 0.765306
272 (POSTAGE) 0.053571 0.765306
65
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
66
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-10
Aim: Write a program to calculate chi-square value using Python. Report your observation.
The Pearson’s Chi-Square statistical hypothesis is a test for independence between categorical
variables. In this article, we will perform the test using a mathematical approach and then using
Python’s SciPy module.
First, let us see the mathematical approach :
The Contingency Table:
A Contingency table (also called crosstab) is used in statistics to summarise the relationship between
several categorical variables. Here, we take a table that shows the number of men and women buying
different types of pets.
dog cat bird total men 207 282 241 730
women 234 242 232 708
total 441 524 473 1438
The aim of the test is to conclude whether the two variables( gender and choice of pet ) are related to
each other.
Null hypothesis:
We start by defining the null hypothesis (H0) which states that there is no relation between
the variables. An alternate hypothesis would state that there is a significant relation between
the two.
We can verify the hypothesis by these methods:
Using p-value:
We define a significance factor to determine whether the relation between the variables is of
considerable significance.
Generally a significance factor or alpha value of 0.05 is chosen.
This alpha value denotes the probability of erroneously rejecting H0 when it is true.
A lower alpha value is chosen in cases where we expect more precision. If the p-value for the test
comes out to be strictly greater than the alpha value, then H0 holds true.
Using chi-square value: If our calculated value of chi-square is less or equal to the tabular(also
called critical) value of chi-square, then H0 holds true.
Expected Values Table :Next, we prepare a similar table of calculated(or expected) values.
To do this we need to
calculate each item in the new table as :
67
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
68
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-11
Aim: write a program of naïve base algorithm classification using python programming.
Introduction to Naive Bayes :
Naive Bayes is among one of the very simple and powerful algorithms for classification
based on Bayes Theorem with an assumption of independence among the predictors. The
Naive Bayes classifier assumes that the presence of a feature in a class is not related to any other
feature.
Naive Bayes is a classification algorithm for binary and multi-class
classification problems.
Bayes Theorem :
Based on prior knowledge of conditions that may be related to an event, Bayes theorem
describes the probability of the event
•conditional probability can be found this way
•Assume we have a Hypothesis(H) and evidence(E),
According to Bayes theorem, the relationship between the probability of Hypothesis
before getting the evidence represented as P(H) and the probability of the hypothesis
after getting the evidence represented as P(H|E) is:
P(H|E) = P(E|H)*P(H)/P(E)
Prior probability = P(H) is the probability before getting the evidence
•Posterior probability = P(H|E) is the probability after getting evidence In general,
•P(class|data) = (P(data|class) * P(class)) / P(data)
Bayes Theorem Example
Assume we have to find the probability of the randomly picked card to be king given that it is
a face card.
There are 4 Kings in a Deck of Cards which implies that P(King) = 4/52
as all the Kings are face Cards so P(Face|King) = 1
there are 3 Face Cards in a Suit of 13 cards and there are 4 Suits in total so P(Face) = 12/52
Therefore,
P(King|face) = P(face|king)*P(king)/P(face) = 1/3
69
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Source Code : Implementing Naive Bayes algorithm from scratch using Python
# Importing library
import math
import random
import csv
# the categorical class names are changed to numberic data
# eg: yes and no encoded to 1 and 0
def encode_class(mydata):
classes = []
for i in range(len(mydata)):
if mydata[i][-1] not in classes:
classes.append(mydata[i][-1])
for i in range(len(classes)):
for j in range(len(mydata)):
if mydata[j][-1] == classes[i]:
mydata[j][-1] = i
return mydata
# Splitting the data
def splitting(mydata, ratio):
train_num = int(len(mydata) * ratio)
train = []
# initially testset will have all the dataset
test = list(mydata)
while len(train) < train_num:
# index generated randomly from range 0
# to length of testset
index = random.randrange(len(test))
# from testset, pop data rows and put it in train
train.append(test.pop(index))
return train, test
70
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
info[classValue] = MeanAndStdDev(instances)
return info
# Calculate Gaussian Probability Density Function
def calculateGaussianProbability(x, mean, stdev):
expo = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * expo
# Calculate Class Probabilities
def calculateClassProbabilities(info, test):
probabilities = {}
for classValue, classSummaries in info.items():
probabilities[classValue] = 1
for i in range(len(classSummaries)):
mean, std_dev = classSummaries[i]
x = test[i]
probabilities[classValue] *= calculateGaussianProbability(x,
mean, std_dev)
return probabilities
# Make prediction - highest probability is the prediction
def predict(info, test):
probabilities = calculateClassProbabilities(info, test)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel
# returns predictions for a set of examples
def getPredictions(info, test):
predictions = []
for i in range(len(test)):
72
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
correct += 1
return (correct / float(len(test))) * 100.0
# driver code
# add the data path in your system
filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive
bayes\filedata.csv'
# load the file and store it in mydata list
mydata = csv.reader(open(filename, "rt"))
mydata = list(mydata)
mydata = encode_class(mydata)
for i in range(len(mydata)):
mydata[i] = [float(x) for x in mydata[i]]
# split ratio = 0.7
# 70% of data is training data and 30% is test data used for testing
ratio = 0.7
train_data, test_data = splitting(mydata, ratio)
print('Total number of examples are: ', len(mydata))
print('Out of these, training examples are: ', len(train_data))
print("Test examples are: ", len(test_data))
73
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
# prepare model
info = MeanAndStdDevForClass(train_data)
# test model
predictions = getPredictions(info, test_data)
accuracy = accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)
Output: Total number of examples are: 200
Out of these, training examples are: 140
Test examples are: 60
Accuracy of your model is: 71.2376788
74
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE 12
Aim: Implement a Java program to perform Apriori algorithm
import java.io.*;
import java.util.*;
/** The class encapsulates an implementation of the Apriori algorithm
to compute frequent itemsets.
* Datasets contains integers (>=0) separated by spaces, one transaction by line, e.g.
*123
*09
*19
* Usage with the command line :
* $ java mining.Apriori fileName support
* $ java mining.Apriori /tmp/data.dat 0.8
* $ java mining.Apriori /tmp/data.dat 0.8 > frequent-itemsets.txt
*
* For a full library, see SPMF https://www.philippe-fournier-viger.com/spmf/
* @author Martin Monperrus, University of Darmstadt, 2010
* @author Nathan Magnus and Su Yibin, under the supervision of Howard Hamilton,
University of Regina, June 2009.
* @copyright GNU General Public License v3
* No reproduction in whole or part without maintaining this copyright notice
* and imposing this condition on any subsequent users.
public class Apriori extends Observable {
public static void main(String[] args) throws Exception {
Apriori ap = new Apriori(args);
}
/** the list of current itemsets */
private List<int[]> itemsets ;
/** the name of the transcation file */
75
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
numTransactions))+" "+support+")");}
}
/** outputs a message in Sys.err if not used as library */
private void log(String message) {
if (!usedAsLibrary) {
System.err.println(message);
}
}
/** computes numItems, numTransactions, and sets minSup */
private void configure(String[] args) throws Exception
{
// setting transafile
if (args.length!=0) transaFile = args[0];
else transaFile = "chess.dat"; // default
// setting minsupport
if (args.length>=2) minSup=(Double.valueOf(args[1]).doubleValue());
else minSup = .8;// by default
if (minSup>1 || minSup<0) throw new Exception("minSup: bad value");
// going thourgh the file to compute numItems and numTransactions
numItems = 0;
numTransactions=0;
BufferedReader data_in = new BufferedReader(new FileReader(transaFile));
while (data_in.ready()) {
String line=data_in.readLine();
if (line.matches("\\s*")) continue; // be friendly with empty lines
numTransactions++;
StringTokenizer t = new StringTokenizer(line," ");
while (t.hasMoreTokens()) {
int x = Integer.parseInt(t.nextToken());
//log(x);
78
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
if (x+1>numItems) numItems=x+1;
}
}
outputConfig();
}
/** outputs the current configuration
*/
private void outputConfig() {
//output config info to the use
log("Input configuration: "+numItems+" items, "+numTransactions+" transactions, ");
log("minsup = "+minSup*100+"%");
}
/** puts in itemsets all sets of size 1,
* i.e. all possibles items of the datasets
*/
private void createItemsetsOfSize1() {
itemsets = new ArrayList<int[]>();
for(int i=0; i<numItems; i++)
{
int[] cand = {i};
itemsets.add(cand);
}
}
/**
* if m is the size of the current itemsets,
* generate all possible itemsets of size n+1 from pairs of current itemsets
* replaces the itemsets of itemsets by the new ones
*/
private void createNewItemsetsFromPreviousOnes()
{
79
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
}
if (!found){ // Y[s1] is not in X
ndifferent++;
// we put the missing value at the end of newCand
newCand[newCand.length -1] = Y[s1];
}
}
// we have to find at least 1 different, otherwise it means that we have two times the same set
in the existing candidates
assert(ndifferent>0);
if (ndifferent==1) {
// HashMap does not have the correct "equals" for int[] :- // I have to create the hash myself using a
String :-(
// I use Arrays.toString to reuse equals and hashcode of String
Arrays.sort(newCand);
tempCandidates.put(Arrays.toString(newCand),newCand);
}
}
}
//set the new itemsets
itemsets = new ArrayList<int[]>(tempCandidates.values());
log("Created "+itemsets.size()+" unique itemsets of size "+(currentSizeOfItemsets+1));
}
/** put "true" in trans[i] if the integer i is in line */
private void line2booleanArray(String line, boolean[] trans) {
Arrays.fill(trans, false);
StringTokenizer stFile = new StringTokenizer(line, " "); //read a line from the file to the
tokenizer
//put the contents of that line into the transaction array
while (stFile.hasMoreTokens())
81
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
{
int parsedVal = Integer.parseInt(stFile.nextToken());
trans[parsedVal]=true; //if it is not a 0, assign the value to true
}
}
/** passes through the data to measure the frequency of sets in {@link itemsets},
* then filters thoses who are under the minimum support (minSup)
*/
private void calculateFrequentItemsets() throws Exception
{
log("Passing through the data to compute the frequency of " + itemsets.size()+ " itemsets of
size "+itemsets.get(0).length);
List<int[]> frequentCandidates = new ArrayList<int[]>(); //the frequent candidates for the
current itemset
boolean match; //whether the transaction has all the items in an itemset
int count[] = new int[itemsets.size()]; //the number of successful matches, initialized by zeros
// load the transaction file
BufferedReader data_in = new BufferedReader(new InputStreamReader(new
FileInputStream(transaFile)));
boolean[] trans = new boolean[numItems];
// for each transaction
for (int i = 0; i < numTransactions; i++) {
// boolean[] trans = extractEncoding1(data_in.readLine());
String line = data_in.readLine();
line2booleanArray(line, trans);
// check each candidate
for (int c = 0; c < itemsets.size(); c++) {
match = true; // reset match to false
// tokenize the candidate so that we know what items need to be
// present for a match
82
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
=+=
FINAL LIST =+=
[2, 3, 5] : 2
84
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Exercise-13
Aim: Write a program to cluster your choice of data using simple k-means algorithm using
JDK
/*
Simple K means creating 2 partitions with 2-dimensional dataset in
JAVA
By Ngangbam Indrason, 23 Feb 2019
*/
import java.util.*;
class KmeansJ {
int i,j,k=2;
int part1[][] = new int[10][2];
int part2[][] = new int[10][2];
float mean1[][] = new float[1][2];
85
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
// Loop till the new mean and previous mean are same
while(!Arrays.deepEquals(mean1, temp1) ||
!Arrays.deepEquals(mean2, temp2)) {
part1[i][1] = 0;
part2[i][0] = 0;
part2[i][1] = 0;
}
i1 = 0; i2 = 0;
i1++;
}
else {
part2[i2][0] = dataset[i][0];
part2[i2][1] = dataset[i][1];
i2++;
}
}
temp1[0][0] = mean1[0][0];
temp1[0][1] = mean1[0][1];
temp2[0][0] = mean2[0][0];
temp2[0][1] = mean2[0][1];
itr++;
}
System.out.println(part2[i][0]+" "+part2[i][1]);
}
System.out.println("\nFinal Mean: ");
System.out.println("Mean1 : "+mean1[0][0]+" "+mean1[0][1]);
System.out.println("Mean2 : "+mean2[0][0]+" "+mean2[0][1]);
System.out.println("\nTotal Iteration: "+itr);
}
}
Here is the output of the above program.
89
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE:14
import matplotlib.pyplot as plt
OUTPUT:
plt.plot(range(1,11),inertias,marker='o')
plt.title('Elbowmethod') plt.xlabel('Number
ofclusters') plt.ylabel('Inertia') plt.show()
90
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
kmeans=KMeans(n_clusters=2) kmeans.fit(data)
plt.scatter(x,y,c=kmeans.labels_) plt.show()
91
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
for i in range(1,11):
kmeans=KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)
plt.plot(range(1,11),inertias,marker='o')
plt.title('Elbowmethod')
plt.xlabel('Numberofclusters')
plt.ylabel('Inertia')
plt.show()
OUPUT:
kmeans=KMeans(n_cluste
2)
kmeans.fit(dat
plt.scatter(x,y,c=kmeans.label
plt.show
s_) )
(
92
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
93
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-15
Aim: Write a program to compute/display dissimilarity matrix (for your own dataset
containing at least four instances with two attributes) using Python
#!/usr/bin/env python
from math import*
from decimal import Decimal
class Similarity():
""" Five similarity measures function """
def euclidean_distance(self,x,y):
""" return euclidean distance between two lists """
return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))
def manhattan_distance(self,x,y):
""" return manhattan distance between two lists """
return sum(abs(a-b) for a,b in zip(x,y))
def minkowski_distance(self,x,y,p_value):
""" return minkowski distance between two lists """
return self.nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value) def
nth_root(self,value, n_root):
""" returns the n_root of an value """
root_value = 1/float(n_root)
return round (Decimal(value) ** Decimal(root_value),3)
def cosine_similarity(self,x,y):
""" return cosine similarity between two lists """
numerator = sum(a*b for a,b in zip(x,y))
denominator = self.square_rooted(x)*self.square_rooted(y)
return round(numerator/float(denominator),3)
def square_rooted(self,x):
""" return 3 rounded square rooted value """
return round(sqrt(sum([a*a for a in x])),3)
94
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
def jaccard_similarity(self,x,y):
""" returns the jaccard similarity between two lists """
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_ca
output:
95
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
EXERCISE-16
Aim: Visualize the datasets using matplotlib in python.(Histogram, Box plot, Bar chart, Pie
chart etc.,)
ANS)
Data visualization is the presentation of data in an accessible manner through visual tools
like graphs or charts. These visualizations aid the process of communicating insights and
relationships within the data, and are an essential part of data analysis., we treat Matplotlib,
which is the most popular data visualization library within the Python programming
language.
Contents
1. Preliminaries
2. Scatter plots
3. Bar charts
4. Histograms
5. Boxplots
1. Preliminaries
Matplotlib is a very well documented package. To make the plotting easier, we make use of
the pyplot module, that makes Matplotlib work like MATLAB. Essentially, all its
functioning can be found HERE. The point of this article is to state its main and most
important functions and give examples on how to use pyplot, as the documentation
sometimes is pretty hard to navigate through.
To call the package module, we begin our code with import matplotlib.pyplot as plt.
Below, we state some of the most important functions when using pyplot:
plt.title: Set a title, which appears above the plot.
plt.grid: Configure the grid lines in the figure. To enable grid lines in the plot, use
plt.grid(True).
plt.legend: Place a legend in the figure.
plt.xlabel and plt.ylabel: Set labels for the axes. For example, plt.xlabel(“Age”)
sets “Age” as the label for the x-axis.
96
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
97
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
2. Scatter Plots
Now that we have seen the basic pyplot functions, we will start with our first type of plot.
The scatter plot displays values for, typically, two variables for a set of data. Such kind of
plots could be very informative when figuring out relationships between pairs of variables.
Consider the following Salary dataset (from Kaggle), which contains 30 observations
consisting of years of working experience and the annual wage (in dollars). To create a
scatter plot, we make use of the plt.scatter function. Then, we can plot these data points
as follows:
import matplotlib.pyplot as plt
import pandas as pddata = pd.read_csv("Salary_data.csv") # load dataset
X = data["YearsExperience"]
Y = data["Salary"]plt.scatter(X, Y)
plt.title("Scatter Plot")
plt.xlabel("Working Experience (years)")
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 99
98
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
plt.show()
We are also able to, for example, distinguish between observations that have more than 5
years of working experience and observations that have less than 5 years of working
experience by using different colors. To do this, we create two scatter plots by using the
relevant data splits and display them in one single plot. The following code results in the
desired plot:
X_1 = X[X > 5]
X_2 = X[X <= 5]
Y_1 = Y[X > 5]
Y_2 = Y[X <= 5]plt.scatter(X_1, Y_1, label='Years of experience > 5')
plt.scatter(X_2, Y_2, label='Years of experience <= 5')
plt.title("Scatter Plot (split)")
plt.legend()
plt.xlabel("Working Experience (years)")
plt.ylabel("Annual Wage (dollars)")
plt.show()
99
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
3. Bar Charts
A bar chart graphically displays categorical data with rectangular bars of different heights,
where the heights or lengths of the bars represent the values of the corresponding measure.
Let us once again consider the Iris dataset, where observations belong to either one of three
iris flower classes. Assume we want to visualize the average value for each feature of the
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 102
Setosa iris class. We can do this by using a bar chart, requiring the plt.bar function. The
following code results in the desired bar chart figure:
from sklearn import datasets
import matplotlib.pyplot as pltiris = datasets.load_iris()
X_iris = iris.data
Y_iris = iris.targetaverage = X_iris[Y_iris == 0].mean(axis=0)
plt.bar(iris.feature_names, average)
plt.title("Bar Chart Setosa Averages")
plt.ylabel("Average (in cm)")
plt.show()
100
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Furthermore, we are also able to nicely display the feature averages for all three iris flowers,
by placing the bars next to each other. This takes a bit more effort than the standard bar chart.
By using the following code, we obtain the desired plot:
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as npiris = datasets.load_iris()
X_iris = iris.data
Y_iris = iris.target
n_classes = 3averages = [X_iris[Y_iris == i].mean(axis=0) for i in
range(n_classes)]
x = np.arange(len(iris.feature_names))
fig = plt.figure()
ax = fig.add_subplot()
bar1 = ax.bar(x - 0.25, averages[0], 0.25, label=iris.target_names[0])
bar2 = ax.bar(x, averages[1], 0.25, label=iris.target_names[1])
bar3 = ax.bar(x + 0.25, averages[2], 0.25, label=iris.target_names[2])
101
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
ax.set_xticks(x)
ax.set_xticklabels(iris.feature_names)
plt.legend()
plt.title("Bar Chart Iris Averages")
plt.ylabel("Average")
plt.show()
102
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
4. Histograms
A histogram is used to give an approximate representation of the distribution of the data,
based on the sample data at hand. A histogram is constructed by using equally sized ‘bins’
(intervals), and counting the number of data points that belong to each bin. Creating
histograms at the start of a new project is very useful to get familiar with the data, and to get
a rough sense of the density of the underlying distribution. To create a histogram, we make
use of the plt.hist function.
To create a basic histogram on the sepal length of all iris flowers, using 20 equal-length bins,
we use the following code:
from sklearn import datasets
import matplotlib.pyplot as pltbins = 20
iris = datasets.load_iris()
X_iris = iris.data
X_sepal = X_iris[:, 0]
plt.hist(X_sepal, bins)
plt.title("Histogram Sepal Length")
plt.xlabel(iris.feature_names[0])
plt.ylabel("Frequency")
plt.show
103
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Instead of plotting the histogram for a single feature, we can plot the histograms for all
features. This can be done by creating separate plots, but here, we will make use of subplots,
so that all histograms are shown in one single plot. For this, we make use of the
plt.subplots function. By using the following code, we obtain the plot containing the four
histograms:
from sklearn import datasets
import matplotlib.pyplot as pltbins = 20
iris = datasets.load_iris()
X_iris = iris.datafig, axs = plt.subplots(2, 2)
axs[0, 0].hist(X_iris[:, 0])
axs[0, 1].hist(X_iris[:, 1], color='orange')
axs[1, 0].hist(X_iris[:, 2], color='green')
axs[1, 1].hist(X_iris[:, 3], color='red')
i=0
for ax in axs.flat:
ax.set(xlabel=iris.feature_names[i], ylabel='Frequency')
104
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
5. Boxplots
A boxplot is a convenient way of graphically depicting groups of numerical data through
different statistics. The interpretation of a boxplot is displayed in the figure below:
Here, the median is the middle value of the dataset (not to confuse with the mean); the 25th
percentile is the median of the lower half of the dataset and the 75th percentile is the
median of the upper half of the dataset. The data points not included between the whiskers
are plotted as an outlier with a dot.
Once again, consider the Iris flower dataset. A boxplot is created by using the plt.boxplot
function. We will make a boxplot for the sepal length of all iris flowers:
from sklearn import datasets
import matplotlib.pyplot as pltiris = datasets.load_iris()
X_iris = iris.data
X_sepal = X_iris[:, 0]
plt.boxplot(X_sepal, labels=[iris.feature_names[0]])
plt.title("Boxplot Sepal Length")
plt.ylabel("cm")
plt.show
105
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Since all features are of the same measure (namely, in cm), we can plot the boxplots for all
features next to each other in one single plot:
from sklearn import datasets
import matplotlib.pyplot as pltiris = datasets.load_iris()
X_iris = iris.dataplt.boxplot(X_iris, labels=[iris.feature_names[0],
iris.feature_names[1], iris.feature_names[2], iris.feature_names[3]])
plt.title("Boxplots Iris features")
plt.ylabel("cm")
plt.show
106
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527