Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

GUNADWDM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 105

AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY

DWDM LAB

EXERCISE:1
Creation of a Data Warehouse:

➢ Build Data Warehouse/Data Mart (using open source tools like Pentaho Data Integration Tool,
Pentaho Business Analytics; or other data warehouse tools like Microsoft-SSIS, Informatica,
Business Objects,etc.,)
➢ Design multi-dimensional data models namely Star, Snowflake and Fact Constellation
schemas for any one enterprise (ex. Banking, Insurance, Finance, Healthcare, manufacturing,
Automobiles, sales etc).
➢ Write ETL scripts and implement using data warehouse tools.
➢ Perform Various OLAP operations such slice, dice, roll up, drill up and pivot .

(i). Identify source tables and populate sample data

In this task, we are going to use MySQL administrator, SQLyog Enterprise tools for
building & identifying tables in database & also for populating (filling) the sample data in those
tables of a database.A data warehouse is constructed by integrating data from multiple
heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and
decision making. We are building a data warehouse by integrating all the tables in database &
analyzing those data. In the below figure we represented MySQL Administrator connection
establishment.

After successful login, it will open new window as shown below.

1 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

There are different options available in MySQL administrator. Another tool SQLyog Enterprise, we are
using for building & identifying tables in a database after successful connection establishment through
MySQL Administrator. Below we can see the window of SQLyog Enterprise.

On left-side navigation, we can see different databases & it‘s related tables. Now we are going to build
tables & populate table‘s data in database through SQL queries. These tables in database can be used
further for building data warehou

2 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

In the above two windows, we created a database named “sample”& in that database we created two
tables named as “user_details”& “hockey”through SQL queries.
Now, we are going to populate (filling) sample data through SQL queries in those two
created tables as represented in below windows.

3 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

Through MySQL administrator & SQLyog, we can import databases from other sources
(.XLS,
.CSV, .sql) & also we can export our databases as backup for further processing. We can connect
MySQL to other applications for data analysis & reporting.

4 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

(ii). Design multi-dimensional data models namely Star, snowflake and Fact constellation
schemas for any one enterprise (ex. Banking, Insurance, Finance, Healthcare,
Manufacturing, Automobile, etc.).

Multi-Dimensional model was developed for implementing data warehouses & it provides both a
mechanism to store data and a way for business analysis. The primary components of dimensional
model are dimensions & facts. There are different of types of multi-dimensional data models. They
are:
1. Star Schema Model
2. Snow Flake Schema Model
3. Fact Constellation Model.

Now, we are going to design these multi-dimensional models for the Marketing enterprise.
First, we need to built the tables in a database through SQLyog as shown below.

In the above window, left side navigation bar consists of a database named as ―sales_dw‖ in
which there are six different tables (dimcustdetails, dimcustomer, dimproduct, dimsalesperson,
dimstores, factproductsales) has been created.

After creating tables in database, here we are going to use a tool called as “Microsoft
Visual Studio 2012 for Business Intelligence” for building multi-dimensional models.

5 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

In the above window, we are seeing Microsoft Visual Studio before creating a project In
which right side navigation bar contains different options like Data Sources, Data Source Views,
Cubes, Dimensions etc.

Through Data Sources, we can connect to our MySQL database named as “sales_dw”.
Then, automatically all the tables in that database will be retrieved to this tool for creating multi-
dimensional models.

By data source views & cubes, we can see our retrieved tables in multi-dimensional
models. We need to add dimensions also through dimensions option. In general, Multi-
dimensional models consists of dimension tables & fact tables.

Star Schema Model:

A Star schema model is a join between a fact table and a no. of dimension tables. Each dimensional
table are joined to the fact table using primary key to foreign key join but dimensional tables are
not joined to each other. It is the simplest style of dataware house schema.

Star schema is a entity relationship diagram of this schema resembles a star with point
radiating from central table as we seen in the below implemented window in visual studio.

6 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

Snow Flake Schema:

It is slightly different from star schema in which dimensional tables from a star schema are
organized into a hierarchy by normalizing them.
Snow flake schema is represented by centralized fact table which are connected to multiple
dimension tables. Snow flake effects only dimension tables not fact tables. we developed a
snowflake schema for sales_dw database by visual studio tool as shown below.

7 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

Fact Constellation Schema:


Fact Constellation is a set of fact tables that share some dimension tables. In this schema
there are two or more fact tables. We developed fact constellation in visual studio as shown below.
Fact tables are labelled in yellow color.

8 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

2. Write ETL scripts and implement using data warehouse tools

ETL (Extract-Transform-Load):

ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process
of how the data are loaded from the source system to the data warehouse. Currently, the ETL
encompasses a cleaning step as a separate step. The sequence is then Extract-Clean- Transform-
Load. Let us briefly describe each step of the ETL process.
Process
Extract:
The Extract step covers the data extraction from the source system and makes it accessible for
further processing. The main objective of the extract step is to retrieve all the required data from
the source system with as little resources as possible. The extract step should be designed in a way
that it does not negatively affect the source system in terms or performance, response time or any
kind of locking.
There are several ways to perform the extract:
• Update notification - if the source system is able to provide a notification that a record has
been changed and describe the change, this is the easiest way to get the data.
• Incremental extract - some systems may not be able to provide notification that an update
has occurred, but they are able to identify which records have been modified and provide
an extract of such records. During further ETL steps, the system needs to identifychanges
and propagate it down. Note, that by using daily extract, we may not be able to handle
deleted records properly.
• Full extract - some systems are not able to identify which data has been changed at all, soa
full extract is the only way one can get the data out of the system. The full extract requires
keeping a copy of the last extract in the same format in order to be able to identify changes.
Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely important.
Particularly for full extracts; the data volumes can be in tens of gigabytes.
Clean:
The cleaning step is one of the most important as it ensures the quality of the data in the data
warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
• Validate address fields against each other (State/Country, City/State, City/ZIP code,
City/Street).
Transform:
The transform step applies a set of rules to transform the data from the source to the target. This
includes converting any measured data to the same dimension (i.e. conformed dimension) using
the same units so that they can later be joined. The transformation step also requires joining data
from several sources, generating aggregates, generating surrogate keys, sorting, deriving new

9 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

Load:
During the load step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible. The target of the Load process is often a database. In order to make the load
process efficient, it is helpful to disable any constraints and indexes before the load and enable
them back only after the load completes. The referential integrity needs to be maintained by ETL
tool to ensure consistency.
Managing ETL Process:

The ETL process seems quite straight forward. As with every application, there is a possibility that
the ETL process fails. This can be caused by missing extracts from one of the systems, missing
values in one of the reference tables, or simply a connection or power outage. Therefore, it is
necessary to design the ETL process keeping fail-recovery in mind.
Staging:
It should be possible to restart, at least, some of the phases independently from the others. For
example, if the transformation step fails, it should not be necessary to restart the Extract step. We
can ensure this by implementing proper staging. Staging means that the data is simply dumped to
the location (called the Staging Area) so that it can then be read by the next processing phase. The
staging area is also used during ETL process to store intermediate results of processing. This is ok
for the ETL process which uses for this purpose. However, tThe staging area should is be accessed
by the load ETL process only. It should never be available to anyone else; particularly not to end
users as it is not intended for data presentation to the end-user.may contain incomplete or in-the-
middle-of-the-processing data.

ETL Tool Implementation:

When you are about to use an ETL tool, there is a fundamental decision to be made: will the
company build its own data transformation tool or will it use an existing tool?
Building your own data transformation tool (usually a set of shell scripts) is the preferred approach
for a small number of data sources which reside in storage of the same type. The reason for that is
the effort to implement the necessary transformation is little due to similar data structure and
common system architecture. Also, this approach saves licensing cost and there is no need to train
the staff in a new tool. This approach, however, is dangerous from the TOC point of view. If the
transformations become more sophisticated during the time or there is a need to integrate other
systems, the complexity of such an ETL system grows but the manageability drops significantly.
Similarly, the implementation of your own tool often resembles re-inventing the wheel.
There are many ready-to-use ETL tools on the market. The main benefit of using off-the-shelf ETL
tools is the fact that they are optimized for the ETL process by providing connectors to common
data sources like databases, flat files, mainframe systems, xml, etc. They provide a means to
implement data transformations easily and consistently across various data sources. This includes
filtering, reformatting, sorting, joining, merging, aggregation and other operations ready to use.
The tools also support transformation scheduling, version control, monitoring and unified metadata
management. Some of the ETL tools are even integrated with BI tools.
Some of the Well Known ETL Tools:
The most well-known commercial tools are Ab Initio, IBM InfoSphere DataStage,
Informatica, Oracle Data Integrator, and SAP Data Integrator.
10 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527
AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

There are several open source ETL tools are OpenRefine,


Apatar, CloverETL, Pentaho and Talend.

In these above tools, we are going to use OpenRefine 2.8 ETL toolto different sample datasets
forextracting, data cleaning, transforming & loading.

11 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

Perform various OLAP operations such slice, dice, roll up, drill down and pivot. OLAP

Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations
• Roll-up (Drill-up)
• Drill-down
• Slice and dice
• Pivot (rotate)

Roll-up (Drill-up):

Roll-up performs aggregation on a data cube in any of the following ways


• By climbing up a concept hierarchy for a dimension
• By dimension reduction
• Roll-up is performed by climbing up a concept hierarchy for the dimension location.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the level
ofcity to the level of country.
• The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down:

Drill-down is the reverse operation of roll-up. It is performed by either of the following ways
• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.
• Drill-down is performed by stepping down a concept hierarchy for the dimension time.
• Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of quarter to the
levelof month.
• When drill-down is performed, one or more dimensions from the data cube are added.
• It navigates the data from less detailed data to highly detailed data.

Slice
:
The slice operation selects one particular dimension from a given cube and provides a

new sub-cube.

Dice
:
Dice selects two or more dimensions from a given cube and provides a new sub-cube.

12 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

Pivot (rotate):

The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data.
Now, we are practically implementing all these OLAP Operations using Microsoft
Excel.

Procedure for OLAP Operations:


1. Open Microsoft Excel, go toData tab in top & click on ―Existing Connections”.
2. Existing Connections window will be opened, there “Browse for more”option should be
clicked for importing .cub extension file for performing OLAP Operations. For sample, I
took music.cub file.

13 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

3. As shown in above window, select ―PivotTable Report” and click “OK”.

4. We got all the music.cub data for analyzing different OLAP Operations.Firstly, we performed
drill-down operation as shown below.

In the above window, we selected year „2008‟ in „Electronic‟ Category, then


automatically the Drill-Down option is enabled on top navigation options. We will click on
5. „Drill-Down‟ option, then the below window will be displayed. Now we are going to perform

roll-up (drill-up) operation, in the above window I selected January month then automatically
Drill-up option is enabled on top. We will click on Drill-up option, then the below window will
be displayed.

Now we are going to perform roll-up (drill-up) operation, in the above window I selected January month
14 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527
AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

then automatically Drill-up option is enabled on top. We will click on Drill-up option, then the below
window will be displayed.

Next OLAP operation Slicing is performed by inserting slicer as shown in top navigation
options. While inserting slicers for slicing operation, we select 2 Dimensions (for e.g.

CategoryName & Year) only with one Measure (for e.g. Sum of sales).After inserting a slice&
adding a filter (CategoryName: AVANT ROCK & BIG BAND; Year: 2009 & 2010), we will get
table as shown below.

15 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

1. Dicing operation is similar to Slicing operation. Here we are selecting 3 dimensions


(CategoryName, Year, RegionCode)& 2 Measures (Sum of Quantity, Sum of Sales) through
„insert slicer‟ option. After that adding a filter for CategoryName, Year & RegionCode as
shown below.

2. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order Date-Year)&
16 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527
AVANTHI’S RESEARCH AND TECHNOLOGICAL ACADEMY
DWDM LAB

columns (Values-Sum of Quantity & Sum of Sales) through right side bottom navigation baras
shown below.

After Swapping
(rotating), we will get resultant as represented below with a pie-chart for Category-Classical&
Year Wise data.

17 S.GUNA SHEKHAR CSE 3RD YEAR 20HQ1A0527


In the below window, we used 3D-Column Charts of Microsoft Excel for analyzing
data in data warehouse.

Below window, represents the data visualization through Pentaho Business Analytics
tool online (http://www.pentaho.com/hosted-demo) for some sample dataset.

19
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-2
Aim:Explore machine learning tool WEKA

Description:

Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to these functions. Portability, since
it is fully implemented in the Java programming language and thus runs on almost any modern
computing platform.

o A comprehensive collection of data preprocessing and modelling techniques.


o Ease of use due to its graphical user interfaces.
o Weka supports several standard data mining tasks, specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. Input to Weka is expected to be
formatted according to the Attribute-Relational File Format and filename with the .arff extension

Why Use WEKA Machine Learning Tool

With WEKA, machine learning algorithms are readily available to users. ML specialists can use these
methods to extract useful information from high volumes of data. Here, the specialists can create an
environment to develop new machine learning methods and implement them on real data.

How to download weka tool:


Step -1 :Check the configuration of the computer system and download the stable version of WEKA
from this page.

20
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Step 2: After successful download, open the file location and double-click on the downloaded file.
The Step Up wizard will appear. Click on Next.

Step-3: The License Agreement terms will open. Read it thoroughly and click on “I Agree”.

Step-4: According to your requirements, select the components to be installed. Full component
installation is recommended. Click on Next.

21
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Step-5: Select the destination folder and Click on Next.

Step -6: Then, Installation will start

Step -7: Select the Start Weka tool Click on Finish

22
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Step-8:open weka tool and explore it

Features of wekatool kit:

The GUI of WEKA gives five options:


Explorer
Experimenter
Knowledge flow
Workbench
Simple CLI.

23
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Explorer: The WEKA Explorer windows show different tabs starting with preprocessing. Initially, the preprocess tab
is active, as first the data set is preprocessed before applying algorithms to it and exploring the dataset

24
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Experimenter: The WEKA experimenter button allows the users to create, run, and modify different
schemes in one experiment on a dataset

The experimenter has 2 types of configuration: Simple and Advanced.


1. The “Open” and “New” buttons will open a new experiment window that users can
do.
2. Results: Set the result destination file from ARFF, JDFC, and CSV files.
3. Experiment Type: The user can choose between cross-validation and train/test
percentage split. The user can choose between Classification and Regression-based
upon the dataset and classifier used.
4. Datasets: The user can browse and select datasets from here. The relative path
checkbox is clicked if working on different machines. The format of datasets
supported is ARFF, C4.5, CSV, libsvm, bsi, and XRFF.
5. Iteration: The default iteration number is set to 10. Datasets first and algorithms first
help in switching between dataset and algorithms so that algorithms can be run on all
datasets.
6. Algorithms: New algorithms are added by “New Button”. The user can choose a
classifier.
7. Save and run the experiment using the Save button
• Knowledge flow: The knowledge flow interface is an alternative to the explorer

• You layout filters,classifiers ,evaluators and visualizers interactively on a 2D canvas


and connect them together

The different components available are Datasources, Datasavers, Filters, Classifiers, Clusters,
Evaluation, and Visualization.

25
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Simple CLI:
Command-line interface is a text based user interface used to run programs to manage computer files
and interact with computer
This also called as command -line user interface
When you click on the Explorer button in the Applications selector, it opens
Now we can see the 6 tabs in explorer:

• Preprocess
• Classify
• Cluster
• Associate
• Select Attributes
• Visualize
Under these tabs, there are several pre-implemented machine learning algorithms. Let us look into each
of them in detail now.
• Preprocess Tab
Initially as you open the explorer, only the Preprocess tab is enabled. The first step in machine learning
is to preprocess the data. Thus, in the Preprocess option, you will select the data file, process it and
make it fit for applying the various machine learning algorithms.
• Classify Tab
The Classify tab provides you several machine learning algorithms for the classification of your data.
To list a few, you may apply algorithms such as Linear Regression, Logistic Regression, Support Vector
26
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very
exhaustive and provides both supervised and unsupervised machine learning algorithms
• Cluster Tab
There are several clustering algorithms provided - such as SimpleKMeans, FilteredClusterer,
HierarchicalClusterer, and so on.
• Associate Tab
In the Associate tab, we can find Apriori, Filtered Associator and FPGrowth.

➢ Select Attributes Tab


Select Attributes allows to feature selections based on several algorithms such as ClassifierSubsetEval,
PrinicipalComponents, etc.
➢ Visualize Tab
In the Visualize option it allows to visualize processed data for analysis.
WEKA provides several ready-to-use algorithms for testing and building machine learning
applications.
To use WEKA effectively, we must have a sound knowledge of these algorithms, how they work,
which one to choose under what circumstances, what to look for in their processed output, and so on.
In short, you must have a solid foundation in machine learning to use WEKA effectively in building
your apps.

27
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

ARFF File format:


ARFF stands for Attribute-Relation File Format.
It is an ASCII text file that describes a list of instances sharing a set of attributes.
ARFF files were developed by the Machine Learning

Weka Data Sets:

some sample weka data sets, in arff format.


• contact-lens.arff
• cpu.arff
• diabetes.arff
• glass.arff
• ionospehre.arff etc.

28
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Load each dataset and observe the following:


→List attribute names and types :
Eg: dataset-Weather.arf
List out the attribute names:
1. outlook
2. temperature
3. humidity
4. windy
5. play
→Number of records in each dataset
Ans: @relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
29
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

rainy,mild,high,TRUE,

→Plot Histogram Steps for identify the plot histogram


1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Visualize button.
4. Click on right click button.
5. Select and Click on polyline option button.
→Determine the number of records for each class:
Ans: @relation weather.symbolic
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,

30
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

→Visualize the data in various dimensions

31
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERICISE-3
Aim:Perform data preprocessing tasks and demonstrate performing association rules mining on data
sets

Description: The data that is collected from the field contains many unwanted things that leads to
wrong analysis
To demonstrate the available features in preprocessing, we will use the weather database that is
provided in the installation.
Step-1:Using the Openfile option under the Preprocess tag select the weather-nominal.arff file.

Step -2:When we open the file, we can see like this:

➢ Applying Filters:
• There are many filters like: Unsupervised filters
32
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

• Supervised filters
• Discretization
• Resample Filter ,etc
▪ Supervised filters:

Supervised learning is a machine learning method in which models are trained using labeled data. In
supervised learning, models need to find the mapping function to map the input variable (X) with the
output variable (Y).

Supervised learning can be used for two types of problems: Classification and Regression

▪ Unsupervised filters:

Unsupervised learning is another machine learning method in which patterns inferred from the
unlabeled input data. The goal of unsupervised learning is to find the structure and patterns from the
input data. Unsupervised learning does not need any supervision. Instead, it finds patterns from the data
by its own.

It can be used for two types of problems:Clustering and Association

33
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

▪ Discretization: Once in a while one has numeric data but wants to use classifier that handles
only nominal values. In that case one needs to discretize the data, which can be done with the
following filters:

But, since discretization depends on the data which presented to the discretization algorithm, one
easily end up with incompatible train and test files

➢ Load weather. Nominal into Weka and run Apriori Algorithm with different support and
confidence values. data set into weka and run
➢ Apriori Algoithm:
• AIM: To select interesting rules from the set of all possible rules, constraints on various
measures of significance and interest can be used. The best known constraints are minimum
thresholds on support and confidence.
• Description:
The Apriori algorithm is one such algorithm in ML that finds out the probable associations and
creates association rules.

34
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

WEKA provides the implementation of the Apriori algorithm. You can define the minimum support
and an acceptable confidence level while computing these rules.

• ALGORITHM:
Association rule mining is to find out association rules that satisfy the predefined minimum support
and confidence from a given database
The Apriori algorithm finds the frequent sets L In Database D.
· Find frequent set Lk − 1.
·Join Step. .
.Ck is generated by joining Lk − 1with itself ·
.Prune Step.
.o Any (k − 1) itemset that is not frequent cannot be a subset of a
.frequent k itemset, hence should be removed.
.Where · (Ck: Candidate itemset of size k) ·
.(Lk: frequent itemset of size k)
.Apriori Pseudocode
.Apriori (T,£) L
L<{ Large 1itemsets that appear in more than transactions }
while L(k1)≠ Φ C(k)<Generate( Lk − 1) for transactions t € T
C(t)Subset(Ck,t) for candidates c € C(t) count[c]
→Steps for run Apriori algorithm in WEKA :

• Open WEKA Tool.


• Click on WEKA Explorer.
• Click on Preprocessing tab button.
• Click on open file button.
• Choose WEKA folder in C drive.
• Select and Click on data option button.
• oChoose Weather data set and open file.
• oClick on Associate tab
• Choose Apriori algorithm oClick on start button.
we will apply the Apriori algorithm to the supermarket data provided in the WEKA installation.

35
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Output:

Support and Confidence values:

• Support count: The support count of an itemset X, denoted by X.count, in a data set T is the number
of transactions in T that contain X. Assume T has n transactions.

• Then,

support = ( X  Y ).count n

confidence = ( X  Y ).count X .count


support = support({A U C})
confidence = support({A U C})/support({A})
→Aim:Apply different discretization filters on numerical attributes and run the Apriori association
rule algorithm. Study the rules generated
Eg:Dataset like Vote,soybean,supermarket,Iris
Steps for run Apriori algorithm in WEKA:

• Open WEKA Tool.


• Click on WEKA Explorer.
• Click on Preprocessing tab button.
• Click on open file button.
• Choose WEKA folder in C drive
36
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

• Select and Click on data option button


• Choose Weather data set and open file.
• Choose filter button and select the Unsupervised-Discritize option and apply
• Click on Associate tab and Choose Aprior algorithm
• Click on start button.
• Output:

37
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-4
4(a)Aim: Load each dataset into Weka and run id3, j48 classification algorithm, study the classifier
output. Compute entropy values, Kappa statistic.
➢ Description:
→Steps for run ID3 and J48 Classification algorithms in WEKA
▪ Open WEKA Tool.
▪ Click on WEKA Explorer.
▪ Click on Preprocessing tab button.
▪ Click on open file button.
▪ Choose WEKA folder in C drive.
▪ Select and Click on data option button.
▪ Choose iris data set and open file.
▪ Click on classify tab and Choose J48 algorithm and select use training set test option.
▪ Click on start button.
▪ Click on classify tab and Choose ID3 algorithm and select use training set test option.
▪ Click on start button.
▪ Output:

The Classifier Output Text :


The text in the Classifier output area has scroll bars allowing you to browse the results.
Clicking with the left mouse button into the text area, while holding Alt and Shift, brings up a dialog
that enables you to save the displayed output

38
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

4(b)Extract if-then rues from decision tree generated by classifier, Observe the confusion matrix
Description: A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf
node holds a class label. The topmost node in the tree is the root node
IF-THEN Rules:
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in
the following from − IF condition THEN conclusion Let us consider a rule R1, R1: IF age=youth
AND student=yes THEN buy_computer=yes.

Output:

39
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

→Load each dataset into Weka and perform Naïve-bayes classification and k-Nearest Neighbour
classification. Interpret the results obtained.
Aim: Determining and classifying the credit good or bad in the dataset with an Accuracy.
Description:
Naive Bayes classifier assumes that the presence of a particular feature of a class is unrelated to the
presence of any other feature.. Even though these features depend on the existence of the other
features, a naive Bayes classifier considers all of these properties to independently contribute to the
probability
→4©Steps for run Naïve-bayes and k-nearest neighbor Classification algorithms in WEKA
▪ Open WEKA Tool
▪ Click on WEKA Explorer.
▪ Click on Preprocessing tab button.
▪ Click on open file button.
▪ Choose WEKA folder in C drive.
▪ Select and Click on data option button
▪ Choose iris data set and open file.
▪ Click on classify tab and Choose Naïve-bayes algorithm
▪ select use training set test option..
▪ Click on start button.
▪ Click on classify tab
▪ Choose k-nearest neighbor
▪ select use training set test option.
▪ Click on start button.
40
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Naïve-bayes algorithm:

41
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

4(d)Plot RoC Curves:


* Steps for identify the plot RoC Curves.
1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Visualize button.
4. Click on right click button.
5. Select and Click on polyline option button

→4(e)Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for each dataset,
and deduce which classifier is performing best and poor for each dataset and justify
Aim:To Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for each dataset
→Description:
Steps for run ID3 and J48 Classification algorithms in WEKA
* Open WEKA Tool.
* Click on WEKA Explorer.
* Click on Preprocessing tab button.
* Click on open file button.
* Choose WEKA folder in C drive
→Select and Click on data option button
* Choose iris data set and open file.
* Click on classify tab and Choose J48 algorithm and select use training set test option.
* Click on start button.

42
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

* Click on classify tab and Choose ID3 algorithm and select use training set test option.
* Click on start button.

* Click on classify tab and Choose Naïve-bayes algorithm and select use training set test option.
* Click on start button.

43
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

* Click on classify tab and Choose k-nearest neighbor and select use training set test option.
* Click on start button.

44
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-5
→5(a)Demonstrate performing clustering on data sets Clustering Tab

➢ AIM: To understanding the selected attributes and removing attributes also to reload & the
arff data file to get all the attributes in the data set.
➢ Description:Selecting a Clusterer :-
By now you will be familiar with the process of selecting and configuring objects. Clicking on the
clustering scheme listed in the Clusterer box at the top of the window brings up a
GenericObjectEditor dialog with which to choose a new clustering scheme.
Steps for run K-mean Clustering algorithms in WEKA:

• Open WEKA Tool.

• Click on WEKA Explorer.

• Click on Preprocessing tab button.

• Click on open file button.

• Choose WEKA folder in C drive.

• Select and Click on data option button.

• Choose iris data set and open file.

• Click on cluster tab and Choose k-mean and select use training set test option.

• Click on start button

Output:

45
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

5(b)Study the clusters formed. Observe the sum of squared errors and centroids, and derive
insights

→5©Explore other clustering techniques available in Weka.

AIM: Clustering Algorithms And Techniques in WEKA

46
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

→5(d)Explore visualization features of weka to visualize the clusters. Derive interesting insights and
explain.
Aim: To explore visualization features of weka to visualize the clusters.
Description:
Visualize Features WEKA’s visualization allows you to visualize a
2-D plot of the current working relation.
Visualization is very useful in practice, it helps to determine difficulty of the learning problem.
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations.
WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data points.

47
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-6
→Demonstrate knowledge flow application on data sets
6(a)Aim: Develop a knowledge flow layout for finding strong association rules by using Apriori, FP
Growth algorithms
Description: The Knowledge Flow presents a data-flow inspired interface to WEKA. The user can
select WEKA components from a palette, place them on a layout canvas and connect them together in
order to form a knowledge flow for processing and analyzing data. At present, all of WEKA’s
classifiers, filters, clusterers, associators, loaders and savers are available in the Knowledge Flow
along with some extra tools.

48
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

6(b)AIM:
Set up the knowledge flow to load an ARFF (batch mode) and perform a cross validation using J48
algorithm
DESCRIPTION:
The knowledge flow provides an alternative way to the explorer as a graphical front end to WEKA’s
algorithm. Knowledge flow is a working progress. So, some of the functionality from explorer is not
yet available. So, on the other hand there are the things that can be done in knowledge flow, but not in
explorer. Knowledge flow presents a dataflow interface to WEKA. The user can select WEKA
components from a toolbar placed them on a layout campus and connect them together in order to
form a knowledge flow for processing and analyzing the data
PROCEDURE:
Step -1:open weka tool in that open knowledge application

Step-2:Now select the data set as data source

49
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Step-3:In data source select the sub ARFF LOADER

Step-4:now select evaluation design in that select class assigner


Now right click on ARFF loader and then configure itself and browse it to iris and then select it
Now again right click and select the data set to class assigner

Step-5: Now select the class assigner and then select to cross validation fold maker

50
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Step-6:Now from cross validation fold maker attach to J48 by training set and test set

Step-7:Now by J48 select the batch classifier and attach to classfier performance evaluator

51
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Step-8:Now from classifier performance evaluator text the data to text viewer

Step-9:Now run this file then status is

52
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Step -10:Now in the text viewer right click and check the show results

53
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Output:
final the comeplte knowledge flow algorithm

6© AIM:Demonstrate plotting multiple ROC curves in the same plot window by using j48 and
Random forest tree
➢ Rules for plotting multiple ROC curves :
• Open WEKA Tool.

• Click on WEKA Explorer.

• Click on Preprocessing tab button.

• Click on open file button.

• Choose WEKA folder in C drive.

• Select and Click on data option button.

• Choose iris data set and open file.

• Click on classify tab button and select trees option and select J48 and run it

54
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Now on J48 click on right and visualize threshold curve

55
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXCERCISE-7
Aim:Demonstrate zero R technique on iris dataset(by using necessary preprocessing technique and
share your observations.
Zero R: ZeroR is the simplest classification method which relies on the target and ignores all
predictors. ZeroR classifier simply predicts the majority category (class). Although there is no
predictability power in ZeroR, it is useful for determining a baseline performance as a benchmark for
other classification methods.
Zero R on iris dataset:

56
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

57
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-8
Aim:write a java program to prepare a simulated dataset with unique instance.
The number will be unique if it is positive integer and there are no repeated digits in the
number. In other words, a number is said to be unique if and only if the digits are not
duplicate. For example, 20, 56, 9863, 145, etc. are the unique numbers while 33, 121, 900,
1010, etc. are not unique numbers. In this section, we will create
here are the following ways to check the number is unique or not: By Comparing Each Digit
Manually Using String
Using Array
By Comparing Each Digit Manually
There are the following steps to check number is unique or not:
1. Read a number from the user.
2. Find the last digit o the number.
3. Compare all digits of the number with the last digit.
4. If the digit found more than one time, the number is not unique.
5. Else, eliminate the last digit of the number.
6. Repeat steps 2 to 5 until the number becomes zero.

UniqueNumberExample1.java
1. Import java.util.Scanner;
2. public class UniqueNumberExample1
3. {
4. public static void main(String args[])
5. {
6. int r1, r2, number, num1, num2, count = 0;
7. Scanner sc = new Scanner(System.in);
8. System.out.print("Enter the number you want to check: ");
9. //reading a number from the user
10. number = sc.nextInt();

58
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

11. //num1 and num2 are temporary variable


12. num1 = number;
13. num2 = number;
14. //iterate over all digits of the number
15. while (num1 > 0)
16. {
17. //detrmins the last digit of the number
18. r1 = num1 % 10;
19. while (num2 > 0)
20. {
21. //finds the last digit
22. r2 = num2 % 10;
23. //comparing the last digit
24. if (r1 == r2)
25. {
26. //increments the count variable by 1
27. count++;
28. }
29. //removes the last digit from the number
30. num2 = num2 / 10;
31. }
32. //removes the last digit from the number
33. num1 = num1 / 10;
34. }
35. if (count == 1)
36. {
37. System.out.println("The number is unique.");
38. }
39. else

59
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

40. {
41. System.out.println("The number is not unique.");
42. }
43. }
44. }
Output 1:
Enter the number you want to check: 13895
The number is unique.
Output 2:
Enter the number you want to check: 11100
The number is not unique.
Output 3:
Enter the number you want to check: 10000
The number is not unique.

60
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-9
Aim : Write a Python program to generate frequent item sets / association rules using Apriori
algorithm

Procedure :
Apriori Algorithm is a Machine Learning algorithm utilized to understand the patterns of
relationships among the various products involved. The most popular use of the algorithm is
to suggest products based on the items already in the user's shopping cart. Walmart
specifically has utilized the algorithm in recommending items to its users.
Dataset: Groceries data
Implementation of algorithm in Python:
Step 1: Import the required libraries
1. import numpy as np
2. import pandas as pd
3. from mlxtend.frequent_patterns import apriori, association_rules
Step 2: Load and explore the data
1. # Now, we will load the Data
2. data1 = pnd.read_excel('Online_Retail.xlsx')
3. data1.head()

61
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Output:

62
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Input:
# here, we will explore the columns of the data
1. data1.columns
Output:
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
'UnitPrice', 'CustomerID', 'Country'],
Dtype = 'object')
Input:
# Now, we will explore the different regions of transactions
1. data1.Country.unique()
Output:
array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
'European Community', 'Malta', 'RSA'], dtype = object)
Step 3: Clean the Data
1. # here, we will strip the extra spaces in the description
2. data1['Description'] = data1['Description'].str.strip()
3. # Now, drop the rows which does not have any invoice number
4. data1.dropna(axis = 0, subset = ['InvoiceNo'], inplace = True)
5. data1['InvoiceNo'] = data1['InvoiceNo'].astype('str')
6. # Now, we will drop all transactions which were done on credit
7. data1 = data1[~data1['InvoiceNo'].str.contains('C')]
Step 4: Split the data according to the region of transaction
1. # Transactions done in France
2. basket1_France = (data1[data1['Country'] == "France"]
63
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

3. .groupby(['InvoiceNo', 'Description'])['Quantity']
4. .sum().unstack().reset_index().fillna(0)
5. .set_index('InvoiceNo'))
6. # Transactions done in the United Kingdom
7. basket1_UK = (data1[data1['Country'] == "United Kingdom"]
8. .groupby(['InvoiceNo', 'Description'])['Quantity']
9. .sum().unstack().reset_index().fillna(0)
10. .set_index('InvoiceNo'))
11. # Transactions done in Portugal
12. basket1_Por = (data1[data1['Country'] == "Portugal"]
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 68
13. .groupby(['InvoiceNo', 'Description'])['Quantity']
14. .sum().unstack().reset_index().fillna(0)
15. .set_index('InvoiceNo'))
16.
17. basket1_Sweden = (data1[data1['Country'] == "Sweden"]
18. .groupby(['InvoiceNo', 'Description'])['Quantity']
19. .sum().unstack().reset_index().fillna(0)
20. .set_index('InvoiceNo'))
Step 5: Hot encoding the Data
# Here, we will define the hot encoding function
1. # for making the data suitable
2. # for the concerned libraries
3. def hot_encode1(P):
4. if(P<= 0):
5. return 0
6. if(P>= 1):
7. return 1
8. # Here, we will encode the datasets
64
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

9. basket1_encoded = basket1_France.applymap(hot_encode1)
10. basket1_France = basket1_encoded
11.
12. basket1_encoded = basket1_UK.applymap(hot_encode1)
13. basket1_UK = basket1_encoded
14.
15. basket1_encoded = basket1_Por.applymap(hot_encode1)
16. basket1_Por = basket1_encoded
17. basket1_encoded = basket1_Sweden.applymap(hot_encode1)
18. basket1_Sweden = basket1_encoded
Step 6: Build the models and analyse the results
a) France:
1. # Build the model
2. frq_items1 = AP(basket1_France, min_support = 0.05, use_colnames = True)
3.
4. # Collect the inferred rules in a dataframe
5. rules1 = AR(frq_items1, metric = "lift", min_threshold = 1)
6. rules1 = rules1.sort_values(['confidence', 'lift'], ascending = [False, False])
7. print(rules1.head())
Output:
antecedents \
45 (JUMBO BAG WOODLAND ANIMALS)
260 (PLASTERS IN TIN CIRCUS PARADE, RED TOADSTOOL ...
272 (RED TOADSTOOL LED NIGHT LIGHT, PLASTERS IN TI...
302 (SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO...
301 (SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...
consequents antecedent support consequent support \
45 (POSTAGE) 0.076531 0.765306
260 (POSTAGE) 0.051020 0.765306
272 (POSTAGE) 0.053571 0.765306
65
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

302 (SET/6 RED SPOTTY PAPER PLATES) 0.102041s 0.127551


301 (SET/6 RED SPOTTY PAPER CUPS) 0.102041 0.137755
support confidence lift leverage conviction
45 0.076531 1.000 1.306667 0.017961 inf
260 0.051020 1.000 1.306667 0.011974 inf
272 0.053571 1.000 1.306667 0.012573 inf
302 0.099490 0.975 7.644000 0.086474 34.897959
301 0.099490 0.975 7.077778 0.085433 34.489796
From the above output, it can be seen that paper cups, paper and plates are bought together in
France. This is because the French has a culture of having a get-together with their friends
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 70
and family at least once a week. Also, since the French government has banned the use of
plastic in the country, people have to purchase paper-based alternatives.

66
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-10
Aim: Write a program to calculate chi-square value using Python. Report your observation.
The Pearson’s Chi-Square statistical hypothesis is a test for independence between categorical
variables. In this article, we will perform the test using a mathematical approach and then using
Python’s SciPy module.
First, let us see the mathematical approach :
The Contingency Table:
A Contingency table (also called crosstab) is used in statistics to summarise the relationship between
several categorical variables. Here, we take a table that shows the number of men and women buying
different types of pets.
dog cat bird total men 207 282 241 730
women 234 242 232 708
total 441 524 473 1438
The aim of the test is to conclude whether the two variables( gender and choice of pet ) are related to
each other.
Null hypothesis:
We start by defining the null hypothesis (H0) which states that there is no relation between
the variables. An alternate hypothesis would state that there is a significant relation between
the two.
We can verify the hypothesis by these methods:
Using p-value:
We define a significance factor to determine whether the relation between the variables is of
considerable significance.
Generally a significance factor or alpha value of 0.05 is chosen.
This alpha value denotes the probability of erroneously rejecting H0 when it is true.
A lower alpha value is chosen in cases where we expect more precision. If the p-value for the test
comes out to be strictly greater than the alpha value, then H0 holds true.
Using chi-square value: If our calculated value of chi-square is less or equal to the tabular(also
called critical) value of chi-square, then H0 holds true.
Expected Values Table :Next, we prepare a similar table of calculated(or expected) values.
To do this we need to
calculate each item in the new table as :
67
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

The expected values table :


dog cat bird total
men 223.87343533 266.00834492 240.11821975 730
women 217.12656467 257.99165508 232.88178025 708
total 441 524 473 1438
Chi-Square Table :
We prepare this table by calculating for each item the following:
The chi-square table:
observed (o) calculated (c) (o-c)^2 / c
207 223.87343533 1.2717579435607573
282 266.00834492 0.9613722161954465
241 240.11821975 0.003238139990850831
234 217.12656467 1.3112758457617977
242 257.99165508 0.991245364156322
232 232.88178025 0.0033387601600580606
Total 4.542228269825232
From this table, we obtain the total of the last column, which gives us the calculated value of chi-
square. Hence the calculated value of chi-square is 4.542228269825232

68
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-11
Aim: write a program of naïve base algorithm classification using python programming.
Introduction to Naive Bayes :
Naive Bayes is among one of the very simple and powerful algorithms for classification
based on Bayes Theorem with an assumption of independence among the predictors. The
Naive Bayes classifier assumes that the presence of a feature in a class is not related to any other
feature.
Naive Bayes is a classification algorithm for binary and multi-class
classification problems.
Bayes Theorem :
Based on prior knowledge of conditions that may be related to an event, Bayes theorem
describes the probability of the event
•conditional probability can be found this way
•Assume we have a Hypothesis(H) and evidence(E),
According to Bayes theorem, the relationship between the probability of Hypothesis
before getting the evidence represented as P(H) and the probability of the hypothesis
after getting the evidence represented as P(H|E) is:

P(H|E) = P(E|H)*P(H)/P(E)
Prior probability = P(H) is the probability before getting the evidence
•Posterior probability = P(H|E) is the probability after getting evidence In general,
•P(class|data) = (P(data|class) * P(class)) / P(data)
Bayes Theorem Example
Assume we have to find the probability of the randomly picked card to be king given that it is
a face card.
There are 4 Kings in a Deck of Cards which implies that P(King) = 4/52
as all the Kings are face Cards so P(Face|King) = 1
there are 3 Face Cards in a Suit of 13 cards and there are 4 Suits in total so P(Face) = 12/52
Therefore,
P(King|face) = P(face|king)*P(king)/P(face) = 1/3
69
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Source Code : Implementing Naive Bayes algorithm from scratch using Python
# Importing library
import math
import random
import csv
# the categorical class names are changed to numberic data
# eg: yes and no encoded to 1 and 0
def encode_class(mydata):
classes = []
for i in range(len(mydata)):
if mydata[i][-1] not in classes:
classes.append(mydata[i][-1])
for i in range(len(classes)):
for j in range(len(mydata)):
if mydata[j][-1] == classes[i]:
mydata[j][-1] = i
return mydata
# Splitting the data
def splitting(mydata, ratio):
train_num = int(len(mydata) * ratio)
train = []
# initially testset will have all the dataset
test = list(mydata)
while len(train) < train_num:
# index generated randomly from range 0
# to length of testset
index = random.randrange(len(test))
# from testset, pop data rows and put it in train
train.append(test.pop(index))
return train, test
70
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

# Group the data rows under each class yes or


# no in dictionary eg: dict[yes] and dict[no]
def groupUnderClass(mydata):
dict = {}
for i in range(len(mydata)):
if (mydata[i][-1] not in dict):
dict[mydata[i][-1]] = []
dict[mydata[i][-1]].append(mydata[i])
return dict
# Calculating Mean
def mean(numbers): return sum(numbers) / float(len(numbers))
# Calculating Standard Deviation
def std_dev(numbers):
avg = mean(numbers)
variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) –
1) return math.sqrt(variance)
def MeanAndStdDev(mydata):
info = [(mean(attribute), std_dev(attribute)) for attribute in
zip(*mydata)]
# eg: list = [ [a, b, c], [m, n, o], [x, y, z]]
# here mean of 1st attribute =(a + m+x), mean of 2nd attribute = (b +
n+y)/3
# delete summaries of last class
del info[-1]
return info
# find Mean and Standard Deviation under each class
def MeanAndStdDevForClass(mydata):
info = {}
dict = groupUnderClass(mydata)
for classValue, instances in dict.items():
71
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

info[classValue] = MeanAndStdDev(instances)
return info
# Calculate Gaussian Probability Density Function
def calculateGaussianProbability(x, mean, stdev):
expo = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * expo
# Calculate Class Probabilities
def calculateClassProbabilities(info, test):
probabilities = {}
for classValue, classSummaries in info.items():
probabilities[classValue] = 1
for i in range(len(classSummaries)):
mean, std_dev = classSummaries[i]
x = test[i]
probabilities[classValue] *= calculateGaussianProbability(x,
mean, std_dev)
return probabilities
# Make prediction - highest probability is the prediction
def predict(info, test):
probabilities = calculateClassProbabilities(info, test)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValue
return bestLabel
# returns predictions for a set of examples
def getPredictions(info, test):
predictions = []
for i in range(len(test)):
72
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

result = predict(info, test[i])


predictions.append(result)
return predictions
# Accuracy score
def accuracy_rate(test, predictions):
correct = 0
for i in range(len(test)):
if test[i][-1] == predictions[i]:
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 81

correct += 1
return (correct / float(len(test))) * 100.0
# driver code
# add the data path in your system
filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive
bayes\filedata.csv'
# load the file and store it in mydata list
mydata = csv.reader(open(filename, "rt"))
mydata = list(mydata)
mydata = encode_class(mydata)
for i in range(len(mydata)):
mydata[i] = [float(x) for x in mydata[i]]
# split ratio = 0.7
# 70% of data is training data and 30% is test data used for testing
ratio = 0.7
train_data, test_data = splitting(mydata, ratio)
print('Total number of examples are: ', len(mydata))
print('Out of these, training examples are: ', len(train_data))
print("Test examples are: ", len(test_data))

73
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

# prepare model
info = MeanAndStdDevForClass(train_data)
# test model
predictions = getPredictions(info, test_data)
accuracy = accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)
Output: Total number of examples are: 200
Out of these, training examples are: 140
Test examples are: 60
Accuracy of your model is: 71.2376788

74
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE 12
Aim: Implement a Java program to perform Apriori algorithm
import java.io.*;
import java.util.*;
/** The class encapsulates an implementation of the Apriori algorithm
to compute frequent itemsets.
* Datasets contains integers (>=0) separated by spaces, one transaction by line, e.g.
*123
*09
*19
* Usage with the command line :
* $ java mining.Apriori fileName support
* $ java mining.Apriori /tmp/data.dat 0.8
* $ java mining.Apriori /tmp/data.dat 0.8 > frequent-itemsets.txt
*
* For a full library, see SPMF https://www.philippe-fournier-viger.com/spmf/
* @author Martin Monperrus, University of Darmstadt, 2010
* @author Nathan Magnus and Su Yibin, under the supervision of Howard Hamilton,
University of Regina, June 2009.
* @copyright GNU General Public License v3
* No reproduction in whole or part without maintaining this copyright notice
* and imposing this condition on any subsequent users.
public class Apriori extends Observable {
public static void main(String[] args) throws Exception {
Apriori ap = new Apriori(args);
}
/** the list of current itemsets */
private List<int[]> itemsets ;
/** the name of the transcation file */

75
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

private String transaFile;


/** number of different items in the dataset */
private int numItems /** total number of transactions in transaFile */
private int numTransactions;
/** minimum support for a frequent itemset in percentage, e.g. 0.8 */
private double minSup;
/** by default, Apriori is used with the command line interface */
private boolean usedAsLibrary = false;
/** This is the main interface to use this class as a library */
public Apriori(String[] args, Observer ob) throws Exception
{
usedAsLibrary = true;
configure(args);
this.addObserver(ob);
go();
}
/** generates the apriori itemsets from a file
*
* @param args configuration parameters: args[0] is a filename, args[1] the min support (e.g.
0.8 for 80%)
*/
public Apriori(String[] args) throws Exception
{
configure(args);
go();
}
/** starts the algorithm after configuration */
private void go() throws Exception {
//start timer
long start = System.currentTimeMillis();
76
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

// first we generate the candidates of size 1


createItemsetsOfSize1();
int itemsetNumber=1; //the current itemset being looked at
int nbFrequentSets=0
while (itemsets.size()>0)
{
calculateFrequentItemsets();
if(itemsets.size()!=0)
{
nbFrequentSets+=itemsets.size();
log("Found "+itemsets.size()+" frequent itemsets of size " + itemsetNumber + " (with support
"+(minSup*100)+"%)");;
createNewItemsetsFromPreviousOnes();
}
itemsetNumber++;
}
//display the execution time
long end = System.currentTimeMillis();
log("Execution time is: "+((double)(end-start)/1000) + " seconds.");
log("Found "+nbFrequentSets+ " frequents sets for support "+(minSup*100)+"% (absolute
"+Math.round(numTransactions*minSup)+")");
log("Done");
}
/** triggers actions if a frequent item set has been found */
private void foundFrequentItemSet(int[] itemset, int support) {
if (usedAsLibrary) {
this.setChanged();
notifyObservers(itemset);
}
else {System.out.println(Arrays.toString(itemset) + " ("+ ((support / (double)
77
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

numTransactions))+" "+support+")");}
}
/** outputs a message in Sys.err if not used as library */
private void log(String message) {
if (!usedAsLibrary) {
System.err.println(message);
}
}
/** computes numItems, numTransactions, and sets minSup */
private void configure(String[] args) throws Exception
{
// setting transafile
if (args.length!=0) transaFile = args[0];
else transaFile = "chess.dat"; // default
// setting minsupport
if (args.length>=2) minSup=(Double.valueOf(args[1]).doubleValue());
else minSup = .8;// by default
if (minSup>1 || minSup<0) throw new Exception("minSup: bad value");
// going thourgh the file to compute numItems and numTransactions
numItems = 0;
numTransactions=0;
BufferedReader data_in = new BufferedReader(new FileReader(transaFile));
while (data_in.ready()) {
String line=data_in.readLine();
if (line.matches("\\s*")) continue; // be friendly with empty lines
numTransactions++;
StringTokenizer t = new StringTokenizer(line," ");
while (t.hasMoreTokens()) {
int x = Integer.parseInt(t.nextToken());
//log(x);
78
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

if (x+1>numItems) numItems=x+1;
}
}
outputConfig();
}
/** outputs the current configuration
*/
private void outputConfig() {
//output config info to the use
log("Input configuration: "+numItems+" items, "+numTransactions+" transactions, ");
log("minsup = "+minSup*100+"%");
}
/** puts in itemsets all sets of size 1,
* i.e. all possibles items of the datasets
*/
private void createItemsetsOfSize1() {
itemsets = new ArrayList<int[]>();
for(int i=0; i<numItems; i++)
{
int[] cand = {i};
itemsets.add(cand);
}
}
/**
* if m is the size of the current itemsets,
* generate all possible itemsets of size n+1 from pairs of current itemsets
* replaces the itemsets of itemsets by the new ones
*/
private void createNewItemsetsFromPreviousOnes()
{
79
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

// by construction, all existing itemsets have the same size


int currentSizeOfItemsets = itemsets.get(0).length;
log("Creating itemsets of size "+(currentSizeOfItemsets+1)+" based on "+itemsets.size()+"
itemsets of size "+currentSizeOfItemsets);
HashMap<String, int[]> tempCandidates = new HashMap<String, int[]>(); //temporary
candidates
// compare each pair of itemsets of size n-1
for(int i=0; i<itemsets.size(); i++)
{
for(int j=i+1; j<itemsets.size(); j++)
{
int[] X = itemsets.get(i);
int[] Y = itemsets.get(j);
assert (X.length==Y.length);
//make a string of the first n-2 tokens of the strings
int [] newCand = new int[currentSizeOfItemsets+1];
for(int s=0; s<newCand.length-1; s++) {
newCand[s] = X[s];
}
int ndifferent = 0;
// then we find the missing value
for(int s1=0; s1<Y.length; s1++)
{
boolean found = false;
// is Y[s1] in X?
for(int s2=0; s2<X.length; s2++) {
if (X[s2]==Y[s1]) {
found = true;
break;
}
80
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

}
if (!found){ // Y[s1] is not in X
ndifferent++;
// we put the missing value at the end of newCand
newCand[newCand.length -1] = Y[s1];
}
}
// we have to find at least 1 different, otherwise it means that we have two times the same set
in the existing candidates
assert(ndifferent>0);
if (ndifferent==1) {
// HashMap does not have the correct "equals" for int[] :- // I have to create the hash myself using a
String :-(
// I use Arrays.toString to reuse equals and hashcode of String
Arrays.sort(newCand);
tempCandidates.put(Arrays.toString(newCand),newCand);
}
}
}
//set the new itemsets
itemsets = new ArrayList<int[]>(tempCandidates.values());
log("Created "+itemsets.size()+" unique itemsets of size "+(currentSizeOfItemsets+1));
}
/** put "true" in trans[i] if the integer i is in line */
private void line2booleanArray(String line, boolean[] trans) {
Arrays.fill(trans, false);
StringTokenizer stFile = new StringTokenizer(line, " "); //read a line from the file to the
tokenizer
//put the contents of that line into the transaction array
while (stFile.hasMoreTokens())

81
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

{
int parsedVal = Integer.parseInt(stFile.nextToken());
trans[parsedVal]=true; //if it is not a 0, assign the value to true
}
}
/** passes through the data to measure the frequency of sets in {@link itemsets},
* then filters thoses who are under the minimum support (minSup)
*/
private void calculateFrequentItemsets() throws Exception
{
log("Passing through the data to compute the frequency of " + itemsets.size()+ " itemsets of
size "+itemsets.get(0).length);
List<int[]> frequentCandidates = new ArrayList<int[]>(); //the frequent candidates for the
current itemset
boolean match; //whether the transaction has all the items in an itemset
int count[] = new int[itemsets.size()]; //the number of successful matches, initialized by zeros
// load the transaction file
BufferedReader data_in = new BufferedReader(new InputStreamReader(new
FileInputStream(transaFile)));
boolean[] trans = new boolean[numItems];
// for each transaction
for (int i = 0; i < numTransactions; i++) {
// boolean[] trans = extractEncoding1(data_in.readLine());
String line = data_in.readLine();
line2booleanArray(line, trans);
// check each candidate
for (int c = 0; c < itemsets.size(); c++) {
match = true; // reset match to false
// tokenize the candidate so that we know what items need to be
// present for a match
82
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

int[] cand = itemsets.get(c);


//int[] cand = candidatesOptimized[c];
// check each item in the itemset to see if it is present in the
// transaction
for (int xx : cand) {
if (trans[xx] == false) {
match = false;
break;
}
}
if (match) { // if at this point it is a match, increase the count
count[c]++;
//log(Arrays.toString(cand)+" is contained in trans "+i+" ("+line+")")
}
}
}
data_in.close();
for (int i = 0; i < itemsets.size(); i++) {
// if the count% is larger than the minSup%, add to the candidate to
// the frequent candidates
if ((count[i] / (double) (numTransactions)) >= minSup) {
foundFrequentItemSet(itemsets.get(i),count[i]);
frequentCandidates.add(itemsets.get(i));
}
//else log("-- Remove candidate: "+ Arrays.toString(candidates.get(i)) + " is: "+ ((count[i] /
(double) numTransactions)));
}
//new candidates are only the frequent candidates
itemsets = frequentCandidates;
}}
83
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB
Item number 1 = 1
Output:
Item number 2 = 2
Item number 3 = 3
Enter the minimum support (as a floating point value,
Item number 4 = 5
0<x<1):
TranIsatcetminounm
0.5 bNeurm b 3e r =:
53 :
Transaction Number: 1:
:
-+- L -+-
Item number 1 = 1
[1] : 2
Item number 2 = 3
[3] :Item
3 number 2 = 3
Item number 3 = 4
[2] : 3Tran sacti on N
[5] : 3umber
-+- L2: -
+[2, 3] :
2
[3, 5] : 2 numb er 1 =
Item
2
[1, 3] : 2
[2, 5] : 3
-+- L -+-
[2, 3, 5]
:2

=+=
FINAL LIST =+=
[2, 3, 5] : 2

84
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Exercise-13
Aim: Write a program to cluster your choice of data using simple k-means algorithm using
JDK
/*
Simple K means creating 2 partitions with 2-dimensional dataset in
JAVA
By Ngangbam Indrason, 23 Feb 2019
*/
import java.util.*;
class KmeansJ {

public static void main(String args[]) {


int dataset[][] = {
{2,1},
{5,2},
{2,2},
{4,1},
{4,3},
{7,5},
{3,6},
{5,7},
{1,4},
{4,1}
};

int i,j,k=2;
int part1[][] = new int[10][2];
int part2[][] = new int[10][2];
float mean1[][] = new float[1][2];

85
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

float mean2[][] = new float[1][2];


float temp1[][] = new float[1][2], temp2[][] = new
float[1][2];
int sum11 = 0, sum12 = 0, sum21 = 0, sum22 = 0;
double dist1, dist2;
int i1 = 0, i2 = 0, itr = 0;

// Printing the dataset


System.out.println("Dataset: ");
for(i=0;i<10;i++) {
System.out.println(dataset[i][0]+" "+dataset[i][1]);
}

System.out.println("\nNumber of partitions: "+k);

// Assuming (2,2) and (5,7) are random means


mean1[0][0] = 2;
mean1[0][1] = 2;
mean2[0][0] = 5;
mean2[0][1] = 7;

// Loop till the new mean and previous mean are same
while(!Arrays.deepEquals(mean1, temp1) ||
!Arrays.deepEquals(mean2, temp2)) {

//Empting the partitions


for(i=0;i<10;i++) {
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 92
part1[i][0] = 0;
86
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

part1[i][1] = 0;
part2[i][0] = 0;
part2[i][1] = 0;
}

i1 = 0; i2 = 0;

//Finding distance between mean and data point and


store the data point in the corresponding partition
for(i=0;i<10;i++) {
dist1 = Math.sqrt(Math.pow(dataset[i][0] -
mean1[0][0],2) + Math.pow(dataset[i][1] - mean1[0][1],2));
dist2 = Math.sqrt(Math.pow(dataset[i][0] -
mean2[0][0],2) + Math.pow(dataset[i][1] - mean2[0][1],2));

if(dist1 < dist2) {


part1[i1][0] = dataset[i][0];
part1[i1][1] = dataset[i][1];

i1++;
}
else {
part2[i2][0] = dataset[i][0];
part2[i2][1] = dataset[i][1];

i2++;
}
}

//Storing the previous mean


87
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

temp1[0][0] = mean1[0][0];
temp1[0][1] = mean1[0][1];
temp2[0][0] = mean2[0][0];
temp2[0][1] = mean2[0][1];

//Finding new mean for new partitions


sum11 = 0; sum12 = 0; sum21 = 0; sum22 = 0;
for(i=0;i<i1;i++) {
sum11 += part1[i][0];
sum12 += part1[i][1];
}
for(i=0;i<i2;i++) {
sum21 += part2[i][0];
sum22 += part2[i][1];
}
mean1[0][0] = (float)sum11/i1;
mean1[0][1] = (float)sum12/i1;
mean2[0][0] = (float)sum21/i2;
mean2[0][1] = (float)sum22/i2;

itr++;
}

System.out.println("\nFinal Partition: ");


System.out.println("Part1:");
for(i=0;i<i1;i++) {
System.out.println(part1[i][0]+" "+part1[i][1]);
}
System.out.println("\nPart2:");
for(i=0;i<i2;i++) {
88
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

System.out.println(part2[i][0]+" "+part2[i][1]);
}
System.out.println("\nFinal Mean: ");
System.out.println("Mean1 : "+mean1[0][0]+" "+mean1[0][1]);
System.out.println("Mean2 : "+mean2[0][0]+" "+mean2[0][1]);
System.out.println("\nTotal Iteration: "+itr);
}
}
Here is the output of the above program.

89
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE:14
import matplotlib.pyplot as plt

x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12] y =


[21, 19, 24, 17, 16, 25, 24, 22, 21, 21] plt.scatter(x,
y) plt.show();

OUTPUT:

from sklearn.cluster import KMeans

data= list(zip(x,y)) inertias=[]


for i in range(1,11):
kmeans=KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)

plt.plot(range(1,11),inertias,marker='o')
plt.title('Elbowmethod') plt.xlabel('Number
ofclusters') plt.ylabel('Inertia') plt.show()

90
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

kmeans=KMeans(n_clusters=2) kmeans.fit(data)

plt.scatter(x,y,c=kmeans.labels_) plt.show()

91
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

import matplotlib.pyplot as plt


from sklearn.cluster import KMeans
x=[4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
data= list(zip(x,y))
print(data)
inertias=[]

for i in range(1,11):
kmeans=KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)

plt.plot(range(1,11),inertias,marker='o')
plt.title('Elbowmethod')
plt.xlabel('Numberofclusters')
plt.ylabel('Inertia')
plt.show()
OUPUT:

kmeans=KMeans(n_cluste
2)
kmeans.fit(dat
plt.scatter(x,y,c=kmeans.label
plt.show
s_) )
(
92
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

93
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-15
Aim: Write a program to compute/display dissimilarity matrix (for your own dataset
containing at least four instances with two attributes) using Python
#!/usr/bin/env python
from math import*
from decimal import Decimal
class Similarity():
""" Five similarity measures function """
def euclidean_distance(self,x,y):
""" return euclidean distance between two lists """
return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))
def manhattan_distance(self,x,y):
""" return manhattan distance between two lists """
return sum(abs(a-b) for a,b in zip(x,y))
def minkowski_distance(self,x,y,p_value):
""" return minkowski distance between two lists """
return self.nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value) def
nth_root(self,value, n_root):
""" returns the n_root of an value """
root_value = 1/float(n_root)
return round (Decimal(value) ** Decimal(root_value),3)
def cosine_similarity(self,x,y):
""" return cosine similarity between two lists """
numerator = sum(a*b for a,b in zip(x,y))
denominator = self.square_rooted(x)*self.square_rooted(y)
return round(numerator/float(denominator),3)
def square_rooted(self,x):
""" return 3 rounded square rooted value """
return round(sqrt(sum([a*a for a in x])),3)

94
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

def jaccard_similarity(self,x,y):
""" returns the jaccard similarity between two lists """
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_ca
output:

95
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

EXERCISE-16
Aim: Visualize the datasets using matplotlib in python.(Histogram, Box plot, Bar chart, Pie
chart etc.,)
ANS)
Data visualization is the presentation of data in an accessible manner through visual tools
like graphs or charts. These visualizations aid the process of communicating insights and
relationships within the data, and are an essential part of data analysis., we treat Matplotlib,
which is the most popular data visualization library within the Python programming
language.
Contents
1. Preliminaries
2. Scatter plots
3. Bar charts
4. Histograms
5. Boxplots
1. Preliminaries
Matplotlib is a very well documented package. To make the plotting easier, we make use of
the pyplot module, that makes Matplotlib work like MATLAB. Essentially, all its
functioning can be found HERE. The point of this article is to state its main and most
important functions and give examples on how to use pyplot, as the documentation
sometimes is pretty hard to navigate through.
To call the package module, we begin our code with import matplotlib.pyplot as plt.
Below, we state some of the most important functions when using pyplot:
plt.title: Set a title, which appears above the plot.
plt.grid: Configure the grid lines in the figure. To enable grid lines in the plot, use
plt.grid(True).
plt.legend: Place a legend in the figure.
plt.xlabel and plt.ylabel: Set labels for the axes. For example, plt.xlabel(“Age”)
sets “Age” as the label for the x-axis.

96
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

DATAWARE HOUSING AND DATA MINING LAB CSE


LIMAT Page 97
plt.xlim and plt.ylim: Set the limit ranges for the axes. So, e.g., plt.xlim([-50,
50]) limits the plot to show only x-values between -50 and 50.
plt.show: Use at the end to display everything in the plot.
To demonstrate the usage of these functions, let us consider a simple example where we want
to draw two simple lines. This can be achieved by using the plt.plot function. By using the
following code
import matplotlib.pyplot as pltx = [1, 2, 3]
x2 = [1, 3, 3.5]
y = [-4, 0, 8]
y2 = [-2, 3, -1]
plt.plot(x, y, label='Line 1')
plt.plot(x2, y2, label='Line 2')
plt.title("Plotting two simple lines")
plt.grid(True)
plt.xlabel("X-label")
plt.ylabel("Y-label")
plt.xlim([0, 4])
plt.ylim([-5, 10])
plt.legend()
plt.show()

97
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

2. Scatter Plots
Now that we have seen the basic pyplot functions, we will start with our first type of plot.
The scatter plot displays values for, typically, two variables for a set of data. Such kind of
plots could be very informative when figuring out relationships between pairs of variables.
Consider the following Salary dataset (from Kaggle), which contains 30 observations
consisting of years of working experience and the annual wage (in dollars). To create a
scatter plot, we make use of the plt.scatter function. Then, we can plot these data points
as follows:
import matplotlib.pyplot as plt
import pandas as pddata = pd.read_csv("Salary_data.csv") # load dataset
X = data["YearsExperience"]
Y = data["Salary"]plt.scatter(X, Y)
plt.title("Scatter Plot")
plt.xlabel("Working Experience (years)")
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 99

98
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

plt.ylabel("Annual Wage (dollars)")

plt.show()
We are also able to, for example, distinguish between observations that have more than 5
years of working experience and observations that have less than 5 years of working
experience by using different colors. To do this, we create two scatter plots by using the
relevant data splits and display them in one single plot. The following code results in the
desired plot:
X_1 = X[X > 5]
X_2 = X[X <= 5]
Y_1 = Y[X > 5]
Y_2 = Y[X <= 5]plt.scatter(X_1, Y_1, label='Years of experience > 5')
plt.scatter(X_2, Y_2, label='Years of experience <= 5')
plt.title("Scatter Plot (split)")
plt.legend()
plt.xlabel("Working Experience (years)")
plt.ylabel("Annual Wage (dollars)")
plt.show()

99
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

3. Bar Charts
A bar chart graphically displays categorical data with rectangular bars of different heights,
where the heights or lengths of the bars represent the values of the corresponding measure.
Let us once again consider the Iris dataset, where observations belong to either one of three
iris flower classes. Assume we want to visualize the average value for each feature of the
DATAWARE HOUSING AND DATA MINING LAB CSE
LIMAT Page 102
Setosa iris class. We can do this by using a bar chart, requiring the plt.bar function. The
following code results in the desired bar chart figure:
from sklearn import datasets
import matplotlib.pyplot as pltiris = datasets.load_iris()
X_iris = iris.data
Y_iris = iris.targetaverage = X_iris[Y_iris == 0].mean(axis=0)
plt.bar(iris.feature_names, average)
plt.title("Bar Chart Setosa Averages")
plt.ylabel("Average (in cm)")
plt.show()

100
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Furthermore, we are also able to nicely display the feature averages for all three iris flowers,
by placing the bars next to each other. This takes a bit more effort than the standard bar chart.
By using the following code, we obtain the desired plot:
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as npiris = datasets.load_iris()
X_iris = iris.data
Y_iris = iris.target
n_classes = 3averages = [X_iris[Y_iris == i].mean(axis=0) for i in
range(n_classes)]
x = np.arange(len(iris.feature_names))
fig = plt.figure()
ax = fig.add_subplot()
bar1 = ax.bar(x - 0.25, averages[0], 0.25, label=iris.target_names[0])
bar2 = ax.bar(x, averages[1], 0.25, label=iris.target_names[1])
bar3 = ax.bar(x + 0.25, averages[2], 0.25, label=iris.target_names[2])

101
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

ax.set_xticks(x)
ax.set_xticklabels(iris.feature_names)
plt.legend()
plt.title("Bar Chart Iris Averages")
plt.ylabel("Average")
plt.show()

102
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

4. Histograms
A histogram is used to give an approximate representation of the distribution of the data,
based on the sample data at hand. A histogram is constructed by using equally sized ‘bins’
(intervals), and counting the number of data points that belong to each bin. Creating
histograms at the start of a new project is very useful to get familiar with the data, and to get
a rough sense of the density of the underlying distribution. To create a histogram, we make
use of the plt.hist function.
To create a basic histogram on the sepal length of all iris flowers, using 20 equal-length bins,
we use the following code:
from sklearn import datasets
import matplotlib.pyplot as pltbins = 20
iris = datasets.load_iris()
X_iris = iris.data
X_sepal = X_iris[:, 0]
plt.hist(X_sepal, bins)
plt.title("Histogram Sepal Length")
plt.xlabel(iris.feature_names[0])
plt.ylabel("Frequency")
plt.show

103
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Instead of plotting the histogram for a single feature, we can plot the histograms for all
features. This can be done by creating separate plots, but here, we will make use of subplots,
so that all histograms are shown in one single plot. For this, we make use of the
plt.subplots function. By using the following code, we obtain the plot containing the four
histograms:
from sklearn import datasets
import matplotlib.pyplot as pltbins = 20
iris = datasets.load_iris()
X_iris = iris.datafig, axs = plt.subplots(2, 2)
axs[0, 0].hist(X_iris[:, 0])
axs[0, 1].hist(X_iris[:, 1], color='orange')
axs[1, 0].hist(X_iris[:, 2], color='green')
axs[1, 1].hist(X_iris[:, 3], color='red')
i=0
for ax in axs.flat:
ax.set(xlabel=iris.feature_names[i], ylabel='Frequency')

104
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

5. Boxplots
A boxplot is a convenient way of graphically depicting groups of numerical data through
different statistics. The interpretation of a boxplot is displayed in the figure below:

Here, the median is the middle value of the dataset (not to confuse with the mean); the 25th
percentile is the median of the lower half of the dataset and the 75th percentile is the
median of the upper half of the dataset. The data points not included between the whiskers
are plotted as an outlier with a dot.
Once again, consider the Iris flower dataset. A boxplot is created by using the plt.boxplot
function. We will make a boxplot for the sepal length of all iris flowers:
from sklearn import datasets
import matplotlib.pyplot as pltiris = datasets.load_iris()
X_iris = iris.data
X_sepal = X_iris[:, 0]
plt.boxplot(X_sepal, labels=[iris.feature_names[0]])
plt.title("Boxplot Sepal Length")
plt.ylabel("cm")
plt.show

105
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527
AVANTHI’S RESERCH AND TECHNOLOGICAL
ACADEMY
DWDM LAB

Since all features are of the same measure (namely, in cm), we can plot the boxplots for all
features next to each other in one single plot:
from sklearn import datasets
import matplotlib.pyplot as pltiris = datasets.load_iris()
X_iris = iris.dataplt.boxplot(X_iris, labels=[iris.feature_names[0],
iris.feature_names[1], iris.feature_names[2], iris.feature_names[3]])
plt.title("Boxplots Iris features")
plt.ylabel("cm")
plt.show

106
S.GUNA SHEKHAR CSE3RD YEAR 20HQ1A0527

You might also like