ccs341 Data Warehousing Lab Manual2021
ccs341 Data Warehousing Lab Manual2021
ccs341 Data Warehousing Lab Manual2021
INDEX
INDEX
S.No. Date Name of the Experiment PageNo Marks Signature
lOMoARcPSD|34169792
Introduction
The goal of this lab is to install and familiarize with Weka.
Steps:
2. Open Weka and have a look at the interface. It is an open-source project written in Java
from the University of Waikato.
7. In this lab, we will work with the dataset Iris. To open Iris dataset, click on ‘Open file’ in
the ‘Preprocess tab’. From your ‘data’ folder, select iris.arff and hit open.
8. To know more about the iris dataset, open iris.arff in notepad++ or in a similar tool and
read the comments.
Iris Versicolour 50
Iris Virginica 50
lOMoARcPSD|34169792
Result:
Thus the data exploration and integration using weka is explored successfully.
lOMoARcPSD|34169792
Aim:
To convert a text file to ARFF(Attribute-Relation File Format) using Weka3.8.2 tool.
Objectives:
Most of the data that we have collected from public forum is in the text format that
cannot be read byWeka tool. Since Weka (Data Mining tool) recognizes the data in ARFF
format only we have to convert the text file into ARFF file.
Algorithm:
Output:
Result:
Thus, conversion of a text file to ARFF(Attribute-Relation File Format) using Weka3.8.2
tool is implemented.
lOMoARcPSD|34169792
Aim:
To convert ARFF (Attribute-Relation File Format) into text file.
Objectives:
Since the data in the Weka tool is in ARFF file format we have to convert the ARFF file to
text format forfurther processing.
Algorithm:
1. Open any ARFF file in Weka tool.
2. Save the file as CSV format.
3. Open the CSV file in MS-EXCEL.
4. Remove some rows and add coreseponding header to the data.
5. Save it as text file with the desire delimiter.
Result:
Thus conversion of ARFF (Attribute-Relation File Format) into text file is implemented.
lOMoARcPSD|34169792
Aim:
To apply the concept of Linear Regression for training the given dataset.
Algorithm:
LINEAR REGRESSION:
PROBLEM:
Consider the dataset below where x is the number of working experience of a college graduate
and y is the corresponding salary of the graduate. Build a regression equation and predict the
salary of college graduate whose experience is 10 years.
INPUT:
lOMoARcPSD|34169792
Output:
Result:
Thus the concept of Linear Regression for training the given dataset is applied and implemented.
lOMoARcPSD|34169792
To apply the Navie Bayes Classification for testing the given dataset.
Algorithm:
Example: predict whether a costumer will buy a computer or not " Costumers are described by two
attributes: age and income " X is a 35 years-old costumer with an income of 40k " H is the
hypothesis that the costumer will buy acomputer " P(H|X) reflects the probability that costumer X
will buy a computer given that we know the costumers’ age and income.
Input Data:
lOMoARcPSD|34169792
Output:
Data:
Result:
Thus the Navie Bayces Classification for testing the given dataset is implemented.
lOMoARcPSD|34169792
Aim:
Process: Replacing Missing Attribute Values by the Attribute Mean. This method is used fordata
sets with numerical attributes. An example of such a data set is presented in fig no: 4.1
In this method, every missing attribute value for a numerical attribute is replaced by the arithmetic mean of
known attribute values. In Fig, the mean of known attribute values for Temperature is 99.2, hence all missing
attribute values for Temperature should be replaced by The table with missing attribute values replaced by the
mean is presented in fig. For symbolic attributes Headache and Nausea, missing attribute values were replaced
using the most common value of the Replace Missing Values.
lOMoARcPSD|34169792
lOMoARcPSD|34169792
Result:
Thus the preprocessing of handling missing values is filled successfully.
lOMoARcPSD|34169792
Aim:
To perform the data pre-processing by applying filter.
Objectives:
The data collected from public fourms have plenty of noise or missing data. Weka
provides filter to replace themissing values and to remove the noisy data. So that the
result will be more accurate.
Algorithm:
OUTPUT:
Result:
lOMoARcPSD|34169792
OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss
OLAPoperations in multidimensional data.
Here is the list of OLAP operations
Roll-up (Drill-up)
Drill-down
Slice and dice
Pivot (rotate)
Roll-up (Drill-up):
Roll-up performs aggregation on a data cube in any of the following ways
By climbing up a concept hierarchy for a dimension
By dimension reduction
Roll-up is performed by climbing up a concept hierarchy for the dimension location.
Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level
ofcity to the level of country.
The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down:
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the
levelof month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice:
The slice operation selects one particular dimension from a given cube and provides a new sub-
cube.
Dice:
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Pivot (rotate):
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide
an alternative presentation of data.
Now, we are practically implementing all these OLAP Operations using
Microsoft
lOMoARcPSD|34169792
Excel.
4. We got all the music.cub data for analyzing different OLAP Operations.Firstly, we
performed drill-down operation as shown below.
lOMoARcPSD|34169792
In the above window, we selected year „2008‟ in „Electronic‟ Category, then automatically
the Drill-Down option is enabled on top navigation options. We will click on „Drill-Down‟
option, then the below window will be displayed.
Now we are going to perform roll-up (drill-up) operation, in the above window I selected
January month then automatically Drill-up option is enabled on top. We will click on Drill-up option,
then the below window will be displayed.
Next OLAP operation Slicing is performed by inserting slicer as shown in top navigation options.
lOMoARcPSD|34169792
While inserting slicers for slicing operation, we select 2 Dimensions (for e.g. CategoryName
& Year) only with one Measure (for e.g. Sum of sales).After inserting a slice& adding a filter
(CategoryName: AVANT ROCK & BIG BAND; Year: 2009 & 2010), we will gettable as shown
below.
Dicing operation is similar to Slicing operation. Here we are selecting 3 dimensions (Category
Name, Year, Region Code)& 2 Measures (Sum of Quantity, Sum of Sales) through „insert slicer‟
option. After that adding a filter for Category Name, Year & Region Code as shown below:
lOMoARcPSD|34169792
Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order Date-
Year) & columns (Values-Sum of Quantity & Sum of Sales) through right side bottom navigation
baras shown below.
Result:
Thus the OLAP operations such as roll up, drill down, slice , dice and pivot are
implemented successfully.
lOMoARcPSD|34169792
ETL (Extract-Transform-Load):
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL coversa
process of how the data are loaded from the source system to the data warehouse.
Currently, the ETL encompasses a cleaning step as a separate step. The sequence is
thenExtract-CleanTransform-Load. Let us briefly describe each step of the ETL process.
PROCESS:
EXTRACT
The Extract step covers the data extraction from the source system and makes it accessible for
further processing. The main objective of the extract step is to retrieve allthe required data
from the source system with as little resources as possible. The extractstep should be designed
in a way that it does not negatively affect the source system in terms or performance, response
time or any kind of locking.
There are several ways to perform the extract:
• Update notification - if the source system is able to provide a notification that a record has
been changed and describe the change, this is the easiest way to get the data.
• Incremental extract - some systems may not be able to provide notification that an update
has occurred, but they are able to identify which records have been modified and provide an
extract of such records. During further ETL steps, the system needs to identify changes and
propagate it down. Note, that by using daily extract, we may not be able to handle deleted
records properly.
• Full extract - some systems are not able to identify which data has been changed at all, so
a full extract is the only way one can get the data out of the system. The full extract requires
keeping a copy of the last extract in the same format in order to be able to identify changes.
Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely
important.Particularly for full extracts; the data volumes can be in tens of gigabytes. Clean:
The cleaning step is one of the most important as it ensures the quality of the data in the data
warehouse. Cleaning should perform basic data unification rules, such as:
• Making identifiersunique(sexcategories Male/Female/Unknown,
M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
• Validate address fields against each other (State/Country, City/State, City/ZIP
code,City/Street).
lOMoARcPSD|34169792
Transform:
The transform step applies a set of rules to transform the data from the source to the target.
This includes converting any measured data to the same dimension (i.e. conformeddimension)
using the same units so that they can later be joined. The transformation step also requires
joining data from several sources, generating aggregates, generating surrogate keys, sorting,
deriving new calculated values, and applying advanced validationrules.
Load:
During the load step, it is necessary to ensure that the load is performed correctly and with as
little resources as possible. The target of the Load process is often a database. Inorder to make
the load process efficient, it is helpful to disable any constraints and indexes before the load
and enable them back only after the load completes. The referential integrity needs to be
maintained by ETL tool to ensure consistency.
Managing ETL Process:
The ETL process seems quite straight forward. As with every application, there is a possibility
that the ETL process fails. This can be caused by missing extracts from one of the systems,
missing values in one of the reference tables, or simply a connection or power outage.
Therefore, it is necessary to design the ETL process keeping fail-recovery in mind.
Staging:
It should be possible to restart, at least, some of the phases independently from the others. For
example, if the transformation step fails, it should not be necessary to restart the Extract step.
We can ensure this by implementing proper staging. Staging means that the data is simply
dumped to the location (called the Staging Area) so that it can then be read by the next
processing phase. The staging area is also used during ETL process to store intermediate
results of processing. This is ok for the ETL process which uses for this purpose. However,
tThe staging area should is be accessed by the load ETL process only. It should never be
available to anyone else; particularly not to end users as it is not intended for data presentation
to the end- user.may contain incomplete or in-the-middle- of-the-processing data.
There are many ready-to-use ETL tools on the market. The main benefit of using off-the-shelf
ETL tools is the fact that they are optimized for the ETL process by providing connectors to
common data sources like databases, flat files, mainframe systems, xml, etc. They provide a
means to implement data transformations easily and consistently across various data sources.
This includes filtering, reformatting, sorting, joining, merging, aggregation and other
operations ready to use. The tools also support transformation scheduling, version control,
monitoring and unified metadata management. Some of the ETL tools are even integrated with
BI tools.
Some of the Well Known ETL Tools:
The most well-known commercial tools are Ab Initio, IBM
Info Sp here Data Stage, In for matic a, Oracle Data Integrator, and SAP Data Integrator.
There are several open source ETL tools are
OpenRefine, Apatar, Clover ETL, Pentaho and Talend.
In these above tools, we are going to use OpenRefine 2.8 ETL toolto different sampledatasets
forextracting, data cleaning, transforming & loading.
Result:
Thus the ELT script is implemented successfully.
lOMoARcPSD|34169792
AIM:
Multi-Dimensional model was developed for implementing data warehouses & it provides
both a mechanism to store data and a way for business analysis. The primary components of
dimensional model are dimensions & facts. There are different of types ofmulti-dimensional data
models. They are:
Star Schema Model
SnowFlake Schema Model
Fact Constellation Model.
Now, we are going to design these multi-dimensional models for the Marketing enterprise.
First, we need to built the tables in a database through SQLyog as shown below.
In the above window, left side navigation bar consists of a database named as sales_dw in
which there are six different tables (dimcustdetails, dimcustomer, dimproduct, dimsalesperson,
dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as “Microsoft Visual
Studio 2012 for Business Intelligence” for building multi- dimensional models.
Through Data Sources, we can connect to our MySQL database named as “sales_dw”.
Then, automatically all the tables in that database will be retrieved to this tool for creating
multidimensional models.
By data source views & cubes, we can see our retrieved tables in multi- dimensional
models. We need to add dimensions also through dimensions option. In general, Multidimensional
models consists of dimension tables & fact tables.
lOMoARcPSD|34169792
Star schema is a entity relationship diagram of this schema resembles a star with point radiating
from central table as we seen in the below implemented window in visualstudio.
Result:
Thus the multidimensional models are created successfully.
lOMoARcPSD|34169792
EXP.NO: 9
EXPLORE WEKA DATA MINING/MACHINE LEARNING TOOLKIT
DATE:
AIM:
(i). Downloading and/or installation of WEKA data mining toolkit Procedure:
1. Go to the Weka website, http://www.cs.waikato.ac.nz/ml/weka/, and download the
software. On the left-hand side, click on the link that says download.
2. Select the appropriate link corresponding to the version of the software based on
youroperating system and whether or not you already have Java VM running on your
machine (if you don‘t know what Java VM is, then you probably don‘t).
3. The link will forward you to a site where you can download the software from a
mirrorsite. Save the self-extracting executable to disk and then double click on it to
install Weka.Answer yes or next to the questions during the installation.
4.Click yes to accept the Java agreement if necessary. After you install the program
Weka should appear on your start menu under Programs (if you are using Windows).
5.Running Weka from the start menu select Programs, then Weka.You will see the
Weka GUI Chooser. Select Explorer. The Weka Explorer will then launch.
(ii). Understand the features of WEKA toolkit such as Explorer, Knowledge
Flowinterface, Experimenter, command-line interface.
(iii). The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting
pointfor launching Weka‘s main GUI applications and supporting tools. If one
prefersa MDI (multiple document interface‖) appearance, then this is provided
by analternative launcher called Main‖ (class weka.gui.Main).
The GUI Chooser consists of four buttons—one for each of the four majorWeka applications
and four menus.
lOMoARcPSD|34169792
When the Explorer is first started only the first tab is active; the others are greyed out. This is
because it is necessary to open (and potentially pre-process) a data set before starting to
explore the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
lOMoARcPSD|34169792
Once the tabs are active, clicking on them flicks between different screens, on which the
respective actions can be performed. The bottom area of the window (including the status box, the
log button, and the Weka bird) stays visible regardless of which section you are in.
1. Preprocessing
Loading Data:
The first four buttons at the top of the preprocess section enable you to load data into
WEKA:
1. Open file................ Brings up a dialog box allowing you to browse for the datafile on the local
file system.
2. Open URL..............................Asks for a Uniform Resource Locator address for wherethe data is
stored.
3. Open DB................ Reads data from a database. (Note that to make this workyou might have to
edit the file in weka/experiment/DatabaseUtils.props.)
4. Generate. ...............Enables you to generate artificial data from a variety of Data Generators.
Using the Open file..............................button you can read files in a variety of formats:
WEKA‘s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF
files typically have a .arff extension, CSV files a .csv extension,C4.5 files a .data and.names
extension, and serialized Instances objects a .bsiextension.
lOMoARcPSD|34169792
2. Classification:
Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text fieldthat gives the
name of the currently selected classifier, and its options. Clickingon the text box withthe left mouse
button brings up a GenericObjectEditordialog box, just the same as for filters, that you can use to
configure the optionsof the current classifier. With a right click(or Alt+Shift+left click) you canonce
again copy the setup string to the clipboard or display the properties in aGenericObjectEditor dialog
box. The Choose button allows youto choose on4eof the classifiers that are available in WEKA.
Test Options
The result of applying the chosen classifier will be tested according to the optionsthat areset by
clicking in the Test options box. There are four test modes:
1. Use training set: The classifier is evaluated on how well it predicts theclass of theinstances it
was trained on.
2. Supplied test set: The classifier is evaluated on how well it predicts theclass of a setof instances
loaded from a file. Clicking the Set... buttonbrings up a dialog allowing you to choose the file to test
on.
4. Percentage split: The classifier is evaluated on how well it predicts acertain percentage of the
data which is held out for testing. The amountof data held out dependson the value entered in the %
field.
lOMoARcPSD|34169792
3. Clustering:
Cluster Modes:
The Cluster mode box is used to choose what to cluster and how to evaluate the results. The
first three options are the same as for classification: Use training set, Supplied test setand
Percentage split.
Test Options
The result of applying the chosen classifier will be tested according to the optionsthat areset by
clicking in the Test options box. There are four test modes:
1. Use training set: The classifier is evaluated on how well it predicts the class of
theinstances it was trained on.
2. Supplied test set: The classifier is evaluated on how well it predicts theclass of a setof
instances loaded from a file. Clicking the Set... buttonbrings up a dialog allowing you to
choose the file to test on.
4. Percentage split: The classifier is evaluated on how well it predicts acertain percentage of
the data which is held out for testing. The amountof data held out dependson the value entered
in the % field.
lOMoARcPSD|34169792
4. Associating:
Setting Up
This panel contains schemes for learning association rules, and the learners are chosenand
configured in the same way as the clusterers, filters, and classifiers in the other panels.
5. Selecting Attributes:
lOMoARcPSD|34169792
6. Visualizing:
WEKA‘s visualization section allows you to visualize 2D plots of the currentrelation.
Result :
Thus the tools are explored and analyzed successfully.
lOMoARcPSD|34169792
AIM:
To design fact and dimension tables.
Fact Table :
A fact table is used in the dimensional model in data warehouse design. A fact table is found at
the center of a star schema or snowflake schema surrounded by dimension tables.A fact table
consists of facts of a particular business process e.g., sales revenue by month by product. Facts are
also known as measurements or metrics. A fact table record captures a measurement or a metric.
1. Choosing business process to model – The first step is to decide what business process to model
by gathering and understanding business needs and available data
2. Declare the grain – by declaring a grain means describing exactly what a fact table record
represents
3. Choose the dimensions – once grain of fact table is stated clearly, it is time to determine
dimensions forthe fact table.
4. Identify facts – identify carefully which facts will appear in the fact table.
Fact table FACT_SALES that has a grain which gives us a number of units sold by date, by store
and by product.
All other tables such as DIM_DATE, DIM_STORE and DIM_PRODUCT are dimensions tables.
This schema isknown as the star schema.
Result:
Thus design fact and dimension tables are created.
lOMoARcPSD|34169792
Aim:
Description:
The knowledge flow provides an alternative way to the explorer as a graphical front end to
WEKA’salgorithm. Knowledge flow is a working progress. So, some of the functionality from
explorer is not yet available. So, on the other hand there are the things that can be done in
knowledge flow, but not in explorer. Knowledge flow presents a dataflow interface to WEKA. The
user can select WEKA components from a toolbar placed them on a layout campus and connect
them together in order to form a knowledge flow for processing and analyzing the data.
Creation of Employee Table:
Procedure:
1) Open Start Programs Accessories Notepad
2) Type the following training data set with the help of Notepad for Employee Table. @relation
employee
@attribute eid numeric
@attribute ename
{raj,ramu,anil,sunil,rajiv,sunitha,kavitha,suresh,ravi,ramana,ram,kavy a,navya}@attribute salary
numeric
@attribute exp numeric
@attribute address
{pdtr,kdp,nlr,gtr} @data 101,raj,10000,4,pdtr 102,ramu,15000,5,pdtr 103,anil,12000,3,kdp
104,sunil,13000,3,kdp 105,rajiv,16000,6,kdp 106,sunitha,15000,5,nlr 107,kavitha,12000,3,nlr
108,suresh,11000,5,gtr 109,ravi,12000,3,gtr 110,ramana,11000,5,gtr 111,ram,12000,3,kdp
112,kavya,13000,4,kdp 113,navya,14000,5,kdp
Output:
Employeedata.
10) Right click on Normalize and select Dataset option then establish a link between
Normalize and Arff Saver.
lOMoARcPSD|34169792
11) Right click on Arff Saver and select Configure option then new window will be opened and
set the path,enter .arff in look in dialog box to save normalize data.
12) Right click on Arff Loader and click on Start Loading option then everything will be
executed one by one.
13) Check whether output is created or not by selecting the preferred path.
14) Rename the data name as a.arff
15) Double click on a.arff then automatically the output will be opened in MS- Excel.
Result:
This program has been successfully executed.