Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DWDM 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

UNIT-2 DATA WAREHOUSE AND OLAP TECHNOLOGY FOR DATA MINING

Data warehousing:
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical
reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves
data cleaning, data integration, and data consolidations.
Functions of Data Warehouse Tools and Utilities
The following are the functions of data warehouse tools and utilities −
 Data Extraction − Involves gathering data from multiple heterogeneous sources.
 Data Cleaning − Involves finding and correcting the errors in data.
 Data Transformation − Involves converting the data from legacy format to warehouse
format.
 Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and
building indices and partitions.
 Refreshing − Involves updating from data sources to warehouse.
Metadata:
Metadata is simply defined as data about data. The data that are used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to the
detailed data.
In terms of data warehouse, we can define metadata as following −
 Metadata is a road-map to data warehouse.
 Metadata in data warehouse defines the warehouse objects.
 Metadata acts as a directory.
Data Cube:
A data cube helps us represent data in multiple dimensions. It is defined by dimensions and
facts. The dimensions are the entities with respect to which an enterprise preserves the records.
Illustration of Data Cube
Suppose a company wants to keep track of sales records with the help of sales data warehouse
with respect to time, item, branch, and location. These dimensions allow to keep track of
monthly sales and at which branch the items were sold. There is a table associated with each
dimension. This table is known as dimension table. For example, "item" dimension table may
have attributes such as item_name, item_type, and item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to time,
item, and location dimensions.

But here in this 2-D table, we have records with respect to time and item only. The sales for
New Delhi are shown with respect to time, and item dimensions according to type of items sold.
Mr. D GANGADHAR
Associate. Professor
If we want to view the sales data with one more dimension, say, the location dimension, then
the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and
location is shown in the table below −

The above 3-D table can be represented as 3-D data cube as shown in the following figure −

Data Mart:
Data marts contain a subset of organization-wide data that is valuable to specific groups of
people in an organization. In other words, a data mart contains only those data that is specific to
a particular group. For example, the marketing data mart may contain only data related to items,
customers, and sales. Data marts are confined to subjects.
Points to Remember About Data Marts
 Windows-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
 The implementation cycle of a data mart is measured in short periods of time, i.e., in
weeks rather than months or years.
 The life cycle of data marts may be complex in the long run, if their planning and design
are not organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data warehouse.
 Data marts are flexible.
The following figure shows a graphical representation of data marts.

Mr. D GANGADHAR
Associate. Professor
Data Warehousing – OLAP:
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information. This chapter cover the types of OLAP, operations on OLAP,
difference between OLAP, and statistical databases and OLTP.
Types of OLAP Servers
We have four types of OLAP servers −

 Relational OLAP (ROLAP)


 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers
Relational OLAP:
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −

 Implementation of aggregation navigation logic.


 Optimization for each DBMS back end.
 Additional tools and services.
Multidimensional OLAP:
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense
and sparse data sets.

Mr. D GANGADHAR
Associate. Professor
Hybrid OLAP:
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers:
Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.
OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −

 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)

Mr. D GANGADHAR
Associate. Professor
Roll-up:
Roll-up performs aggregation on a data cube in any of the following ways −

 By climbing up a concept hierarchy for a dimension


 By dimension reduction
The following diagram illustrates how roll-up works.

Mr. D GANGADHAR
Associate. Professor
 Roll-up is performed by climbing up a concept hierarchy for the dimension location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the level
of city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down:
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −

 By stepping down a concept hierarchy for a dimension


 By introducing a new dimension.
The following diagram illustrates how drill-down works −

Mr. D GANGADHAR
Associate. Professor
 Drill-down is performed by stepping down a concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the level
of month.
 When drill-down is performed, one or more dimensions from the data cube are added.
 It navigates the data from less detailed data to highly detailed data.
Slice:
The slice operation selects one particular dimension from a given cube and provides a new sub-
cube. Consider the following diagram that shows how slice works.

 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
Mr. D GANGADHAR
Associate. Professor
 It will form a new sub-cube by selecting one or more dimensions.
Dice:
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three
dimensions.

 (location = "Toronto" or "Vancouver")


 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot:
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide
an alternative presentation of data. Consider the following diagram that shows the pivot
operation.

Mr. D GANGADHAR
Associate. Professor
Mr. D GANGADHAR
Associate. Professor
OLAP vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing of Involves day-to-day processing.


information.

2 OLAP systems are used by OLTP systems are used by clerks, DBAs,
knowledge workers such as or database professionals.
executives, managers and analysts.

3 Useful in analyzing the business. Useful in running the business.

4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.


Schema and Fact Constellation
Schema.

6 Contains historical data. Contains current data.

7 Provides summarized and Provides primitive and highly detailed


consolidated data. data.

8 Provides summarized and Provides detailed and flat relational view


multidimensional view of data. of data.

9 Number or users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in Number of records accessed is in tens.


millions.

11 Database size is from 100 GB to 1 Database size is from 100 MB to 1 GB.


TB

12 Highly flexible. Provides high performance.


Data Warehousing – Schemas:
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema.
Star Schema:
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

Mr. D GANGADHAR
Associate. Professor
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
Star Schema Definition:
The star schema that we have discussed can be defined using Data Mining Query Language
(DMQL) as follows −

define cube sales star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)

Snowflake Schema:
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
 Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.

Mr. D GANGADHAR
Associate. Professor
Snowflake Schema Definition:
Snowflake schema can be defined using DMQL as follows −

define cube sales snowflake [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, country))

Fact Constellation Schema:


 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.

Mr. D GANGADHAR
Associate. Professor
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units sold.
 It is also possible to share dimension tables between fact tables. For example, time, item,
and location dimension tables are shared between the sales and shipping fact table.
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
Fact Constellation Schema Definition:
Fact constellation schema can be defined using DMQL as follows −

define cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)

Mr. D GANGADHAR
Associate. Professor
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales
Three-Tier Data Warehouse Architecture:
Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the
data warehouse architecture.
 Bottom Tier − The bottom tier of the architecture is the data warehouse database server.
It is the relational database system. We use the back end tools and utilities to feed data
into the bottom tier. These back end tools and utilities perform the Extract, Clean, Load,
and refresh functions.
 Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in
either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional data
to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
 Top-Tier − This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse

Mr. D GANGADHAR
Associate. Professor
Data Warehouse Models:
From the perspective of data warehouse architecture, we have the following data warehouse
models −

 Virtual Warehouse
 Data mart
 Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build
a virtual warehouse. Building a virtual warehouse requires excess capacity on operational
database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific
groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For
example, the marketing data mart may contain data related to items, customers, and sales. Data
marts are confined to subjects.
Points to remember about data marts −
 Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
 The implementation data mart cycles is measured in short periods of time, i.e., in weeks
rather than months or years.
 The life cycle of a data mart may be complex in long run, if its planning and design are
not organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data warehouse.
 Data mart are flexible.
Enterprise Warehouse
 An enterprise warehouse collects all the information and the subjects spanning an entire
organization
 It provides us enterprise-wide data integration.
 The data is integrated from operational systems and external information providers.
 This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.

Mr. D GANGADHAR
Associate. Professor

You might also like