Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 39

Data Warehousing: Data Models and

OLAP operations
Topics Covered
1. Understanding the term “Data Warehousing”

2. Three-tier Decision Support Systems

3. Approaches to OLAP servers

4. Multi-dimensional data model

5. ROLAP

6. MOLAP

7. HOLAP

8. Which to choose: Compare and Contrast

9. Conclusion
Understanding the term Data Warehousing
 Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he
defined in the following way: "A warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support
of management's decision making process". He defined the terms in
the sentence as follows:
 Subject Oriented:
Data that gives information about a particular subject instead of about
a company's ongoing operations.
 Integrated:
Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
 Time-variant:
All data in the data warehouse is identified with a particular time
period.
 Non-volatile
Data is stable in a data warehouse. More data is added but data is
never removed. This enables management to gain a consistent picture
of the business.
Data Warehouse Architecture
Other important terminology
 Enterprise Data warehouse
collects all information about subjects (customers,products,sales,assets,
personnel) that span the entire organization

 Data Mart
Departmental subsets that focus on selected subjects

 Decision Support System (DSS)


Information technology to help the knowledge worker (executive, manager, analyst)
make faster & better decisions

 Online Analytical Processing (OLAP)


an element of decision support systems (DSS)
Three-Tier Decision Support Systems
 Warehouse database server
 Almost always a relational DBMS, rarely flat files
 OLAP servers
 Relational OLAP (ROLAP): extended relational DBMS that maps
operations on multidimensional data to standard relational operators
 Multidimensional OLAP (MOLAP): special-purpose server that
directly implements multidimensional data and operations
 Clients
 Query and reporting tools
 Analysis tools
 Data mining tools
The Complete Decision Support System

Information Sources Data Warehouse OLAP Servers Clients


Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
Semistructured OLAP
Sources
Data
Warehouse serve

extract Query/Reporting
transform
load serve
refresh
etc. e.g., ROLAP
Operational
DB’s Data Mining
serve

Data Marts
Approaches to OLAP Servers
Three possibilities for OLAP servers
(1) Relational OLAP (ROLAP)
 Relational and specialized relational DBMS to store and manage
warehouse data
 OLAP middleware to support missing pieces
(2) Multidimensional OLAP (MOLAP)
 Array-based storage structures
 Direct access to array data structures
(3) Hybrid OLAP (HOLAP)
 Storing detailed data in RDBMS
 Storing aggregated data in MDBMS
 User access via MOLAP tools
The Multi-Dimensional Data Model
“Sales by product line over the past six months”
“Sales by store between 1990 and 1995”

Store Info Key columns joining fact table


to dimension tables Numerical Measures

Prod Code Time Code Store Code Sales Qty

Fact table for


Product Info
measures

Dimension tables Time Info

...
ROLAP: Dimensional Modeling Using
Relational DBMS
 Special schema design: star, snowflake

 Special indexes: bitmap, multi-table join

 Proven technology (relational model, DBMS), tend to outperform


specialized MDDB especially on large data sets

 Products
 IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
Star Schema (in RDBMS)
Star Schema Example
The “Classic” Star Schema

 A single fact table, with detail


Store Dimension Fact Table Time Dimension and summary data
STORE KEY
Fact table primary key has only
STORE KEY PERIOD KEY
Store Description PRODUCT KEY
Period Desc

City PERIOD KEY
State
District ID
Dollars
Year
Quarter
one key column per dimension
Units
District Desc. Month
Region_ID
Region Desc.
Price
Day
Current Flag
 Each key is generated
Regional Mgr.
Product Dimension
Level PRODUCT KEY
Resolution
Sequence  Each dimension is a single
Product Desc.
Brand
Color
table, highly de-normalized
Size
Manufacturer
Level

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low
maintenance, very simple metadata
Star Schema
with Sample
Data
The “Snowflake” Schema
Store Dimension
STORE KEY District_ID Region_ID
Store Description District Desc. Region Desc.
City Region_ID Regional Mgr.
State
District ID
Region_ID
Regional Mgr.
Store Fact Table
STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars
Units
Price
Aggregates
 Add up amounts for day 1
 In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 81
p1 s3 1 50
p2 s2 1 8
p1 s1 2 44
p1 s2 2 4
Aggregates
 Add up amounts by day
 In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 ans date sum
p1 s3 1 50 1 81
p2 s2 1 8 2 48
p1 s1 2 44
p1 s2 2 4
Another Example
 Add up amounts by day, product
 In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 s1 1 12 sale prodId date amt
p2 s1 1 11 p1 1 62
p1 s3 1 50 p2 1 19
p2 s2 1 8
p1 s1 2 44 p1 2 48
p1 s2 2 4

rollup
drill-down
Points to be noticed about ROLAP
 Defines complex, multi-dimensional data with simple model
 Reduces the number of joins a query has to process
 Allows the data warehouse to evolve with rel. low maintenance
 Can contain both detailed and summarized data.
 ROLAP is based on familiar, proven, and already selected
technologies.
BUT!!!
 SQL for multi-dimensional manipulation of calculations.
MOLAP: Dimensional Modeling Using the
Multi Dimensional Model

 MDDB: a special-purpose data model


 Facts stored in multi-dimensional arrays
 Dimensions used to index array
 Sometimes on top of relational DB
 Products
 Pilot, Arbor Essbase, Gentia
The MOLAP Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId amt
p1 s1 12 s1 s2 s3
p2 s1 11 p1 12 50
p1 s3 50 p2 11 8
p2 s2 8

dimensions = 2
3-D Cube

Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 s1 s2 s3
day 2
p1 s3 1 50 p1 44 4
p2 s2 1 8 p2 s1 s2 s3
p1 s1 2 44 day 1
p1 12 50
p1 s2 2 4 p2 11 8

dimensions = 3
Example
roll-up to region
Dimensions:
NY
SF
Time, Product, Store
roll-up to brand
LA
Attributes:
10
Product (upc, price, …)
Juice
Store …
Product

Milk 34
56 …
Coke
Cream 32 Hierarchies:
Soap 12 Product  Brand  …
Bread 56 roll-up to week Day  Week  Quarter
M T W Th F S S
Store  Region  Country
Time
56 units of bread sold in LA on M
Cube Aggregation: Roll-up
Example: computing sums
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
Cube Operators for Roll-up

s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)
Extended Cube

* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
Aggregation Using Hierarchies

s1 s2 s3
day 2
p1 44 4
store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region

country

region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)
Points to be noticed about MOLAP

 Pre-calculating or pre-consolidating transactional data improves speed.


BUT
Fully pre-consolidating incoming data, MDDs require an enormous amount
of overhead both in processing time and in storage. An input file of 200MB
can easily expand to 5GB

MDDs are great candidates for the <50GB department data marts.

 Rolling up and Drilling down through aggregate data.

 With MDDs, application design is essentially the definition of dimensions


and calculation rules, while the RDBMS requires that the database schema be
a star or snowflake.
Cube Definition Syntax (BNF) in DMQL

 Cube Definition (Fact Table)


define cube <cube_name> [<dimension_list>]:
<measure_list>
 Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
 Special Case (Shared Dimension Tables)
 First time as “cube definition”
 define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>

29 Data Mining: Concepts and Techniques November 19, 2019


Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

30 Data Mining: Concepts and Techniques November 19, 2019


Hybrid OLAP (HOLAP)
 HOLAP = Hybrid OLAP:

 Best of both worlds

 Storing detailed data in RDBMS

 Storing aggregated data in MDBMS

 User access via MOLAP tools


Data Flow in HOLAP
RDBMS Server MDBMS Server Client
Multi-
dimensional
SQL-Read access
Multidimensional
User Multi-
data Meta data
Viewer
dimensional
Derived data
data
SQL-Reach
Through
Relational
Viewer
SQL-Read
When deciding which technology to go
for, consider:

1) Performance:

 How fast will the system appear to the end-user?

 MDD server vendors believe this is a key point in their favor.

2) Data volume and scalability:

 While MDD servers can handle up to 50GB of storage, RDBMS servers can
handle hundreds of gigabytes and terabytes.
An experiment with Relational and the
Multidimensional models on a data set
.

relational Multi- Improvement


dimensional
Disk space requirement 17 10 1.7
(Gigabytes)
Retrieve the corporate measures 240 1 240
Actual Vs Budget, by month (I/O’s)
Calculation of Variance 237 2* 110*
Budget/Actual for the whole
database (I/O time in hours)

* This may include the calculation of many other derived data without any
additional I/O.
What-if analysis
IF
A. You require write access
B. Your data is under 50 GB
C. Your timetable to implement is 60-90 days
D. Lowest level already aggregated
E. Data access on aggregated level
F. You’re developing a general-purpose application for inventory movement or assets management
THEN
Consider an MDD /MOLAP solution for your data mart

IF
A. Your data is over 100 GB
B. You have a "read-only" requirement
C. Historical data at the lowest level of granularity
D. Detailed access, long-running queries
E. Data assigned to lowest level elements
THEN
Consider an RDBMS/ROLAP solution for your data mart.

IF
A. OLAP on aggregated and detailed data
B. Different user groups
C. Ease of use and detailed data
THEN
Consider an HOLAP for your data mart
Examples
 ROLAP
 Telecommunication startup: call data records (CDRs)
 ECommerce Site
 Credit Card Company
 MOLAP
 Analysis and budgeting in a financial department
 Sales analysis
 HOLAP
 Sales department of a multi-national company
 Banks and Financial Service Providers
Tools available
 ROLAP:
 ORACLE 8i
 ORACLE Reports; ORACLE Discoverer
 ORACLE Warehouse Builder
 Arbors Software’s Essbase

 MOLAP:
 ORACLE Express Server
 ORACLE Express Clients (C/S and Web)
 MicroStrategy’s DSS server
 Platinum Technologies’ Plantinum InfoBeacon

 HOLAP:
 ORACLE 8i
 ORACLE Express Serve
 ORACLE Relational Access Manager
 ORACLE Express Clients (C/S and Web)
Conclusion
 ROLAP: RDBMS -> star/snowflake schema

 MOLAP: MDD -> Cube structures

 ROLAP or MOLAP: Data models used play major role in performance differences

 MOLAP: for summarized and relatively lesser volumes of data (10-50GB)

 ROLAP: for detailed and larger volumes of data

 Both storage methods have strengths and weaknesses

 The choice is requirement specific, though currently data warehouses are predominantly built
using RDBMSs/ROLAP.
References
 http://dimlab.usc.edu/csci599/Fall2002/paper/I2_P064.pdf
 OLAP, Relational, and Multidimensional Database Systems, by George Colliat,
Arbor Software Corporation

 http://www.donmeyer.com/art3.html
 Data warehousing Services, Data Mining & Analysis, LLC

 http://www.cs.man.ac.uk/~franconi/teaching/2001/CS636/CS636-olap.ppt
 Data Warehouse Models and OLAP Operations, by Enrico Franconi

 http://www.promatis.com/mediacenter/papers
- ROLAP, MOLAP, HOLAP: How to determine which to technology is appropriate,
by Holger Frietch, PROMATIS Corporation

You might also like