Data Warehousing and OLAP Technology For Data Mining: What Is A Data Warehouse?
Data Warehousing and OLAP Technology For Data Mining: What Is A Data Warehouse?
Data Warehousing and OLAP Technology For Data Mining: What Is A Data Warehouse?
“Heterogeneities are
everywhere” Personal
Databases
World
Scientific Databases
Wide
Web
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information
CS 336 3
Problem: Data Management in Large Enterprises
CS 336 4
Goal: Unified Access to Data
Integration System
World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases
Two Approaches:
Query-Driven (Lazy)
Warehouse (Eager)
Source Source
CS 336 6
The Traditional Research Approach
...
Wrapper Wrapper Wrapper
...
Source Source Source
CS 336 7
Disadvantages of Query-Driven Approach
CS 336 8
The Warehousing Approach
Clients
Information integrated
in advance
Data
Stored in wh for direct Warehouse
querying and analysis
...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor
...
Source Source Source
CS 336 9
Definitions
Data Warehouse
A subject-oriented, integrated, time-variant, non-
updatable collection of data used in support of
management decision-making processes
Subject-oriented: e.g. customers, patients,
students, products
Integrated: consistent naming conventions, formats,
Data Mart
A data warehouse that is limited in scope
10
Data Warehouse—Subject-Oriented
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
Security, no auditing
CS 336 15
Not Either-Or Decision
numbers of sources
Clients with unpredictable needs
CS 336 16
Definition of Data Warehouse :
A Practitioners Viewpoint
CS 336 17
Generic Warehouse Architecture
Client Client
Query & Analysis
Warehouse Metadata
Maintenance
Integrator Optimization
...
CS 336 18
Data Warehouse Architectures: Conceptual
View
Operational Informational
Single-layer systems systems
Two-layer
Real-time + derived data
Operational Informational
Most commonly used approach in systems systems
industry today
Derived Data
Real-time data
CS 336 19
Three-layer Architecture: Conceptual View
Operational Informational
systems systems
View level
“Particular informational
Derived Data
needs”
Reconciled Data
Physical Implementation
of the Data Warehouse
Real-time data
CS 336 20
Data Warehouse: Concepts and
March 5, 2022 Techniques 21
Data Warehousing and OLAP Technology
for Data Mining
all
0-D(apex) cuboid
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Conceptual Modeling of Data Warehouses
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
Specification of hierarchies
Schema hierarchy
day < {month <
quarter; week} < year
Set_grouping hierarchy
{1..10} < inexpensive
Multidimensional Data
Sales volume as a function of product, month, and
region
Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi
Office Day
Month
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
Browsing a Data Cube
Visualization
OLAP capabilities
Interactive manipulation
Typical OLAP Operations
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is called a Promotion Organization
footprint
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
other
source Integrator
s Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Enterprise warehouse
collects all of the information about subjects spanning the
entire organization
Data Mart
a subset of corporate-wide data that is of value to a specific
materialized
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best traversing
c0
b3 B13 14 15 16 60 order to do multi-way
44
9
28 56 aggregation?
b2
B 40
24 52
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A
Multi-way Array Aggregation for
Cube Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
Multi-way Array Aggregation for
Cube Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
Multi-Way Array Aggregation for
Cube Computation (Cont.)
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing,