Data Warehouse
Data Warehouse
Data Warehouse
Technology
* 1
Data Warehouse vs. Heterogeneous DBMS
* 2
Data Warehouse vs. Operational DBMS
■ OLTP (on-line transaction processing)
■ Major task of traditional relational DBMS
■ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
■ OLAP (on-line analytical processing)
■ Major task of data warehouse system
■ Data analysis and decision making
■ Distinct features (OLTP vs. OLAP):
■ User and system orientation: customer vs. market
■ Data contents: current, detailed vs. historical, consolidated
■ Database design: ER + application vs. star + subject
■ View: current, local vs. evolutionary, integrated
■ Access patterns: update vs. read-only but complex queries
* 3
OLTP vs. OLAP
* 4
Why Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
■ Different functions and different data:
■ missing data: Decision support requires historical data
which operational DBs do not typically maintain
■ data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
■ data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
* 5
From Tables and Spreadsheets
to Data Cubes
all
0-D(apex)
cuboid
tim ite locatio supplie
e m n r 1-D
cuboids
time,item time,location item,location location,supplier
2-D
time,supplier item,supplier cuboids
time,location,supplie
time,item,location r 3-D
cuboids
time,item,supplie item,location,supplier
r
4-D(base)
time, item, location, supplier cuboid
* 7
Conceptual Modeling of
Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
* 8
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key location_key
branch_key
branch_name dollars_sold street
branch_type units_sold city
province_or_street
country
avg_sales
Measures
* 9
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch branch_key location
location_key
location_key
branch_key
street
branch_name
city_key city
branch_type units_sold
city_key
dollars_sold
avg_sales city
province_or_street
Measures country
* 10
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key
from_location
branch location
branch_key location_key to_location
location_key
branch_name street dollars_cost
branch_type units_sold
city
dollars_sold province_or_street units_shipped
country shipper
Measures avg_sales
shipper_key
shipper_name
location_key
* shipper_type 11
Measures: Three Categories
■ distributive: if the result derived by applying the function
to n aggregate values is the same as that derived by
applying the function on all the data without partitioning.
■ E.g., count(), sum(), min(), max().
■ algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each
of which is obtained by applying a distributive aggregate
function.
■ E.g., avg(), min_N(), standard_deviation().
■ holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
■ E.g., median(), mode(), rank().
* 12
Multidimensional Data
■ Sales volume as a function of product, month,
and region
Dimensions: Product, Location,
Time
n
Office Day
Mont
* h 13
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qt 2Qt 3Qt 4Qt su
t
uc
TV r
r r r m
od
PC U.S.
Pr
VCR A
Country
su
Canad
m
a
Mexic
o
su
m
* 14
Cuboids Corresponding to the Cube
al
l 0-D(apex)
product countr cuboid
date
y 1-D
cuboids
product,dat product,countr date,
e y country 2-D
cuboids
3-D(base)
product, date, cuboid
country
* 15
Browsing a Data Cube
■ Visualization
■ OLAP capabilities
■ Interactive manipulation
* 16
Typical OLAP Operations
ORDER
TRUCK
PRODUCT LINE
Time Produc
ANNUALY QTRLY DAILY t
PRODUCT ITEM PRODUCT GROUP
CIT
Y SALES PERSON
COUNTRY
DISTRIC
T
REGION
DIVISION
Locatio
Promotio Organization
n
* Each (abstraction
n level) circle is called a footprint 18
Design of a Data Warehouse: A
Business Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the
data warehouse
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view
of end-user
* 19
Data Warehouse Design Process
* 20
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadat
Integrato
a
source r
s Analysis
Operational Extract Query
Transform Data Serv Reports
DBs
Load
Warehouse e Data
Refresh
mining
Data
Marts