Professional Documents
Culture Documents
Data Warehousing: CPS216 Notes 13
Data Warehousing: CPS216 Notes 13
CPS216
Notes 13
Shivnath Babu
Warehousing
Growing industry: $8 billion way back in
1998
Range from desktop to huge:
Walmart: 900-CPU, 2,700 disk, 23TB
Teradata system
Lots of buzzwords, hype
slice & dice, rollup, MOLAP, pivot, ...
2
Outline
What is a data warehouse?
Why a warehouse?
Models & operations
Implementing a warehouse
Future directions
3
What is a Warehouse?
Collection of diverse data
subject oriented
aimed at executive, decision maker
often a copy of operational data
with value-added data (e.g., summaries, history)
integrated
time-varying
non-volatile
more
4
What is a Warehouse?
Collection of tools
gathering data
cleansing, integrating, ...
querying, reporting, analysis
data mining
monitoring, administering warehouse
5
Warehouse Architecture
Client Client
Metadata Warehouse
Integration
6
Motivating Examples
Forecasting
Comparing performance of units
Monitoring, detecting fraud
Visualization
7
Why a Warehouse?
Two Approaches:
Query-Driven (Lazy)
Warehouse (Eager)
Source Source
8
Query-Driven Approach
Client Client
Mediator
9
Advantages of Warehousing
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
Modify, summarize (store aggregates)
Add historical information
10
Advantages of Query-Driven
No need to copy data
less storage
no need to purchase data
More up-to-date data
Query needs can be unknown
Only query interface needed at sources
May be less draining on sources
11
OLTP vs. OLAP
OLTP: On Line Transaction Processing
Describes processing at operational sites
OLAP: On Line Analytical Processing
Describes processing at warehouse
12
OLTP vs. OLAP
OLTP OLAP
Mostly updates Mostly reads
Many small transactions Queries long, complex
Mb-Gb of data Gb-Tb of data
Raw data Summarized,
Clerical users consolidated data
Up-to-date data Decision-makers,
Consistency, analysts as users
recoverability critical
13
Data Marts
Smaller warehouses
Spans part of organization
e.g., marketing (customers, products, sales)
Do not require enterprise-wide consensus
but long term integration problems?
14
Warehouse Models & Operators
Data Models
relations
stars & snowflakes
cubes
Operators
slice & dice
roll-up, drill down
pivoting
other
15
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la
16
Star Schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
17
Terms
Fact table
Dimension tables
Measures
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
18
Dimension Hierarchies
sType
store
city region
sType tId size location
t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south
snowflake schema
constellations region regId name
north cold region
south warm region
19
Cube
dimensions = 2
20
3-D Cube
dimensions = 3
21
ROLAP vs. MOLAP
ROLAP:
Relational On-Line Analytical Processing
MOLAP:
Multi-Dimensional On-Line Analytical
Processing
22
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
81
p1 c1 2 44
p1 c2 2 4
23
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4
24
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4
rollup
drill-down
25
Aggregates
Operators: sum, count, max, min,
median, ave
“Having” clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)
26
Cube Aggregation
Example: computing sums
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
27
Cube Operators
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)
28
Extended Cube
* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* c267 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
29
Aggregation Using Hierarchies
c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8
country
region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)
30
Pivoting
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
31
Derived Data
Derived Warehouse Data
indexes
aggregates
materialized views (next slide)
When to update derived data?
Incremental vs. refresh
32
Materialized Views
Define new warehouse relations using
SQL expressions
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
33
Processing
ROLAP servers vs. MOLAP servers
Index Structures
What to Materialize?
Client
Algorithms Client
Query & Analysis
Metadata Warehouse
Integration
34
ROLAP Server
Relational OLAP Server sale prodId
p1
date
1
sum
62
p2 1 19
p1 2 48
tools
relational
DBMS
35
MOLAP Server
Multi-Dimensional OLAP Server
Sales
ty
B
Ci
A
milk
Product
M.D. tools soda
eggs
soap
1 2 3 4
Date
utilities
multi- could also
dimensional sit on
relational
server DBMS
36
Index Structures
Traditional Access Methods
B-trees, hash tables, R-trees, grids, …
Popular in Warehouses
inverted lists
bit map indexes
join indexes
text indexes
37
Inverted Lists
18
19
...
inverted data
age
lists records
index
38
Using Inverted Lists
Query:
Get people with age = 20 and name = “fred”
List for age = 20: r4, r18, r34, r35
List for name = “fred”: r18, r52
Answer is intersection: r18
39
Bit Maps
18 1
19 1
0
1 id name age
1 1 joe 20
20 2 fred 20
20 0
23 0
3 sally 21
21 0 0
22 0 4 nancy 20
1
0 5 tom 20
0
6 pat 25
0
23 7 dave 21
0
25 8 jeff 26
1
26
0
...
1
1
age bit data
index maps records
40
Using Bit Maps
Query:
Get people with age = 20 and name = “fred”
List for age = 20: 1101100000
List for name = “fred”: 0100000001
Answer is intersection: 010000000000
41
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
42
Join Indexes
join index
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4
43
What to Materialize?
Store in warehouse results useful for
common queries
Example:
total sales
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8
c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1 110
p2 19
44
Materialization Factors
Type/frequency of queries
Query response time
Storage cost
Update cost
45
Cube Aggregates Lattice
129
all
c1 c2 c3
p1 67 12 50
city product date
use greedy
day 2
c1 c2 c3
city, product, date algorithm to
day 1
p1
p2 c1
44
c2
4
c3 decide what
to materialize
p1 12 50
p2 11 8
46
Dimension Hierarchies
all
city
47
Dimension Hierarchies
all
state
city, product, date
state, date
state, product
48
Interesting Hierarchy
time day week month quarter year
all 1 1 1 1 2000
2 1 1 1 2000
3 1 1 1 2000
4 1 1 1 2000
years 5 1 1 1 2000
6 1 1 1 2000
7 1 1 1 2000
weeks 8 2 1 1 2000
quarters
months conceptual
dimension table
days
49