Data Warehousing: CPS216 Notes 13

Data Warehousing
CPS216
Notes 13
Shivnath Babu
Warehousing
 Growing industry: $8 billion way back in
1998
 Range from desktop to huge:
 Walmart: 900-CPU, 2,700 disk, 23TB
Teradata system
 Lots of buzzwords, hype
 slice & dice, rollup, MOLAP, pivot, ...
2
Outline
 What is a data warehouse?
 Why a warehouse?
 Models & operations
 Implementing a warehouse
 Future directions
3
What is a Warehouse?
 Collection of diverse data
 subject oriented
 aimed at executive, decision maker
 often a copy of operational data
 with value-added data (e.g., summaries, history)
 integrated
 time-varying
 non-volatile
more
4
What is a Warehouse?
 Collection of tools
 gathering data
 cleansing, integrating, ...
 querying, reporting, analysis
 data mining
 monitoring, administering warehouse
5
Warehouse Architecture
Client Client
Query & Analysis
Metadata Warehouse
Integration
Source Source Source
6
Motivating Examples
 Forecasting
 Comparing performance of units
 Monitoring, detecting fraud
 Visualization
7
Why a Warehouse?
 Two Approaches:
 Query-Driven (Lazy)
 Warehouse (Eager)
Source Source
8
Query-Driven Approach
Client Client
Mediator
Wrapper Wrapper Wrapper
9
Advantages of Warehousing
 High query performance
 Queries not visible outside warehouse
 Local processing at sources unaffected
 Can operate when sources unavailable
 Can query data not stored in a DBMS
 Extra information at warehouse
 Modify, summarize (store aggregates)
 Add historical information
10
Advantages of Query-Driven
 No need to copy data
 less storage
 no need to purchase data
 More up-to-date data
 Query needs can be unknown
 Only query interface needed at sources
 May be less draining on sources
11
OLTP vs. OLAP
 OLTP: On Line Transaction Processing
 Describes processing at operational sites
 OLAP: On Line Analytical Processing
 Describes processing at warehouse
12
OLTP vs. OLAP
OLTP OLAP
 Mostly updates  Mostly reads
 Many small transactions  Queries long, complex
 Mb-Gb of data  Gb-Tb of data
 Raw data  Summarized,
 Clerical users consolidated data
 Up-to-date data  Decision-makers,
 Consistency, analysts as users
recoverability critical
13
Data Marts
 Smaller warehouses
 Spans part of organization
 e.g., marketing (customers, products, sales)
 Do not require enterprise-wide consensus
 but long term integration problems?
14
Warehouse Models & Operators
 Data Models
 relations
 stars & snowflakes
 cubes
 Operators
 slice & dice
 roll-up, drill down
 pivoting
 other
15
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la
sale oderId date custId prodId storeId qty amt

o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50
customer custId name address city

53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la
16
Star Schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
17
Terms
 Fact table
 Dimension tables
 Measures
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
18
Dimension Hierarchies
sType
store
city region
sType tId size location
t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south
 snowflake schema
 constellations region regId name
north cold region
south warm region
19
Cube
Fact table view: Multi-dimensional cube:

sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8
dimensions = 2
20
3-D Cube

sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50
day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
dimensions = 3
21
ROLAP vs. MOLAP
 ROLAP:
Relational On-Line Analytical Processing
 MOLAP:
Multi-Dimensional On-Line Analytical
Processing
22
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
81
p1 c1 2 44
p1 c2 2 4
23
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4
24
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4
rollup
drill-down
25
Aggregates
 Operators: sum, count, max, min,
median, ave
 “Having” clause
 Using dimension hierarchy
 average by region (within store)
 maximum by month (within date)
26
Cube Aggregation
Example: computing sums
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
27
Cube Operators
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)
28
Extended Cube
* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* c267 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
29
Aggregation Using Hierarchies
c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8
country
region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)
30
Pivoting
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
Pivot turns unique values from

c1 c2 c3
one column into unique columns p1 56 4 50
in the output p2 11 8
31
Derived Data
 Derived Warehouse Data
 indexes
 aggregates
 materialized views (next slide)
 When to update derived data?
 Incremental vs. refresh
32
Materialized Views
 Define new warehouse relations using
SQL expressions
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
joinTb prodId name price storeId date amt

p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11 does not exist
p1 bolt 10 c3 1 50 at any source
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4
33
Processing
 ROLAP servers vs. MOLAP servers
 Index Structures
 What to Materialize?
Client
 Algorithms Client
Query & Analysis
Metadata Warehouse
Integration
34
ROLAP Server
 Relational OLAP Server sale prodId
p1
date
1
sum
62
p2 1 19
p1 2 48
tools
ROLAP Special indices, tuning;

utilities Schema is “denormalized”
server
relational
DBMS
35
MOLAP Server
 Multi-Dimensional OLAP Server
Sales
ty
B
Ci
A
milk
Product
M.D. tools soda
eggs
soap
1 2 3 4
Date
utilities
multi- could also
dimensional sit on
relational
server DBMS
36
Index Structures
 Traditional Access Methods
 B-trees, hash tables, R-trees, grids, …
 Popular in Warehouses
 inverted lists
 bit map indexes
 join indexes
 text indexes
37
Inverted Lists
18
19
r4 rId name age

r18 r4 joe 20
20 r18 fred 20
20 r34
23 r19 sally 21
21 r35
22 r34 nancy 20
r35 tom 20
r5
r36 pat 25
r19
23 r5 dave 21
r37
25 r41 jeff 26
r40
26
...
inverted data
age
lists records
index
38
Using Inverted Lists
 Query:
 Get people with age = 20 and name = “fred”
 List for age = 20: r4, r18, r34, r35
 List for name = “fred”: r18, r52
 Answer is intersection: r18
39
Bit Maps
18 1
19 1
0
1 id name age
1 1 joe 20
20 2 fred 20
20 0
23 0
3 sally 21
21 0 0
22 0 4 nancy 20
1
0 5 tom 20
0
6 pat 25
0
23 7 dave 21
0
25 8 jeff 26
1
26
0
...
1
1
age bit data
index maps records
40
Using Bit Maps
 Query:
 Get people with age = 20 and name = “fred”
 List for age = 20: 1101100000
 List for name = “fred”: 0100000001
 Answer is intersection: 010000000000
 Good if domain cardinality small

 Bit vectors can be compressed
41
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
joinTb prodId name price storeId date amt

p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4
42
Join Indexes
join index
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4
sale rId prodId storeId date amt

r1 p1 c1 1 12
r2 p2 c1 1 11
r3 p1 c3 1 50
r4 p2 c2 1 8
r5 p1 c1 2 44
r6 p1 c2 2 4
43
What to Materialize?
 Store in warehouse results useful for
common queries
 Example:
total sales
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8
c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1 110
p2 19
44
Materialization Factors
 Type/frequency of queries
 Query response time
 Storage cost
 Update cost
45
Cube Aggregates Lattice
129
all
c1 c2 c3
p1 67 12 50
city product date
city, product city, date product, date

c1 c2 c3
p1 56 4 50
p2 11 8
use greedy
day 2
c1 c2 c3
city, product, date algorithm to
day 1
p1
p2 c1
44
c2
4
c3 decide what
to materialize
p1 12 50
p2 11 8
46
all
cities city state

state c1 CA
c2 NY
city
47
all
city product date
city, product city, date product, date
state
city, product, date
state, date
state, product
state, product, date
not all arcs shown...
48
Interesting Hierarchy
time day week month quarter year
all 1 1 1 1 2000
2 1 1 1 2000
3 1 1 1 2000
4 1 1 1 2000
years 5 1 1 1 2000
6 1 1 1 2000
7 1 1 1 2000
weeks 8 2 1 1 2000
quarters
months conceptual
dimension table
days
49

Data Warehousing: CPS216 Notes 13

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing: CPS216 Notes 13

Uploaded by

Copyright:

Available Formats

Data Warehousing

Query & Analysis

Source Source Source

Wrapper Wrapper Wrapper

Source Source Source

sale oderId date custId prodId storeId qty amt

customer custId name address city

Fact table view: Multi-dimensional cube:

Fact table view: Multi-dimensional cube:

Pivot turns unique values from

joinTb prodId name price storeId date amt

Source Source Source

ROLAP Special indices, tuning;

r4 rId name age

 Good if domain cardinality small

joinTb prodId name price storeId date amt

sale rId prodId storeId date amt

city, product city, date product, date

cities city state

city product date

city, product city, date product, date

state, product, date

not all arcs shown...

You might also like