Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 49

Data Warehousing

CPS216
Notes 13

Shivnath Babu
Warehousing
 Growing industry: $8 billion way back in
1998
 Range from desktop to huge:
 Walmart: 900-CPU, 2,700 disk, 23TB
Teradata system
 Lots of buzzwords, hype
 slice & dice, rollup, MOLAP, pivot, ...

2
Outline
 What is a data warehouse?
 Why a warehouse?
 Models & operations
 Implementing a warehouse
 Future directions

3
What is a Warehouse?
 Collection of diverse data
 subject oriented
 aimed at executive, decision maker
 often a copy of operational data
 with value-added data (e.g., summaries, history)
 integrated
 time-varying
 non-volatile
more
4
What is a Warehouse?
 Collection of tools
 gathering data
 cleansing, integrating, ...
 querying, reporting, analysis
 data mining
 monitoring, administering warehouse

5
Warehouse Architecture
Client Client

Query & Analysis

Metadata Warehouse

Integration

Source Source Source

6
Motivating Examples
 Forecasting
 Comparing performance of units
 Monitoring, detecting fraud
 Visualization

7
Why a Warehouse?
 Two Approaches:
 Query-Driven (Lazy)
 Warehouse (Eager)

Source Source

8
Query-Driven Approach

Client Client

Mediator

Wrapper Wrapper Wrapper

Source Source Source

9
Advantages of Warehousing
 High query performance
 Queries not visible outside warehouse
 Local processing at sources unaffected
 Can operate when sources unavailable
 Can query data not stored in a DBMS
 Extra information at warehouse
 Modify, summarize (store aggregates)
 Add historical information
10
Advantages of Query-Driven
 No need to copy data
 less storage
 no need to purchase data
 More up-to-date data
 Query needs can be unknown
 Only query interface needed at sources
 May be less draining on sources

11
OLTP vs. OLAP
 OLTP: On Line Transaction Processing
 Describes processing at operational sites
 OLAP: On Line Analytical Processing
 Describes processing at warehouse

12
OLTP vs. OLAP
OLTP OLAP
 Mostly updates  Mostly reads
 Many small transactions  Queries long, complex
 Mb-Gb of data  Gb-Tb of data
 Raw data  Summarized,
 Clerical users consolidated data
 Up-to-date data  Decision-makers,
 Consistency, analysts as users
recoverability critical

13
Data Marts
 Smaller warehouses
 Spans part of organization
 e.g., marketing (customers, products, sales)
 Do not require enterprise-wide consensus
 but long term integration problems?

14
Warehouse Models & Operators
 Data Models
 relations
 stars & snowflakes
 cubes
 Operators
 slice & dice
 roll-up, drill down
 pivoting
 other
15
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

16
Star Schema

sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

17
Terms
 Fact table
 Dimension tables
 Measures
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

18
Dimension Hierarchies
sType
store
city region
sType tId size location
t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

 snowflake schema
 constellations region regId name
north cold region
south warm region

19
Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8

dimensions = 2

20
3-D Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50
day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

dimensions = 3

21
ROLAP vs. MOLAP
 ROLAP:
Relational On-Line Analytical Processing
 MOLAP:
Multi-Dimensional On-Line Analytical
Processing

22
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
81
p1 c1 2 44
p1 c2 2 4

23
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

24
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4

rollup
drill-down

25
Aggregates
 Operators: sum, count, max, min,
median, ave
 “Having” clause
 Using dimension hierarchy
 average by region (within store)
 maximum by month (within date)

26
Cube Aggregation
Example: computing sums
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8

c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
27
Cube Operators

c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)

c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)

28
Extended Cube

* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* c267 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81

29
Aggregation Using Hierarchies

c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8

country

region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)

30
Pivoting
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

Pivot turns unique values from


c1 c2 c3
one column into unique columns p1 56 4 50
in the output p2 11 8

31
Derived Data
 Derived Warehouse Data
 indexes
 aggregates
 materialized views (next slide)
 When to update derived data?
 Incremental vs. refresh

32
Materialized Views
 Define new warehouse relations using
SQL expressions
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

joinTb prodId name price storeId date amt


p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11 does not exist
p1 bolt 10 c3 1 50 at any source
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4

33
Processing
 ROLAP servers vs. MOLAP servers
 Index Structures
 What to Materialize?
Client
 Algorithms Client
Query & Analysis

Metadata Warehouse

Integration

Source Source Source

34
ROLAP Server
 Relational OLAP Server sale prodId
p1
date
1
sum
62
p2 1 19
p1 2 48

tools

ROLAP Special indices, tuning;


utilities Schema is “denormalized”
server

relational
DBMS

35
MOLAP Server
 Multi-Dimensional OLAP Server
Sales

ty
B

Ci
A
milk

Product
M.D. tools soda
eggs
soap

1 2 3 4
Date

utilities
multi- could also
dimensional sit on
relational
server DBMS

36
Index Structures
 Traditional Access Methods
 B-trees, hash tables, R-trees, grids, …
 Popular in Warehouses
 inverted lists
 bit map indexes
 join indexes
 text indexes

37
Inverted Lists
18
19

r4 rId name age


r18 r4 joe 20
20 r18 fred 20
20 r34
23 r19 sally 21
21 r35
22 r34 nancy 20
r35 tom 20
r5
r36 pat 25
r19
23 r5 dave 21
r37
25 r41 jeff 26
r40
26

...
inverted data
age
lists records
index
38
Using Inverted Lists
 Query:
 Get people with age = 20 and name = “fred”
 List for age = 20: r4, r18, r34, r35
 List for name = “fred”: r18, r52
 Answer is intersection: r18

39
Bit Maps
18 1
19 1
0
1 id name age
1 1 joe 20
20 2 fred 20
20 0
23 0
3 sally 21
21 0 0
22 0 4 nancy 20
1
0 5 tom 20
0
6 pat 25
0
23 7 dave 21
0
25 8 jeff 26
1
26
0

...
1
1
age bit data
index maps records

40
Using Bit Maps
 Query:
 Get people with age = 20 and name = “fred”
 List for age = 20: 1101100000
 List for name = “fred”: 0100000001
 Answer is intersection: 010000000000

 Good if domain cardinality small


 Bit vectors can be compressed

41
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

joinTb prodId name price storeId date amt


p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4

42
Join Indexes
join index
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4

sale rId prodId storeId date amt


r1 p1 c1 1 12
r2 p2 c1 1 11
r3 p1 c3 1 50
r4 p2 c2 1 8
r5 p1 c1 2 44
r6 p1 c2 2 4

43
What to Materialize?
 Store in warehouse results useful for
common queries
 Example:
total sales
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8

c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1 110
p2 19

44
Materialization Factors
 Type/frequency of queries
 Query response time
 Storage cost
 Update cost

45
Cube Aggregates Lattice
129
all

c1 c2 c3
p1 67 12 50
city product date

city, product city, date product, date


c1 c2 c3
p1 56 4 50
p2 11 8

use greedy
day 2
c1 c2 c3
city, product, date algorithm to
day 1
p1
p2 c1
44
c2
4
c3 decide what
to materialize
p1 12 50
p2 11 8

46
Dimension Hierarchies

all

cities city state


state c1 CA
c2 NY

city

47
Dimension Hierarchies
all

city product date

city, product city, date product, date

state
city, product, date
state, date
state, product

state, product, date

not all arcs shown...

48
Interesting Hierarchy
time day week month quarter year
all 1 1 1 1 2000
2 1 1 1 2000
3 1 1 1 2000
4 1 1 1 2000
years 5 1 1 1 2000
6 1 1 1 2000
7 1 1 1 2000
weeks 8 2 1 1 2000
quarters

months conceptual
dimension table

days

49

You might also like