Datawarehouse: Fact Table
Datawarehouse: Fact Table
Datawarehouse: Fact Table
Lecture 2
1
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
2
What is Data Warehouse?
4
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous
data sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
5
Data Warehouse—Time Variant
6
Data Warehouse—Non-Volatile
7
Data Warehouse vs. Heterogeneous DBMS
8
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
9
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
Decision support requires historical data which
operational DBs do not typically maintain
Decision Support requires consolidation (aggregation,
summarization) of data from heterogeneous sources
Different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
10
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
11
A Multi-Dimensional Data Model
Location
sum
Canada
Mexico
sum
13
4-D Data Cube
Supplier 1
Supplier 2
Supplier 3
14
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
15
Conceptual Modeling of Data Warehouses
time item
time_key item_key supplier
day Sales Fact Table item_name supplier_key
day_of_the_week brand supplier_type
time_key type
month
quarter item_key supplier_key
year
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province
Measures
country
18
Example of Fact Constellation
time
time_key Shipping Fact Table
day item
day_of_the_week Sales Fact Table item_key time_key
month item_name
quarter time_key brand item_key
year type shipper_key
item_key supplier_type
branch_key from_location
<dimension_name_first_time> in cube
<cube_name_first_time>
20
Defining a Star Schema in DMQL
21
Defining a Snowflake Schema in DMQL
23
Measures: Three Categories
Measure: a function evaluated on aggregated data
corresponding to given dimension-value pairs.
Measures can be:
distributive: if the measure can be calculated in a
distributive manner.
E.g., count(), sum(), min(), max().
algebraic: if it can be computed from arguments obtained
by applying distributive aggregate functions.
E.g., avg()=sum()/count(), min_N(), standard_deviation().
holistic: if it is not algebraic.
E.g., median(), mode(), rank().
24
Measures: Three Categories
25
Browsing a Data Cube
Visualization
OLAP capabilities
Interactive manipulation
26
A Concept Hierarchy
• Concept hierarchies allow data to be handled
at varying levels of abstraction
Office Day
Month
27
Typical OLAP Operations (Fig 2.10)
Roll up (drill-up): summarize data
by climbing up concept hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
28
Querying Using a Star-Net Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
Each circle is
ORDER called a footprint
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
Promotion Organization
29
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
30
Data Warehouse Design Process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
31
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining
Data Marts
materialized
33
OLAP Server Architectures
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and
and services
greater scalability
matrix techniques)
fast indexing to pre-computed summarized data
schemas
34
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
35
Efficient Data Cube Computation
38
Multiway Array Aggregation for MOLAP
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A 39
Multiway Array Aggregation for MOLAP
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
1 2 3 4
b0
After scan {1,2,3,4}:
a0 a1 a2 a3
A • b0c0 chunk is computed
• a0c0 and a0b0 are not
computed
40
Multiway Array Aggregation for MOLAP
We need to keep a We need to keep 4
single b-c chunk in a-c chunks in
memory memory
C c3 61 62 63 64
c2 45 46 47 48 After scan 1-13:
c1 29 30 31 32
c0
B13
• a0c0 and b0c0
14 15 16 60
b3 44 chunks are
B b2 9
28 56 computed
40
24
b1 5
36
52 • a0b0 is not
20 computed (we will
b0 1 2 3 4
need to scan 1-49)
a0 a1 a2 a3
A
We need to keep 16
a-b chunks in
memory
41
Multiway Array Aggregation for MOLAP
42
Indexing OLAP Data: Bitmap Index
Suitable for low cardinality domains
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value
for the indexed column
44
Online Aggregation
45
Efficient Processing of OLAP Queries
49
Discovery-Driven Exploration of Data
Cubes
Hypothesis-driven: exploration by user, huge search space
Discovery-driven (Sarawagi et al.’98)
pre-compute measures indicating exceptions, guide user in the
data analysis, at all levels of aggregation
Exception: significantly different from the value anticipated,
based on a statistical model
Visual cues such as background color are used to reflect the
degree of exception of each cell
Computation of exception indicator can be overlapped with cube
construction
50
Examples: Discovery-Driven Data Cubes
51
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
52
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
Differences among the three tasks
53
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
55