Data Mining:: Concepts and Techniques
Data Mining:: Concepts and Techniques
— Unit 2 —
1
Unit 2 : Data Warehousing and On-line Analytical
Processing
2
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
3
Data Warehouse—Subject-Oriented
4
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data
sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
5
Data Warehouse—Time Variant
6
Data Warehouse—Nonvolatile
7
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
8
Why a Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP analysis
directly on relational databases
9
Data Warehouse: A Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Clean
Data Serve Reports
Transform
Load Warehouse Data mining
Refresh
Data Marts
materialized
11
Extraction, Transformation, and Loading (ETL)
Data extraction
get data from multiple, heterogeneous, and external
sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse
format
Load
sort, summarize, consolidate, compute views, check
warehouse
12
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
13
Chapter 4: Data Warehousing and On-line Analytical
Processing
14
From Tables and Spreadsheets to
Data Cubes
A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
15
January 19, 2022 Data Mining: Concepts and Techniques 16
January 19, 2022 Data Mining: Concepts and Techniques 17
January 19, 2022 Data Mining: Concepts and Techniques 18
January 19, 2022 Data Mining: Concepts and Techniques 19
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location time,item,supplier item,location,supplier
21
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
22
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
23
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
24
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
26
Multidimensional Data
Street
Month
27
A Sample Data Cube
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
28
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product date country
1-D cuboids
29
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
30
Fig. 3.10 Typical OLAP Operations 31
s) Chicago
c i tie 440
( New York 1560
t ion Toronto 395
a
loc Vancouver drill-down
Q1 605 825 14 400 on time
time (quarters)
Q2
(from quarters
to months)
Q3
Q4
roll-up es)
ti
Chicago
on location computer security
(c i
New York
on Toronto
home phone ati
(from cities entertainment c
lo Vancouver
to countries) item (types) January 150
February 100
March 150
s)
tr ie April
n
time (months)
u
n (co USA 2000 May
o Canada
cati June
lo Q1 1000
July
time (quarters)
August
Q2
September
Q3 October
November
Q4
December
computer security computer security
home phone home phone
entertainment entertainment
item (types) item (types)
made by Radmilo Pesic & Branko
Golubovic 32/74
)
i t ies Chicago 440 s)
(c New York 1560 c itie USA 395
on Toronto (
cati 395
t ion Canada
lo Vancouver a
loc
(quarters)
Q1 605
time
Q1 605 825 14 400
time (quarters)
Q2
Q2
computer
Q3 dice for home
entertainment
(location=“Toronto” or “Vancouver”) item (types)
Q4
and (time=“Q1”or “Q2”) and
computer security (item=“home entertainment” or “computer”)
home phone
entertainment
item (types)
slice
for time=“Q1”
home
Chicago 605
entertainment
location (cities)
item (types)
New York computer 825
Toronto phone 14
pivot
Vancouver 605 825 14 400 security 400
34
Chapter 4: Data Warehousing and On-line Analytical
Processing
35
Design of Data Warehouse: A Business
Analysis Framework
Four views regarding the design of a data warehouse
Top-down view
allows selection of the relevant information necessary for the
data warehouse
Data source view
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from the view
of end-user
36
Data Warehouse Design Process
38
Difference between OLAP and Data Mining
How much did the bank lose from loan What are the characteristics of the
defaultes within past 2 years? customers most likely to default on their
loans before the year is over.
What were the highest selling fashion What additional products are most likely
items in our stores? to be sold to customers who buy shirts?
Which store/location made the highest In which are should we open a new
sales in the past year? store next year?