Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lect 5

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 31

Data Warehousing

— Chapter 5—

Data Mining: Concepts and Techniq


24/12/5 ues 1
Data Warehousing and OLAP
Technology: An Overview

 What is a data warehouse?

 A multi-dimensional data model

 From data warehousing to data mining

Data Mining: Concepts and Techniq


24/12/5 ues 2
What is Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained
separately from the organization’s operational database
 Support information processing by providing a solid
platform of consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

Data Mining: Concepts and Techniq


24/12/5 ues 3
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as


customer, product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
 Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process

Data Mining: Concepts and Techniq


24/12/5 ues 4
Data Warehouse—Integrated
 Constructed by integrating multiple,
heterogeneous data sources
 relational databases, flat files, on-line

transaction records
 Data cleaning and data integration techniques
are applied.
 Ensure consistency in naming conventions,

encoding structures, attribute measures, etc.


among different data sources

E.g., Hotel price: currency, tax, breakfast covered,
etc.
 When data is moved to the warehouse, it is
converted.
Data Mining: Concepts and Techniq
24/12/5 ues 5
Data Warehouse—Time Variant
 The time horizon for the data warehouse is
significantly longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or
implicitly
 But the key of operational data may or may not
contain “time element”
Data Mining: Concepts and Techniq
24/12/5 ues 6
Data Warehouse—Nonvolatile
 A physically separate store of data transformed
from the operational environment
 Operational update of data does not occur in the
data warehouse environment
 Does not require transaction processing,
recovery, and concurrency control mechanisms
 Requires only two operations in data
accessing:

initial loading of data and access of data

Data Mining: Concepts and Techniq


24/12/5 ues 7
Data Warehouse vs. Heterogeneous
DBMS
 Traditional heterogeneous DB integration: A query driven approach
 Build wrappers/mediators on top of heterogeneous databases
 When a query is posed to a client site, a meta-dictionary is used
to translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into
a global answer set

Complex information filtering, compete for resources
 Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in
advance and stored in warehouses for direct query and analysis

Data Mining: Concepts and Techniq


24/12/5 ues 8
Data Warehouse vs. Operational
DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
Data Mining: Concepts and Techniq
24/12/5 ues 9
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

Data Mining: Concepts and Techniq


24/12/5 ues 10
Why Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
 Note: There are more and more systems which perform OLAP
analysis directly on relational databases
Data Mining: Concepts and Techniq
24/12/5 ues 11
Chapter 3: Data Warehousing and
OLAP Technology: An Overview

 What is a data warehouse?

 A multi-dimensional data model

 From data warehousing to data mining

Data Mining: Concepts and Techniq


24/12/5 ues 12
From Tables and Spreadsheets to Data
Cubes
 A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a
base cuboid. The top most 0-D cuboid, which holds the
highest-level of summarization, is called the apex cuboid.
The lattice of cuboids forms a data cube.
Data Mining: Concepts and Techniq
24/12/5 ues 13
Multidimensional Data

 Sales volume as a function of product,


month, and region
Dimensions: Product, Location, Time
o n
gi
Re
Product

Month
Data Mining: Concepts and Techniq
24/12/5 ues 14
Cuboids Corresponding to the Cube

all
0-D(apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D(base) cuboid
product, date, country

Data Mining: Concepts and Techniq


24/12/5 ues 15
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier
Data Mining: Concepts and Techniq
24/12/5 ues 16
Conceptual Modeling of Data
Warehouses
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected
to a set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact
constellation
Data Mining: Concepts and Techniq
24/12/5 ues 17
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Data Mining: Concepts and Techniq
24/12/5 ues 18
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Data Mining: Concepts and Techniq
24/12/5 ues 19
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
Data Mining: Concepts and Techniq location_key
24/12/5 ues shipper_type 20
Multidimensional Data

 Sales volume as a function of product,


month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
o n
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month
Data Mining: Concepts and Techniq
24/12/5 ues 21
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
ct

TV
du

PC U.S.A
o
Pr

VCR

Country
sum
Canada

Mexico

sum

Data Mining: Concepts and Techniq


24/12/5 ues 22
Cuboids Corresponding to the Cube

all
0-D(apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D(base) cuboid
product, date, country

Data Mining: Concepts and Techniq


24/12/5 ues 23
Typical OLAP Operations
 Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or
detailed data, or introducing new dimensions

Slice and dice: project and select
 Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D
planes
 Other operations

drill across: involving (across) more than one fact table

drill through: through the bottom level of the cube to
its back-end relational tables (using SQL)
Data Mining: Concepts and Techniq
24/12/5 ues 24
Fig. 3.10 Typical
OLAP Operations

Data Mining: Concepts and Techniq


24/12/5 ues 25
Chapter 3: Data Warehousing and
OLAP Technology: An Overview

 What is a data warehouse?

 A multi-dimensional data model

 From data warehousing to data mining

Data Mining: Concepts and Techniq


24/12/5 ues 26
Data Warehouse Usage
 Three kinds of data warehouse applications

Information processing

supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs

Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling,
pivoting

Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models,
performing classification and prediction, and presenting
the mining results using visualization tools
Data Mining: Concepts and Techniq
24/12/5 ues 27
From On-Line Analytical Processing
(OLAP)
to On Line Analytical Mining (OLAM)
 Why online analytical mining?

High quality of data in data warehouses

DW contains integrated, consistent, cleaned
data

Available information processing structure
surrounding data warehouses

ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools

OLAP-based exploratory data analysis

Mining with drilling, dicing, pivoting, etc.

On-line selection of data mining functions

Integration and swapping of multiple mining
functions, algorithms, and tasks
Data Mining: Concepts and Techniq
24/12/5 ues 28
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data
Data integration
Warehouse
Mining: Concepts and Techniq Repository
24/12/5 ues 29
Chapter 3: Data Warehousing and
OLAP Technology: An Overview
 What is a data warehouse?

 A multi-dimensional data model

 From data warehousing to data mining

 Summary

Data Mining: Concepts and Techniq


24/12/5 ues 30
Summary: Data Warehouse and OLAP
Technology

 Why data warehousing?


 A multi-dimensional model of a data warehouse
 Star schema, snowflake schema, fact constellations
 A data cube consists of dimensions & measures
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 Efficient computation of data cubes
 Partial vs. full vs. no materialization
 Indexing OALP data: Bitmap index and join index
 OLAP query processing
 From OLAP to OLAM (on-line analytical mining)

Data Mining: Concepts and Techniq


24/12/5 ues 31

You might also like