Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Warehouse

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

Data Warehousing and OLAP

Technology

* 1
Data Warehouse vs. Heterogeneous DBMS

■ Traditional heterogeneous DB integration:


■ Build wrappers/mediators on top of heterogeneous databases
■ Query driven approach
■ When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
■ Complex information filtering, compete for resources
■ Data warehouse: update-driven, high performance
■ Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis

* 2
Data Warehouse vs. Operational DBMS
■ OLTP (on-line transaction processing)
■ Major task of traditional relational DBMS
■ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
■ OLAP (on-line analytical processing)
■ Major task of data warehouse system
■ Data analysis and decision making
■ Distinct features (OLTP vs. OLAP):
■ User and system orientation: customer vs. market
■ Data contents: current, detailed vs. historical, consolidated
■ Database design: ER + application vs. star + subject
■ View: current, local vs. evolutionary, integrated
■ Access patterns: update vs. read-only but complex queries
* 3
OLTP vs. OLAP

* 4
Why Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
■ Different functions and different data:
■ missing data: Decision support requires historical data
which operational DBs do not typically maintain
■ data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
■ data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
* 5
From Tables and Spreadsheets
to Data Cubes

■ A data warehouse is based on a multidimensional data model which


views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
* 6
Cube: A Lattice of Cuboids

all
0-D(apex)
cuboid
tim ite locatio supplie
e m n r 1-D
cuboids
time,item time,location item,location location,supplier
2-D
time,supplier item,supplier cuboids
time,location,supplie
time,item,location r 3-D
cuboids
time,item,supplie item,location,supplier
r
4-D(base)
time, item, location, supplier cuboid
* 7
Conceptual Modeling of
Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
* 8
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key location_key
branch_key
branch_name dollars_sold street
branch_type units_sold city
province_or_street
country
avg_sales
Measures

* 9
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch branch_key location
location_key
location_key
branch_key
street
branch_name
city_key city
branch_type units_sold
city_key
dollars_sold
avg_sales city
province_or_street
Measures country

* 10
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key
from_location
branch location
branch_key location_key to_location
location_key
branch_name street dollars_cost
branch_type units_sold
city
dollars_sold province_or_street units_shipped
country shipper
Measures avg_sales
shipper_key
shipper_name
location_key
* shipper_type 11
Measures: Three Categories
■ distributive: if the result derived by applying the function
to n aggregate values is the same as that derived by
applying the function on all the data without partitioning.
■ E.g., count(), sum(), min(), max().
■ algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each
of which is obtained by applying a distributive aggregate
function.
■ E.g., avg(), min_N(), standard_deviation().
■ holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
■ E.g., median(), mode(), rank().
* 12
Multidimensional Data
■ Sales volume as a function of product, month,
and region
Dimensions: Product, Location,
Time
n

Hierarchical summarization paths


o
gi

Industry Region Year


Re

Category Country Quarter


Produc

Product City Month Week


t

Office Day

Mont
* h 13
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qt 2Qt 3Qt 4Qt su
t
uc

TV r
r r r m
od

PC U.S.
Pr

VCR A

Country
su
Canad
m
a
Mexic
o
su
m

* 14
Cuboids Corresponding to the Cube

al
l 0-D(apex)
product countr cuboid
date
y 1-D
cuboids
product,dat product,countr date,
e y country 2-D
cuboids

3-D(base)
product, date, cuboid
country

* 15
Browsing a Data Cube

■ Visualization
■ OLAP capabilities
■ Interactive manipulation
* 16
Typical OLAP Operations

■ Roll up (drill-up): summarize data


■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or detailed
data, or introducing new dimensions
■ Slice and dice:
■ project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes.
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
* 17
A Star-Net Query Model
Customer
Shipping
Orders Custome
Method
CONTRACTS r
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Produc
ANNUALY QTRLY DAILY t
PRODUCT ITEM PRODUCT GROUP
CIT
Y SALES PERSON
COUNTRY
DISTRIC
T
REGION
DIVISION
Locatio
Promotio Organization
n
* Each (abstraction
n level) circle is called a footprint 18
Design of a Data Warehouse: A
Business Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the
data warehouse
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view
of end-user
* 19
Data Warehouse Design Process

■ Top-down, bottom-up approaches or a combination of both


■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record

* 20
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadat
Integrato
a
source r
s Analysis
Operational Extract Query
Transform Data Serv Reports
DBs
Load
Warehouse e Data
Refresh
mining

Data
Marts

Data Data OLAP Engine Front-End Tools


* 21
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Data Data Enterprise


Mart Mart Data
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


* 22
OLAP Server Architectures
■ Relational OLAP (ROLAP)
■ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
■ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
■ greater scalability
■ Multidimensional OLAP (MOLAP)
■ Array-based multidimensional storage engine (sparse matrix
techniques)
■ fast indexing to pre-computed summarized data
■ Hybrid OLAP (HOLAP)
■ User flexibility, e.g., low level: relational, high-level: array
■ Specialized SQL servers
■ specialized support for SQL queries over star/snowflake schemas
* 23

You might also like