What Is Data Warehouse?
What Is Data Warehouse?
What Is Data Warehouse?
Data warehousing:
Data Warehouse—Subject-Oriented
Focusing on the modeling and analysis of data for decision makers, not on daily operations or
transaction processing
Provide a simple and concise view around particular subject issues by excluding data that are not
useful in the decision support process
Data Warehouse—Integrated
The time horizon for the data warehouse is significantly longer than that of operational systems
But the key of operational data may or may not contain “time element”
Data Warehouse—Nonvolatile
Operational update of data does not occur in the data warehouse environment
Does not require transaction processing, recovery, and concurrency control mechanisms
When a query is posed to a client site, a meta-dictionary is used to translate the query into
queries appropriate for individual heterogeneous sites involved, and the results are
integrated into a global answer set
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
missing data: Decision support requires historical data which operational DBs do not
typically maintain
data quality: different sources typically use inconsistent data representations, codes and
formats which have to be reconciled
Note: There are more and more systems which perform OLAP analysis directly on relational
databases
A data warehouse is based on a multidimensional data model which views data in the form of a
data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of
cuboids forms a data cube.
A schema is a collection of database objects (as far as this hour is concerned—tables) associated
with one particular database username. This username is called the schema owner, or the owner
of the related group of objects. You may have one or multiple schemas in a database.
A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized
Star schema: A fact table in the middle connected to a set of dimension tables
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact constellation
branch_key
branch_name
branch_type
Cube Definition Syntax (BNF) in DMQL
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)
city
Multidimensional Data
from higher level summary to lower level summary or detailed data, or introducing new
dimensions
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes
Other operations
drill through: through the bottom level of the cube to its back-end relational tables (using
SQL)
Waterfall: structured and systematic analysis at each step before proceeding to the next
Spiral: rapid generation of increasingly functional systems, short turn around time, quick
turn around
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
Three Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects spanning the entire organization
Data Mart
a subset of corporate-wide data that is of value to a specific groups of users. Its scope is
confined to specific, selected groups, such as marketing data mart
Virtual warehouse
A set of views over operational databases
Model refinement
Data Warehouse Back-End Tools and Utilities
Data extraction
Data cleaning
Data transformation
Load
sort, summarize, consolidate, compute views, check integrity, and build indicies and
partitions
Refresh
Metadata Repository
schema, view, dimensions, hierarchies, derived data defn, data mart locations and
contents
Operational meta-data
data lineage (history of migrated data and transformation path), currency of data (active,
archived, or purged), monitoring information (warehouse usage statistics, error reports,
audit trails)
Business data
Use relational or extended-relational DBMS to store and manage warehouse data and
OLAP middle ware
Greater scalability
Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et
al.’96)
FROM SALES
The i-th bit is set if the i-th row of the base table has the value for the indexed column
In data warehouses, join index relates the values of the dimensions of a start schema to rows in
the fact table.
E.g. fact table: Sales and two dimensions city and product
A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city
Join indices can span multiple dimensions
Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice =
selection + projection
Let the query to be processed be on {brand, province_or_state} with the condition “year
= 2004”, and there are 4 materialized cuboids available:
Explore indexing structures and compressed vs. dense array structs in MOLAP
v Data generalization and summarization-based characterization
– Approaches:
Data cube approach(OLAP approach)
Data Warehousing/Mining 2
Characterization: Data Cube Approach
(without using Attribute Oriented-
Induction)
Perform computations and store results in data cubes
Strength
– An efficient implementation of data generalization
– Computation of various kinds of measures
e.g., count( ), sum( ), average( ), max( )
– Generalization and specialization can be performed on a data cube by
roll-up and drill-down
Limitations
– handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
– Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach
Data Warehousing/Mining 3
Attribute-Oriented
Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
– Collect the task-relevant data( initial relation) using a relational
database query
– Perform generalization by attribute removal or attribute
generalization.
– Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts.
– Interactive presentation with users.
Data Warehousing/Mining 4
Basic Principles of Attribute-
Oriented Induction
Data focusing: task-relevant data, including dimensions,
and the result is the initial relation.
Attribute-removal: remove attribute A if there is a large set
of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are
expressed in terms of other attributes.
Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control: control the final
relation/rule size.
Data Warehousing/Mining 5
Data Warehousing/Mining 7
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Data Warehousing/Mining 8
Attribute Relevance Analysis
Why?
– Which dimensions should be included?
– How high level of generalization?
– Automatic vs. interactive
– Reduce # attributes; easy to understand patterns
What?
– statistical method for preprocessing data
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
Data Warehousing/Mining 9
How?
– Data Collection
– Analytical Generalization
Use information gain analysis (e.g., entropy or other
measures) to identify highly relevant dimensions and
levels.
– Relevance Analysis
Sort and select the most relevant dimensions and levels.
– Attribute-oriented Induction for class description
On selected dimension/level
– OLAP operations (e.g. drilling, slicing) on relevance
rules
Data Warehousing/Mining 10
Relevance Measures
Data Warehousing/Mining 11
Information-Theoretic Approach
Decision tree
– each internal node tests an attribute
– each branch corresponds to attribute value
– each leaf node assigns a classification
ID3 algorithm
– build decision tree based on training objects with
known class labels to classify testing objects
– rank attributes with information gain measure
– minimal height
the least number of tests to classify an object
Data Warehousing/Mining 12
Top-Down Induction of Decision
Tree
Attributes = {Outlook, Temperature, Humidity, Wind}
PlayTennis = {yes, no}
Outlook
sunny rain
overcast
Humidity Wind
yes
high normal strong weak
no yes no yes
Data Warehousing/Mining 13
Data Warehousing/Mining 14
Example: Analytical comparison
Task
– Compare graduate and undergraduate students using
discriminant rule.
– DMQL query
use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
Data Warehousing/Mining 15
Given
– attributes name, gender, major, birth_place,
birth_date, residence, phone# and gpa
– Gen(ai) = concept hierarchies on attributes ai
– Ui = attribute analytical thresholds for attributes ai
– Ti = attribute generalization thresholds for
attributes ai
– R = attribute relevance threshold
Data Warehousing/Mining 16
Example: Analytical comparison (3)
1. Data collection
– target and contrasting classes
3. Synchronous generalization
– controlled by user-specified dimension thresholds
– prime target and contrasting class(es) relations/cuboids
Data Warehousing/Mining 17
5. Presentation
– as generalized relations, crosstabs, bar charts, pie charts, or
rules
– contrasting measures to reflect comparison between target
and contrasting classes
e.g. count%
Data Warehousing/Mining 19