Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

What Is Data Warehouse?

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

What is Data Warehouse?

 Defined in many different ways, but not rigorously.

 A decision support database that is maintained separately from the organization’s


operational database

 Support information processing by providing a solid platform of consolidated, historical


data for analysis.

 “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of


data in support of management’s decision-making process.”—W. H. Inmon

 Data warehousing:

 The process of constructing and using data warehouses

Data Warehouse—Subject-Oriented

 Organized around major subjects, such as customer, product, sales

 Focusing on the modeling and analysis of data for decision makers, not on daily operations or
transaction processing

 Provide a simple and concise view around particular subject issues by excluding data that are not
useful in the decision support process

Data Warehouse—Integrated

 Constructed by integrating multiple, heterogeneous data sources

 relational databases, flat files, on-line transaction records

 Data cleaning and data integration techniques are applied.

 Ensure consistency in naming conventions, encoding structures, attribute measures, etc.


among different data sources

 E.g., Hotel price: currency, tax, breakfast covered, etc.

 When data is moved to the warehouse, it is converted.

Data Warehouse—Time Variant

 The time horizon for the data warehouse is significantly longer than that of operational systems

 Operational database: current value data


 Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)

 Every key structure in the data warehouse

 Contains an element of time, explicitly or implicitly

 But the key of operational data may or may not contain “time element”

Data Warehouse—Nonvolatile

 A physically separate store of data transformed from the operational environment

 Operational update of data does not occur in the data warehouse environment

 Does not require transaction processing, recovery, and concurrency control mechanisms

 Requires only two operations in data accessing:

 initial loading of data and access of data

Data Warehouse vs. Heterogeneous DBMS

 Traditional heterogeneous DB integration: A query driven approach

 Build wrappers/mediators on top of heterogeneous databases

 When a query is posed to a client site, a meta-dictionary is used to translate the query into
queries appropriate for individual heterogeneous sites involved, and the results are
integrated into a global answer set

 Complex information filtering, compete for resources

 Data warehouse: update-driven, high performance

 Information from heterogeneous sources is integrated in advance and stored in


warehouses for direct query and analysis

Data Warehouse vs. Operational DBMS

 OLTP (on-line transaction processing)

 Major task of traditional relational DBMS

 Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,


registration, accounting, etc.

 OLAP (on-line analytical processing)

 Major task of data warehouse system


 Data analysis and decision making

 Distinct features (OLTP vs. OLAP):

 User and system orientation: customer vs. market

 Data contents: current, detailed vs. historical, consolidated

 Database design: ER + application vs. star + subject

 View: current, local vs. evolutionary, integrated

 Access patterns: update vs. read-only but complex queries

OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Why Separate Data Warehouse?
 High performance for both systems

 DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery

 Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view,


consolidation

 Different functions and different data:

 missing data: Decision support requires historical data which operational DBs do not
typically maintain

 data consolidation: DS requires consolidation (aggregation, summarization) of data from


heterogeneous sources

 data quality: different sources typically use inconsistent data representations, codes and
formats which have to be reconciled

 Note: There are more and more systems which perform OLAP analysis directly on relational
databases

Data Warehousing and OLAP Technology: An Overview

 What is a data warehouse?

 A multi-dimensional data model

 Data warehouse architecture

 Data warehouse implementation

 From data warehousing to data mining

From Tables and Spreadsheets to Data Cubes

 A data warehouse is based on a multidimensional data model which views data in the form of a
data cube

 A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

 Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)

 Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables

 In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of
cuboids forms a data cube.

Cube: A Lattice of Cuboids


all 0-D(apex) cuboid
time item locationsupplier
1-D cuboids
time,location item,location location,supplier
time,supplier 2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
Data Warehouse Schema

 A schema is a collection of database objects (as far as this hour is concerned—tables) associated
with one particular database username. This username is called the schema owner, or the owner
of the related group of objects. You may have one or multiple schemas in a database.

 A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized

Conceptual Modeling of Data Warehouses

 Modeling data warehouses: dimensions & measures

 Star schema: A fact table in the middle connected to a set of dimension tables

 Snowflake schema: A refinement of star schema where some dimensional hierarchy is


normalized into a set of smaller dimension tables, forming a shape similar to snowflake

 Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact constellation

Example of Star Schema


branch

branch_key
branch_name
branch_type
Cube Definition Syntax (BNF) in DMQL

 Cube Definition (Fact Table)

define cube <cube_name> [<dimension_list>]: <measure_list>

 Dimension Definition (Dimension Table)

define dimension <dimension_name> as (<attribute_or_subdimension_list>)

 Special Case (Shared Dimension Tables)

 First time as “cube definition”

 define dimension <dimension_name> as <dimension_name_first_time> in cube


<cube_name_first_time>

Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

Defining Fact Constellation in DMQL

define cube sales [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales

define dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)

define dimension from_location as location in cube sales

define dimension to_location as location in cube sales

A Concept Hierarchy: Dimension (location)

city
Multidimensional Data

 Sales volume as a function of product, month, and region


Typical OLAP Operations

 Roll up (drill-up): summarize data

 by climbing up hierarchy or by dimension reduction

 Drill down (roll down): reverse of roll-up

 from higher level summary to lower level summary or detailed data, or introducing new
dimensions

 Slice and dice: project and select

 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes

 Other operations

 drill across: involving (across) more than one fact table

 drill through: through the bottom level of the cube to its back-end relational tables (using
SQL)

Data Warehouse Design Process

 Top-down, bottom-up approaches or a combination of both

 Top-down: Starts with overall design and planning (mature)

 Bottom-up: Starts with experiments and prototypes (rapid)


 From software engineering point of view

 Waterfall: structured and systematic analysis at each step before proceeding to the next

 Spiral: rapid generation of increasingly functional systems, short turn around time, quick
turn around

 Typical data warehouse design process

 Choose a business process to model, e.g., orders, invoices, etc.

 Choose the grain (atomic level of data) of the business process

 Choose the dimensions that will apply to each fact table record

 Choose the measure that will populate each fact table record
Three Data Warehouse Models

 Enterprise warehouse

 collects all of the information about subjects spanning the entire organization

 Data Mart

 a subset of corporate-wide data that is of value to a specific groups of users. Its scope is
confined to specific, selected groups, such as marketing data mart

 Independent vs. dependent (directly from warehouse) data mart

 Virtual warehouse
 A set of views over operational databases

 Only some of the possible summary views may be materialized

Data Warehouse Development: A Recommended Approach

Model refinement
Data Warehouse Back-End Tools and Utilities

 Data extraction

 get data from multiple, heterogeneous, and external sources

 Data cleaning

 detect errors in the data and rectify them when possible

 Data transformation

 convert data from legacy or host format to warehouse format

 Load

 sort, summarize, consolidate, compute views, check integrity, and build indicies and
partitions

 Refresh

 propagate the updates from the data sources to the warehouse

Metadata Repository

 Meta data is the data defining warehouse objects. It stores:

 Description of the structure of the data warehouse

 schema, view, dimensions, hierarchies, derived data defn, data mart locations and
contents

 Operational meta-data

 data lineage (history of migrated data and transformation path), currency of data (active,
archived, or purged), monitoring information (warehouse usage statistics, error reports,
audit trails)

 The algorithms used for summarization

 The mapping from operational environment to the data warehouse


 Data related to system performance

 warehouse schema, view and derived data definitions

 Business data

 business terms and definitions, ownership of data, charging policies

OLAP Server Architectures

 Relational OLAP (ROLAP)

 Use relational or extended-relational DBMS to store and manage warehouse data and
OLAP middle ware

 Include optimization of DBMS backend, implementation of aggregation navigation logic,


and additional tools and services

 Greater scalability

 Multidimensional OLAP (MOLAP)

 Sparse array-based multidimensional storage engine

 Fast indexing to pre-computed summarized data

 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

 Flexibility, e.g., low level: relational, high-level: array

 Specialized SQL servers (e.g., Redbricks)

 Specialized support for SQL queries over star/snowflake schemas

Efficient Data Cube Computation

 Data cube can be viewed as a lattice of cuboids

 The bottom-most cuboid is the base cuboid

 The top-most cuboid (apex) contains only one cell

 How many cuboids in an n-dimensional cube with L levels?

 Materialization of data cube

 Materialize every (cuboid) (full materialization), none (no materialization), or some


(partial materialization)

 Selection of which cuboids to materialize

 Based on size, sharing, access frequency, etc.


Cube Operation

 Cube definition and computation in DMQL

define cube sales[item, city, year]: sum(sales_in_dollars)

compute cube sales

 Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et
al.’96)

SELECT item, city, year, SUM (amount)

FROM SALES

CUBE BY item, city, year

 Need compute the following Group-Bys

(date, product, customer),

(date,product),(date, customer), (product, customer),

(date), (product), (customer)

Indexing OLAP Data: Bitmap Index

 Index on a particular column

 Each value in the column has a bit vector: bit-op is fast

 The length of the bit vector: # of records in the base table

 The i-th bit is set if the i-th row of the base table has the value for the indexed column

 not suitable for high cardinality domains

Indexing OLAP Data: Join Indices

 Join index: JI(R-id, S-id) where R (R-id, …) >< S (S-id, …)

 Traditional indices map the values to a list of record ids

 It materializes relational join in JI file and speeds up relational join

 In data warehouses, join index relates the values of the dimensions of a start schema to rows in
the fact table.

 E.g. fact table: Sales and two dimensions city and product

 A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city
 Join indices can span multiple dimensions

Efficient Processing OLAP Queries

 Determine which operations should be performed on the available cuboids

 Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice =
selection + projection

 Determine which materialized cuboid(s) should be selected for OLAP op.

 Let the query to be processed be on {brand, province_or_state} with the condition “year
= 2004”, and there are 4 materialized cuboids available:

1) {year, item_name, city}

2) {year, brand, country}

3) {year, brand, province_or_state}

4) {item_name, province_or_state} where year = 2004

Which should be selected to process the query?

 Explore indexing structures and compressed vs. dense array structs in MOLAP
v Data generalization and summarization-based characterization

v Analytical characterization: Analysis of attribute relevance

v Mining class comparisons: Discriminating between different classes

Data Generalization and


Summarization-based Characterization
 Data generalization
– A process which abstracts a large set of task-relevant data in a
database from a low conceptual levels to higher ones.
1
2
3
4
Conceptual levels
5

– Approaches:
 Data cube approach(OLAP approach)

 Attribute-oriented induction approach

Data Warehousing/Mining 2
Characterization: Data Cube Approach
(without using Attribute Oriented-
Induction)
 Perform computations and store results in data cubes
 Strength
– An efficient implementation of data generalization
– Computation of various kinds of measures
 e.g., count( ), sum( ), average( ), max( )
– Generalization and specialization can be performed on a data cube by
roll-up and drill-down
 Limitations
– handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
– Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach

Data Warehousing/Mining 3

Attribute-Oriented
Induction
 Proposed in 1989 (KDD ‘89 workshop)
 Not confined to categorical data nor particular measures.
 How it is done?
– Collect the task-relevant data( initial relation) using a relational
database query
– Perform generalization by attribute removal or attribute
generalization.
– Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts.
– Interactive presentation with users.

Data Warehousing/Mining 4
Basic Principles of Attribute-
Oriented Induction
 Data focusing: task-relevant data, including dimensions,
and the result is the initial relation.
 Attribute-removal: remove attribute A if there is a large set
of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are
expressed in terms of other attributes.
 Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A.
 Attribute-threshold control: typical 2-8, specified/default.
 Generalized relation threshold control: control the final
relation/rule size.
Data Warehousing/Mining 5

Basic Algorithm for Attribute-


Oriented Induction
 InitialRel: Query processing of task-relevant data, deriving
the initial relation.
 PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
Data Warehousing/Mining 6
Example
 DMQL: Describe general characteristics of graduate
students in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”
 Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence,
phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }

Data Warehousing/Mining 7

Class Characterization: An Example


Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67


Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

Data Warehousing/Mining 8
Attribute Relevance Analysis

 Why?
– Which dimensions should be included?
– How high level of generalization?
– Automatic vs. interactive
– Reduce # attributes; easy to understand patterns
 What?
– statistical method for preprocessing data
 filter out irrelevant or weakly relevant attributes
 retain or rank the relevant attributes

– relevance related to dimensions and levels


– analytical characterization, analytical comparison

Data Warehousing/Mining 9

Attribute relevance analysis (cont’d)

 How?
– Data Collection
– Analytical Generalization
 Use information gain analysis (e.g., entropy or other
measures) to identify highly relevant dimensions and
levels.
– Relevance Analysis
 Sort and select the most relevant dimensions and levels.
– Attribute-oriented Induction for class description
 On selected dimension/level
– OLAP operations (e.g. drilling, slicing) on relevance
rules

Data Warehousing/Mining 10
Relevance Measures

 Quantitative relevance measure determines the


classifying power of an attribute within a set of
data.
 Methods
– information gain (ID3)
– gain ratio (C4.5)
– gini index
– 2 contingency table statistics
– uncertainty coefficient

Data Warehousing/Mining 11

Information-Theoretic Approach

 Decision tree
– each internal node tests an attribute
– each branch corresponds to attribute value
– each leaf node assigns a classification
 ID3 algorithm
– build decision tree based on training objects with
known class labels to classify testing objects
– rank attributes with information gain measure
– minimal height
 the least number of tests to classify an object

Data Warehousing/Mining 12
Top-Down Induction of Decision
Tree
Attributes = {Outlook, Temperature, Humidity, Wind}
PlayTennis = {yes, no}

Outlook
sunny rain
overcast

Humidity Wind
yes
high normal strong weak

no yes no yes

Data Warehousing/Mining 13

Mining Class Comparisons

 Comparison: Comparing two or more classes.


 Method:
– Partition the set of relevant data into the target class and the
contrasting class(es)
– Generalize both classes to the same high level concepts
– Compare tuples with the same high level descriptions
– Present for every tuple its description and two measures:
 support - distribution within single class
 comparison - distribution between classes
– Highlight the tuples with strong discriminant features
 Relevance Analysis:
– Find attributes (features) which best distinguish
different classes.

Data Warehousing/Mining 14
Example: Analytical comparison
 Task
– Compare graduate and undergraduate students using
discriminant rule.
– DMQL query

use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student

Data Warehousing/Mining 15

Example: Analytical comparison (2)

 Given
– attributes name, gender, major, birth_place,
birth_date, residence, phone# and gpa
– Gen(ai) = concept hierarchies on attributes ai
– Ui = attribute analytical thresholds for attributes ai
– Ti = attribute generalization thresholds for
attributes ai
– R = attribute relevance threshold

Data Warehousing/Mining 16
Example: Analytical comparison (3)

 1. Data collection
– target and contrasting classes

 2. Attribute relevance analysis


– remove attributes name, gender, major, phone#

 3. Synchronous generalization
– controlled by user-specified dimension thresholds
– prime target and contrasting class(es) relations/cuboids

Data Warehousing/Mining 17

Example: Analytical comparison (4)


Birth_country Age_range Gpa Count%
Canada 20-25 Good 5.53%
Canada 25-30 Good 2.32%
Canada Over_30 Very_good 5.86%
… … … …
Other Over_30 Excellent 4.68%
Prime generalized relation for the target class: Graduate students

Birth_country Age_range Gpa Count%


Canada 15-20 Fair 5.53%
Canada 15-20 Good 4.53%
… … … …
Canada 25-30 Good 5.02%
… … … …
Other Over_30 Excellent 0.68%
Prime generalized relation for the contrasting class: Undergraduate students
Data Warehousing/Mining 18
Example: Analytical comparison (5)

 4. Drill down, roll up and other OLAP operations on


target and contrasting classes to adjust levels of
abstractions of resulting description

 5. Presentation
– as generalized relations, crosstabs, bar charts, pie charts, or
rules
– contrasting measures to reflect comparison between target
and contrasting classes
 e.g. count%

Data Warehousing/Mining 19

You might also like