7 Data Warehousing - 1
7 Data Warehousing - 1
7 Data Warehousing - 1
• Data warehousing:
– The process of constructing and using data warehouses.
• Data warehouses provide on-line analytical processing (OLAP) tools for the
interactive analysis of multidimensional data of varied granularities, which facilitates
effective data generalization and data mining.
– Many other data mining functions, such as association, classification, prediction,
and clustering, can be integrated with OLAP operations to enhance interactive
mining of knowledge at multiple levels of abstraction.
Major Features of a Data Warehouse
Subject-Oriented
Four major features of a data warehouse:
• A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making process.
Subject-Oriented:
• A data warehouse is organized around major subjects, such as customer, product,
sales.
• A data warehouse focuses on the modeling and analysis of data for decision makers,
not on daily operations or transaction processing.
• A data warehouse provides a simple and concise view around particular subject issues
by excluding data that are not useful in the decision support process.
Major Features of a Data Warehouse
Integrated
Integrated:
• A data warehouse is constructed by integrating multiple heterogeneous data sources
such as relational databases, flat files, on-line transaction records.
• Data cleaning and data integration techniques are applied to ensure consistency in
naming conventions, encoding structures, attribute measures, etc.
– When data is moved to the warehouse from operational databases, it is converted.
Major Features of a Data Warehouse
Time-Variant
Time-Variant:
• The time horizon for a data warehouse is significantly longer than that of operational
systems
– Operational database: current value data
– Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)
• Every key structure in the data warehouse contains an element of time, explicitly or
implicitly.
– But the key structure of operational data may or may not contain “time element”
Major Features of a Data Warehouse
Non-volatile
Non-volatile:
• A data warehouse is a physically separate store of data transformed from the
operational environment.
• Operational update of data does not occur in a data warehouse environment.
– Does not require transaction processing, recovery, and concurrency control
mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
Operational Database Systems
and Data Warehouses
• The major task of on-line operational database systems is to perform on-line
transaction and query processing.
– These systems are called on-line transaction processing (OLTP) systems.
– They cover most of the day-to-day operations of an organization, such as
purchasing, inventory, banking, payroll, registration, and accounting.
• Data warehouse systems, on the other hand, serve users or knowledge workers in the
role of data analysis and decision making.
– Such systems can organize and present data in various formats in order to
accommodate the diverse needs of the different users.
– These systems are known as on-line analytical processing (OLAP) systems.
OLTP vs. OLAP
Users and system orientation:
• An OLTP system is customer-oriented and is used for transaction and query
processing by clerks, clients, and IT professionals.
• An OLAP system is market-oriented and is used for data analysis by knowledge
workers, including managers, executives, and analysts.
Data contents:
• An OLTP system manages current data that, typically, are too detailed to be easily
used for decision making.
• An OLAP system manages large amounts of historical data, provides facilities for
summarization and aggregation.
Database design:
• An OLTP system usually adopts an entity-relationship (ER) data model and an
application-oriented database design.
• An OLAP system typically adopts either a star or snowflake model and a subject
oriented database design.
OLTP vs. OLAP …
View:
• OLTP focuses on current and local data view where as
• OLAP has multiple version of DB schema due to evolutionary process of the
enterprise.
Access patterns:
• OLTP access pattern is usually update where as
• OLAP access pattern is read-only but complex queries
OLTP vs. OLAP …
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
A Three-Tier Data Warehouse Architecture
Enterprise Warehouse
• Collects all of the information about subjects spanning the entire organization
Data Mart
• A subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
Virtual Warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
Multidimensional Data Model: Data Cube
• Data warehouses and OLAP tools are based on a multidimensional data model.
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes
OLAP Operation: Roll-up
• The roll-up operation (also called the drill-up
operation) performs aggregation on a data cube,
either by climbing up a concept hierarchy for a
dimension or by dimension reduction.
OLAP Operation: Drill-down