DM Chapter 2
DM Chapter 2
machine learning
(INSY3051)
Chapter 2
Data warehousing and OLAP Technology for data
mining
1
Introduction to Data Mining
Contents
OLAP technology, attribute-oriented induction
What is a data warehouse?
A multidimensional data models
Data cube computation
Data warehouse architecture
Data warehouse implementation
2
Introduction to Data Mining
What Is a DataWarehouse
Defined in many different ways, but none are
rigorous definition.
generalize and consolidate data in multidimensional space.
provides architectures and tools for business executives to
systematically organize, understand, and use their data to
make strategic decisions.
a database that is maintained separately from an
organization’s operational databases.
3
Introduction to Data Mining
Cont,,,
A short and more comprehensive definition is
given by Inmon as:
“A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision making process”.
Subject-oriented: A data warehouse is organized
around major subjects, such as customer, supplier,
product, and sales.
focuses on the modeling and analysis of data for
decision makers
4
Introduction to Data Mining
Cont,,,
Integrated: is usually constructed by integrating
multiple heterogeneous sources, such as relational
databases, flat files, and on-line transaction records.
Time-variant: Data are stored to provide information
from a historical perspective(e.g., the past 5–10 years).
Nonvolatile: A data warehouse is always a physically
separate store of data transformed from the application
data found in the operational environment.
5
Introduction to Data Mining
Differences between Operational Database Systems and Data
Warehouses
6
Introduction to Data Mining
Cont,,,
on the other hand on-line analytical processing
(OLAP) systems serve users or knowledge workers in
the role of data analysis and decision making.
Such systems can organize and present data in
various formats in order to accommodate the
diverse needs of the different users.
7
Introduction to Data Mining
Comparison between OLTP and OLAP systems.
• A 2-D view of sales data according to the dimensions time and item, where the
sales are from branches located in the city of Vancouver. The measure displayed is
dollars sold (in thousands).
10
Introduction to Data Mining
Cont,,
Although we usually think of cubes as 3-D
geometric structures, in data warehousing the data
cube is n-dimensional.
11
Introduction to Data Mining
Cont,,,
A 3-D view of sales data for AllElectronics, according to the dimensions time, item, and
location. The measure displayed is dollars sold (in thousands).
12
Introduction to Data Mining
Cont,,
A 3-D data cube representation of the data in above Table according to the dimensions
time, item, and location. The measure displayed is dollars sold (in thousands).
13
Introduction to Data Mining
Thus--
▶ In data warehousing literature, an n dimensional (n-D) cube
is called a base cuboid.
▶ Base cuboid shows some information about every attribute at
different granularity
▶ The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
▶ This shows the most summarized information which is free from
any attribute
▶ The lattice of cuboids forms a data cube.
14
Introduction to Data Mining
Cube: A Lattice of Cuboids
16
Introduction to Data Mining
Example of Star Schema
17
Introduction to Data Mining
Example of Snowflake Schema
18
Introduction to Data Mining
Example of Fact Constellation
19
Introduction to Data Mining
Typical OLAP Operations
▶In multidimensional model, data are organized into
multiple dimensions,and each dimension contains multiple
level of abstraction defined by concept hierarchies.
▶ This organization provides users with flexibility to view
and conduct BA/DM investigations from different
perspectives.
▶ Different OLAP data cube operations exists to materialize
these views
(basics):
– Roll up (drill-up) and Drill down (roll down)
– Slice and dice
Introduction to Data Mining 20
OLAP Operations: Roll-up and Drill-down
23
Introduction to Data Mining
Cont,,
24
Introduction to Data Mining
OLAP Operations: Slicing and Dicing
26
Introduction to Data Mining
Cont,,
27
Introduction to Data Mining
Design of a Data Warehouse: A Business Analysis
Framework
▶The basic steps involved in the design process of data warehouse
mainly involves business analysis
▶ It involves answering a question “What can a business analysts
gain from having a data warehouse?”
▶ May provide a competitive advantage by presenting relevant
information
▶ May enhance business productivity as it enable to quickly and
efficiently gather information that accurately describe the organization
▶ May facilitate customer relationship management by providing
consistent view of customers and items across all lines of business, all
departments and all markets
28
Introduction to Data Mining
Cont,,
▶ Four views that should be considered regarding the
design of a data warehouse with in a business analysis
framework.
– Top-down view: allows selection of the relevant
information (subjects)necessary for the data warehouse
– Data source view: exposes the information being
captured, stored, and managed by operational systems
– Data warehouse view: Seeing from the perspective of fact
tables and dimension tables
– Business query view: sees the perspectives of data in the
warehouse from the view of end-user
29
Introduction to Data Mining
Data Warehouse Design Process
▶ Can be built using top-down approaches, bottom-up
approaches or a combination of both
▶ Top-down: Starts with overall design and planning
▶ Require huge investment and commitment, Appropriate when
the technology is mature and well known
▶ Bottom-up: Starts with experiments and prototypes
▶ Appropriate in the early stage of business modeling and technology
development, Enables the business to move forward at considerably
less expense and to evaluate the benefits of technology before
making significant commitment
▶ From software engineering point of view
▶ Waterfall
▶ Spiral/Agile ..
30
Introduction to Data Mining
Architectural representation
▶ Data warehouse often adopt three-tier architecture
▶ Warehouse database server (The bottom tier)
▶ Almost always a relational DBMS, rarely flat files
▶ Back end tools and utilities are used to feed data into the middle tier
▶ The tools and utilities perform data extraction, cleaning and transformation as
well as load and refresh functions to update the warehouse
▶ OLAP servers (Middle tier)
▶ Implemented either as Relational OLAP (ROLAP) or Multidimensional OLAP
(MOLAP)
▶ ROLAP: extended relational DBMS that maps operations on multidimensional
data to standard relational operators
▶ Multidimensional OLAP (MOLAP): special-purpose server that directly
implements multidimensional data and operations
▶ Clients(the top tier)
▶ Query and reporting tools, Analysis tools, Data mining tools
31
Introduction to Data Mining
The Complete Data Warehouse System
32
Introduction to Data Mining
Three Data Warehouse Models-implementation
perspective
▶ From the implementation point of view, there are three DW models
▶ Enterprise warehouse: collects all information about subjects that span
the entire organization (customers, products, sales, assets, personnel)
▶ Requires extensive business modeling (may take years to design and
build)
▶ Data Mart: a subset of corporate-wide data that is of value to a specific
groups of users.
▶ Its scope is confined to specific, selected groups . For example, a
marketing data mart my confine its subject to customer, product and
sales
▶ Virtual warehouse : A set of views over operational databases
▶ Only some of the possible summary views may be materialized
▶ Easy to build but requires excess capacity on operational database
servers 33
Introduction to Data Mining
▶ Describe three possible conceptual data model for data warehouse?
▶ Explain slicing and role up as OLAP operations
▶ Enumerate at least 5 differences b/n OLAP and OLTP?
▶ Describe how a dimensional model (DM) differs from an
Entity–Relationship (ER) model.
▶ Present a diagrammatic representation of a typical star schema.
34
Introduction to Data Mining
Thank you
35
Introduction to Data Mining