Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Concepts and Techniques: Data Mining

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 52

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 4 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary

2
What is a Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from
the organization’s operational database
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

3
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as customer,


product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
 Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process

4
Data Warehouse—Integrated
 Constructed by integrating multiple, heterogeneous data
sources
 relational databases, flat files, on-line transaction

records
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different


data sources
 E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is
converted.

5
Data Warehouse—Time Variant

 The time horizon for the data warehouse is significantly


longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not
contain “time element”

6
Data Warehouse—Nonvolatile

 A physically separate store of data transformed from the


operational environment
 Operational update of data does not occur in the data
warehouse environment
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data

7
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

8
Why a Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
 Note: There are more and more systems which perform OLAP analysis
directly on relational databases

9
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


10
Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects spanning

the entire organization


 It contains detailed data as well as summarized data

and can be from GB to hundreds of GB,TB or beyond.


 It is implemented on traditional mainframes.

 Data Mart
 a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to


specific, selected groups, such as marketing data mart
 Independent vs. dependent (directly from warehouse) data mart
 They typically run on unix/linux or windows based.
11
Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be
materialized
• For efficient query processing, only some of the
possible

August 26, 2020 Data Mining: Concepts and Techniques 12


Extraction, Transformation, and Loading (ETL)
 Data extraction
 get data from multiple, heterogeneous, and external

sources
 Data cleaning
 detect errors in the data and rectify them when possible

 Data transformation
 convert data from legacy or host format to warehouse

format
 Load
 sort, summarize, consolidate, compute views, check

integrity, and build indicies and partitions


 Refresh
 propagate the updates from the data sources to the

warehouse
13
Metadata Repository
 Meta data is the data defining warehouse objects.
It stores:
 Description of the structure of the data warehouse
 schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
 Operational meta-data
 data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse:
Gateway descriptions,data partitions, data extraction,cleaning,tansformation
rules.
 Data related to system performance
 warehouse schema, view and derived data definitions

14
Business metadata
business terms and definitions, ownership of data, charging
policies.

August 26, 2020 Data Mining: Concepts and Techniques 15


Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary

16
From Tables and Spreadsheets to
Data Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
17
August 26, 2020 Data Mining: Concepts and Techniques 18
August 26, 2020 Data Mining: Concepts and Techniques 19
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier

20
Conceptual Modeling of Data Warehouses
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a
set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
21
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

22
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

23
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 24
A Concept Hierarchy:
Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

25
Data Cube Measures: Three Categories

 Distributive: if the result derived by applying the function to


n aggregate values is the same as that derived by applying
the function on all the data without partitioning
 E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
 E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
 E.g., median(), mode(), rank()

26
View of Warehouses and Hierarchies

Specification of hierarchies
 Schema hierarchy
day < {month <
quarter; week} < year
 Set_grouping hierarchy
{1..10} < inexpensive

27
Multidimensional Data

 Sales volume as a function of product, month,


and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month
28
A Sample Data Cube

Total annual sales


Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc

TV
od

PC U.S.A
Pr

VCR

Country
sum
Canada

Mexico

sum

29
Cuboids Corresponding to the Cube

all
0-D (apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D (base) cuboid


product, date, country

30
Typical OLAP Operations
1)Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
Roll-up is also known as "consolidation" or "aggregation." The Roll-up

operation can be performed in 2 ways


Reducing dimensions

Climbing up concept hierarchy. Concept hierarchy is a system of grouping

things based on their order or level.


2)Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or detailed data, or
introducing new dimensions
In drill-down data is fragmented into smaller parts. It is the opposite of the

rollup process. It can be done via


Moving down the concept hierarchy

Increasing a dimension

31
 3)Slice : Here, one dimension is selected, and a new sub-cube is
created.

 4)dice: defines a sub cube by performing a selection on two or


more dimensions
5)Pivot (rotate):
In Pivot, you rotate the data axes to provide a substitute presentation of
data.
 reorient the cube, visualization, 3D to series of 2D planes

 Other operations
 drill across: executes queries involving (across) more than one

fact table
 drill through: drill through the bottom level of the cube to its

back-end relational tables (using SQL)


August 26, 2020 Data Mining: Concepts and Techniques 32
Fig. 3.10 Typical OLAP
Operations

33
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT

REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
34
Browsing a Data Cube

 Visualization
 OLAP capabilities
 Interactive manipulation
35
Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary

36
OLAP Server Architectures

 Relational OLAP (ROLAP)


 Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support for SQL queries over star/snowflake schemas
37
Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary

38
Attribute-Oriented Induction

 Proposed in 1989 (KDD ‘89 workshop)


 Not confined to categorical data nor particular measures
 How it is done?
 Collect the task-relevant data (initial relation) using a
relational database query
 Perform generalization by attribute removal or
attribute generalization
 Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
 Interaction with users for knowledge presentation

39
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of graduate
students in the University database
 Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
 Step 2. Perform attribute-oriented induction
 Step 3. Present results in generalized relation, cross-tab,
or rule forms

40
Class Characterization: An Example

Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67


Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

41
Basic Principles of Attribute-Oriented Induction

 Data focusing: task-relevant data, including dimensions, and


the result is the initial relation
 Attribute-removal: remove attribute A if there is a large set
of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are expressed
in terms of other attributes
 Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A
 Attribute-threshold control: typical 2-8, specified/default
 Generalized relation threshold control: control the final
relation/rule size
42
Attribute-Oriented Induction: Basic Algorithm

 InitialRel: Query processing of task-relevant data, deriving


the initial relation.
 PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.

43
Presentation of Generalized Results
 Generalized relation:
 Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to contingency
tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with
quantitative information associated with it, e.g.,
grad ( x)  male( x) 
birth _ region( x) "Canada"[t :53%] birth _ region( x) " foreign"[t : 47%].
44
Mining Class Comparisons

 Comparison: Comparing two or more classes


 Method:
 Partition the set of relevant data into the target class and the
contrasting class(es)
 Generalize both classes to the same high level concepts
 Compare tuples with the same high level descriptions
 Present for every tuple its description and two measures
 support - distribution within single class
 comparison - distribution between classes
 Highlight the tuples with strong discriminant features
 Relevance Analysis:
 Find attributes (features) which best distinguish different classes

45
Concept Description vs. Cube-Based OLAP
 Similarity:
 Data generalization
 Presentation of data summarization at multiple levels of
abstraction
 Interactive drilling, pivoting, slicing and dicing
 Differences:
 OLAP has systematic preprocessing, query independent,

and can drill down to rather low level


 AOI has automated desired level allocation, and may

perform dimension relevance analysis/ranking when


there are many relevant dimensions
 AOI works on the data which are not in relational forms

46
Chapter 4: Data Warehousing and On-line Analytical
Processing

 Data Warehouse: Basic Concepts


 Data Warehouse Modeling: Data Cube and OLAP
 Data Warehouse Design and Usage
 Data Warehouse Implementation
 Data Generalization by Attribute-Oriented
Induction
 Summary

47
Summary
 Data warehousing: A multi-dimensional model of a data warehouse
 A data cube consists of dimensions & measures
 Star schema, snowflake schema, fact constellations
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 Data Warehouse Architecture, Design, and Usage
 Multi-tiered architecture
 Business analysis design framework
 Information processing, analytical processing, data mining, OLAM (Online
Analytical Mining)
 Implementation: Efficient computation of data cubes
 Partial vs. full vs. no materialization
 Indexing OALP data: Bitmap index and join index
 OLAP query processing
 OLAP servers: ROLAP, MOLAP, HOLAP
 Data generalization: Attribute-oriented induction
48
References (I)
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses.
SIGMOD’97
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
 E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July 1993.
 J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and
sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
 A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and Applications.
MIT Press, 1999.
 J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
 J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97

49
References (II)
 C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
 W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
 P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
11, Sept. 1995.
 P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
 S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
 A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
 D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
 P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
 J. Widom. Research problems in data warehousing. CIKM’95
 K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006

50
Surplus Slides

51
Compression of Bitmap Indices
 Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
 Two compression schemes:
 Byte-aligned Bitmap Code (BBC)
 Word-Aligned Hybrid (WAH) code
 Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
 Optimal on attributes of low cardinality as well as those of
high cardinality.
 WAH out performs BBC by about a factor of two
52

You might also like