Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Data Mining

刘莹,博士,教授

中国科学院大学计算机科学与技术学院
中国科学院大学数据挖掘与高性能计算实验室
Knowledge Discovery (KDD) Process
▪ Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Selection and
Transformation

Data Warehouse
Data Cleaning
and Integration

Databases Flat files


2022-03-02 2
Data Warehouse
◼ What is a data warehouse?
◼ A multi-dimensional data model
◼ Data warehouse architecture
◼ From data warehousing to data mining

2022-03-02 3
What is Data Warehouse?
◼ “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support
of management’s decision-making process.” — W. H.
Inmon
◼ Defined in many different ways, but not rigorously
▪ A decision support database that is maintained separately from
the organization’s operational database
▪ Support information processing by providing a solid platform of
consolidated, historical data for analysis

2022-03-02 4
Data Warehouse
◼ 数据仓库将分布在企业网络中不同信息岛上的业务数据
集成到一起,存储在一个单一的集成关系型数据库中,
利用这样的集成信息,可方便用户对信息访问,可使决
策人员对一段时间内的历史数据进行分析,研究事务的
发展走势—Informix 公司
◼ 数据仓库是一种管理技术,旨在通过通畅、合理、全面
的信息管理,达到有效的决策支持—SAS软件研究所
◼ 数据仓库是集成信息的存储中心,这些信息可用于查询
或分析—Stanford University
2022-03-02 5
Example
◼ Customer relationship management

◼ Banking decision support system


◼ Insurance decision support system
2022-03-02 6
Example
◼ Weather forecasting
▪ Air pressure, temperature, longitude/latitude,
humidity, time, etc.
▪ Slice, drill down, roll up, etc.
▪ Query
▪ Multi-dimensional visualization

2022-03-02 7
Data Warehouse—Subject-Oriented
◼ Organized around major subjects, such as customer,
product, sales
◼ Focus on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
◼ Provide a simple and concise view around particular
subject issues by excluding data that are not useful
in the decision support process

2022-03-02 8
Data Warehouse—Integrated
◼ Constructed by integrating multiple, heterogeneous
data sources
▪ relational databases, flat files, on-line transaction records
◼ Data cleaning and data integration techniques are
applied
▪ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
▪ When data is moved to the warehouse, it is converted

2022-03-02 9
Data Warehouse—Time Variant
◼ The time horizon for the data warehouse is
significantly longer than that of operational systems
▪ Operational database: current value data
▪ Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
◼ Every key structure in the data warehouse
▪ Contains an element of time, explicitly or implicitly
▪ But the key of operational data may or may not contain
“time element”

2022-03-02 10
Data Warehouse—Nonvolatile
◼ A physically separate store of data transformed
from the operational environment
◼ Operational update of data does not occur in the
data warehouse environment
▪ Does not require transaction processing, recovery,
and concurrency control mechanisms
▪ Requires only two operations in data accessing:
• initial loading of data and access of data

2022-03-02 11
Data Warehouse vs. Operational DBMS
◼ OLTP (on-line transaction processing)
▪ Major task of traditional relational DBMS
▪ Day-to-day operations: e.g. purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
◼ OLAP (on-line analytical processing)
▪ Major task of data warehouse system
▪ Data analysis and decision making
◼ Distinct features (OLTP vs. OLAP):
▪ User and system orientation: customer vs. market
▪ Data contents: current, detailed vs. historical, consolidated
▪ View: current, local vs. evolutionary, integrated
▪ Access patterns: update vs. read-only but complex queries
2022-03-02 12
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

2022-03-02 13
Data Warehouse

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ From data warehousing to data mining

2022-03-02 14
From Tables and Spreadsheets to Data Cubes

◼ A data warehouse is based on a


multidimensional data model which views data
in the form of a data cube
◼ A data cube allows data to be modeled and
viewed in multiple dimensions
▪ Dimension tables, such as item (item_name, brand,
type), or time (day, week, month, quarter, year)
▪ Fact table contains measures (such as dollars_sold)
and keys to each of the related dimension tables
2022-03-02 15
From Tables and Spreadsheets to Data Cubes

2022-03-02 16
Conceptual Modeling of Data Warehouses

◼ Modeling data warehouses: dimensions &


measures
▪ Star schema: A fact table in the middle connected to a
set of dimension tables
▪ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
▪ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
2022-03-02 17
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
2022-03-02 18
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
2022-03-02 19
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
2022-03-02 20
shipper_type
Cube Definition Syntax in DMQL

◼ Cube Definition (Fact Table)


define cube <cube_name> [<dimension_list>]:
<measure_list>
◼ Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
◼ Special Case (Shared Dimension Tables)
▪ First time as “cube definition”
▪ define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
2022-03-02 21
Defining Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold, avg_sales, units_sold
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

2022-03-02 22
Defining Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch,


location]:
dollars_sold, avg_sales, units_sold
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city(city_key, province_or_state, country))
2022-03-02 23
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold, avg_sales, units_sold
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost, unit_shipped
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
2022-03-02 24
Exercise
1. Suppose that a data warehouse consists of three
dimensions time, doctor, and patient, and two
measures count and charge, where charge is the
fee that a doctor charges a patient for a visit.

(1) Draw a schema diagram for the data warehouse.

2022-03-02 25
How to Generate a Specified Data Cube?
◼ DMQL specification is translated into SQL query
define cube sales_star [time, item, branch, location]:
dollars_sold, units_sold, units_sold

translator
select s.time_key, s.item_key, s.branch_key, s.location_key,
sum(s.number_of_units_sold*s.price), sum(s.number_of_units_sold)
from time t, item i, branch b, location l, sales s,
where s.time_key = t.time_key and s.item_key = i.item_key
and s.branch_key = b.branch_key and s.location_key = l.location_key
group by s.time_key, s.item_key, s.branch_key, s.location_key

2022-03-02 28
A Concept Hierarchy: Dimension (location)
all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

2022-03-02 29
A Concept Hierarchy: Dimension (time)
year

quarter

month
week

day

2022-03-02 30
A Concept Hierarchy for Numeric Values
$0…$1000

$0…$200 $200…$400 $400…$600 $600…$800 $800…$1000

$0…$100 $200…$300 $400…$500 $600…$700 $800…$900

$100…$200 $300…$400 $500…$600 $700…$800 $900…$1000

2022-03-02 31
Multidimensional Data
◼ Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

time
2022-03-02 32
Typical OLAP Operations
◼ Roll up (drill-up): summarize data
▪ by climbing up hierarchy or by dimension
reduction
◼ Drill down (roll down): reverse of roll-up
▪ from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
◼ Slice and dice: project and select
◼ Pivot (rotate):
▪ reorient the cube, visualization, 3D to series
of 2D planes
2022-03-02 33
A Sample Data Cube
Time Total annual sales
2Qtr of TV in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico
Total annual
sales of TV
sum

2022-03-02 34
2022-03-02 35
OLAP Operations

◼ Other operations
▪ drill across: involving (across) more than one fact
table
▪ drill through: through the bottom level of the cube
to its back-end relational tables (using SQL)
▪ rank top N or bottom N items in lists
▪ Compute average, variance, deviation

2022-03-02 36
Exercise
1. Suppose that a data warehouse consists of three
dimensions time, doctor, and patient, and two
measures count and charge, there charge is the fee
that a doctor charges a patient for a visit.

(2) Starting with the base cuboid [day, doctor, patient],


what OLAP operations should be performed in order
to list the total fee collected by each doctor in 1999?

2022-03-02 37
Data Warehouse

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ From data warehousing to data mining

2022-03-02 39
Data Warehouse: A Three-Layer Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Bottom layer: Middle layer: Top layer:


2022-03-02 Data Storage OLAP Engine Front-End40 Tools
Data Warehouse Back-End Tools and Utilities
◼ Data extraction
▪ get data from multiple, heterogeneous, and external sources
◼ Data cleaning
▪ detect errors in the data and rectify them when possible
◼ Data transformation
▪ convert data from legacy or host format to warehouse format
◼ Load
▪ sort, summarize, consolidate, compute views, check integrity
◼ Refresh
▪ propagate the updates from the data sources to the
warehouse
2022-03-02 41
Three Data Warehouse Models
◼ Enterprise warehouse
▪ collect all of the information about subjects spanning the
entire organization
◼ Data mart
▪ a subset of corporate-wide data that is of value to a specific
group of users. Its scope is confined to specific, selected
groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data
mart
◼ Virtual warehouse
▪ A set of views over operational databases
▪ Only some of the possible summary views may be
2022-03-02 materialized 42
Data Mart
▪ Credit scoring
C_id sex age income edu # credit Payment ratio # loans Payment ratio …
cards per month per month

12 0 34 50K BS. 1 100% 1 100% …


14 1 29 60K BS. 2 20% 1 50% …
135 1 46 100K MS. 4 100% 2 100% …
… … … … … … … … … …

▪ Utility mining
C_id T_id A Profit(A) B Profit(B) C Profit(C) D Profit(D) …

12 01 0 0 4 5.2 1 0.9 3 5.7 …


14 123 3 6.0 0 0 1 0.9 2 3.8 …
135 12 1 2.0 1 1.3 2 1.8 1 1.9 …
… … … … … … … … … … …
2022-03-02 43
Metadata Repository
◼ Meta data is data about data. It contains:
▪ Description of the structure of the data warehouse
• schema, view, dimensions, hierarchies, derived data
definition, data mart locations and contents
▪ Operational meta-data
• data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error
reports, audit trails)

2022-03-02 44
Metadata Repository
▪ The algorithms used for summarization
▪ The mapping from operational environment to the
data warehouse
▪ Data related to system performance
• warehouse schema, view and derived data definitions
▪ Business data
• business terms and definitions, ownership of data,
charging policies

2022-03-02 45
OLAP Server Architectures

◼ Relational OLAP (ROLAP)


▪ Use relational or extended-relational DBMS to
store and manage warehouse data and OLAP
middle ware
▪ Include optimization of DBMS backend,
implementation of aggregation navigation logic,
and additional tools and services
▪ Use parallel computing, bitmap indexing, etc.

2022-03-02 46
OLAP Server Architectures

◼ Multidimensional OLAP (MOLAP)


▪ Sparse array-based multidimensional storage
engine
▪ Fast indexing to pre-computed summarized data
▪ Sparse matrix compression technique
◼ Hybrid OLAP (HOLAP) (e.g., Microsoft
SQLServer)
▪ Flexibility, e.g., low level: relational, high-level:
array

2022-03-02 47
Data Warehouse

◼ What is a data warehouse?

◼ A multi-dimensional data model

◼ Data warehouse architecture

◼ From data warehousing to data mining

2022-03-02 48
Data Warehouse Usage
◼ Three kinds of data warehouse applications
▪ Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
▪ Analytical processing
• supports basic OLAP operations, slice-dice, drilling, pivoting
▪ Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools

2022-03-02 49
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
◼ Why online analytical mining?
▪ High quality of data in data warehouses
• DW contains integrated, consistent, cleaned data
▪ Available information processing structure surrounding data
warehouses
• ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
▪ OLAP-based exploratory data analysis
• Mining with drilling, dicing, pivoting, etc.
▪ On-line selection of data mining functions
• Integration and swapping of multiple mining functions,
algorithms, and tasks
2022-03-02 50
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
2022-03-02 51
Summary

◼ Why data warehousing?


◼ A multi-dimensional model of a data warehouse
▪ Star schema, snowflake schema, fact constellations
▪ A data cube consists of dimensions & measures
◼ OLAP operations: drilling, rolling, slicing, dicing and
pivoting
◼ Data warehouse architecture
◼ OLAP servers: ROLAP, MOLAP, HOLAP
◼ From OLAP to OLAM (on-line analytical mining)

2022-03-02 52

You might also like