CSE 592 Data Mining: Instructor: Pedro Domingos
CSE 592 Data Mining: Instructor: Pedro Domingos
CSE 592 Data Mining: Instructor: Pedro Domingos
Data Mining
Instructor: Pedro Domingos
1
Today’s Program
2
Logistics
Office: TBA
Web: www.cs.washington.edu/592
Mailing list: cse574@cs
3
Assignments
Two projects
Groups of two
Three homeworks
Individual
12.5% each
4
Source Materials
5
What is Data Mining?
6
Related Disciplines
Machine learning
Databases
Statistics
Information retrieval
Visualization
High-performance computing
Etc.
7
Applications of Data Mining
E-commerce
Marketing and retail
Finance
Telecoms
Drug design
Process control
Space and earth sensing
Etc.
8
The Data Mining Process
Classification
Regression
Probability estimation
Clustering
Association detection
Summarization
Trend and deviation detection
Etc.
10
Inductive Learning
11
Widely-used Approaches
Decision trees
Rule induction
Bayesian learning
Neural networks
Genetic algorithms
Instance-based learning
Etc.
12
Requirements for a Data Mining
System
Statistically sound
Ergonomically sound
13
Components of a Data Mining System
Representation
Evaluation
Search
Data management
User interface
14
Topics for this Quarter (Slide 1 of 2)
15
Topics for this Quarter (Slide 2 of 2)
Model ensembles
Instance-based learning
Learning theory
Association rules
Clustering
16
Data Warehousing and OLAP
17
What is a Data Warehouse?
19
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous
data sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
20
Data Warehouse—Time Variant
21
Data Warehouse—Non-Volatile
22
Data Warehouse vs. Heterogeneous DBMS
23
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
24
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
25
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
missing data: Decision support requires historical data
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use inconsistent
data representations, codes and formats which have to
be reconciled
26
Data Warehousing and OLAP
27
From Tables and Spreadsheets
to Data Cubes
all
0-D(apex) cuboid
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
29
Conceptual Modeling of
Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
30
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
31
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
32
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
35
Specification of Hierarchies
Schema hierarchy
day < {month < quarter; week} < year
Set_grouping hierarchy
{1..10} < inexpensive
36
Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Office Day
Month
37
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
38
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
39
Browsing a Data Cube
Visualization
OLAP capabilities
Interactive manipulation
40
Typical OLAP Operations
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
42
Data Warehousing and OLAP
43
Data Warehouse Design Process
44
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining
Data Marts
materialized
46
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
techniques)
fast indexing to pre-computed summarized data
48
Data Warehousing and OLAP
49
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L
levels? n
T ( Li 1)
i 1
()
(city, item, year)
51
Efficient Processing of OLAP Queries
52
Metadata Repository
Meta data is the data defining warehouse objects. It has the following
kinds
Description of the structure of the warehouse
53
Data Warehouse Back-End Tools and
Utilities
Data extraction:
get data from multiple, heterogeneous, and external
sources
Data cleaning:
detect errors in the data and rectify them when
possible
Data transformation:
convert data from legacy or host format to warehouse
format
Load:
sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
propagate the updates from the data sources to the
warehouse
54
Data Warehousing and OLAP
55
Discovery-Driven Exploration of Data
Cubes
57
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
59
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
Differences among the three tasks
60
From Online Analytical Processing
to Online Analytical Mining (OLAM)
Layer2
MDDB
MDDB
Meta Data
63