Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Warehouse

Download as pdf or txt
Download as pdf or txt
You are on page 1of 174

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 4 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.

1
Chapter 4: Data Warehousing and On-line Analytical
Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary

2
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses

3
Data Warehouse—Subject-Oriented

■ Organized around major subjects, such as customer,


product, sales
■ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process

4
Data Warehouse—Integrated

■ Constructed by integrating multiple, heterogeneous data


sources
■ relational databases, flat files, on-line transaction

records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different


data sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.

5
Data Warehouse—Time Variant

■ The time horizon for the data warehouse is significantly


longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
■ Contains an element of time, explicitly or implicitly
■ But the key of operational data may or may not
contain “time element”

6
Data Warehouse—Nonvolatile
■ A physically separate store of data transformed from the
operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data

7
Online analytical processing
(OLAP) and online
transactional processing OLTP vs. OLAP
(OLTP)

8
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
9
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
Integrato
sources r
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data OLAP Engine Front-End


Storage Tools 10
Three Data Warehouse Models
■ Enterprise warehouse
■ collects all of the information about subjects spanning

the entire organization


■ Data Mart
■ a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to


specific, selected groups, such as marketing data mart
■ Independent vs. dependent (directly from warehouse) data mart
■ Virtual warehouse
■ A set of views over operational databases

■ Only some of the possible summary views may be

materialized
11
Extraction, Transformation, and Loading (ETL)
■ Data extraction
■ get data from multiple, heterogeneous, and external
sources
■ Data cleaning
■ detect errors in the data and rectify them when possible

■ Data transformation
■ convert data from legacy or host format to warehouse
format
■ Load
■ sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the
warehouse
12
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions
■ Business data
■ business terms and definitions, ownership of data, charging policies
13
Chapter 4: Data Warehousing and On-line Analytical
Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary

14
From Tables and Spreadsheets to
Data Cubes
■ A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.

15
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier

16
Conceptual Modeling of Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

17
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

18
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

19
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city
units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 20
A Concept Hierarchy:
Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

21
Data Cube Measures: Three Categories

■ Distributive: if the result derived by applying the function


to n aggregate values is the same as that derived by
applying the function on all the data without partitioning
■ E.g., count(), sum(), min(), max()
■ Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
■ E.g., avg(), min_N(), standard_deviation()
■ Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
■ E.g., median(), mode(), rank()
22
View of Warehouses and Hierarchies

Specification of hierarchies
■ Schema hierarchy
day < {month < quarter;
week} < year
■ Set_grouping hierarchy
{1..10} < inexpensive

23
Multidimensional Data

■ Sales volume as a function of product, month,


and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi
Re

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month
24
A Sample Data Cube

Total annual sales


Date of TVs in U.S.A.
1Qtr 2Qtr sum
t

3Qtr 4Qtr
uc

TV
od

PC U.S.A
Pr

VCR

Country
sum
Canada

Mexico

sum

1st quarter mai USA mai kitne products sell hue All country, all product, all quater
25
Cuboids Corresponding to the Cube

all
0-D (apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D (base) cuboid


product, date, country

26
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

27
Fig. 3.10 Typical OLAP
Operations

28
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
29
Browsing a Data Cube

■ Visualization
■ OLAP capabilities
■ Interactive manipulation
30
Chapter 4: Data Warehousing and On-line Analytical
Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary

31
Design of Data Warehouse: A Business
Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the
data warehouse
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view
of end-user

32
Data Warehouse Design Process
■ Top-down, bottom-up approaches or a combination of both
■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record

33
Data Warehouse Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Enterprise
Data Data
Data
Mart Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


34
Data Warehouse Usage
■ Three kinds of data warehouse applications
■ Information processing
■ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
■ Analytical processing
■ multidimensional analysis of data warehouse data
■ supports basic OLAP operations, slice-dice, drilling, pivoting
■ Data mining
■ knowledge discovery from hidden patterns
■ supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools

35
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
■ Why online analytical mining?
■ High quality of data in data warehouses

■ DW contains integrated, consistent, cleaned data

■ Available information processing structure surrounding

data warehouses
■ ODBC, OLEDB, Web accessing, service facilities,

reporting and OLAP tools


■ OLAP-based exploratory data analysis

■ Mining with drilling, dicing, pivoting, etc.

■ On-line selection of data mining functions

■ Integration and swapping of multiple mining

functions, algorithms, and tasks


36
Chapter 4: Data Warehousing and On-line Analytical
Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary

37
Efficient Data Cube Computation
■ Data cube can be viewed as a lattice of cuboids
■ The bottom-most cuboid is the base cuboid
■ The top-most cuboid (apex) contains only one cell
■ How many cuboids in an n-dimensional cube with L
levels?

■ Materialization of data cube


■ Materialize every (cuboid) (full materialization),
none (no materialization), or some (partial
materialization)
■ Selection of which cuboids to materialize
■ Based on size, sharing, access frequency, etc.
38
The “Compute Cube” Operator
■ Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
■ Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96) ()
SELECT item, city, year, SUM (amount)
FROM SALES (city) (item) (year)
CUBE BY item, city, year
■ Need compute the following Group-Bys
(date, product, customer), (city, item) (city, year) (item, year)
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
(city, item, year)
()
39
Indexing OLAP Data: Bitmap Index
■ Index on a particular column
■ Each value in the column has a bit vector: bit-op is fast
■ The length of the bit vector: # of records in the base table
■ The i-th bit is set if the i-th row of the base table has the value for
the indexed column
■ not suitable for high cardinality domains
■ A recent bit compression technique, Word-Aligned Hybrid (WAH),
makes it work for high cardinality domain as well [Wu, et al. TODS’06]
Base table Index on Region Index on Type

40
Indexing OLAP Data: Join Indices

■ Join index: JI(R-id, S-id) where R (R-id, …)


S (S-id, …)
■ Traditional indices map the values to a list of
record ids
■ It materializes relational join in JI file and
speeds up relational join
■ In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
■ E.g. fact table: Sales and two dimensions city
and product
■ A join index on city maintains for each

distinct city a list of R-IDs of the tuples


recording the Sales in the city
■ Join indices can span multiple dimensions

41
Efficient Processing OLAP Queries
■ Determine which operations should be performed on the available cuboids
■ Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
■ Determine which materialized cuboid(s) should be selected for OLAP op.
■ Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
■ Explore indexing structures and compressed vs. dense array structs in MOLAP

42
OLAP Server Architectures

■ Relational OLAP (ROLAP)


■ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
■ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
■ Greater scalability
■ Multidimensional OLAP (MOLAP)
■ Sparse array-based multidimensional storage engine
■ Fast indexing to pre-computed summarized data
■ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
■ Flexibility, e.g., low level: relational, high-level: array
■ Specialized SQL servers (e.g., Redbricks)
■ Specialized support for SQL queries over star/snowflake schemas

43
Chapter 4: Data Warehousing and On-line Analytical
Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary

44
Attribute-Oriented Induction

■ Proposed in 1989 (KDD ‘89 workshop)


■ Not confined to categorical data nor particular measures
■ How it is done?
■ Collect the task-relevant data (initial relation) using a
relational database query
■ Perform generalization by attribute removal or attribute
generalization
■ Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
■ Interaction with users for knowledge presentation

45
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of graduate
students in the University database
■ Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
■ Step 2. Perform attribute-oriented induction
■ Step 3. Present results in generalized relation, cross-tab,
or rule forms

46
Class Characterization: An Example

Initial
Relation

Prime
Generalized
Relation

47
Basic Principles of Attribute-Oriented Induction

■ Data focusing: task-relevant data, including dimensions,


and the result is the initial relation
■ Attribute-removal: remove attribute A if there is a large set
of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are
expressed in terms of other attributes
■ Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A
■ Attribute-threshold control: typical 2-8, specified/default
■ Generalized relation threshold control: control the final
relation/rule size
48
Attribute-Oriented Induction: Basic Algorithm

■ InitialRel: Query processing of task-relevant data, deriving


the initial relation.
■ PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
■ PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
■ Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.

49
Presentation of Generalized Results
■ Generalized relation:
■ Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
■ Cross tabulation:
■ Mapping results into cross tabulation form (similar to contingency
tables).
■ Visualization techniques:
■ Pie charts, bar charts, curves, cubes, and other visual forms.
■ Quantitative characteristic rules:
■ Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,

50
Mining Class Comparisons

■ Comparison: Comparing two or more classes


■ Method:
■ Partition the set of relevant data into the target class and the
contrasting class(es)
■ Generalize both classes to the same high level concepts
■ Compare tuples with the same high level descriptions
■ Present for every tuple its description and two measures
■ support - distribution within single class
■ comparison - distribution between classes
■ Highlight the tuples with strong discriminant features
■ Relevance Analysis:
■ Find attributes (features) which best distinguish different classes

51
Concept Description vs. Cube-Based OLAP
■ Similarity:
■ Data generalization
■ Presentation of data summarization at multiple levels of
abstraction
■ Interactive drilling, pivoting, slicing and dicing
■ Differences:
■ OLAP has systematic preprocessing, query independent,

and can drill down to rather low level


■ AOI has automated desired level allocation, and may

perform dimension relevance analysis/ranking when


there are many relevant dimensions
■ AOI works on the data which are not in relational forms

52
Chapter 4: Data Warehousing and On-line Analytical
Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary

53
Summary
■ Data warehousing: A multi-dimensional model of a data warehouse
■ A data cube consists of dimensions & measures
■ Star schema, snowflake schema, fact constellations
■ OLAP operations: drilling, rolling, slicing, dicing and pivoting
■ Data Warehouse Architecture, Design, and Usage
■ Multi-tiered architecture
■ Business analysis design framework
■ Information processing, analytical processing, data mining, OLAM (Online
Analytical Mining)
■ Implementation: Efficient computation of data cubes
■ Partial vs. full vs. no materialization
■ Indexing OALP data: Bitmap index and join index
■ OLAP query processing
■ OLAP servers: ROLAP, MOLAP, HOLAP
■ Data generalization: Attribute-oriented induction

54
References (I)
■ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
■ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
■ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
■ S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
■ E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July
1993.
■ J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab
and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
■ A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
■ J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
■ V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
■ J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97

55
References (II)
■ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
■ W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
■ R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
■ P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record,
24:8–11, Sept. 1995.
■ P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
■ Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
■ S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
■ A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
■ D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
■ P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
■ J. Widom. Research problems in data warehousing. CIKM’95
■ K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006

56
Surplus Slides

57
Compression of Bitmap Indices
■ Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
■ Two compression schemes:
■ Byte-aligned Bitmap Code (BBC)
■ Word-Aligned Hybrid (WAH) code
■ Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
■ Optimal on attributes of low cardinality as well as those of
high cardinality.
■ WAH out performs BBC by about a factor of two
58
Data Mining:
Concepts and Techniques
(3rd ed.)

— Module 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Data Objects and Attribute Types
 Data sets are made up of data objects.
 A data object represents an entity—
Examples
 in a sales database, the objects may be customers, store items,

and sales;
 in a medical database, the objects may be patients;

 in a university database, the objects may be students, professors,

and courses.
Data objects are typically described by attributes.

2
 Data objects can also be referred to as samples, examples, instances,
data points, or objects.
 If the data objects are stored in a database, they are data tuples.
 That is,
 the rows of a database >>>> the data objects,

 and the columns >>>> the attributes..

3
 An attribute is a data field, representing a characteristic or feature
of a data object.
 The nouns attribute, dimension, feature, and variable are often
used.
 The term dimension is commonly used in data warehousing.

 Machine learning literature tends to use the term feature

 statisticians prefer the term variable.

Eg: customer ID,name, and address

4
Types of attributes

Types
 nominal
 binary
 ordinal
 Numeric
 Interval scaled

 Ratio scaled

5
Nominal Attributes
 Nominal means “relating to names.” The values of a nominal
attribute are symbols or names of things.
 Each value represents some kind of category, code, or state, and so

nominal attributes are also referred to as categorical.


 The values do not have any meaningful order

Examples
 hair color and marital status are two attributes describing person
objects.
 hair color > black,brown, blond, red, auburn, gray, and white.

 marital status > single, married, divorced, and widowed..

6
Nominal Attributes

 it is possible to represent such symbols or “names” with numbers.


 0 for black, 1 for brown, and so on.

 the numbers are not intended to be used quantitatively.


 That is, mathematical operations on values of nominal attributes
are not meaningful
nominal attribute values do not have any meaningful order about
them and are not quantitative

what central tendency measure is most appropriate for this attribute?

7
 Binary Attributes
A binary attribute is a nominal attribute with only two categories or
states: 0 or 1,
 Where 0 typically means that the attribute is absent, and 1

means that it is present.


 Binary attributes are referred to as Boolean if the two states

correspond to true and false.


Examples
 attribute smoker describing a patient object,

 1 indicates that the patient smokes, while 0 indicates that the patient
does not
8
Binary Attributes

 Symmetric
 A binary attribute is symmetric if both of its states are equally

valuable and carry the same weight; that is, there is no


preference on which outcome should be coded as 0 or 1.
 example : attribute gender having the states male and female.
 Asymmetric
 A binary attribute is asymmetric if the outcomes of the states

are not equally important


 Example:- medical test (positive vs negative)

9
Ordinal Attributes

 Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude
between successive values is not known.

 Size ={small, medium large}


 Grades={A+, A, A-, B+}
 Professional ranks = {assistant, associate, professor}

10
 Ordinal attributes are useful for registering subjective assessments
of qualities that cannot be measured objectively; thus ordinal
attributes are often used in surveys for ratings.
 Customer satisfaction had the following ordinal categories:

 0: very dissatisfied,
 1: somewhat dissatisfied,
 2: neutral,
 3: satisfied, and
 4: very satisfied
The central tendency of an ordinal attribute can be represented by its
mode and its median but the mean cannot be defined

11
Numeric Attributes

 A numeric attribute is quantitative; that is, it is a measurable


quantity, represented in integer or real values.
 Numeric attributes can be interval-scaled or ratio-scaled.
 Interval-scaled attributes

 measured on a scale of equal-size units


 The values of interval-scaled attributes have order and can be positive,
0, or negative.
 No true zero point
 In addition to the median and mode measures of central tendency,
mean value can also be computed
 Eg; calendar dates, temeperature in C and F

12
Numeric Attributes

 Ratio scaled
 A ratio-scaled attribute is a numeric attribute with an

inherent zero-point.
 In addition, the values are ordered, and we can also compute

the difference between values, as well as the mean, median, and


mode.

13
 nominal, binary, and ordinal attributes are qualitative.
 , they describe a feature of an object without giving an actual

size or quantity.
 The values of such qualitative attributes are typically words

representing categories. If integers are used, they represent


computer codes for the categories, as opposed to measurable
quantities
(e.g., 0 for small drink size, 1 for medium, and 2 for large)

14
Basic Statistical Descriptions

04.02.2022
Measures of Central Tendency

 Measures of central tendency include the mean, median, mode,


and
midrange.
The most common and effective numeric measure of the “center” of
a set of data is the (arithmetic) mean.
 Let x1,x2,:::,xN be a set of N values or observations, such as for
some numeric attribute X, like salary. The mean of this set of
values is

16
17
 Median. Let’s find the median of the data from Example 2.6. The
data are already sorted in increasing order. There is an even
number of observations (i.e., 12); therefore, the median is not
unique. It can be any value within the two middlemost values of 52
and 56 (that is, within the sixth and seventh values in the list).
average of the two middlemost values as the median.

 Suppose that we had only the first 11 values in the list. Given an
odd number of values, the median is the middlemost value. This is
the sixth value in this list, which has a value of $52,000.
18
 The mode is another measure of central tendency. The mode for a
set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative
attributes
 Mode. The data from Example 2.6 are bimodal. The two modes are
$52,000 and $70,000.
 The midrange can also be used to assess the central tendency of
a numeric data set.
It is the average of the largest and smallest values in the set.
 The midrange of the data of Example 2.6 is

19
20
Measuring the Dispersion of Data: Range, Quartiles,
Variance, Standard Deviation, and Interquartile Range

The measures include range, quantiles, quartiles, percentiles, and the


interquartile range. The five-number summary, which can be
displayed as a boxplot, is useful in identifying outliers. Variance and
standard deviation also indicate the spread of a data distribution.
Range, Quartiles, and Interquartile Range

 Range

Let x1,x2,:::,xN be a set of observations for some numeric


attribute, X.
The range of the set is the difference between the largest
(max()) and smallest (min()) values.
Quantiles

Suppose that the data for attribute X are sorted in increasing numeric order. Imagine
that we can pick certain data points so as to split the data distribution into equal-size
consecutive sets, as in Figure 2.2. These data points are called quantiles.

Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially
equal size consecutive sets.
Interquartile Range

 The quartiles give an indication of a distribution’s center, spread, and shape. The first
quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data.

 The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or
highest 25%) of the data.
 The second quartile is the 50th percentile. As the median, it gives the center of the data
distribution.

 The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as

IQR = Q3 - Q1
 Semi interquartile range=IQR/3

 Mean=sum of all observations/no of observations

 Mean deviation about mean=ratio of the summation of absolute values of


dispersion to the number of observations relative to mean.

 Mean deviation about median=ratio of the summation of absolute values


of
dispersion to the number of observations relative to mean.
 Variance=summation of [squaring deviations from the mean] ÷ number of
observations
 Standard deviation=square root of variance

 Coefficient of variation=Ratio of standard deviation to the mean


Data Mining:
Concepts and Techniques
(3rd ed.)

— Module 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 2: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
2
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view


 Accuracy or noisy: correct or wrong, values deviated from
expected
 Completeness: not recorded, unavailable, overlooked, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be understood?

3
 Inaccurate, incomplete, and inconsistent
data are common-place properties of large real-
world databases and data warehouses.
 inaccurate data or having incorrect attribute values
 The data collection instruments used may be

faulty.
 There may have been human or computer

errors
occurring at data entry.
 Disguised missing
 Errors in data transmission can also occur
4
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files, different
names (cid, cust_id), inferable attributes (avoid redundancy)
 Data reduction
 Dimensionality reduction
 Numerosity reduction (replacing data by alternate smaller
representations)
 Data compression
 Data transformation and data discretization
 Normalization, aggregation
 Concept hierarchy generation 5
Unit II: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
6
7
 In summary, real-world data tend to be dirty,
incomplete, and inconsistent.
 Data preprocessing techniques can improve
data quality, thereby helping to improve the
accuracy
and efficiency of the subsequent mining
process.
 Data preprocessing is an important step
in the knowledge discovery process, because
quality decisions must be based on quality
data.
8
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers, faulty data collection
instruments, human or computer error, technology limitations…
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday? 9
Data cleaning (or data
cleansing) routines attempt to fill in
missing values, smooth out noise
while identifying outliers, and
correct inconsistencies in the
data.

10
Incomplete (Missing) Data

 Data is not always available


 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
11
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute..mean or median
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
12
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning


 duplicate records

 incomplete data

 inconsistent data

13
How to Handle Noisy Data?

 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human (e.g.,

deal with possible outliers)

14
Binning Methods

15
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check unique rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing tool: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections


 Data auditing tool: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering


to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to


specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

16
Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
17
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Integrate metadata from different sources
 Entity identification problem:
 How to identify equivalent real world entities from multiple data
sources, e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units

18
Data Integration
 Entity identification problem:
 Examples of metadata for each attribute include the name,
meaning, data type, and range of values permitted for the
attribute, and null rules for handling blank, zero, or null values.
Such metadata can be used to help avoid errors in schema
integration.
 The metadata may also be used to help transform the data (e.g.,
where data codes for pay type in one database may be “H” and “S”
but 1 and 2 in another).

19
Handling Redundancy in Data Integration

 An attribute may be redundant if it can be “derived” from


another attribute or set of attributes.
 Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
20
Handling Redundancy in Data Integration

 Redundant attributes may be able to be detected by


correlation analysis and covariance analysis
 Some redundancies can be detected by correlation
analysis. Given two attributes, such analysis can measure
how strongly one attribute implies the other, based on
the available data.

21
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

22
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution
in the two categories)
(250  90 ) 2 (50  210 ) 2 (200  360 ) 2 (1000  840 ) 2
 
2
    507 .93
90 210 360 840
 It shows that like_science_fiction and play_chess are
correlated in the group
23
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product


moment coefficient)

i1 (ai  A)(bi  B) 


n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated

24
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

25
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, A and B, and then take their dot product

a'k  (ak  mean ( A)) / std ( A)

b'k  (bk  mean ( B)) / std ( B)

correlation( A, B)  A'B'

26
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or


expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence27
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.


Tuple Duplication
 In addition to detecting redundancies between attributes, duplication
should also be detected at the tuple level.

 Duplicates cause inconsistency.

 The use of denormalized tables is another source of data redundancy.


Data Value Conflict Detection and
Resolution
 For the same real-world entity, attribute values from different sources
may differ. This may be due to differences in representation, scaling,
or encoding. Ex. Weight in kilos or pounds

 For a hotel chain, the price of rooms in different cities may involve not
only different currencies (Rupees, Dollars, Euro etc.)

 Price may include also different services (e.g., free breakfast) and
taxes.
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
31
31
Data Reduction Strategies
 When the dataset is very huge, complex data analysis and mining is
very time consuming.
 Data Reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.

32
Data Reduction Strategies
 Dimensionality reduction, Dimensionality reduction is the process of
reducing the number of random variables or attributes under
consideration.
 Wavelet transforms

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction: Numerosity reduction techniques replace the


original data volume by alternative, smaller forms of data
representation.
 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression: Transformations are applied to obtain a reduced or


“compressed” representation of the original data. If the original data
can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless. If we can
reconstruct only an approximation of the original data, then the data
reduction is called lossy.
33
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

34
What Is Wavelet Transform?
 The discrete wavelet transform (DWT) is a linear signal
processing technique.
 When DWT is applied to a data vector X, it is transformed
to a numerically different vector, X’, of wavelet
coefficients.
 X and X’ are of the same length. When applying this
technique to data reduction, we consider each tuple as an
n-dimensional data vector, that is X = (x1,x2, …, xn).
 If X and X’ are of same length, how data reduction is
achieved?
 ..because the wavelet transformed data can be truncated.

35
Data reduction via Wavelet Transform
 A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet
coefficients.
 For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are
set to 0.
 The resulting data representation is therefore very sparse
(small, infrequent, scattered), so that operations that can
take advantage of data sparsity are computationally very fast
if performed in wavelet space.
 The technique also works to remove noise without smoothing
out the main features of the data, making it effective for data
cleaning as well. An approximation of the original data can be
constructed by applying the inverse of the DWT used.
36
What Is Wavelet Transform?
 Decomposes a signal into
different frequency subbands
 Applicable to n-
dimensional signals
 Data are transformed to
preserve relative distance
between objects at different
levels of resolution
 Allow natural clusters to
become more distinguishable
 Used for image compression

37
Wavelet Transformation
 DWT is closely related to discrete Fourier transform (DFT),
but better lossy compression, ie, DWT provides a more
accurate approximation of the original data.
 For an equivalent approximation, the DWT requires less
space than the DFT.
 Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
 Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired
38 length
Wavelet Decomposition
 Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
 S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ =
[23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
 Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients are
retained

39
Why Wavelet Transform?
 Use hat-shape filters
 Emphasize region where points cluster

 Suppress weaker information in their boundaries

 Effective removal of outliers


 Insensitive to noise, insensitive to input order

 Multi-resolution
 Detect arbitrary shaped clusters at different scales

 Efficient
 Complexity O(N)

 Only applicable to low dimensional data

40
Principal Component Analysis (PCA)
 This is a lossy compression method applied on numerical data for
identifying patterns. This is another way of dimensionality reduction
 If there are n attributes/dimensions for a dataset that has to be
reduced, PCA searches for k n-dimensional orthogonal vectors that
can best be used to represent data.
 The original data are projected onto a much smaller space, resulting
in dimensionality reduction x2
 And plots the data resulting in dimension reduction.

x 41
Principal Component Analysis (Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Calculate the co-variance matrix nXn
 Calculate the eigen vectors and eigen values of co-variance matrix.
The eigen vectors should be unit eigen vectors and are
perpendicular to each other
 Data is plotted (normalized data) against the eigen vectors of the
covariance matrix
 Order the eigen vectors also known as PCA in decreasing strength
/by eigen values. There PCA serves as a new set of axes providing
important information about variance. This helps in identifying
groups.
42
Principal Component Analysis
 Advantages of PCA
 Inexpensive
 It combines the essence of all attributes even during dimensionality
reduction
 Can handle sparse and skewed data
 Can be applied to ordered and unordered attributes
 Multidimension can be reduced to 2 dimensions.

43
Attribute Subset Selection
 Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes
 Redundant attributes
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA
44
Attribute Selection

 The goal of attribute subset selection is to find a minimum


set of attributes
 The “best” and “worst” attributes are typically determined
using tests of statistical Significance
 1. Stepwise forward selection: The procedure starts
with an empty set of attributes as the reduced set. The
best of the original attributes is determined and added to
the reduced set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the set.
 2. Stepwise backward elimination: The procedure
starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.

45
Attribute Selection

 3. Combination of forward selection and backward


elimination: The stepwise forward selection and backward
elimination methods can be combined so that, at each step,
the procedure selects the best attribute and removes the
worst from among the remaining attributes.
 4. Decision tree induction: Decision tree algorithms
(e.g., ID3, C4.5, and CART) were originally intended for
classification. Decision tree induction constructs a
flowchartlike structure where each internal (nonleaf) node
denotes a test on an attribute, each branch corresponds to
an outcome of the test, and each external (leaf) node
denotes a class prediction. At each node, the algorithm
chooses the “best” attribute to partition the data into
individual classes.
46
Data Reduction 2: Numerosity Reduction
 Numerosity means number of distinct values in a dataset.
 Numerosity reduction is reducing volume of huge dataset by
choosing alternative, smaller forms of data representation
 There are two techniques, parametric and non-parametric
 In parametric, a model is used to represent the data. Here
only model parameters need to be stored.
 In non-parametric, no model is being used. A reduced
representation of data is stored.

47
Numerosity Reduction-Parametric Methods
 Parametric methods (e.g., regression)
 Parametric methods are used when data can be

related using an equation or a constant.


 This type of data may fit into a model and model

parameters can be estimated. And store only the


parameters, and discard the actual data. This is how
numerosity reduction works.
 The models used in this technique are Regression and

Log-linear models

48
Parametric Data Reduction: Regression
and Log-Linear Models
 Regression are of two types: Linear and Multiple
 Linear regression
 In linear regression, data are modeled to fit in a straight

line and can be represented by an equation.


 For example, a random variable, y (called a response
variable), can be modeled as a linear function of another
random variable, x (called a predictor variable), with the
equation y=mx+c, where m, c are constants known as
regression coefficients.

49
y
Regression Analysis
Y1

 The co-efficients can be solved using ‘least


squares method’ which minimizes the error Y1’
y=x+1
between the actual line separating the data
and the estimate of line.
 The parameters are estimated so as to give X1 x
a "best fit" of the data
 Used for prediction
(including forecasting of
time-series data), inference,
hypothesis testing, and
modeling of causal
relationships

50
Parametric Data Reduction: Regression
and Log-Linear Models
 Multiple regression
 Multiple linear regression extends the linear

regression and it correlates more than 2 variables.


Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector as
follows: y=m1x +m2x+c
 Log-linear model
 Log linear models postulate a linear relationship

between the independent variables and logarithm of a


dependant variable, log(y)=a0+a1x1+a2x2+…+anxn
 This model helps us to find probability of each tuple

for a set of discretized attributes and thus helps in


numerosity reduction.
51
Summary of Parametric models
 Linear regression: Y = mX + c
 Two regression coefficients, m and c, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
 Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
 Useful for dimensionality reduction and data smoothing
52
Comparison of Regresion and Log linear
 Both are used on sparse data
 Regression works well on skewed data
 Log linear models are suited for higher dimensional data where as
regression will be computationally intensive for the same.

53
Numerosity Reduction-nonParametric Methods
 Non-parametric methods
 Major families: histograms, clustering, sampling, …

 These methods are applied to data which will not fit

into a model.

 1. Histogram
 This is a popular data reduction technique which
distribute the data into disjoint subsets knows as
buckets. Buckets can hold one attribute value and
also continuous range of a given attribute.

54
Histogram Analysis
 There are different partitioning rules 40
to determine bucket size. They are:
35
 Partitioning rules:
30
 Equal-width: equal bucket
25
range
20
 Equal-frequency (or equal-
15
depth)
10
Advantages:
5
Most accurate, practical data
distribution and are effective. 0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
55
Clustering
 Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data
is “smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms

56
Clustering

57
Sampling

 Sampling: obtaining a small sample s to represent the


whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at a
time)
58
Types of Sampling

 Simple random sampling


 There is an equal probability of selecting any particular
item
 Sampling without replacement(SRSWOR)
 Once an object is selected, it is removed from the
population
 Sampling with replacement(SRSWR)
 A selected object is not removed from the population

 Cluster sampling
 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
 Used in conjunction with skewed data 59
Sampling: With or without Replacement

Raw Data
60
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

61
2/1/2024 Data Mining: Concepts and Techniques 62
Sampling

In cluster sampling the sample consist of elements from


the selected clusters only. A reduced data representation
can be obtained by applying, say, SRSWOR to the
pages, resulting in a cluster sample of the tuples.

In stratified sampling, the sample comes from all the


strata

2/1/2024 Data Mining: Concepts and Techniques 63


Stratified Sampling

2/1/2024 Data Mining: Concepts and Techniques 64


2/1/2024 Data Mining: Concepts and Techniques 65
Cluster sampling

2/1/2024 Data Mining: Concepts and Techniques 66


2/1/2024 Data Mining: Concepts and Techniques 67
Data Cube Aggregation

 The lowest level of a data cube (base cuboid)


 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
68
Imagine that you have collected the data for your
analysis. These data consist of the AllElectronics sales
per quarter, for the years 2008 to 2010.
You are, however, interested in the annual sales (total
per year), rather than the total per quarter.
Data Reduction 3: Data Compression
 String compression
 There are extensive theories and well-tuned algorithms

 Typically lossless, but only limited manipulation is

possible without expansion


 Audio/video compression
 Typically lossy compression, with progressive refinement

 Sometimes small fragments of signal can be

reconstructed without reconstructing the whole


 Time sequence is not audio
 Typically short and vary slowly with time

 Dimensionality and numerosity reduction may also be


considered as forms of data compression
70
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

71
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
72
Data Transformation
 the data are transformed or consolidated so that the resulting mining
process may be more efficient, and the patterns found may be easier
to understand.
 A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be
identified with one of the new values

73
Data Transformation
 Data Transformation Methods
 Smoothing: Remove noise from data (Binning, regression,
clustering are the techniques to achieve it)
 Attribute/feature construction - New attributes constructed from
the given set of attributes to help the mining process
 Aggregation: Summarization (from daily to monthly or yearly),
data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing (attributes such as
street can be generalized to higher-level concepts, like city or
country) 74
Data Transformation
 Data Transformation Methods
 Discretization: Ex: the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior).
 Concept hierarchy generation: Concept hierarchy climbing
(attributes such as street can be generalized to higher-level
concepts, like city or country)

75
Normalization
 Normalizing the data attempts to give all attributes an equal weight.
 Expressing an attribute in smaller units will lead to a larger range for
that attribute, and thus tend to give such an attribute greater effect
or “weight.”
 To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized.
 This involves transforming the data to fall within a smaller or
common range such as [-1, 1] or [0.0, 1.0].
 We consider 3 methods for data normalization, namely, min-max
normalization, z-score normalization, and normalization by decimal
scaling.

76
Normalization
 For our discussion, let A be a numeric attribute with n observed
values, v1, v2, … , vn.
 Min-max normalization performs a linear transformation on the
original data.

77
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600  12 ,000
1.0]. Then $73,600 is mapped to 98,000  12,000 (1.0  0)  0  0.716
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54 ,000
 Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16 ,000

78
Normalization
 Normalization by decimal scaling

v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10

 Suppose that the recorded values of A range from -986 to 917.


 The maximum absolute value of A is 986.
 To normalize by decimal scaling, we therefore divide each value by
1000 (i.e., j =3) so that -986 normalizes to -0.986 and 917
normalizes to 0.917.

79
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification

80
Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning -Top-down split, Binning does not use class information
and is therefore an unsupervised discretization technique. (The
sorted values are distributed into a number of “buckets,” or bins. 1.
smoothing by bin means, 2. smoothing by bin boundaries)
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge) 81
Simple Discretization: Binning

 Equal-width (distance) partitioning


 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well

 Equal-depth (frequency) partitioning


 Divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
82
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
83
Discretization Without Using Class Labels
(Binning vs. Clustering)

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

84
Discretization by Classification &
Correlation Analysis
 Classification (e.g., decision tree analysis)
 Supervised: Given class labels, e.g., cancerous vs. benign
 Using entropy to determine split point (discretization point)
 Top-down, recursive split
 Details to be covered in Chapter 7
 Correlation analysis (e.g., Chi-merge: χ2-based discretization)
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
 Merge performed recursively, until a predefined stopping condition

85
Concept Hierarchy Generation

 Concept hierarchy organizes concepts (i.e., attribute values)


hierarchically and is usually associated with each dimension in a data
warehouse
 Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
 Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.

86
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit
data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
87
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


88
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

 Summary
89
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem

 Remove redundancies

 Detect inconsistencies

 Data reduction
 Dimensionality reduction

 Numerosity reduction

 Data compression

 Data transformation and data discretization


 Normalization

 Concept hierarchy generation

90
References
 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
 A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
 H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
 M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
 H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
 J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
 T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
91

You might also like