Data Warehouse
Data Warehouse
Data Warehouse
— Chapter 4 —
1
Chapter 4: Data Warehousing and On-line Analytical
Processing
2
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses
3
Data Warehouse—Subject-Oriented
4
Data Warehouse—Integrated
records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding
5
Data Warehouse—Time Variant
6
Data Warehouse—Nonvolatile
■ A physically separate store of data transformed from the
operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data
7
Online analytical processing
(OLAP) and online
transactional processing OLTP vs. OLAP
(OLTP)
8
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
9
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
Integrato
sources r
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
materialized
11
Extraction, Transformation, and Loading (ETL)
■ Data extraction
■ get data from multiple, heterogeneous, and external
sources
■ Data cleaning
■ detect errors in the data and rectify them when possible
■ Data transformation
■ convert data from legacy or host format to warehouse
format
■ Load
■ sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the
warehouse
12
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions
■ Business data
■ business terms and definitions, ownership of data, charging policies
13
Chapter 4: Data Warehousing and On-line Analytical
Processing
14
From Tables and Spreadsheets to
Data Cubes
■ A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
15
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
16
Conceptual Modeling of Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
17
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
18
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
19
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
21
Data Cube Measures: Three Categories
Specification of hierarchies
■ Schema hierarchy
day < {month < quarter;
week} < year
■ Set_grouping hierarchy
{1..10} < inexpensive
23
Multidimensional Data
Office Day
Month
24
A Sample Data Cube
3Qtr 4Qtr
uc
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
1st quarter mai USA mai kitne products sell hue All country, all product, all quater
25
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product date country
1-D cuboids
26
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
27
Fig. 3.10 Typical OLAP
Operations
28
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
29
Browsing a Data Cube
■ Visualization
■ OLAP capabilities
■ Interactive manipulation
30
Chapter 4: Data Warehousing and On-line Analytical
Processing
31
Design of Data Warehouse: A Business
Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the
data warehouse
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view
of end-user
32
Data Warehouse Design Process
■ Top-down, bottom-up approaches or a combination of both
■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record
33
Data Warehouse Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
35
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
■ Why online analytical mining?
■ High quality of data in data warehouses
data warehouses
■ ODBC, OLEDB, Web accessing, service facilities,
37
Efficient Data Cube Computation
■ Data cube can be viewed as a lattice of cuboids
■ The bottom-most cuboid is the base cuboid
■ The top-most cuboid (apex) contains only one cell
■ How many cuboids in an n-dimensional cube with L
levels?
40
Indexing OLAP Data: Join Indices
41
Efficient Processing OLAP Queries
■ Determine which operations should be performed on the available cuboids
■ Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
■ Determine which materialized cuboid(s) should be selected for OLAP op.
■ Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
■ Explore indexing structures and compressed vs. dense array structs in MOLAP
42
OLAP Server Architectures
43
Chapter 4: Data Warehousing and On-line Analytical
Processing
44
Attribute-Oriented Induction
45
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of graduate
students in the University database
■ Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
■ Step 2. Perform attribute-oriented induction
■ Step 3. Present results in generalized relation, cross-tab,
or rule forms
46
Class Characterization: An Example
Initial
Relation
Prime
Generalized
Relation
47
Basic Principles of Attribute-Oriented Induction
49
Presentation of Generalized Results
■ Generalized relation:
■ Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
■ Cross tabulation:
■ Mapping results into cross tabulation form (similar to contingency
tables).
■ Visualization techniques:
■ Pie charts, bar charts, curves, cubes, and other visual forms.
■ Quantitative characteristic rules:
■ Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
50
Mining Class Comparisons
51
Concept Description vs. Cube-Based OLAP
■ Similarity:
■ Data generalization
■ Presentation of data summarization at multiple levels of
abstraction
■ Interactive drilling, pivoting, slicing and dicing
■ Differences:
■ OLAP has systematic preprocessing, query independent,
52
Chapter 4: Data Warehousing and On-line Analytical
Processing
53
Summary
■ Data warehousing: A multi-dimensional model of a data warehouse
■ A data cube consists of dimensions & measures
■ Star schema, snowflake schema, fact constellations
■ OLAP operations: drilling, rolling, slicing, dicing and pivoting
■ Data Warehouse Architecture, Design, and Usage
■ Multi-tiered architecture
■ Business analysis design framework
■ Information processing, analytical processing, data mining, OLAM (Online
Analytical Mining)
■ Implementation: Efficient computation of data cubes
■ Partial vs. full vs. no materialization
■ Indexing OALP data: Bitmap index and join index
■ OLAP query processing
■ OLAP servers: ROLAP, MOLAP, HOLAP
■ Data generalization: Attribute-oriented induction
54
References (I)
■ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
■ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
■ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
■ S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
■ E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July
1993.
■ J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab
and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
■ A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
■ J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
■ V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
■ J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
55
References (II)
■ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
■ W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
■ R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
■ P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record,
24:8–11, Sept. 1995.
■ P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
■ Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
■ S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
■ A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
■ D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
■ P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
■ J. Widom. Research problems in data warehousing. CIKM’95
■ K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006
56
Surplus Slides
57
Compression of Bitmap Indices
■ Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
■ Two compression schemes:
■ Byte-aligned Bitmap Code (BBC)
■ Word-Aligned Hybrid (WAH) code
■ Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
■ Optimal on attributes of low cardinality as well as those of
high cardinality.
■ WAH out performs BBC by about a factor of two
58
Data Mining:
Concepts and Techniques
(3rd ed.)
— Module 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Data Objects and Attribute Types
Data sets are made up of data objects.
A data object represents an entity—
Examples
in a sales database, the objects may be customers, store items,
and sales;
in a medical database, the objects may be patients;
and courses.
Data objects are typically described by attributes.
2
Data objects can also be referred to as samples, examples, instances,
data points, or objects.
If the data objects are stored in a database, they are data tuples.
That is,
the rows of a database >>>> the data objects,
3
An attribute is a data field, representing a characteristic or feature
of a data object.
The nouns attribute, dimension, feature, and variable are often
used.
The term dimension is commonly used in data warehousing.
4
Types of attributes
Types
nominal
binary
ordinal
Numeric
Interval scaled
Ratio scaled
5
Nominal Attributes
Nominal means “relating to names.” The values of a nominal
attribute are symbols or names of things.
Each value represents some kind of category, code, or state, and so
Examples
hair color and marital status are two attributes describing person
objects.
hair color > black,brown, blond, red, auburn, gray, and white.
6
Nominal Attributes
7
Binary Attributes
A binary attribute is a nominal attribute with only two categories or
states: 0 or 1,
Where 0 typically means that the attribute is absent, and 1
1 indicates that the patient smokes, while 0 indicates that the patient
does not
8
Binary Attributes
Symmetric
A binary attribute is symmetric if both of its states are equally
9
Ordinal Attributes
Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude
between successive values is not known.
10
Ordinal attributes are useful for registering subjective assessments
of qualities that cannot be measured objectively; thus ordinal
attributes are often used in surveys for ratings.
Customer satisfaction had the following ordinal categories:
0: very dissatisfied,
1: somewhat dissatisfied,
2: neutral,
3: satisfied, and
4: very satisfied
The central tendency of an ordinal attribute can be represented by its
mode and its median but the mean cannot be defined
11
Numeric Attributes
Ratio scaled
A ratio-scaled attribute is a numeric attribute with an
inherent zero-point.
In addition, the values are ordered, and we can also compute
13
nominal, binary, and ordinal attributes are qualitative.
, they describe a feature of an object without giving an actual
size or quantity.
The values of such qualitative attributes are typically words
14
Basic Statistical Descriptions
04.02.2022
Measures of Central Tendency
16
17
Median. Let’s find the median of the data from Example 2.6. The
data are already sorted in increasing order. There is an even
number of observations (i.e., 12); therefore, the median is not
unique. It can be any value within the two middlemost values of 52
and 56 (that is, within the sixth and seventh values in the list).
average of the two middlemost values as the median.
Suppose that we had only the first 11 values in the list. Given an
odd number of values, the median is the middlemost value. This is
the sixth value in this list, which has a value of $52,000.
18
The mode is another measure of central tendency. The mode for a
set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative
attributes
Mode. The data from Example 2.6 are bimodal. The two modes are
$52,000 and $70,000.
The midrange can also be used to assess the central tendency of
a numeric data set.
It is the average of the largest and smallest values in the set.
The midrange of the data of Example 2.6 is
19
20
Measuring the Dispersion of Data: Range, Quartiles,
Variance, Standard Deviation, and Interquartile Range
Range
Suppose that the data for attribute X are sorted in increasing numeric order. Imagine
that we can pick certain data points so as to split the data distribution into equal-size
consecutive sets, as in Figure 2.2. These data points are called quantiles.
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially
equal size consecutive sets.
Interquartile Range
The quartiles give an indication of a distribution’s center, spread, and shape. The first
quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data.
The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or
highest 25%) of the data.
The second quartile is the 50th percentile. As the median, it gives the center of the data
distribution.
The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = Q3 - Q1
Semi interquartile range=IQR/3
— Module 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 2: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
2
Data Quality: Why Preprocess the Data?
3
Inaccurate, incomplete, and inconsistent
data are common-place properties of large real-
world databases and data warehouses.
inaccurate data or having incorrect attribute values
The data collection instruments used may be
faulty.
There may have been human or computer
errors
occurring at data entry.
Disguised missing
Errors in data transmission can also occur
4
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files, different
names (cid, cust_id), inferable attributes (avoid redundancy)
Data reduction
Dimensionality reduction
Numerosity reduction (replacing data by alternate smaller
representations)
Data compression
Data transformation and data discretization
Normalization, aggregation
Concept hierarchy generation 5
Unit II: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
6
7
In summary, real-world data tend to be dirty,
incomplete, and inconsistent.
Data preprocessing techniques can improve
data quality, thereby helping to improve the
accuracy
and efficiency of the subsequent mining
process.
Data preprocessing is an important step
in the knowledge discovery process, because
quality decisions must be based on quality
data.
8
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers, faulty data collection
instruments, human or computer error, technology limitations…
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday? 9
Data cleaning (or data
cleansing) routines attempt to fill in
missing values, smooth out noise
while identifying outliers, and
correct inconsistencies in the
data.
10
Incomplete (Missing) Data
technology limitation
incomplete data
inconsistent data
13
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Clustering
detect and remove outliers
14
Binning Methods
15
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
16
Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
17
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Integrate metadata from different sources
Entity identification problem:
How to identify equivalent real world entities from multiple data
sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
18
Data Integration
Entity identification problem:
Examples of metadata for each attribute include the name,
meaning, data type, and range of values permitted for the
attribute, and null rules for handling blank, zero, or null values.
Such metadata can be used to help avoid errors in schema
integration.
The metadata may also be used to help transform the data (e.g.,
where data codes for pay type in one database may be “H” and “S”
but 1 and 2 in another).
19
Handling Redundancy in Data Integration
21
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
(Observed Expected ) 2
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
22
Chi-Square Calculation: An Example
24
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
25
Correlation (viewed as linear relationship)
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, A and B, and then take their dot product
correlation( A, B) A'B'
26
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
For a hotel chain, the price of rooms in different cities may involve not
only different currencies (Rupees, Dollars, Euro etc.)
Price may include also different services (e.g., free breakfast) and
taxes.
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
31
31
Data Reduction Strategies
When the dataset is very huge, complex data analysis and mining is
very time consuming.
Data Reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
32
Data Reduction Strategies
Dimensionality reduction, Dimensionality reduction is the process of
reducing the number of random variables or attributes under
consideration.
Wavelet transforms
34
What Is Wavelet Transform?
The discrete wavelet transform (DWT) is a linear signal
processing technique.
When DWT is applied to a data vector X, it is transformed
to a numerically different vector, X’, of wavelet
coefficients.
X and X’ are of the same length. When applying this
technique to data reduction, we consider each tuple as an
n-dimensional data vector, that is X = (x1,x2, …, xn).
If X and X’ are of same length, how data reduction is
achieved?
..because the wavelet transformed data can be truncated.
35
Data reduction via Wavelet Transform
A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet
coefficients.
For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are
set to 0.
The resulting data representation is therefore very sparse
(small, infrequent, scattered), so that operations that can
take advantage of data sparsity are computationally very fast
if performed in wavelet space.
The technique also works to remove noise without smoothing
out the main features of the data, making it effective for data
cleaning as well. An approximation of the original data can be
constructed by applying the inverse of the DWT used.
36
What Is Wavelet Transform?
Decomposes a signal into
different frequency subbands
Applicable to n-
dimensional signals
Data are transformed to
preserve relative distance
between objects at different
levels of resolution
Allow natural clusters to
become more distinguishable
Used for image compression
37
Wavelet Transformation
DWT is closely related to discrete Fourier transform (DFT),
but better lossy compression, ie, DWT provides a more
accurate approximation of the original data.
For an equivalent approximation, the DWT requires less
space than the DFT.
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Method:
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired
38 length
Wavelet Decomposition
Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ =
[23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
Compression: many small detail coefficients can be
replaced by 0’s, and only the significant coefficients are
retained
39
Why Wavelet Transform?
Use hat-shape filters
Emphasize region where points cluster
Multi-resolution
Detect arbitrary shaped clusters at different scales
Efficient
Complexity O(N)
40
Principal Component Analysis (PCA)
This is a lossy compression method applied on numerical data for
identifying patterns. This is another way of dimensionality reduction
If there are n attributes/dimensions for a dataset that has to be
reduced, PCA searches for k n-dimensional orthogonal vectors that
can best be used to represent data.
The original data are projected onto a much smaller space, resulting
in dimensionality reduction x2
And plots the data resulting in dimension reduction.
x 41
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Calculate the co-variance matrix nXn
Calculate the eigen vectors and eigen values of co-variance matrix.
The eigen vectors should be unit eigen vectors and are
perpendicular to each other
Data is plotted (normalized data) against the eigen vectors of the
covariance matrix
Order the eigen vectors also known as PCA in decreasing strength
/by eigen values. There PCA serves as a new set of axes providing
important information about variance. This helps in identifying
groups.
42
Principal Component Analysis
Advantages of PCA
Inexpensive
It combines the essence of all attributes even during dimensionality
reduction
Can handle sparse and skewed data
Can be applied to ordered and unordered attributes
Multidimension can be reduced to 2 dimensions.
43
Attribute Subset Selection
Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes
Redundant attributes
Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
44
Attribute Selection
45
Attribute Selection
47
Numerosity Reduction-Parametric Methods
Parametric methods (e.g., regression)
Parametric methods are used when data can be
Log-linear models
48
Parametric Data Reduction: Regression
and Log-Linear Models
Regression are of two types: Linear and Multiple
Linear regression
In linear regression, data are modeled to fit in a straight
49
y
Regression Analysis
Y1
50
Parametric Data Reduction: Regression
and Log-Linear Models
Multiple regression
Multiple linear regression extends the linear
53
Numerosity Reduction-nonParametric Methods
Non-parametric methods
Major families: histograms, clustering, sampling, …
into a model.
1. Histogram
This is a popular data reduction technique which
distribute the data into disjoint subsets knows as
buckets. Buckets can hold one attribute value and
also continuous range of a given attribute.
54
Histogram Analysis
There are different partitioning rules 40
to determine bucket size. They are:
35
Partitioning rules:
30
Equal-width: equal bucket
25
range
20
Equal-frequency (or equal-
15
depth)
10
Advantages:
5
Most accurate, practical data
distribution and are effective. 0
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
55
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
56
Clustering
57
Sampling
Cluster sampling
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data 59
Sampling: With or without Replacement
Raw Data
60
Sampling: Cluster or Stratified Sampling
61
2/1/2024 Data Mining: Concepts and Techniques 62
Sampling
Original Data
Approximated
71
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
72
Data Transformation
the data are transformed or consolidated so that the resulting mining
process may be more efficient, and the patterns found may be easier
to understand.
A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be
identified with one of the new values
73
Data Transformation
Data Transformation Methods
Smoothing: Remove noise from data (Binning, regression,
clustering are the techniques to achieve it)
Attribute/feature construction - New attributes constructed from
the given set of attributes to help the mining process
Aggregation: Summarization (from daily to monthly or yearly),
data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing (attributes such as
street can be generalized to higher-level concepts, like city or
country) 74
Data Transformation
Data Transformation Methods
Discretization: Ex: the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior).
Concept hierarchy generation: Concept hierarchy climbing
(attributes such as street can be generalized to higher-level
concepts, like city or country)
75
Normalization
Normalizing the data attempts to give all attributes an equal weight.
Expressing an attribute in smaller units will lead to a larger range for
that attribute, and thus tend to give such an attribute greater effect
or “weight.”
To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized.
This involves transforming the data to fall within a smaller or
common range such as [-1, 1] or [0.0, 1.0].
We consider 3 methods for data normalization, namely, min-max
normalization, z-score normalization, and normalization by decimal
scaling.
76
Normalization
For our discussion, let A be a numeric attribute with n observed
values, v1, v2, … , vn.
Min-max normalization performs a linear transformation on the
original data.
77
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 12 ,000
1.0]. Then $73,600 is mapped to 98,000 12,000 (1.0 0) 0 0.716
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73,600 54 ,000
Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16 ,000
78
Normalization
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
79
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic
rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
80
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning -Top-down split, Binning does not use class information
and is therefore an unsupervised discretization technique. (The
sorted values are distributed into a number of “buckets,” or bins. 1.
smoothing by bin means, 2. smoothing by bin boundaries)
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge) 81
Simple Discretization: Binning
84
Discretization by Classification &
Correlation Analysis
Classification (e.g., decision tree analysis)
Supervised: Given class labels, e.g., cancerous vs. benign
Using entropy to determine split point (discretization point)
Top-down, recursive split
Details to be covered in Chapter 7
Correlation analysis (e.g., Chi-merge: χ2-based discretization)
Supervised: use class information
Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
Merge performed recursively, until a predefined stopping condition
85
Concept Hierarchy Generation
86
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit
data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
87
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
The attribute with the most distinct values is placed at
the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
89
Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
90
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
91