Data Quality and Preprocessing Concepts ETL
Data Quality and Preprocessing Concepts ETL
Data Quality and Preprocessing Concepts ETL
Q
1
Boxplot: ends of the box are the quartiles, median is marked,
and plot outlier individually
Variance and standard deviation (sample: s, population: )
Variance: (algebraic, scalable computation)
Standard deviation s (or ) is the square root of variance s
2 (
or
2)
n
i
n
i
i i
n
i
i
x
n
x
n
x x
n
s
1 1
2
2
1
2 2
] ) (
1
[
1
1
) (
1
1
n
i
i
n
i
i
x
N
x
N
1
2
2
1
2 2
1
) (
1
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: use inference-based formula such as
Bayesian formula or decision tree
Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
x
y
y = x + 1
X1
Y1
Y1
Cluster Analysis
Data Cleaning as a Process
Data discrepancy detection (not wanting to give details, outdated
address, poorly designed forms, too many options for questions)
Use any knowledge say metadata (e.g., domain, range, dependency,
distribution) your write your own scripts.
Check field overloading (2004/12/25, 25/12/2004.)
Check uniqueness rule, consecutive rule and null rule(zero, person
refusing to provide, blanks)
Data cleaning contd.
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections use
parsing, fuzzy matching techniques)
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
(www.control.cs.berkeley.edu.abc
Iterative and interactive (e.g., Potters Wheels)
Work in progress: writing declarative languages using
SQL for data cleaning
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales,
e.g., metric vs. British units
ETL
Ralph Speaks
Technical Design Challenges Posed
By The Data Warehouse Evolution
Timeliness
Data Volumes
Response Times
BI
Legacy
Systems
..
B2C
B2B
CRM
The Big Picture!
Which Approach Do We Take ?
Data Extraction and Preparation
Extract
Analyze, Clean
and Transform
Data Movement
and Load
Stage I
Stage II
Stage III
Periodic
Refresh/
Update
The ETL Process
Access data dictionaries defining source files
Build logical and physical data models for target
data
Survey existing systems to identify sources of
data
Specify business and technical rules for data
extraction, conversion and transformation
Perform data extraction and transformation
Load target databases
Metadata
Repository
Data
Definitions
Source
Databases
Data Modeling
Tool
RDBMS
MDDB
Define/code
Extraction Rules
Extract Program
Generation
Run Extract
Programs
Load Data
Warehouse
Source
Metadata Target
Metadata
Raw
Data
Clean
Data
The ETL Process
OLTP
Systems
Staging
Area
Data
Warehouse
Extract
OLTP
Systems
OLTP
Systems
Load
Transform
Stage I
Stage II
Stage III
The ETL Process
Data Extraction - Simplified
ETL Tools - Classification
First-generation
Code-generation products
Generate the source code
Second-generation
Engine-driven products
Generate directly executable code
ETL Tools - Classification
Due to more efficient architecture, second
generation tools have significant advantage over
first-generation
ETL Tools - First-Generation
Strengths
Tools are mature
Programmers are familiar
with code generation in
COBOL or C
Limitations
High cost of products
Complex training
Extract programs have to
compiled from source
Many transformations have
to coded manually
Lack of parallel execution
support
Most metadata to be
manually generated
Characterized by the Generation and Deployment of Multiple Codes
ETL Tools - Second-Generation
Extraction/Transformation/Load runs on server
Data directly extracted from source and processed on server
Data transformation in memory and written directly to warehouse
database. High throughput since intermediate files are not used
Directly executable code
Support for monitoring, scheduling, extraction, scrubbing,
transformation, load, index, aggregation, metadata
Characterized by the Transformation Engine
ETL Tools - Second-Generation
PowerCentre/Mart from Informatica
Data Mart Solution from Sagent Technology
DataStage from Ascential
ETL Tools - Selection
Support to retrieve, cleanse, transform,
summarize, aggregate, and load data
Engine-driven products for fast, parallel operation
Generate and manage central metadata repository
Open metadata exchange architecture
Provide end-users with access to metadata in
business terms
Support development of logical and physical data
models
Data Loading - First Time Loads
First load is a complex
exercise
Data extracted from
tapes, files, archives etc.
First time load might take
several days to complete
Extract, Clean,Transform etc
Source: www.survey.com
ETL Trends
DWH market is growing at 40-45% p.a.
Meta data management is shaping the market
Real time CRM requires real time DWH
E-comm and E-business are fuelling DWH & BI
ERP Data Warehousing is in demand
Major Trends
Source: Cutter Report - May 2000
ETL Trends
ETL technology built into other BI products
XML enabled platform independent data traffic
Near Real Time Data Warehouses using
middleware
Vendors have evolved their products into data
mart/analytical platforms
The Data Mart Strategy
The most common approach
Begins with a single mart and architected marts are added
over time for more subject areas
Relatively inexpensive and easy to implement
Can be used as a proof of concept for data warehousing
Can perpetuate the silos of information problem
Can postpone difficult decisions and activities
Requires an overall integration plan
Data Sources and Types
Primarily from legacy, operational systems
Almost exclusively numerical data at the present
time
External data may be included, often purchased
from third-party sources
Technology exists for storing unstructured data
and expect this to become more important over
time
Extraction, Transformation,
and Loading (ETL) Processes
The plumbing work of data warehousing
Data are moved from source to target data bases
A very costly, time consuming part of data
warehousing
Recent Development:
More Frequent Updates
Updates can be done in bulk and trickle modes
Business requirements, such as trading partner
access to a Web site, requires current data
For international firms, there is no good time to
load the warehouse
Recent Development:
Clickstream Data
Results from clicks at web sites
A dialog manager handles user interactions. An
ODS (operational data store in the data staging
area) helps to custom tailor the dialog
The clickstream data is filtered and parsed and
sent to a data warehouse where it is analyzed
Software is available to analyze the clickstream
data
Data Extraction
Often performed by COBOL routines
(not recommended because of high program
maintenance and no automatically generated meta
data)
Sometimes source data is copied to the target
database using the replication capabilities of
standard RDMS (not recommended because of
dirty data in the source systems)
Increasing performed by specialized ETL software
Sample ETL Tools
Teradata Warehouse Builder from Teradata
DataStage from Ascential Software
SAS System from SAS Institute
Power Mart/Power Center from Informatica
Sagent Solution from Sagent Software
Hummingbird Genio Suite from Hummingbird
Communications
Reasons for Dirty Data
Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys,
Non-Unique Identifiers
Data Integration Problems
Data Cleansing
Source systems contain dirty data that must be
cleansed
ETL software contains rudimentary data cleansing
capabilities
Specialized data cleansing software is often used.
Important for performing name and address
correction and householding functions
Leading data cleansing vendors include Vality
(Integrity), Harte-Hanks (Trillium), and Firstlogic
(i.d.Centric)
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
Parsing
Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
Examples include parsing the first, middle, and
last name; street number and street name; and
city and state.
Correcting
Corrects parsed individual data components using
sophisticated data algorithms and secondary data
sources.
Example include replacing a vanity address and
adding a zip code.
Standardizing
Standardizing applies conversion routines to
transform data into its preferred (and consistent)
format using both standard and custom business
rules.
Examples include adding a pre name, replacing a
nickname, and using a preferred street name.
Matching
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Examples include identifying similar names and
addresses.
Consolidating
Analyzing and identifying relationships between
matched records and consolidating/merging them
into ONE representation.
Data Staging
Often used as an interim step between data extraction and
later steps
Accumulates data from asynchronous sources using
native interfaces, flat files, FTP sessions, or other
processes
At a predefined cutoff time, data in the staging file is
transformed and loaded to the warehouse
There is usually no end user access to the staging file
An operational data store may be used for data staging
Data Transformation
Transforms the data in accordance with the
business rules and standards that have been
established
Example include: format changes, deduplication,
splitting up fields, replacement of codes, derived
values, and aggregates
Data Loading
Data are physically moved to the data warehouse
The loading takes place within a load window
The trend is to near real time updates of the data
warehouse as the warehouse is increasingly used
for operational applications
Meta Data
Data about data
Needed by both information technology personnel
and users
IT personnel need to know data sources and
targets; database, table and column names;
refresh schedules; data usage measures; etc.
Users need to know entity/attribute definitions;
reports/query tools available; report distribution
information; help desk contact information, etc.
Recent Development:
Meta Data Integration
A growing realization that meta data is critical to
data warehousing success
Progress is being made on getting vendors to
agree on standards and to incorporate the sharing
of meta data among their tools
Vendors like Microsoft, Computer Associates, and
Oracle have entered the meta data marketplace
with significant product offerings
Thats lots of ETL.
Come back to basics: let us revise