Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

bi-unit-3

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

What Is ETL?

Extract Transform Load

ETL is the automated and auditable data


acquisition process from source system that
involves one or more sub processes of data
extraction, data cleaning, data transportation,
data transformation, data consolidation, data
integration and data loading .
ETL
• ETL comes from Data Warehousing and stands
for Extract-Transform-Load. ETL covers a
process of how the data are loaded from the
source system to the data warehouse.
Currently, the ETL encompasses a cleaning
step as a separate step. The sequence is then
Extract-Clean-Transform-Load.
Data Quality
Data quality is a critical factor for the success of data warehousing projects. If
data is of inadequate quality, then the knowledge workers who query the
data warehouse and the decision makers who receive the information cannot
trust the results.
Data quality has to be maintained for individual records or even small bits of
information to ensure accuracy of complete database.
Three powerful forces have converged to put data quality concerns near the top
of the list for executives:
• First, the long-term cultural trend that says, “If only I could see the data,
then I could manage my business better” continues to grow; today’s
knowledge workers believe instinctively that data is a crucial requirement for
them to function in their jobs.
• Second, most organizations understand their data sources are profoundly
distributed, typically around the world, and that effectively integrating a
myriad of disparate data sources is required.
• And third, the sharply increased demands for compliance mean careless
handling of data will not be overlooked or excused.
Data Profiling
• Data profiling is the technical analysis of data to
describe its content, consistency, and structure.
• In some sense, any time you perform a SELECT
DISTINCT investigative query on a database field,
you are doing data profiling.
• There are a variety of tools specifcally designed
to do powerful profiling.
• Data profiling plays two distinct roles: strategic
and tactical.
Data enrichment
• Data enrichment is adding/combining
external data values, rules to enrich the
information already existing in the data.
• Example if we can get a list that provides a
relationship between Zip Code, City and State,
then if a address field has Zip code 06905 it be
safely assumed and address can be enriched
by doing a lookup on this table to get Zip Code
06905 –> City Stamford –> State CT.
Data Deduplication
• Data Deduplication is a process of removing
redundant data. There result of Data Deduplication
is an unique list of values which can be used to
populate data warehouse dimension tables.
• The typical business example would be creating
unique list of customers.
• Data deduplication is a technique for reducing the
amount of storage space an organization needs to
save its data.
ETL architecture
Extract System
• The Extract step covers the data extraction
from the source system and makes it
accessible for further processing. The main
objective of the extract step is to retrieve all
the required data from the source system with
as little resources as possible. The extract step
should be designed in a way that it does not
negatively affect the source system in terms or
performance, response time or any kind of
locking.
There are several ways to perform the extract:

• Update notification - if the source system is able to provide a


notification that a record has been changed and describe the change,
this is the easiest way to get the data.
• Incremental extract - some systems may not be able to provide
notification that an update has occurred, but they are able to identify
which records have been modified and provide an extract of such
records. During further ETL steps, the system needs to identify changes
and propagate it down. Note, that by using daily extract, we may not be
able to handle deleted records properly.
• Full extract - some systems are not able to identify which data has been
changed at all, so a full extract is the only way one can get the data out
of the system. The full extract requires keeping a copy of the last extract
in the same format in order to be able to identify changes. Full extract
handles deletions as well.
Clean
• The cleaning step is one of the most important as it ensures the
quality of the data in the data warehouse. Cleaning should perform
basic data unification rules, such as:
• Making identifiers unique (sex categories Male/Female/Unknown,
M/F/null, Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
• Convert null values into standardized Not Available/Not Provided
value
• Convert phone numbers, ZIP codes to a standardized form
• Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
• Validate address fields against each other (State/Country, City/State,
City/ZIP code, City/Street).
Transform

• The transform step applies a set of rules to


transform the data from the source to the target.
This includes converting any measured data to
the same dimension (i.e. conformed dimension)
using the same units so that they can later be
joined. The transformation step also requires
joining data from several sources, generating
aggregates, generating surrogate keys, sorting,
deriving new calculated values, and applying
advanced validation rules.
Load
• During the load step, it is necessary to ensure
that the load is performed correctly and with
as little resources as possible. The target of the
Load process is often a database. In order to
make the load process efficient, it is helpful to
disable any constraints and indexes before the
load and enable them back only after the load
completes. The referential integrity needs to be
maintained by ETL tool to ensure consistency.
Change Data Capture System
• During the data warehouse’s initial historic load,
capturing source data content changes is not
important because you load all data from a point
in time forward. However, many data warehouse
tables are so large that they cannot be refreshed
during every ETL cycle.
• The idea behind CDC is to Just transfer the data
that has changed since the last load. But building
a good CDC system is not as easy as it sounds.
The key goals for the change data capture subsystem are:

• Isolate the changed source data to allow selective


processing rather than a complete refresh.
• Capture all changes (deletions, edits, and insertions)
made to the source data, including changes made
through nonstandard interfaces.
• Tag changed data with reason codes to distinguish
error corrections from true updates.
• Support compliance tracking with additional metadata.
• Perform the CDC step as early as possible, preferably
before a bulk data transfer to the data warehouse.
Data Mart

• Data marts contain a subset of organization-


wide data that is valuable to specific groups of
people in an organization. In other words, a
data mart contains only those data that is
specific to a particular group. For example, the
marketing data mart may contain only data
related to items, customers, and sales. Data
marts are confined to subjects.
Points to Remember About Data Marts

• Windows-based or Unix/Linux-based servers are used to


implement data marts. They are implemented on low-cost
servers.
• The implementation cycle of a data mart is measured in short
periods of time, i.e., in weeks rather than months or years.
• The life cycle of data marts may be complex in the long run, if
their planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data
warehouse.
• Data marts are flexible.
Datamart
Data Cube
• A data cube helps us represent data in
multiple dimensions. It is defined by
dimensions and facts. The dimensions are the
entities with respect to which an enterprise
preserves the records.
Load Manager
• This component performs the operations required to
extract and load process.
• The size and complexity of the load manager varies
between specific solutions from one data warehouse to
another.
Load Manager Architecture
The load manager performs the following functions:
• Extract the data from source system.
• Fast Load the extracted data into temporary data store.
• Perform simple transformations into structure similar to
the one in the data warehouse.
Load Manager Architecture
Late Arriving facts
• Data warehouses are usually built around the ideal
assumption that measured activity (fact records)
arrive in the data warehouse at the same time as the
context of the activity (dimension records).
• When you have both the fact records and the correct
contemporary dimension rows, you have the luxury of
first maintaining the dimension keys and then using
these up-to-date keys in the accompanying fact rows.
However, for a variety of reasons, the ETL system may
need to process late arriving fact or dimension data.
• In some environments, there may need to be special
modifications to the standard processing procedures to deal
with late arriving facts, namely fact records that come into
the warehouse very much delayed.
• Late arriving dimensions occur when the activity
measurement (fact record) arrives at the data warehouse
without its full context
• For example, the identification of the new customer may
come in a separate feed delayed by several hours; you may
just be able to wait until the dependency is resolved. But in
many situations, especially real-time environments, this
delay is not acceptable. You cannot suspend the rows and
wait for the dimension updates to occur; the business
requirements demand that you make the fact row visible
before knowing the dimensional context.
• The ETL system needs additional capabilities to support this
Staging

• It should be possible to restart, at least, some of the phases


independently from the others. For example, if the transformation
step fails, it should not be necessary to restart the Extract step.
We can ensure this by implementing proper staging. Staging
means that the data is simply dumped to the location (called the
Staging Area) so that it can then be read by the next processing
phase. The staging area is also used during ETL process to store
intermediate results of processing. This is ok for the ETL process
which uses for this purpose. However, tThe staging area should is
be accessed by the load ETL process only. It should never be
available to anyone else; particularly not to end users as it is not
intended for data presentation to the end-user.may contain
incomplete or in-the-middle-of-the-processing data.
Look ups
• Lookup transformation - returns values from a
database table or a flat file associated with a
given input value.
• It can be used to to lookup data in a flat file,
relational table , view or synonym.
• There are the following Lookup transformation
types:
• Dynamic Lookup (connected)
• Unconnected lookup
Time-lag
• The Time-lag (or Backlog) is one of the Key P
erformance Indicator of ETL. It is defined here
as the difference between the date of today
and the date of latest loaded data.
• This Time-lag is due to the time needed to
collect, load and process the data (ETL)
procedure.
Scheduling and Dependency Matrix
• Dependency Scheduling Matrix (DSM) has
been identified as a powerful tool to plan the
activity sequences, identify and manage
information exchanges. However, its
application in scheduling is very limited. So far,
DSM has been used to enable critical path
calculations by assigning the amount of
effort/work done as duration to the activities.

You might also like