Data Warehousing ETL Checklist
Data Warehousing ETL Checklist
Data Warehousing ETL Checklist
INTRODUCTION
ETL Extract, Transform and Load is the process by which data from multiple
systems is consolidated, typically in a data warehouse, so that executives can obtain a
comprehensive picture of business functions, e.g. the relationships between marketing
campaigns, sales, production and distribution. It has been estimated that in any data
integration proect, ETL will consume !" to #" percent of the time and resources.
The process is complicated and has been the subect of numerous books. $nyone
undertaking an ETL proect for the first time will have to do some serious research. %ere
is a high&level checklist of the important topics.
CHECKLIST
' ()*+E. The conventional wisdom is, ,-on.t try to boil the ocean./ It.s more
important to deliver business results than it is to have a comprehensive program.
In some cases, it may be better for the business as a whole to leave certain data
sources untouched.
' T$01ET (2(TE3 )*4TE4T. The content of the target system will drive the
whole proect, and will of course be determined by business needs. (pecifically,
target system content will determine which source systems will be involved.
' -$T$ (*50)E( 6(*50)E (2(TE3(7. The first step in the process is
identifying the systems from which the data will be extracted. There are two
broad categories8
Internal data. This is data from within your organi9ation. :rom a technical point
of view, sources can range from E0+, )03 and legacy applications to flat files
and even Excel spreadsheets. It.s important to become very familiar with the
data in all internal sources as part of the planning process.
External -ata. *ften, when a database 6e.g. a data warehouse7 is to be used for
decision support, its usefulness can be greatly enhanced when the internal data
is supplemented with external data such as demographic information on
customers.
' *;4E0(%I+. It.s critical to determine who will take responsibility for the data
in its new ,home,/ and who will be responsible for maintaining the update
process 6which includes taking responsibility for the data.s accuracy7.
' 3*-E *: E<E)5TI*4. There are three options for executing the ETL process8
%ome grown. ;riting code in&house used to be the most common approach to
ETL. This approach is often the easiest for small proects, and has the advantage
of being able to handle the idiosyncrasies of unusual data formats. *n the
negative side, home grown code re=uires maintenance over time, and often has
scalability problems.
>rd +arty bolt&on. ?olt&on modules for existing systems are often a convenient
approach, however they often mandate data formats that are not very flexible
and can cause trouble when it comes to accommodating data from other sources.
+ackaged systems. (ystems from pure&play data integration companies offer
flexibility and relative ease&of&use, but may be costly and re=uire more training
than the other two solutions.
' -$T$ +0*:ILI41. This process provides metrics on the =uality of the data in
source systems prior to beginning the proect and can help predict the difficulties
that will be involved in data re&use.
' -$T$ T0$4(:*03$TI*4
)leansing. (ource system data typically must be cleansed. The most important
aspect of cleansing is usually ,de&duping/ the removal of multiple files
identifying the same item or person, e.g. @. (mith and @ohn (mith, both with the
same address. )leansing also involves removal 6or correction7 of files with
incorrect data, e.g. an address in the name field, and establishing default values.
0eformatting. The data must be standardi9ed in terms of nomenclature and
format.
Enhancement. -ata associated with marketing is often enhanced via external
sources, which creates a re=uirement for additional fields beyond those
associated with the internal data.
$ggregationA)alculation. If there are to be aggregated or calculated fields, it.s
necessary to determine at what point in the process the aggregationAcalculation
will take place. This is an issue during the initial population of the new database
and in its ongoing maintenance.
' 4*3E4)L$T50E. 4aming conventions can have a disproportionate effect on
user satisfaction, and users should by all means be involved in decisions involving
names.
:ield names. *ne issue is what to name the new fields, e.g. ,(ex/ vs. ,1ender/
vs. ,3r.A3rs.A3s./ ;henever possible, the field names in the target system
should match 6or be derived from7 the field names in the source systems. It is
easier for all involved if the data model column names also match the target
system field names.
-ata names. The same issue exists with data names. The paint called ,0ed >#/
by the E0+ may be called ,%ot )rimson/ in the marketing database.
' 3ET$-$T$. There are two types of metadata to be considered.
Technical metadata is concerned with data types, lengths, source mappings and
other details that are relevant to developers.
?usiness metadata is information that would be potentially useful to end users,
such as valid values, mandatory vs. optional indications, and source systems.
' (E)50IT2. (ecurity is extremely important with sensitive data, e.g. customer
records that include personal data like date of birth or financial information such
as credit card numbers.
' TI3I41. *nce the initial data has been loaded into the target system, it.s
necessary to determine how often it will be refreshed. This depends primarily on
business needs. -o managers need to track a number 6sales, inventory, hours, etc.7
on a =uarterly, monthly, weekly or daily basisB *ther considerations involve the
=uantity of data to be transferred and the speed of the process.
' T0$I4I41. If you choose a >rd party bolt&on or a packaged system, the
developers involved will most likely need training. They may also need training in
a database reporting tool and if it.s new the scheduling system.