Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
648 views

Error Handling: Archive For The ETL Exception & Error Handling' Category

The document discusses the importance of error handling in ETL processes. It notes that there are two main types of errors: data errors and process errors. Data errors can be handled using row error logging to capture errors in error tables where they can be analyzed and corrected. Process errors can be handled by configuring email notifications for session failures. The document also provides details on how row error logging works and how it captures error information for analysis.

Uploaded by

Sunil Katta
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
648 views

Error Handling: Archive For The ETL Exception & Error Handling' Category

The document discusses the importance of error handling in ETL processes. It notes that there are two main types of errors: data errors and process errors. Data errors can be handled using row error logging to capture errors in error tables where they can be analyzed and corrected. Process errors can be handled by configuring email notifications for session failures. The document also provides details on how row error logging works and how it captures error information for analysis.

Uploaded by

Sunil Katta
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Error Handling

Identifying errors and creating an error handling strategy is very important.


The 2 types of errors in an ETL process are – Data Errors & Process Errors.
To handle Data errors we can use the Row Error Logging feature. The errors are captured into
the

error tables. We can then analyse, correct and reprocess them.

To handle Process errors we can configure an email task to notify the event of a session failure.

Row Error Logging: When we configure the session with this option the Integration service logs

errors
information to relational tables or to an error log file.
First time it creates the table or a file and then onwards it appends to the existing table or file.
This log file contains information such as source name, row ID, row data, transformation error
code etc. which can be used to determine the cause & source of an error.
By default the Integration service does not write the dropped rows to session log or create a
reject
file. So we can enable verbose tracing to write to session log. Performance is decreased as one
row at
a time is processed.

Archive for the ‘ETL Exception & Error Handling’ Category

For every rule there is an exception; for each exception there are more exceptions…

To implement an ETL process there are many steps that are followed. One such step is creating a
mapping document. This mapping document describes the data mapping between the source
systems and the target and the rules of data transformation.

Ex. Table / column map between source and target, rules to identify unique rows, not null
attributes, unique values, and range of a attributes, transformations rules, etc.

Without going into further details of the document, lets analyze the very next step. It seems
obvious and natural to start development of the of the ETL process. The ETL developer is all
fired up and comes up with a design document and starts developing, few days time the code is
ready for data loading.
But unexpectedly (?) the code starts having issues every few days. Issues are found and fixed.
And then it fails again. What’s happening? Analysis was done properly; rules were chalked out
& implemented according to the mapping document. But why are issues popping up?  Was
something missed?
Maybe not! Isn’t it, normal to have more issues in the initial lifetime of the processes?
 
Maybe Yes! You have surely missed ‘Source System Data Profiling’. The business analyst has
told you rules as the how the data is structured in the source system and how it is supposed to
behave; but he/she has not told you the ‘buts and ifs’ called as EXCEPTIONS for those rules.
 
To be realistic it is not possible for anyone to just read you all rules and exceptions like a parrot.
You have to collaborate and dig the truth. The actual choice is yours, to do data profiling on the
source system and try to break all the rules told by the analyst. Or you can choose to wait for the
process to go live and then wakeup every night as the load fails.  If you are lucky enough you
deal with an unhappy user every morning you go to the office. 
 
Make the right choice; don’t miss ‘Source system data profiling’ before actually righting a single
line of code. Question every rule. Try to find exception to the rules. There must be at least 20
tables. One table on an average will have 30 columns; each column will have on an average 100k
values. If you make matrix of number of tables * columns * data values, it will give the number
of reasons the why your assumptions may be wrong.   It’s like unit testing source data even
without loading. There is a reason why machines alone cannot do your job; there is reason why
IT jobs are more paying.
 
Remember, ‘for every rule there is an exception; for each exception there are more exceptions…’

Posted in Data Profiling, Data Quality, ETL Exception & Error Handling, Source System
Analysis, Uncategorized | 1 Comment »

ETL Startegy to store data validation rules

Every time there is movement of data the results have to be tested against the expected results.
For every ETL process, test conditions for testing data are defined before/during design and
development phase itself.  Some that are missed can be added later on.

Various test conditions are used to validate data when the ETL process is migrated from DEV-
to->QA-to->PRD. These test conditions are can exists in the developer’s/tester’s mind
/documented in word or excel. With time the test conditions either lost ignored or scattered all
around to be really useful.

In production if the ETL process runs successfully without error is a good thing. But it does not
really mean anything. You still need rules to validate data processed by ETL. At this point you
need data validation rules again!

A better ETL strategy is to store the ETL business rules in a RULES table by target table, source
system. These rules can be in SQL text. This will create a repository of all the rules in a single
location which can be called by any ETL process/ auditor at any phase of the project life cycle.

There is also no need to re-write /rethink rules. Any or all of these rules can be made optional,
tolerances can be defined, called immediately after the process is run or data can be audited at
leisure.
This Data validation /auditing system will basically contain
A table that contains the rules,
A process to call is dynamically and
A table to store the results from the execution of the rules

Benefits:

Rules can be added dynamically with no cange to code.

Rules are stored permanantly.

Tolerance level can be changed with ever changing the code

Biz rules can be added or validated by business experts without worring about the ETL code.

NOTE: This post is applicable to all etl tools or databases like Informatica, DataStage, Syncsort
DMExpress, Sunopsis  or Oracle, Sybase, SQL Server Integration Services (SSIS)/DTS, Ab
Initio,    MS SQL Server, RDB, etc.

Posted in ETL Exception & Error Handling, ETL Testing, Uncategorized | 1 Comment »

Introduction to Error and exception management.

Monday, July 3rd, 2006


ETL is all about transportation, transformation and organizing of data.  Of anytime something
moves (as a matter of fact even if you are perfectly stationary and items around moves) accidents
are bound to happen. So any ETL specialist believes that their code is perfect and nothing can
happen obviously lives in a fool’s paradise.

The next obvious thing is to design to manage accidents, like making a safer car or a factory.
And as an ETL specialist if you don’t do it you are no different then others. As in any country
there are laws for accidents and accident due to criminal negligence. Later being the worst.

How many times I have seen people putting ETL code into production without actually
designing processes to prevent, manage or report accidents. Writing code is one thing writing
production worthy code is another. Do ask yourself or your developers, “Is the code production
worthy?”

Next

ERRORS: A programmatic error that causes the the program to fail or makes the program run for
uncontrolled time frame.
EXCEPTIONS:  A program/code written to handle expected or unexpected errors gracefully so
that the program continues run with logging the error and bypassing the erroneous conditions or
even logging the error and gracefully exiting with error message.
More detailed description will come with topic…. ‘Unhandled exceptions results in Errors’.

Note: The topic on error and exceptions is relevant to Informatica, Data Stage,  Abinitio, Oracle
warehouse builder, PLSQL, SQLLDR, Transact SQL or any ETL other tools.

Posted in ETL Exception & Error Handling, ETL Strategy | No Comments »

Multiple executions ETL process against same set of data.

Every ETL designer, developer & tester should always ask this question…”What will happen, if
I run the ETL process multiple times, against the same data set?”

Answer: 1. I get the same result set.


Answer: 2. I get multiple result set.

If you go back to the original article on What is ETL & What ETL is not! You will immediately
come to the conclusion that Answer 2 is incorrect, as ETL is not allowed to create data.

Why will the process run more than once against the same set of data? Many reasons, example
most common being operators mistake, accidental kickoff, old set of data file remaining in the
directory, staging table loaded more than once, intentional rerun of ETL process after correction
of some data in source data set, etc. Without going into further details, I would advise ETL folks
to always include in your process ways to prevent it from happening by one or more
combinations of following methods…
1. Identify the primary key (logical/physical) and put update else insert logic.
2. Deleting the target data set before processing again (based on logical/physical primary key)
3. Preventing occurrences of multiple runs by flagging processed dates
4. Marking processed records with processed flags after commit
5. Prevention of multiple loads in the staging area
6. identifying duplicate records in stage area before the data gets processed
7. more…

So do these experiments in the development or test environment run the ETL process more than
once, check the result! If you get the result 2 (copies of rows, with no way to distinguish or
retiring the old rows)
The designer or the developer is wrong & if the process as passed the QA or testing then the
tester is wrong.

Bottom line:
A test case to check multiple runs is must in life cycle of an ETL process.

You might also like