Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ETL Testing

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 5

ETL testing (Extract, Transform, and Load).

This article will present you with a complete idea about ETL testing and what we do to test ETL process.

It has been observed that Independent erification and alidation is gaining huge mar!et potential and many companies are now seeing this as prospective business gain. "ustomers have been offered different range of products in terms of service offerings, distributed in many areas based on technology, process and solutions. ETL or data warehouse is one of the offerings which are developing rapidly and successfully.

Why do organizations need Data Warehouse? #rgani$ations with organi$ed IT practices are loo!ing forward to create a next level of technology transformation. They are now trying to ma!e themselves much more operational with easy%to%interoperate data. &aving said that data is most important part of any organi$ation, it may be everyday data or historical data. 'ata is bac!bone of any report and reports are the baseline on which all the vital management decisions are ta!en. (ost of the companies are ta!ing a step forward for constructing their data warehouse to store and monitor real time data as well as historical data. "rafting an efficient data warehouse is not an easy )ob. (any organi$ations have distributed departments with different applications running on distributed technology. ETL tool is employed in order to ma!e a flawless integration between different data sources from different departments. ETL tool will wor! as an integrator, extracting data from different sources* transforming it in preferred format based on the business transformation rules and loading it in cohesive '+ !nown are 'ata ,arehouse.

,ell planned, well defined and effective testing scope guarantees smooth conversion of the pro)ect to the production. - business gains the real buoyancy once the ETL processes are verified and validated by independent group of experts to ma!e sure that data warehouse is concrete and robust.

ETL or Data warehouse testing is categorized into four different engagements irrespective of technology or ETL tools used.

New Data Warehouse Testing / 0ew ', is built and verified from scratch. 'ata input is ta!en from customer re1uirements and different data sources and new data warehouse is build and verified with the help of ETL tools. Migration Testing / In this type of pro)ect customer will have an existing ', and ETL performing the )ob but they are loo!ing to bag new tool in order to improve efficiency. Change Request / In this type of pro)ect new data is added from different sources to an existing ',. -lso, there might be a condition where customer needs to change their existing business rule or they might integrate the new rule. Report Testing / 2eport are the end result of any 'ata ,arehouse and the basic propose for which ', is build. 2eport must be tested by validating layout, data in the report and calculation.

ETL Testing Techniques:


3) erify that data is transformed correctly according to various business re1uirements and rules. 4) (a!e sure that all pro)ected data is loaded into the data warehouse without any data loss and truncation. 5) (a!e sure that ETL application appropriately re)ects, replaces with default values and reports invalid data. 6) (a!e sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability.

-part from these 6 main ETL testing methods other testing methods li!e integration testing and user acceptance testing is also carried out to ma!e sure everything is smooth and reliable.

ETL Testing Process:


7imilar to any other testing that lies under Independent erification and alidation, ETL also go through the same phase.

+usiness and re1uirement understanding alidating Test Estimation Test planning based on the inputs from test estimation and business re1uirement 'esigning test cases and test scenarios from all the available inputs #nce all the test cases are ready and are approved, testing team proceed to perform pre%execution chec! and test data preparation for testing Lastly execution is performed till exit criteria are met 8pon successful completion summary report is prepared and closure process is done.

It is necessary to define test strategy which should be mutually accepted by sta!eholders before starting actual testing. - well defined test strategy will ma!e sure that correct approach has been followed meeting the testing aspiration. ETL testing might re1uire writing 79L statements extensively by testing team or may be tailoring the 79L provided by development team. In any case testing team must be aware of the results they are trying to get using those 79L statements.

ETL Testing Basics


ETL basically stands for Extract Transform Load which simply implies the process where you extract data from Source Tables, transform them in to the desired format based on certain rules and finally load them onto Target tables. There are numerous tools that help you with ETL process Informatica, Control-M being a few notable ones.

In ETL Testing, the following are validated:

Data File loads from Source System on to Source Tables Transform Process that is designed to extract data from Source tables and move them to Staging tables Data Validation of all Mapping Rules/Transformation Rules within the Staging tables Data Validation within Target tables to ensure data is present in required format and there is no data loss from Source to Target tables So ETL Testing implies Testing this entire process using a tool or at table level with the help of test cases and Rules Mapping document.

Typically, data that is loaded into a data warehouse is derived from diverse sources of operational data, which may consist of data from databases, feeds, application files or flat files. The data must be extracted from these diverse sources, transformed to a common format, and loaded into the data warehouse. Typically, it is further aggregated into a data mart for efficient reporting. The ETL (Extract, transform and load) process is a critical step in any data warehouse implementation, and continues to be an area of major significance whenever the ETL code is updated.

An effective data warehouse testing strategy focuses on the main structures within the data warehouse architecture:

1. The ETL layer

2. The data warehouse itself

3. Associated data marts

4. The front-end business intelligence/reporting applications

ETL Testing is categorized into four different engagements:

New Data Warehouse Testing- a new data warehouse is built from ground up, gathering inputs from customer, extracting different data sources. This is verified with the help of ETL tools Migration Testing In this type of engagement, migrating from the current ETL tool to a better option to improve efficiency Change Request In this type of project new data is added from different sources to an existing DW. Also, there might be a condition where customer needs to change their existing business rule or they might integrate the new rule. Report Testing- Validating report layout, data in the report and calculation

ETL Testing Challenges Some of the ETL Testing Challenges include:

Environment Instability Response time from the query executed, the failure of the jobs, the data set up required for the FIT testing, volume testing Data selection from multiple source systems and analysis that follows pose great challenge Volume and the complexity of the data Inconsistent and redundant data in a data warehouse Inconsistent and Inaccurate reports Non-availability of History data

ETL testing Fundamentals


by HariprasadT on March 29, 2012 in Are You Being Served

Get the atest updates on Are We Being Served direct y in your inbo!" Subscribe no# Introduction: $o%prehensive testing o& a data #arehouse at every point throughout the 'T( )e!tract, trans&or%, and oad* process is beco%ing increasing y i%portant as %ore data is being co ected and used &or strategic decision+%a,ing" -ata #arehouse or 'T( testing is o&ten initiated as a resu t o& %ergers and ac.uisitions, co%p iance and regu ations, data conso idation, and the increased re iance on data+driven decision %a,ing )use o& Business /nte igence too s, etc"*" 'T( testing is co%%on y i%p e%ented either %anua y or #ith the he p o& a too )&unctiona testing too , 'T( too , proprietary uti ities*" (et us understand so%e o& the basic 'T( concepts" B/ 0 -ata 1arehousing testing pro2ects can be con2ectured to be divided into 'T( )'!tract 3 Trans&or% 3 (oad* testing and hence&orth the report testing" Extract Transform Load is the process to enab e businesses to conso idate their data #hi e %oving it &ro% p ace to p ace )i"e"* %oving data &ro% source syste%s into the data #arehouse" The data can arrive &ro% any source4 Extract Transform - /t can the be de&ined ogics as as e!tracting speci&ied b the y the data &ro% business on nu%erous the data heterogeneous derived &ro% syste%s" sources"

- App ying

business

Load - 5u%ping the data into the &ina #arehouse a&ter co%p eting the above t#o process" The 'T( part o& the testing %ain y dea s #ith ho#, #hen, &ro%, #here and #hat data #e carry in our data #arehouse &ro% #hich the &ina reports are supposed to be generated" Thus, 'T( testing spreads across a and each stage o& data & o# in the #arehouse starting &ro% the source databases to the &ina target #arehouse" Star Schema

The star sche%a is perhaps the si%p est data #arehouse sche%a" /t is ca ed a star sche%a because the entity+re ationship diagra% o& this sche%a rese%b es a star, #ith points radiating &ro% a centra tab e" The center o& the star consists o& a arge &act tab e and the points o& the star are the di%ension tab es" A star sche%a is characteri6ed by one 78 %ore o& very arge &act tab es that contain the pri%ary in&or%ation in the data #arehouse, and a nu%ber o& %uch s%a er di%ension tab es )78 oo,up tab es*, each o& #hich contains in&or%ation about the entries &or a particu ar attribute in the &act tab e" A star .uery is a 2oin bet#een a &act tab e and a nu%ber o& di%ension tab es" 'ach di%ension tab e is 2oined to the &act tab e using a pri%ary ,ey to &oreign ,ey 2oin, but the di%ension tab es are not 2oined to each other" The cost+based opti%i6er recogni6es star .ueries and generates e&&icient e!ecution p ans &or the%" A typica &act tab e

contains ,eys and %easures" 9or e!a%p e, in the sa%p e sche%a, the &act tab e sa es, contain the %easures, .uantity so d, a%ount, average, the ,eys ti%e ,ey, ite%+,ey, branch ,ey, and ocation ,ey" The di%ension tab es are ti%e, branch, ite% and ocation" Sno -Fla!e Schema

The sno#& a,e sche%a is a %ore co%p e! data #arehouse %ode than a star sche%a, and is a type o& star sche%a" /t is ca ed a sno#& a,e sche%a because the diagra% o& the sche%a rese%b es a sno#& a,e" Sno#& a,e sche%as nor%a i6e di%ensions to e i%inate redundancy" That is, the di%ension data has been grouped into %u tip e tab es instead o& one arge tab e" 9or e!a%p e, a ocation di%ension tab e in a star sche%a %ight be nor%a i6ed into a ocation tab e and city tab e in a sno#& a,e sche%a" 1hi e this saves space, it increases the nu%ber o& di%ension tab es and re.uires %ore &oreign ,ey 2oins" The resu t is %ore co%p e! .ueries and reduced .uery per&or%ance" 9igure above presents a graphica representation o& a sno#& a,e sche%a" When to use star schema and sno fla!e schema"

1hen #e re&er to Star and Sno#& a,e Sche%as, #e are ta ,ing about a di%ensiona %ode &or a -ata 1arehouse or a -ata%art" The Star sche%a %ode gets it na%e &ro% the design appearance because there is one centra &act tab e surrounded by %any di%ension tab es" The re ationship bet#een the &act and di%ension tab es is created by 5: +; 9: re ationship and the ,eys are genera y surrogate to the natura or business ,ey o& the di%ension tab es" A data &or any given di%ension is stored in the one di%ension tab e" Thus, the design o& the %ode cou d potentia y oo, i,e a STA8" 7n the other hand, the Sno#& a,e sche%a %ode brea,s the di%ension data into %u tip e tab es &or the purpose o& %a,ing the data %ore easi y understood or &or reducing the #idth o& the di%ension tab e" An e!a%p e o& this type o& sche%a %ight be a di%ension #ith 5roduct data o& %u tip e eve s" 'ach eve in the 5roduct Hierarchy %ight have %u tip e attributes that are %eaning&u on y to that eve " Thus, one #ou d brea, the sing e di%ension tab e into %u tip e tab es in a hierarchica &ashion #ith the highest eve tied to the &act tab e" 'ach tab e in the di%ension hierarchy #ou d be tied to the eve above by natura or business ,ey #here the highest eve #ou d be tied to the &act tab e by a surrogate ,ey" As you can i%agine the appearance o& this sche%a design cou d rese%b e the appearance o& a sno#& a,e" T#$es of %imensions Ta&les

T#$e ': This is straight&or#ard r e & r e s h " The &ie ds are constant y over#ritten and history is not ,ept &or the co u%n" 9or e!a%p e shou d a description change &or a 5roduct nu%ber,the o d va ue #i be over #ritten by the ne# va ue" T#$e (: This is ,no#n as a s o# y changing di%ension, as history can be ,ept" The co u%n)s* #here the history is captured has to be de&ined" /n our e!a%p e o& the 5roduct description changing &or a product nu%ber, i& the s o# y changing attribute captured is the product description, a ne# ro# o& data #i be created sho#ing the ne# product description" The o d description #i sti be contained in the o d" T#$e ): This is a so a s o# y changing di%ension" Ho#ever, instead o& a ne# ro#, in the e!a%p e, the o d product description #i be %oved to an <o d va ue= co u%n in the di%ension, #hi e the ne# description #i over#rite the e!isting co u%n" /n addition, a date sta%p co u%n e!ists to say #hen the va ue #as updated" A though there #i be no &u history here, the previous va ue prior to the update is captured" >o ne# ro#s #i be created &or history as the attribute is %easured &or the s o# y changing va ue" T#$es of fact ta&les:

Transactional: Most &acts #i &a into this category" The transactiona &act #i capture transactiona data such as sa es ines or stoc, %ove%ent ines" The %easures &or these &acts can be su%%ed together" Sna$shot: A snapshot &act #i capture the current data &or point &or a day" 9or e!a%p e, a the current stoc, positions, #here ite%s are, in #hich branch, at the end o& a #or,ing day can be captured"

Snapshot &act %easures can be su%%ed &or this day, but cannot be su%%ed across %ore than 2 snapshot days as this data #i be incorrect" Accumulative: An accu%u ative snapshot #i su% data up &or an attribute, and is not based on ti%e" 9or e!a%p e, to get the accu%u ative sa es .uantity &or a sa e o& a particu ar product, the ro# o& data #i be ca cu ated &or this ro# each night 3 giving an <accu%u ative= va ue" *e# hit-$oints in ETL testing are: There are severa eve s o& testing that can be per&or%ed during data #arehouse testing and they shou d be de&ined as part o& the testing strategy in di&&erent phases )$o%ponent Asse%b y, 5roduct* o& testing" So%e e!a%p es inc ude4 '+ ,onstraint Testing: -uring constraint testing, the ob2ective is to va idate uni.ue constraints, pri%ary ,eys, &oreign ,eys, inde!es, and re ationships" The test script shou d inc ude these va idation points" So%e 'T( processes can be deve oped to va idate constraints during the oading o& the #arehouse" /& the decision is %ade to add constraint va idation to the 'T( process, the 'T( code %ust va idate a business ru es and re ationa data re.uire%ents" /n Auto%ation, it shou d be ensured that the setup is done correct y and %aintained throughout the ever+changing re.uire%ents process &or e&&ective testing" An a ternative to auto%ation is to use %anua .ueries" ?ueries are #ritten to cover a test scenarios and e!ecuted %anua y" (+ Source to Target ,ounts: The ob2ective o& the count test scripts is to deter%ine i& the record counts in the source %atch the record counts in the target" So%e 'T( processes are capab e o& capturing record count in&or%ation such as records read, records #ritten, records in error, etc" /& the 'T( process used can capture that eve o& detai and create a ist o& the counts, a o# it to do so" This #i save ti%e during the va idation process" /t is a #ays a good practice to use .ueries to doub e chec, the source to target counts" )+ Source to Target %ata -alidation: >o 'T( process is s%art enough to per&or% source to target &ie d+to+&ie d va idation" This piece o& the testing cyc e is the %ost abor intensive and re.uires the %ost thorough ana ysis o& the data" There are a variety o& tests that can be per&or%ed during source to target va idation" Be o# is a ist o& tests that are best practices4

.+ Transformation and Business /ules: Tests to veri&y a

possib e outco%es o& the trans&or%ation ru es, de&au t va ues, straight %oves and as speci&ied in the

Business Speci&ication docu%ent" As a specia %ention, Boundary conditions %ust be tested on the business ru es" 0+ Batch Se1uence 2 %e$endenc# Testing: 'T(@s in -1 are essentia y a se.uence o& processes that e!ecute in a particu ar se.uence" -ependencies do e!ist a%ong various processes and the sa%e is critica to %aintain the integrity o& the data" '!ecuting the se.uences in a #rong order %ight resu t in inaccurate data in the #arehouse" The testing process %ust inc ude at east 2 iterations o& the end3end e!ecution o& the #ho e batch se.uence" -ata %ust be chec,ed &or its integrity during this testing" The %ost co%%on type o& errors caused because o& incorrect se.uence is the re&erentia integrity &ai ures, incorrect end+dating )i& app icab e* etc, re2ect

records etc" 3+ 4o& restart Testing: /n a rea production environ%ent, the 'T( 2obs0processes &ai because o& nu%ber o& reasons )say &or e!4 database re ated &ai ures, connectivity &ai ures etc*" The 2obs can &ai ha &0part y e!ecuted" A good design a #ays a o#s &or a restart abi ity o& the 2obs &ro% the &ai ure point" A though this is %ore o& a design suggestion0approach, it is suggested that every 'T( 2ob is bui t and tested &or restart capabi ity" 5+ Error 6andling: Anderstanding a script %ight &ai during data va idation, %ay con&ir% the 'T( process is #or,ing through process va idation" -uring process va idation the testing tea% #i #or, to identi&y additiona data c eansing needs, as #e as identi&y consistent error patterns that cou d possib y be diverted by %odi&ying the 'T( code" /t is the responsibi ity o& the va idation tea% to identi&y any and a records that see% suspect" 7nce a record has been both data and process va idated and the script has passed, the 'T( process is &unctioning correct y" $onverse y, i& suspect records have been identi&ied and docu%ented during data va idation those are not supported through process va idation, the 'T( process is not &unctioning correct y" 7+ -ie s: Bie#s created on the tab es shou d be tested to ensure the attributes %entioned in the vie#s are correct and the data oaded in the target tab e %atches #hat is being re& ected in the vie#s" 8+ Sam$ling: Sa%p ing #i invo ve creating predictions out o& a representative portion o& the data that is to be oaded into the target tab eC these predictions #i be %atched #ith the actua resu ts obtained &ro% the data oaded &or business Ana yst Testing" $o%parison #i be veri&ied to ensure that the predictions %atch the data oaded into the target tab e" '9+ :rocess Testing: The testing o& inter%ediate &i es and processes to ensure the &ina outco%e is va id and that per&or%ance %eets the syste%0business need" ''+ %u$licate Testing: -up icate Testing %ust be per&or%ed at each stage o& the 'T( process and in the &ina target tab e" This testing invo ves chec,s &or dup icates ro#s and a so chec,s &or %u tip e ro#s #ith sa%e pri%ary ,ey, both o& #hich cannot be a o#ed" '(+ :erformance: /t is the %ost i%portant aspect a&ter data va idation" 5er&or%ance testing shou d chec, i& the 'T( process is co%p eting #ithin the oad #indo#" ')+ -olume: Beri&y that the syste% can process the %a!i%u% e!pected .uantity o& data &or a given cyc e in the ti%e e!pected" '.+,onnectivit# Tests: As the na%e suggests, this invo ves testing the upstrea%, do#nstrea% inter&aces and intra -1 connectivity" /t is suggested that the testing represents the e!act transactions bet#een these inter&aces" 9or e!4 /& the design approach is to e!tract the &i es &ro% source syste%, #e shou d actua y test e!tracting a &i e out o& the syste% connectivity" '0+ ;egative Testing: >egative Testing chec,s #hether the app ication &ai s and #here it shou d &ai #ith inva id inputs and out o& boundary scenarios and to chec, the behavior o& the app ication" '3+ <$erational /eadiness Testing =</T>: This is the &ina phase o& testing #hich &ocuses on veri&ying the dep oy%ent o& so&t#are and the operationa readiness o& the app ication" The %ain areas o& testing in this phase inc ude4 -ep oy%ent Test 1" 2" D" Tests Tests the Tests overa security the technica aspects o& the dep oy%ent dep oy%ent syste% o& <chec, ist= inc uding user the and authentication so ution ti%e&ra%es and and not 2ust the

authori6ation, and user+access eve s" ,onclusion 'vo ving needs o& the business and changes in the source syste%s #i drive continuous change in the data #arehouse sche%a and the data being oaded" Hence, it is necessary that deve op%ent and testing processes are c ear y de&ined, &o o#ed by i%pact+ana ysis and strong a ign%ent bet#een deve op%ent, operations and the business"

You might also like