ETL Testing
ETL Testing
ETL Testing
ETL Testing
By - SrilakshmiSudhaker
ETL Testing
ETL Testing
ETL Testing
ETL Testing
Data warehouses are not the optimal environment for unstructured data. Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data. Over their life, data warehouses can have high costs. Maintenance costs are high. Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization. There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems and vice versa.
ETL Testing
ETL Concept:
ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data transportation, data transformation, data consolidation, data integration, data loading and data cleaning. E - Extracting data from source operational or archive systems which are primary source of data for the data warehouse. T - Transforming the data which may involve cleaning, filtering, validating and applying business rules. L - Loading the data into the data warehouse or any other database or application that houses the data.
ETL Testing
ETL Process:
ETL Process involves the Extraction, Transformation and Loading Process.
Extraction:
The first part of an ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. Extraction converts the data into a format for transformation processing. An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
Transformation:
Transformation is the series of tasks that prepares the data for loading into the warehouse. Once data is secured, you have worry about its format or structure.
ETL Testing
Because it will be not be in the format needed for the target. Example the grain level, data type, might be different. Data cannot be used as it is. Some rules and functions need to be applied to transform the data One of the purposes of ETL is to consolidate the data in a central repository or to bring it at one logical or physical place. Data can be consolidated from similar systems, different subject areas, etc. ETL must support data integration for the data coming from multiple sources and data coming at different times. This has to be seamless operation. This will avoid overwriting existing data, creating duplicate data or even worst simply unable to load the data in the target
Loading:
Loading process is critical to integration and consolidation. Loading process decides the modality of how the data is added in the warehouse or simply rejected. Methods like addition, Updating or deleting are executed at this step. What happens to the existing data? Should the old data be deleted because of new information? Or should the data be archived? Should the data be treated as additional data to the existing one? So data to the data warehouse has to loaded with utmost care for which data auditing process can only establish the confidence level. This auditing process normally happens after the loading of data.
ETL Testing
ETL4ALL DB2 Warehouse Edition Pentaho Data Integration Adeptia Integration Server IKAN IBM Pentaho Adeptia
ETL Testing:
Following are some common goals for testing an ETL application:
Data completeness - To ensure that all expected data is loaded. Data Quality - It promises that the ETL application correctly rejects, substitutes default values, corrects and reports invalid data. Data transformation - This is meant for ensuring that all data is correctly transformed according to business rules and design specifications. Performance and scalability- This is to ensure that the data loads and queries perform within expected time frames and the technical architecture is scalable. Integration testing- It is to ensure that ETL process functions well with other upstream and downstream applications. User-acceptance testing - It ensures the solution fulfills the users current expectations and also anticipates their future expectations. Regression testing - To keep the existing functionality intact each time a new release of code is completed. Basically data warehouse testing is divided into two categories Back-end testing and Front-end testing. The former applies where the source systems data is compared to the end-result data in Loaded area which is the ETL testing. While the latter refers to where the user checks the data by comparing their MIS with the data that is displayed by the end-user tools.
Data Validation: Data completeness is one of the basic ways for data validation. This is needed to verify that all expected data loads into the data warehouse. This includes the validation of all the records, fields and ensures that the full contents of each field are loaded.
Data Transformation: Validating that the data is transformed correctly based on business rules, can be one of the most complex parts of testing an ETL application with significant transformation logic. Another way of testing is to pick up some sample records and compare them for validating data transformation manually, but this method
ETL Testing requires manual testing steps and testers who have a good amount of experience and understand of the ETL logic.
10
ETL Testing exceptions. An unbiased approach is required to ensure maximum efficiency. Knowledge of the business process is an added advantage since we must be able to interpret the results functionally and not just code-wise. The QA team must test for: i. Data completeness match source to target counts terms of business. Also the load windows refresh period for the DW and the views created should be signed off from users. Data aggregations match aggregated data against staging tables. Granularity of data is as per specifications. Error logs and audit tables are generated and populated properly. Notifications to IT and/or business are generated in proper format
Regression testing: A DW application is not a one-time solution. Possibly it is the best example of an incremental design where requirements are enhanced and refined quite often based on business needs and feedbacks. In such a situation it is very critical to test that the existing functionalities of a DW application are not messed up whenever an enhancement is made to it. Generally this is done by running all functional tests for existing code whenever a new piece of code is introduced. However, a better strategy could be to preserve earlier test input data and result sets and running the same again. Now the new results could be compared against the older ones to ensure proper functionality. Integration testing: This is done to ensure that the application developed works from an end-to-end perspective. Here we must consider the compatibility of the DW application with upstream and downstream flows. We need to ensure for data integrity across the flow. Our test strategy should include testing for: i. Sequence of jobs to be executed with job dependencies and scheduling ii. Re-startability of jobs in case of failures iii. Generation of error logs iv. Cleanup scripts for the environment including database This activity is a combined responsibility and participation of experts from all related applications is a must in order to avoid misinterpretation of results. Acceptance testing: This is the most critical part because here the actual users validate your output datasets. They are the best judges to ensure that the application works as expected by them. However, business users may not have proper ETL knowledge. Hence, the development and test team should be ready to provide answers regarding ETL process that relate to data population. The test team must have sufficient business knowledge to translate the results in terms of business. Also the load windows, refresh period for the DW and the views created should be signed off from users.
11
ETL Testing Performance testing: In addition to the above tests a DW must necessarily go through another phase called performance testing. Any DW application is designed to be scalable and robust. Therefore, when it goes into production environment, it should not cause performance problems. Here, we must test the system with huge volume of data. We must ensure that the load window is met even under such volumes. This phase should involve DBA team, and ETL expert and others who can review and validate your code for optimization.
Summary:
Testing a DW application should be done with a sense of utmost responsibility. A bug in a DW traced at a later stage results in unpredictable losses. And the task is even more difficult in the absence of any single end-to-end testing tool. So the strategies for testing should be methodically developed, refined and streamlined. This is also true since the requirements of a DW are often dynamically changing. Under such circumstances repeated discussions with development team and users is of utmost importance to the test team. Another area of concern is test coverage. This has to be reviewed multiple times to ensure completeness of testing. Always remember, a DW tester must go an extra mile to ensure near defect free solutions.
12