DW
DW
DW
It is often located at the centre of a star schema or a snowflake schema, surrou nded by dimension tables. Fact tables provide the (usually) additive values that act as independent variab les by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represen ts the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Produ ct by Store". Each record in this fact table is therefore uniquely defined by a day, product a nd store. Other dimensions might be members of this fact table (such as location/region) b ut these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent fact s but generally provide insights at a higher level of aggregation (a region cont ains many stores). Measure types Additive - Measures that can be added across all dimensions. Non Additive - Measures that cannot be added across any dimension. Semi Additive - Measures that can be added across some dimensions and not ac ross others. A fact table might contain either detail level facts or facts that have been agg regated (fact tables that contain aggregated facts are often instead called summary tabl es). Special care must be taken when handling ratios and percentage. One good design rule[1] is to never store percentages or ratios in fact tables b ut only calculate these in the data access tool. Thus only store the numerator and denominator in the fact table, which then can be aggregated and the aggregated stored values can then be used f or calculating the ratio or percentage in the data access tool. In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called "factless fact tables", or "junction tables". The "Factless fact tables" can for example be used for modeling many-to-many rel ationships or capture events[1] [edit] Types of fact tables There are basically three fundamental measurement events, which characterizes al l fact tables.[2] Transactional A transactional table is the most basic and fundamental. The grain associated with a transactional fact table is usually specified as "on e row per line in a transaction", e.g., every line on a receipt. Typically a transactional fact table holds data of the most detailed level, caus ing it to have a great number of dimensions associated with it. Periodic snapshots
The periodic snapshot, as the name implies, takes a "picture of the moment", whe re the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to d eliver the chosen performance output. Accumulating snapshots This type of fact table is used to show the activity of a process that has a wel l-defined beginning and end, e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the f act table is updated. An accumulating snapshot table often has multiple date columns, each representin g a milestone in the process. Therefore, it's important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the ro w. ******************************************************************************** ********************************************************************* Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are genera ted.... This makes it much easier and more efficient to run queries over data that origi nally came from different sources. Typical relational databases are designed for on-line transactional processing ( OLTP) and do not meet the requirements for effective on-line analytical processing (OL AP). As a result, data warehouses are designed differently than traditional relationa l databases. ******************************************************************************** ********************************************************************* What is ODS? 1. ODS means Operational Data Store. 2. A collection of operation or bases data that is extracted from operation data bases and standardized, cleansed, consolidated, transformed, and loaded into an enterprise data architecture. An ODS is used to support data mining of operational data, or as the store for b ase data that is summarized for a data warehouse. The ODS may also be used to audit the data warehouse to assure summarized and de rived data is calculated properly. The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases. ******************************************************************************** ********************************************************************* What is a dimension table?
A dimensional table is a collection of hierarchies and categories along which th e user can drill down and drill up. it contains only the textual attributes. ******************************************************************************** ********************************************************************* Why should you put your data warehouse on a different system than your OLTP syst em? Answer1: A OLTP system is basically " data oriented " (ER model) and not " Subject orient ed "(Dimensional Model) . That is why we design a separate system that will have a subject oriented OLAP s ystem... Moreover if a complex querry is fired on a OLTP system will cause a heavy overhe ad on the OLTP server that will affect the daytoday business directly. ******************************************************************************** ********************************************************************* What are Aggregate tables? Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. Retrieving the required data from the actual table, which have millions of recor ds will take more time and also affects the server performance. To avoid this we can aggregate the table to certain required level and can use i t. This tables reduces the load in the database server and increases the performanc e of the query and can retrieve the result very fastly ******************************************************************************** ********************************************************************* Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension tab le contains the context of measuremnets ie, the dimensions on which the facts are calculated. A common response by practitioners who write on the subject is that you should n o more build a database without a model than you should build a house without blueprints. ******************************************************************************** ********************************************************************* What is data mining? Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance dataware house can be used to mine data for the most hi gh risk people to insure in a certain geographial area ******************************************************************************** ********************************************************************* What is ETL?
ETL stands for extraction, transformation and loading. ETL provide developers with an interface for designing source-to-target mappings , ransformation and job control parameter. Extraction Take data from an external source and move it to the warehouse pre-processor dat abase. Transformation Transform data task allows point-to-point generating, modifying and transforming data. Loading Load data task adds records to a database table in a warehouse. ******************************************************************************** ********************************************************************* What is the Difference between OLTP and OLAP? Main Differences between OLTP and OLAP are:1. User and System Orientation OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT professionals. OLAP: market-oriented, used for data analysis by knowledge workers( managers, ex ecutives, analysis). 2. Data Contents OLTP: manages current data, very detail-oriented. OLAP: manages large amounts of historical data, provides facilities for summariz ation and aggregation, stores information at different levels of granularity to support decision making process. 3. Database Design OLTP: adopts an entity relationship(ER) model and an application-oriented databa se design. OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design. 4. View OLTP: focuses on the current data within an enterprise or department. OLAP: spans multiple versions of a database schema due to the evolutionary proce ss of an organization; integrates information from many organizational locations and data stores ******************************************************************************** ********************************************************************* What is SCD1 , SCD2 , SCD3? SCD Stands for Slowly changing dimensions.
SCD1: only maintained updated values. Ex: a customer address modified we update existing record with new address. SCD2: maintaining historical information and current information by using A) Effective Date B) Versions C) Flags or combination of these SCD3: by adding new columns to target table we maintain historical information a nd current information ******************************************************************************** ********************************************************************* Why are OLTP database designs not generally a good idea for a Data Warehouse? Since in OLTP,tables are normalised and hence query response will be slow for en d user and OLTP doesnot contain years of data and hence cannot be analysed ******************************************************************************** ********************************************************************* What is BUS Schema? BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts ******************************************************************************** ********************************************************************* What is Normalization, First Normal Form, Second Normal Form , Third Normal Form ? 1.Normalization is process for assigning attributes to entities Reducesdata redund ancies Helps eliminate data anomalies Produces controlledredundancies to link tables 2.Normalization is the analysis offunctional dependency between attributes / dat a items of userviews?It reduces a complex user view to a set of small andstable subgroups of fields / relations 1NF:Repeating groups must beeliminated, Dependencies can be identified, All key attributesdefined,No repeating groups in table 2NF: The Table is already in1NF,Includes no partial dependencies No attribute depe ndent on a portionof primary key, Still possible to exhibit transitivedependency,Attributes may be functionally dependent on non-key attributes 3NF: The Table is already in 2NF, Contains no transitivedependencies ******************************************************************************** ********************************************************************* What is Fact table?
Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales" , then a measurement of this business proces s such as "monthly sales number" is captured in the Fact table. Fact table also contains the foriegn keys for the dimension tables. ******************************************************************************** ********************************************************************* What are conformed dimensions? Answer1: Conformed dimensions mean the exact same thing with every possible fact table to which they are joined Ex:Date Dimensions is connected all facts like Sales facts,Inventory facts..etc ******************************************************************************** ********************************************************************* What are the Different methods of loading Dimension tables? Conventional Load: Before loading the data, all the Table constraints will be checked against the d ata. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table co nstraints and the bad data won't be indexed. ******************************************************************************** ********************************************************************* What is conformed fact? Conformed dimensions are the dimensions which can be used across multiple Data M arts in combination with multiple facts tables accordingly ******************************************************************************** ********************************************************************* What are Data Marts? Data Marts are designed to help manager make strategic decisions about their bus iness. Data Marts are subset of the corporate-wide data that is of value to a specific group of users. There are two types of Data Marts: sources from data captured form OLTP system, 1.Independent data marts external providers or from data generated locally within a particular department or geographic area. 2.Dependent data mart sources directly form enterprise data warehouses.
******************************************************************************** ********************************************************************* Level of granularity means level of detail that you put into the fact table in a
data warehouse. For example: Based on design you can decide to put the sales data in each transa ction. Now, level of granularity would mean what detail are you willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it upto minut e and put that data. ******************************************************************************** ********************************************************************* How are the Dimension tables designed? Most dimension tables are designed using Normalization principles upto 2NF. In s ome instances they are further normalized to 3NF. ******************************************************************************** ********************************************************************* What are non-additive facts? Non-Additive: Non-additive facts are facts that cannot be summed up for any of t he dimensions present in the fact table. ******************************************************************************** ********************************************************************* What type of Indexing mechanism do we need to use for a typical datawarehouse? On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of clustered/non-clustere d, unique/non-unique indexes. To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports bitmaps. ******************************************************************************** ********************************************************************* What Snow Flake Schema? Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table. ******************************************************************************** ********************************************************************* What is real time data-warehousing? Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time activity is activity that is happening right now. The activity could b e anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing cap tures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available
instantly. In other words, real-time data warehousing is a framework for deriving informati on from data as the data becomes available ******************************************************************************** ********************************************************************* What are slowly changing dimensions? SCD stands for Slowly changing dimensions. Slowly changing dimensions are of thr ee types SCD1: only maintained updated values. Ex: a customer address modified we update existing record with new address. SCD2: maintaining historical information and current information by using A) Effective Date B) Versions C) Flags or combination of these scd3: by adding new columns to target table we maintain historical information a nd current information ******************************************************************************** ********************************************************************* What are Semi-additive and factless facts and in which scenario will you use suc h kinds of fact tables? Snapshot facts are semi-additive, while we maintain aggregated facts we go for s emi-additive. EX: Average daily balance A fact table without numeric fact columns is called factless fact table. Ex: Promotion Facts While maintain the promotion values of the transaction (ex: product samples) bec ause this table doesn t contain any measures. ******************************************************************************** ********************************************************************* Differences between star and snowflake schemas? Star schema - all dimensions will be linked directly with a fat table. Snow schema - dimensions maybe interlinked or may have one-to-many relationship with other tables. ******************************************************************************** ********************************************************************* What is a Star Schema? Star schema is a type ult from the database Usually a star schema le which looks like a of organising the tables such that we can retrieve the res easily and fastly in the warehouse environment. consists of one or more dimension tables around a fact tab star,so that it got its name.
********************************************************************************
********************************************************************* What is ER Diagram? The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Che n76] as a way to unify the network and relational database views. Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represents data objects. Since Chen wrote his paper the model has been extended and today it is commonly used for database design For the database designer, the utility of the ER model is: it maps well to the relational model. The constructs used in the ER model can ea sily be transformed into relational tables. it is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the des ign to the end user. In addition, the model can be used as a design plan by the database developer to implement a data model in a specific database management software. ******************************************************************************** ********************************************************************* Which columns go to the fact table and which columns go the dimension table? The Primary Key columns of the Tables(Entities) go to the Dimension Tables as Fo reign Keys. The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys. ******************************************************************************** ********************************************************************* How do you load the time dimension? Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. It is not unusual for 100 years to be represented in a time dimension, with one row per day. ******************************************************************************** ********************************************************************* What is VLDB? Answer 1: VLDB stands for Very Large DataBase. It is an environment or storage space managed by a relational database managemen t system (RDBMS) consisting of vast quantities of information. Answer 2: VLDB doesn t refer to size of database or vast amount of information stored. It re fers to the window of opportunity to take back up the database.
Window of opportunity refers to the time of interval and if the DBA was unable t o take back up in the specified time then the database was considered as VLDB. ******************************************************************************** ********************************************************************* What are Data Marts ? A data mart is a focused subset of a data warehouse that deals with a single are a(like different department) of data and is organized for quick analysis ******************************************************************************** ********************************************************************* What is Difference between E-R Modeling and Dimentional Modeling.? Basic diff is E-R modeling will have logical and physical model. Dimensional mod el will have only physical model. E-R modeling is used for normalizing the OLTP database design. Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design. ******************************************************************************** ********************************************************************* Why fact table is in normal form? Basically the fact table consists of the Index keys of the dimension/ook up tabl es and the measures. so when ever we have the keys in a table .that itself implies that the table is in the normal form. ******************************************************************************** ********************************************************************* What is degenerate dimension table? Degenerate Dimensions : If a table contains the values, which r neither dimesion nor measures is called degenerate dimensions.Ex : invoice id,empno ******************************************************************************** ********************************************************************* What is Dimensional Modelling? Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension tab le contains the context of measuremnets ie, the dimensions on which the facts are calculated. ******************************************************************************** *********************************************************************
What is the main differnce between schema in RDBMS and schemas in DataWarehouse. ...? RDBMS Schema * Used for OLTP systems * Traditional and old schema * Normalized * Difficult to understand and navigate * Cannot solve extract and complex problems * Poorly modelled DWH Schema * Used for OLAP systems * New generation schema * De Normalized * Easy to understand and navigate * Extract and complex problems can be easily solved * Very good model ******************************************************************************** ********************************************************************* What is hybrid slowly changing dimension? Hybrid SCDs are combination of both SCD 1 and SCD 2. It may happen that in a table, some columns are important and we need to track c hanges for them i.e capture the historical data for them whereas in some columns even if the data changes, we don't care. For such tables we implement Hybrid SCDs, where in some columns are Type 1 and s ome are Type 2 ******************************************************************************** ********************************************************************* what is junk dimension? what is the difference between junk dimension and degene rated dimension? Junk dimension: Grouping of Random flags and text Attributes in a dimension and moving them to a separate sub dimension. Degenerate Dimension: Keeping the control information on Fact table ex: Consider a Dimension table with fields like order number and order line numb er and have 1:1 relationship with Fact table, In this case this dimension is removed and the order information will be directl y stored in a Fact table inorder eliminate unneccessary joins while retrieving order information.. ******************************************************************************** ********************************************************************* Differences between star and snowflake schemas? Star schema A single fact table with N number of Dimension Snowflake schema
Any dimensions with extended dimensions are know as snowflake schema ******************************************************************************** ********************************************************************* Star schema contains the dimesion tables mapped around one or more fact tables. It is a denormalised model. No need to use complicated joins. Queries results fastly. Snowflake schema It is the normalised form of Star schema. contains indepth joins ,bcas the tbales r splitted in to many pieces.We can easi ly do modification directly in the tables. We hav to use comlicated joins ,since we hav more tables . There will be some delay in processing the Query . ******************************************************************************** ********************************************************************* View - store the SQL statement in the database and let you use it as a table. Ev erytime you access the view, the SQL statement executes. Materialized view - stores the results of the SQL in table form in the database. SQL statement only executes once and after that everytime you run the query, the stored result set is used. Pros include quick query results. ******************************************************************************** ********************************************************************* What is aggregate table and aggregate fact table ... any examples of both? Aggregate table contains summarised data. The materialized view are aggregated ta bles. for ex in sales we have only date transaction. if we want to create a report lik e sales by product per year. in such cases we aggregate the date vales into week_agg, month_agg, quarter_agg, year_agg. to retrive date from this tables we use @aggrtegate function. ******************************************************************************** ********************************************************************* What is the difference between Datawarehousing and BusinessIntelligence? Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta dat a management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, an alytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term business intellig ence is used to encompass OLAP, data visualization, data mining and query/reporting tools.Think of the data warehouse as the back of fice and business intelligence as the entire business including
the back office. The business needs the back office on which to function, but th e back office without a business to support, makes no sense ******************************************************************************** ********************************************************************* What is fact less fact table? where you have used it in your project? Factless table means only the key available in the Fact there is no mesures avai lalabl ******************************************************************************** ********************************************************************* What is snapshot? You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data. However, you must reconnect to the catalog if you want to refresh the data. ******************************************************************************** ********************************************************************* Partitioning Concept By defalult Source Qualifier and Targets are partitions points (points where we can tell to Informatica server to add partitions). Source Qualifier - (partition point) - Transformations - (partition point) - Tar get That said, if we do not add additional partitions, then by default there would be only 3 threads that does ETL. 1 - to pull from source 2 - to transform data 3 - to write into target If you add partitions at partition points(any partition points, including the de fault source and target points), you increase the number of threads to perform an operation between the partition points. http://informatica-tech-talk.blogspot.com/ To give an example, If your source database can handle parallel read operations, instead of using a single thread to pull millions of data, you can add partitions so that each thread (=partition) will pull its designated data (based on the paritioning type). Assume that if a source table is list partitioned in database based on Sales Cha nnel, and there are 5 partitions for each of 5 channels (eservice, ivr, sales reps, call to customer etc...). Here in Informatica too, y ou can add 5 partitions of list type at source qualifier,
and use the same partition key values for each partition. This way you read oper ation will be multi-threaded each pulling data from one partition. And best part of this example is your database and informatica partitions are of same type and in perfect sync that means you are avoiding huge read contentions on the source database which w ould otherwise be. Hope it helps! Performance Tuning at session level is applicable to remove Bottleneck at ETL da ta load. Session Partitioning means "Splitting ETL dataload in multiple parallel pipelines threads". It will be helpful on RDBMS like Oracle but not so effectiv e for Teradata or Netezza (auto parallel aware architectural conflict ). Differe nt Type of Partitioning supported by Informatica 1. Pass-Through (Default) 2. Round-robin 3. Database partitioning 4. Hash auto-keys 5. Hash user keys 6. Key range Open Workflow Manager, Goto session properties, Mapping Tab, select Partition Hy perlink. Here we can add/delete/view partition, Set Partition Point, Add Number of Partition then Partition type. Pass-Through (Default) : All rows in a single partition: No data Distribution. A dditional Stage area for better performance Round-Robin : Equally data distribution among all partition using round robin al gorithm. Each partition almost has same number of rows Hash auto-keys : System generated partition key based on grouped ports at transf ormation level. When a new set of logical keys exists, Integration service gener ates a Hash key using Hash map and putted row to appropriate partition. Popularl y used as Ramk, Sorter and Unsorted Aggregator Hash user keys : User Defined group of ports for partition. For key value, Syste m generated a Hash value using Hashing algorithm. Row is puted to ceratin partit ion based on Hash value. Key range : Each port(s) for key range partition need to be assigned a range of value. Key value and range decide partition to held current value. Popularly use d for Source and Target level. System Level partitioning key generated for hash auto-keys, round-robin, or pass -through partitioning. Session partitioning enables parallel processing logic of ETL load implementatio n. It enhance the performance using Multiprocessing/Grid processing ETL load. ******************************************************************************** ********************************************************************* Tuning: This is the first of the number of articles on the series of Data Warehouse Appl ication performance tuning scheduled to come every week. This one is on Informatica performance tuning. Source Query/ General Query Tuning 1.1 Calculate original query cost 1.2 Can the query be re-written to reduce cost? - Can IN clause be changed with EXISTS? - Can a UNION be replaced with UNION ALL if we are not using any DISTINCT cluase in query? - Is there a redundant table join that can be avoided? - Can we include additional WHERE clause to further limit data volume?
- Is there a redundant column used in GROUP BY that can be removed? - Is there a redundant column selected in the query but not used anywhere in map ping? 1.3 Check if all the major joining columns are indexed 1.4 Check if all the major filter conditions (WHERE clause) are indexed - Can a function-based index improve performance further? 1.5 Check if any exclusive query hint reduce query cost - Check if parallel hint improves performance and reduce cost 1.6 Recalculate query cost - If query cost is reduced, use the changed query Tuning Informatica LookUp 1.1 Redundant Lookup transformation - Is there a lookup which is no longer used in the mapping? - If there are consecutive lookups, can those be replaced inside a single lookup override? 1.2 LookUp conditions - Are all the lookup conditions indexed in database? (Uncached lookup only) - An unequal condition should always be mentioned after an equal condition 1.3 LookUp override query - Should follow all guidelines from 1. Source Query part above 1.4 There is no unnecessary column selected in lookup (to reduce cache size) 1.5 Cached/Uncached - Carefully consider whether the lookup should be cached or uncached - General Guidelines - Generally don't use cached lookup if lookup table size is > 300MB - Generally don't use cached lookup if lookup table row count > 20,000,00 - Generally don't use cached lookup if driving table (source table) row count < 1000 1.6 Persistent Cache - If found out that a same lookup is cached and used in different mappings, Cons ider persistent cache 1.7 Lookup cache building - Consider "Additional Concurrent Pipeline" in session property to build cache c oncurrently "Prebuild Lookup Cache" should be enabled, only if the lookup is surely called i n the mapping Tuning Informatica Joiner 3.1 Unless unavoidable, join database tables in database only (homogeneous join) and don't use joiner 3.2 If Informatica joiner is used, always use Sorter Rows and try to sort it in SQ Query itself using Order By (If Sorter Transformation is used then make sure Sorter has enough cache to perform 1-pass sort) 3.3 Smaller of two joining tables should be master Tuning Informatica Aggregator 4.1 When possible, sort the input for aggregator from database end (Order By Cla use) 4.2 If Input is not already sorted, use SORTER. If possible use SQ query to Sort the records. Tuning Informatica Filter 5.1 Unless unavoidable, use filteration at source query in source qualifier 5.2 Use filter as much near to source as possible Tuning Informatica Sequence Generator 6.1 Cache the sequence generator
Setting Correct Informatica Session Level Properties 7.1 Disable "High Precision" if not required (High Precision allows decimal upto 28 decimal points) 7.2 Use "Terse" mode for tracing level 7.3 Enable pipeline partitioning (Thumb Rule: Maximum No. of partitions = No. of CPU/1.2) (Also remember increasing partitions will multiply the cache memory requirement accordingly) Tuning Informatica Expression 8.1 Use Variable to reduce the redundant calculation 8.2 Remove Default value " ERROR('transformation error')" for Output Column. 8.3 Try to reduce the Code complexity like Nested If etc. 8.4 Try to reduce the Unneccessary Type Conversion in Calculation ******************************************************************************** ********************************************************************* Pushdown Optimization which is a new concept in Informatica PowerCentre, allows developers to balance data transformation load among servers. This article describes pushdown techniques. What is Pushdown Optimization? optimization Pushdown optimization is a way of load-balancing among servers in o rder to achieve optimal performance. Veteran ETL developers often come across issues when they need to determine the appropriate place to perform ETL logic. Suppose an ETL logic needs to filter out data based on some condition. One can either do it in database by using WHERE condition in the SQL query or in side Informatica by using Informatica Filter transformation. Sometimes, we can even "push" some transformation logic to the target database i nstead of doing it in the source side (Especially in the case of EL-T rather than ETL). Such optimization is crucial f or overall ETL performance. How does Push-Down Optimization work? One can push transformation logic to the source or target database using pushdow n optimization. The Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the source or the target database which executes the SQL queries to process the transformations. The amount of transformation logic one can push to the database depends on the d atabase, transformation logic, and mapping and session configuration. The Integration Service analyzes the transformation logic it can push to the dat abase and executes the SQL statement generated against the source or target tables, and it processes any transformation logic that it c annot push to the database. What is Pushdown Optimization and things to consider The process of pushing transformation logic to the source or target database by Informatica Integration service is known as Pushdown Optimization. When a sessio n is configured to run for Pushdown Optimization, the Integration Service transl ates the transformation logic into SQL queries and sends the SQL queries to the database. The Source or Target Database executes the SQL queries to process the transformations. How does Pushdown Optimization (PO) Works? The Integration Service generates SQL statements when native database driver is
used. In case of ODBC drivers, the Integration Service cannot detect the databas e type and generates ANSI SQL. The Integration Service can usually push more tr ansformation logic to a database if a native driver is used, instead of an ODBC driver. For any SQL Override, Integration service creates a view (PM_*) in the database while executing the session task and drops the view after the task gets complete . Similarly it also create sequences (PM_*) in the database. Database schema (SQ Connection, LKP connection), should have the Create View / C reate Sequence Privilege, else the session will fail. Few Benefits in using PO There is no memory or disk space required to manage the cache in the Informa tica server for Aggregator, Lookup, Sorter and Joiner Transformation, as the tra nsformation logic is pushed to database. SQL Generated by Informatica Integration service can be viewed before runnin g the session through Optimizer viewer, making easier to debug. When inserting into Targets, Integration Service do row by row processing us ing bind variable (only soft parse only processing time, no parsing time). But I n case of Pushdown Optimization, the statement will be executed once. Without Using Pushdown optimization: INSERT INTO EMPLOYEES(ID_EMPLOYEE, EMPLOYEE_ID, FIRST_NAME, LAST_NAME, EMAIL, PHONE_NUMBER, HIRE_DATE, JOB_ID, SALARY, COMMISSION_PCT, MANAGER_ID,MANAGER_NAME, DEPARTMENT_ID) VALUES (:1, :2, :3, :4, :5, :6, :7, :8, :9, :10, :11, :12, :13) ecutes 7012352 times With Using Pushdown optimization INSERT INTO EMPLOYEES(ID_EMPLOYEE, EMPLOYEE_ID, FIRST_NAME, LAST_NAME, EMAIL, PH ONE_NUMBER, HIRE_DATE, JOB_ID, SALARY, COMMISSION_PCT, MANAGER_ID, MANAGER_NAME, DEPARTMENT_ID) SELECT CAST(PM_SJEAIJTJRNWT45X3OO5ZZLJYJRY.NEXTVAL AS NUMBER(15, 2)), EMPLOYEES_SRC.EMPLOYEE_ID, EMPLOYEES_SRC.FIRST_NAME, EMPLOYEES_SRC.LAST_NA ME, CAST((EMPLOYEES_SRC.EMAIL @gmail.com ) AS VARCHAR2(25)), EMPLOYEES_SRC.PHONE _NUMBER, CAST(EMPLOYEES_SRC.HIRE_DATE AS date), EMPLOYEES_SRC.JOB_ID, EMPLOYEES_ SRC.SALARY, EMPLOYEES_SRC.COMMISSION_PCT, EMPLOYEES_SRC.MANAGER_ID, NULL, EMPLOY EES_SRC.DEPARTMENT_ID FROM (EMPLOYEES_SRC LEFT OUTER JOIN EMPLOYEES PM_Alkp_emp_ mgr_1 ON (PM_Alkp_emp_mgr_1.EMPLOYEE_ID = EMPLOYEES_SRC.MANAGER_ID)) WHERE ((EMP LOYEES_SRC.MANAGER_ID = (SELECT PM_Alkp_emp_mgr_1.EMPLOYEE_ID FROM EMPLOYEES PM_ Alkp_emp_mgr_1 WHERE (PM_Alkp_emp_mgr_1.EMPLOYEE_ID = EMPLOYEES_SRC.MANAGER_ID)) ) OR (0=0)) executes 1 time Things to note when using PO There are cases where the Integration Service and Pushdown Optimization can prod uce different result sets for the same transformation logic. This can happen dur ing data type conversion, handling null values, case sensitivity, sequence gener ation, and sorting of data. The database and Integration Service produce different output when the following settings and conversions are different: Nulls treated as the highest or lowest value: While sorting the data, the In tegration Service can treat null values as lowest, but database treats null valu ex
es as the highest value in the sort order. SYSDATE built-in variable: Built-in Variable SYSDATE in the Integration Serv ice returns the current date and time for the node running the service process. However, in the database, the SYSDATE returns the current date and time for the machine hosting the database. If the time zone of the machine hosting the databa se is not the same as the time zone of the machine running the Integration Servi ce process, the results can vary. Date Conversion: The Integration Service converts all dates before pushing t ransformations to the database and if the format is not supported by the databas e, the session fails. Logging: When the Integration Service pushes transformation logic to the dat abase, it cannot trace all the events that occur inside the database server. The statistics the Integration Service can trace depend on the type of pushdown opt imization. When the Integration Service runs a session configured for full pushd own optimization and an error occurs, the database handles the errors. When the database handles errors, the Integration Service does not write reject rows to t he reject file. ******************************************************************************** ********************************************************************* Informatica performance tuning : Identifying bottlenecks Written by nisheetsingh Posted February 14, 2010 at 8:47 am In our last post for Informatica performance tuning, we discussed very briefly. Today, let s discuss it in details. Well performance tuning is similar to a chain which can be as stronger as its weakest link. Informatica performance tuning is not so straight forward. As I told in last pos t there are four crucial domains that require attention: System Network Database and the Informatica coding Since first three comes under administrator territory, we will discuss about Inf ormatica coding. Informatica performance tuning is an iterative process. At each iteration, we identify the biggest bottlenecks then remove those bottlen ecks. From a developer perspective, we should start below in below order: Source Target Mapping transformation Session Another way to get detail log is to run the session with enabling option performance detail . Collect
To identify if there is any problem with source or target, replace the source (o r target) with flat file with similar data. If performance improves drastically means there is some problem in source (or ta rget) tables. Then we have to check that table statistics at database level. Session logs The PowerCenter 8 session log provides very detailed information that can be use
d to establish a baseline and will identify potential problems. Very useful for the developer are the detailed thread statistics that will help benchmarking your actions. The thread statistics will show if the bottlenecks occur while transforming data or while reading/writing. Always focus attention on the thread with the highest busy percentage first. For every thread, detailed information on the run and idle time are presented. The busy percentage is calculated: (run time idle time) * 100/ run time . Each session has a minimum of three threads: reader thread transformation thread writer thread An example: ***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ***** Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_NST_CU ST] has completed: Total Run Time = [41.984171] secs, Total Idle Time = [0.00000 0] secs, Busy Percentage = [100.000000]. Thread [TRANSF_1_1_1] created for [the transformation stage] of partition point [SQ_NST_CUST] has completed: Total Run Time = [0.749966] secs, Total Idle Time = [0.334257] secs, Busy Percentage = [34.364003]. Thread [WRITER_1_*_1] created for [the write stage] of partition point [NST_CST_ 1] has completed: Total Run Time = [47.668825] secs, Total Idle Time = [0.000000 ] secs, Busy Percentage = [100.000000]. In this example, the obvious bottleneck is source and target present in database . Transformations take very less time comparison to source and targets. If you s ee reader or writer thread 100% busy then to improve the performance, we can opt for partitioning of source/target. It opens several connection to database and reads/writes to database parallel. Similarly, we can see another example as below: ***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ***** Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_NST_EM P] has completed. The total run time was insufficient for any meaningful statist ics. Thread [TRANSF_1_1_1] created for [the transformation stage] of partition point [SQ_NST_EMP] has completed: Total Run Time = [37.403830] secs, Total Idle Time = [0.000000] secs, Busy Percentage = [100.000000]. Thread [WRITER_1_*_1] created for [the write stage] of partition point [T_NST_EM P] has completed: Total Run Time = [28778821] secs, Total Idle Time = [25.223456 ] secs, Busy Percentage = [34.23602355]. Here we can see that transformation thread is 100% busy henceforth the bottlenec k of this execution is transformation. As of now, I am still not done with performance optimization. Only thing I have discussed is to identify the bottleneck of any mapping/session/workflow. In next session, I will discuss to resolve those bottlenecks. Cheers!!! Rating: 5.8/10 (15 votes cast) Rating: +12 (from 16 votes) Filed under: Uncategorized RSS feed for comments on this post TrackBack URI
8 Responses to this post George Parappuram on July 17th, 2010 7:32 am Nisheet, I have a case where: ***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ***** Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_TE D_HANDSET_REPAIR_DETAIL_V] has completed. Total Run Time = [30583.228133] secs Total Idle Time = [0.000000] secs Busy Percentage = [100.000000] Thread [TRANSF_1_1_1] created for [the transformation stage] of partition po int [SQ_TED_HANDSET_REPAIR_DETAIL_V] has completed. Total Run Time = [30583.137429] secs Total Idle Time = [30386.113959] secs Busy Percentage = [0.644223] Transformation-specific statistics for this thread were not accurate enough to report. Thread [WRITER_1_*_1] created for [the write stage] of partition point [HAND SET_REPAIR_DETAIL_STG] has completed. Total Run Time = [30582.229525] secs Total Idle Time = [27009.644572] secs Busy Percentage = [11.681898] I do not have any exposure to Informatica, but I am an Oracle DBA. The SQL r uns in 2 hrs on local Oracle DB server, and 4 hrs across network. But Informatic a session takes 17 hrs. What are the parameters which can improve network perfor mance here? Rating: 5.0/5 (1 vote cast) Rating: +2 (from 2 votes) Nisheet Singh on July 17th, 2010 9:40 pm Since Powercenter is taking around 8.5 hrs while reading the data from sourc e. There could be problem with source table/partition. Try to analyze /gather st ats for the same partition/table, that could work. Plz let me know if you still face the issue. Thanks!! Rating: 5.0/5 (1 vote cast) Rating: +2 (from 4 votes) Ganguly on September 14th, 2010 2:26 am Can u please optimize it :It is SQ->target no transformation btw.Source(orac le),Target(flatfile) The SQ query is : select lpad(AS1,7, 0'), lpad(AS2,3,0), case when a.p = 1 then A when a.p = 2 then B when a.p = 6 then C when a.p = 7 then E end, case when a.p = 2 then ((256*substr(b.fkr_86, 2, 8) + substr(b.fkr_86, 10, 3)) ) -4294967296 else
((256*substr(b.fkr_86, 2, 8) + substr(b.fkr_86, 10, 3)) ) end, case when a.p in (6,7) then ((256*substr(b.p_k, 2, 8) + substr(b.p_k, 10, 3)) ) else ((256*substr(b.p_k, 2, 8) + substr(b.p_k, 10, 3)) 4294967296 ) end, trim(ARH), ARH1, lpad(trim(ARH2),2,0), trim(ARH3), lpad(trimARH4),2,0), trim(ARH5), arh_date_eff_txn, trim(arh_code_fee), case when arh_timestamp_date is null then null else to_date(to_char(arh_timestamp_date, MM/DD/YYYY ) to_char(case when nvl(arh_timestamp_hr, 00) > 23 then 0 else nvl(arh_timestamp_hr, 00) end) : to_char(case when nvl(arh_timestamp_min, 00) > 59 then 0 else nvl(arh_timestamp_min, 00) end) : to_char(case when nvl(arh_timestamp_sec, 00) > 59 then 0 else nvl(arh_timestamp_sec, 00) end), MM/DD/RRRR HH24:MI:SS ) end, arh_amt_adj, trim(arh_code_adj), ARH_PCT_INDEX, ARH_PCT_INDEX_MAX, ARH_PCT_INDEX_MIN, trim(lpad(ARH6,3, 0')), trim(lpad(ARH_CODE_INDEX,3, 0')), ARH_PCT_FLOAT_BASE, trim(ARH_COMMENT), lpad(ARH_NBR_VOUCHER,7, 0'), trim(ARH_REASON_CODE), trim(ARH_PAYABLE_IND), sysdate, nvl((256*substr(b.fk_r7588_r7584, 2, 8) + substr(b.fk_r7588_r7584, 10, 3)),0 ) from r7270_acct_sched a, R7584_AR_hist b where a.p_k = b.fk_r7270_arhist and a.p = b.p and b.p in ( 1', 2', 6', 7') and b.fkr_86 is not null ***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] ***** Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_JC T0007_ARHIST] has completed. Total Run Time = [5850.163300] secs Total Idle Time = [1.300887] secs Busy Percentage = [99.977763] Thread [WRITER_1_*_1] created for [the write stage] of partition point [JCT0 007_ARHIST_F] has completed. Total Run Time = [5795.195516] secs
Total Idle Time = [3362.131688] secs Busy Percentage = [41.984154] Rating: 4.0/5 (1 vote cast) Rating: +1 (from 1 vote) Nisheet Singh on September 15th, 2010 7:23 am Hi Ganguly, only by seeing this SQL, predicting bottleneck is little tricky. Right now I can only suggest you to analyze or gather stats for tables (or part ition) r7270_acct_sched a, R7584_AR_hist b. If still problem persists then pleas e send me the execution plan of above SQL. Thanks!!! Rating: 3.5/5 (2 votes cast) Rating: +1 (from 1 vote) Amit on October 17th, 2010 11:28 pm hello Nisheet, i am amit and i am new to informatica technology. i am interested to learn more about informatica performance tuning. So can y ou please share some basic ppt or case studies for Informatica Performance Tunin g on my mail id so that i can better understand. Thanking you in advance. Rating: 0.0/5 (0 votes cast) Rating: 0 (from 0 votes) Deva on December 28th, 2010 10:48 am hello Nisheet, Tis is devendra ..i am new to tis technology .could you please explain me brief ly about session partioning and perforance tuning i am looking forward for your va luble reply .Thanks in advance.. Rating: 5.0/5 (1 vote cast) Rating: +1 (from 1 vote) Mark Connelly on May 19th, 2011 4:46 am Hello Nisheet, I like your blog and think that you are doing a good job here. I wanted to l et you know about HIPPO which is our specialist Informatica performance profilin g and capacity planning tool that we have developed and has been highly rated by the Informatica Marketplace. I would be happy to arrange a free trial of the so ftware for you to get your opinion and feedback. I think that you might find it interesting because we focus not just on PowerCenter itself but how PowerCenter is interacting with the database, network infrastructure, CPU and Memory usage o n the server and much more. You can drill down from folders, through workflows a nd mappings all the way to individual transformations and view their performance . Let me know if I can set that up for you as it would be great to get your thou ghts. Many thanks, Mark Rating: 0.0/5 (0 votes cast) Rating: -1 (from 1 vote) Anish on June 19th, 2011 5:35 am Hi, I have a session whose source has around 8 million records.Both source and t arget are Oracle. There is no source qualifier query. In the mapping, each record is splitted into 40 records. So around 350 milli on records are loaded into the target.
The session runs for over 14 hours.The source through put is between 20 to 3 0. In the session log, i can see that the busy percent in the target is 100%. So i assume that the bottleneck is on the target (Writer thread). I think since the target is acting as a bottleneck here, it prevents the sou rce from having good throughput. I cannot use bulk load, because it is not possible to disable constraints of the target tables in the production enviroment. I tried increasing the COMMIT INTERVAL and the DTM BUFFER SIZE, but both did not work. I even tried NO LOGGING in the target table. Can anyone suggest a way to increase the session performance? Thanks, Anish ******************************************************************************** ********************************************************************* Aborting DTM process due to memory allocation failure Hello All, When I am running one of my workflows, the session in it fails with memory alloc ation failure (out of virtual memory). The DTM process gets aborted.
We have 2GB of RAM and my session is a simple pass through and does not contain any transformations. We have increased our RAM from 512MB to 2GB when we started seeing this error. We have also increased our virtual memory to 2GB. However, e ven after increasing the RAM and virtual memory, the problem is not resolved yet .
Is there anything we are missing? DO I need to change any setting in PowerCenter Admin console after increasing virtual memory and RAM on the system?
narra
56 posts since Oct 12, 2007 Oct 12, 2008 10:29 PM (in response to Subhashini Narra) Please, increase the size of Please, increase the size of the DTM buffer where it existed in Properties o f the failed session in the failed workflow. Like (0) Subhashini Narra Newbie Subhashini Narra 2 posts since Sep 2, 2008 Oct 13, 2008 9:31 AM (in response to VIJAYA KUMAR KAVALAKUNTLA) Aborting DTM process due to memory allocation failure Thanks for the response. I have increased the DTM buffer size. However, I get a new error now, which says to increase DTM buffer size to 4.78GB. I am tr ying to run a simple pass through session which loads 150 rows from SQL Server 2 000 DB to Oracle 10g DB. I can load 1024 rows from a different SQL Server 2000 D B to the mentioned Oracle 10g DB with no problem and the amount of memory consum ed is just .96MB.
Thanks narra Like (0) VIJAYA KUMAR KAVALAKUNTLA Newbie VIJAYA KUMAR KAVALAKUNTLA 56 posts since Oct 12, 2007 Oct 13, 2008 10:39 PM (in response to Subhashini Narra) Hi! Please, check the size Hi!
Please, check the size of the increased DTM buffer size once again. Because, for 1024 rows, the amount of memory consumed is just .96MB. Now, you ar e trying to load only 150 rows. So, you have to check with DTM Buffer size once again.
Like (0) Karthi Keyan Newbie Karthi Keyan 2 posts since Dec 18, 2007
Oct 14, 2008 11:31 PM (in response to VIJAYA KUMAR KAVALAKUNTLA) DTM Failure Please check in your mapping if any one of the colum conatins th e datatype width more than the specifed default width. If yes then change the wi dth of the column below the default width. I faced the same problem and did the suggestion what I told here. Hope it will work.
Thanks Karthikeyan ******************************************************************************** ********************************************************************* Improving Mapping Performance in Informatica 1/3/2005 by ITtoolbox Popular Q&A Team for ITtoolbox as adapted from Informatica -L discussion group Summary: How can I improve the performance of the mapping tool in Informatica? Full Article: Disclaimer: Contents are not reviewed for correctness and are not endorsed or re commended by ITtoolbox or any vendor. Popular Q&A contents include summarized in formation from Informatica-L discussion unless otherwise noted. Adapted from a response by Jonathan On Tuesday, December 21, 2004 Mapping optimization When to optimize mappings ------------------------The best time in the development cycle is after system testing. Focus on mapping -level optimization only after optimizing the target and source databases. Use Session Log to identify if the source, target or transformations are the per formance bottleneck --------------------------------------------------------------------------------------------The session log contains thread summary records: MASTER> PETL_24018 Thread [READER_1_1_1] created for the read stage of partition point [SQ_test_all_text_data] has completed: Total Run Time = [11.703201] secs, Total Idle Time = [9.560945] secs, Busy Percentage = [18.304876]. MASTER> PETL_24019 Thread [TRANSF_1_1_1_1] created for the transformation stage of partition point [SQ_test_all_text_data] has completed: Total Run Time = [11.764368] secs, Total Idle Time = [0.000000] secs, Busy Percentage = [100.000000]. MASTER> PETL_24022 Thread [WRITER_1_1_1] created for the write stage of partition point(s) [test_all_text_data1] has completed: Total Run Time = [11.778229] secs, Total Idle Time = [8.889816] secs, Busy Percentage = [24.523321]. If one thread has busy percentage close to 100% and the others have significantl
y lower value, the thread with the high busy percentage is the bottleneck. In th e example above, the session is transformation bound Identifying Target Bottlenecks -----------------------------The most common performance bottleneck occurs when the Informatica Server writes to a target database. You can identify target bottlenecks by configuring the se ssion to write to a flat file target. If the session performance increases signi ficantly when you write to a flat file, you have a target bottleneck. Consider performing the following tasks to increase performance: * Drop indexes and key constraints. * Increase checkpoint intervals. * Use bulk loading. * Use external loading. * Increase database network packet size. * Optimize target databases. Identifying Source Bottlenecks -----------------------------If the session reads from relational source, you can use a filter transformation , a read test mapping, or a database query to identify source bottlenecks: * Filter Transformation - measure the time taken to process a given amount of da ta, then add an always false filter transformation in the mapping after each sou rce qualifier so that no data is processed past the filter transformation. You h ave a source bottleneck if the new session runs in about the same time. * Read Test Session - compare the time taken to process a given set of data usin g the session with that for a session based on a copy of the mapping with all tr ansformations after the source qualifier removed with the source qualifiers conn ected to file targets. You have a source bottleneck if the new session runs in a bout the same time. * Extract the query from the session log and run it in a query tool. Measure the time taken to return the first row and the time to return all rows. If there is a significant difference in time, you can use an optimizer hint to eliminate th e source bottleneck Consider performing the following tasks to increase performance: * Optimize the query. * Use conditional filters. * Increase database network packet size. * Connect to Oracle databases using IPC protocol. Identifying Mapping Bottlenecks ------------------------------If you determine that you do not have a source bottlenec ******************************************************************************** ********************************************************************* informatica:What is Data Transformation Manager Process? How many Threads it cre ates to process data, explain each thread in brief. When the workflow reaches a session, the Load Manager starts the DTM process. Th e DTM process is the process associated with the session task. The Load Manager creates one DTM process for each session in the workflow. The DTM process perfor ms the following tasks: Reads session information from the repository. Expands the server and session variables and parameters. Creates the session log file.
Validates source and target code pages. Verifies connection object permissions. Runs pre-session shell commands, stored procedures and SQL. Creates and runs mapping, reader, writer, and transformation threads to extract, transform, and load data. Runs post-session stored procedures, SQL, and shell commands. Sends post-session email. The DTM allocates process memory for the session and divides it into buffers. Th is is also known as buffer memory. The default memory allocation is 12,000,000 b ytes. The DTM uses multiple threads to process data. The main DTM thread is call ed the master thread. The master thread creates and manages other threads. The master thread for a ses sion can create mapping, pre-session, post-session, reader, transformation, and writer threads. Mapping Thread -One thread for each session. Fetches session and mapping informa tion. Compiles the mapping. Cleans up after session execution. Pre- and Post-Session Threads- One thread each to perform pre- and post-session operations. Reader Thread -One thread for each partition for each source pipeline. Reads fro m sources. Relational sources use relational reader threads, and file sources us e file reader threads . Transformation Thread -One or more transformation threads for each partition. Pr ocesses data according to the transformation logic in the mapping. Writer Thread- One thread for each partition, if a target exists in the source p ipeline. Writes to targets. Relational targets use relational writer threads, an d file targets use file writer threads. ******************************************************************************** ********************************************************************* Performance improvement can be achieved on different levels. There is database t uning, improvement of the mapping, work at session level and of course at workfl ow level. One aspect of tuning is the different memory settings on session level . There are two settings regarding memory in a session, namely: DTM Buffer Pool Size; Default Buffer Block Size. DTM Buffer Pool Size defines the amount of memory that the Integration Service u ses as Data Transformation Manager buffer memory. The buffer is used to swap dat a into and out of the Integration Service. Setting this buffer pool size correct ly can improve performance during momentary slowdowns. There is an optimum for t his setting, just increasing it will initially lead to a performance improvement but will level off. Informatica recommends a default DTM Buffer size of 12.000. 000 bytes. Workflow Workbench can now automatically calculate the optimum for th is memory settings, in correspondence with the buffer block size. Workflow Workb ench will only change this settings when the recommended amount of buffer memory is more than the default recommended by Informatica. Buffer Block Size depends on the record size of the different source and target tables that are used in the mapping. Ideally the buffer can transport 100 rows at once. Workflow Workbench will go through all the source and target tables in a mapping and calculate your ideal buffer block size based on the largest record
size definition within both sources and targets. Again the recommended size of 64.000 bytes is taken into account. The buffer block size is only increased when the optimal size exceeds this recommended size. Workflow Workbench will calculate these memory settings for you, and can export them to the Informatica Workflow Manager. ******************************************************************************** ********************************************************************* How to Tune Performance of Informatica Aggregator Transformation DWBIGuru Print Tuning Aggregator Transformation Like Joiner, the basic rule for tuning aggregator is to avoid aggregator transfo rmation altogether unless You really can not do the aggregation in the source qualifier SQL query (e.g . Flat File source) Fields used for aggregation are derived inside the mapping If you have to do the aggregation using Informatica aggregator, then ensure that all the columns used in the group by are sorted in the same order of group by a nd Sorted Input option is checked in the aggregator properties. Ensuring the input data is sorted is absolutely must in order to achieve better performance and we will soon know why. Other things that need to be checked to increase aggregator performance are Check if Case-Sensitive String Comparison option is really required. Keeping t his option checked (default) slows down the aggregator performance Enough memory (RAM) is available to do the in memory aggregation. See below section for details. Aggregator cache is partitioned How to (and when to) set aggregator Data and Index cache size As I mentioned before also, my advice is to leave the Aggregator Data Cache Size a nd Aggregator Index Cache Size options as Auto (default) in the transformation lev el and if required, set either of the followings in the session level (under Conf ig Object tab) to allow Informatica allocate enough memory automatically for the transformation: Maximum Memory Allowed For Auto Memory Attributes Maximum Percentage of Total Memory Allowed For Auto Memory Attributes However if you do have to set Data Cache/ Index Cache size yourself, please note that the value you set here is actually RAM memory requirement (and not disk sp ace requirement) and hence, your mapping will fail if Informatica can not alloca te the entire memory in RAM at the session initiation. And yes, this can happen often because you never know what other jobs are running in the server and what amount of RAM other jobs are really occupying while you run this job. Having understood the risk, let s now see the benefit of manually configuring the Index and Data Cache sizes. If you leave the index and data cache sizes to auto then if Informatica does not get enough memory during session run time, your job will not fail, instead Informatica will page-out the data to hard disk level. S ince I/O performance of hard disk drive is 1000~ times slower than RAM, paging o ut to hard disk drive will have performance penalty. So by setting data and inde
x cache size manually you can ensure that Informatica block this memory in the b eginning of session run so that the cache is not paged-out to disk and the entir e aggregation actually take place in RAM. Do this at your own risk. Manually configuring index and data cache sizes can be beneficial if ensuring co nsistent session performance is your highest priority compared to session stabil ity and operational steadiness. Basically you risk your operations (since it cre ates high chance of session failure) to obtain optimized performance. The best way to determine the data and index cache size(s) is to check the sessi on log of already executed session. Session log clearly shows these sizes in byt es. But this size depends on the row count. So keep some buffer (around 20% in m ost cases) on top of these sizes and use those values for the configuration. Other way to determine Index and Data Cache sizes are, of course, to use the inb uilt Cache-size calculator accessible in session level. Aggregator Cache Size Calculator Fig. Aggregator - Cache Calculator Using the Informatica Aggregator cache size calculator is a bit difficult (and l ot inaccurate). The reason is to calculate cache size properly you will need to know the number of groups that the aggregator is going to process. The definitio n of number of groups is as below: No. of Groups = Multiplication of cardinality values of each group by column This means, suppose you group by store and product, and there are total 150 dist inct stores and 10 distinct products, then no. of groups will be 150 X 10 = 1500 . This is inaccurate because, in most cases you can not ascertain how many distinc t stores and product data will come on each load. You might have 150 stores and 10 products, but there is no guarantee that all the product will come on all the load. Hence the cache size you determine in this method is quite approximate. You can, however, calculate the cache size in both the two methods discussed her e and take the max of the values to be in safer side. ******************************************************************************** *********************************************************************
******************************************************************************** *********************************************************************
******************************************************************************** *********************************************************************
******************************************************************************** *********************************************************************