Abinitio Training
Abinitio Training
Abinitio Training
3/5/2014
DAY ONE
Introduction to Data warehouse ETL AbInitio AbInitio Features Architecture GDE CO>Operating System EME Setting up Environment Data set types and Components Data types and DML I/P File, O/P file, Intermediate file and Lookup file Filter by expression, Replicate, Reformat and Redefine
3/5/2014
Data Warehouse is a
Subject-oriented, Integrated, Time variant and Non-volatile collection of data in support of managements decision-making process.
3/5/2014
ETL
Reading the source data. Applying business, transformation, and technical rules. Loading the data.
3/5/2014
AbInitio
AbInitio is Latin for From the Beginning.
AbInitio software is a general-purpose data processing platform for mission critical applications such as: Data warehousing Batch Processing Click-stream analysis Data movement Data transformation
3/5/2014
AbInitio Features
Transformation of disparate sources. Aggregation and other processing. Referential integrity checking. Database loading. Extraction for external processing. Aggregation and loading of data marts. Processing just about any form and volume of data. Parallel sort/merge processing. Data transformation. Re hosting of corporate data. Parallel execution of existing application.
6
3/5/2014
Architecture
User Application Development Environment GDE Shell Component Library User defined component 3rd party component AbInitio CO> Operating System Native Operating System EME
3/5/2014
GDE
3/5/2014
CO>Operating System
Parallel and distributed application execution. Control. Data Transport. Transactional semantics at the application level. Check pointing. Monitoring and debugging. Parallel file management. Metadata driven components.
3/5/2014
CO>Operating System
AbInitio Co>Operating system runs on Sun Solaris IBM AIX Hewlett-Packard HP-UX Siemens Pyramid Reliant Unix IBM DYNIX/ptx Silicon Graphics IRIX Red Hat Linux Windows NT 4.0(x86) Windows NT 2000 (x86) Compaq Tru64 UNIX IBM OS/390 NCR MP-RAS
10
3/5/2014
EME
Repository
3/5/2014
11
Setting up Environment
3/5/2014
12
3/5/2014
13
Base Void Number String Date Datetime Compound Vector Record Union
3/5/2014
14
DML
To
define the complete record structure. Can be defined either in grid mode or in text mode. Can be stored under a file name which can be referred multiple times or can be embedded.
3/5/2014
15
Lookup File: Represents one or multiple serial files or a multiple of data records small enough to be held in main memory, letting a transform function retrieve records much more quickly than it could retrieve them if they were stored on disk.
Allows filter the data based on expression that identifies only the records that you need. can also be used for data validation.
Replicate: Used when user want to make multiple copies of a flow for separate processing.
3/5/2014
17
Changes the record format of data records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records. manipulates one record at a time and does work like validation and cleansing e.g. deleting bad values, setting default values, standardizing field formats or rejecting records with invalid date etc.
3/5/2014
18
Transformation rules are defined for transform (0). Use of Reformat component is to Clean input data so that all of the records conform to the same convention
Redefine:
Copies data records from its input to its output without changing the values in the data records. Used to change or rename fields in a record format without changing the values in the records.
3/5/2014
19
DAY TWO
Sort,
Sort within Group, Dedup Sort Rollup and Scan Reject, Error Handling and Debugging
3/5/2014
20
3/5/2014
21
What it does: First sort the data. Set the key for grouping in the dedup component. Finally choose which duplicate to keep.
3/5/2014
23
3/5/2014
24
3/5/2014
25
Invalid data will go to Rejected Port. Setting reject-threshold parameter inside the component. GDE has a built in debugger capability. Add a Watcher File.
3/5/2014
26
DAY THREE
Join
Multi
Files Parallelism Partition and De Partition Layout, Fan-in, Fan-out and All-to-All
3/5/2014
27
Join
Join:
Used to combine data from two or more flows of records based on a matching key (or keys). Join deals with two activities. 1.Transforming data sources with different record format. 2.Combining data sources with the same record format.
3/5/2014
28
Join
Join types: Inner Join Full outer Join Explicit Join Inner Join: Uses only records with matching keys on both inputs. Full Outer Join: Uses all records from both inputs If a record from one does not have a matching record in the other input, a NULL record is used for the missing record
3/5/2014 29
Join
Explicit Join: Uses all records in one specified input (Based upon True/False), but records with matching keys in the other inputs are optional. Again a NULL record is used for the missing records.
3/5/2014 30
Multi Files
Essentially the global view of a set of ordinary files, each of which may be located anywhere the AbInitio Co-Operating System is installed. Each partition of a multi file is an ordinary file. Resides in multi directories. Identified using URL syntax with mfile: as the protocol part. One Control File.
3/5/2014
31
Parallelism
Processing of datasets in parallel for better performance. Types of Parallelism 1.Componet 2.Pipeline 3.Data Component Parallelism: When more than one component is running at the same time on different data streams. Comes for free with Graph Programming. Limitation: Scales to no. of branches a graph.
3/5/2014
32
Parallelism
Pipeline Parallelism: When two or more connected components process data one by one. Limitation:
Scales to length of branches in a graph. Some operations, like sorting, do not pipeline
Data Parallelism: Occurs when multiple copies of a process act on different sets of data at the same time. Process the whole more quickly using multiple CPU at the same time
3/5/2014 33
The component Partition by Expression partitions data by dividing it according to a DML expression.
The component Partition by Key partitions data by grouping it by a key, like dealing cards into piles according to their suit
3/5/2014
34
The Component Partition with Load Balance Partitions Data by Dynamic load balancing. More data goes to CPUs that are less busy and vice versa, thus maximizing throughput. The Component Partition by Percentage Partitions Data by Distributing it, so the output is proportional to fraction of 100. The Component Partition by Range Partitions Data by Dividing it evenly among nodes, based on a key and a set of partitioning ranges. The Component Partition by Round-robin Partitions Data by Distributing it evenly, in block size chunks, across the output partitions, like dealing cards.
3/5/2014
35
The Gather component collects inputs from multiple partitions in an arbitrary manner, and produces a single output flow. It does not maintain sort order, but is the most efficient departitioned.
36
3/5/2014
Interleave component collects records from many sources in round-robin fashion. The effect is like taking a card from each player in turn, forming a deck of cards. Merge components collets inputs from multiple sorted partitions and maintains the sort order.
37
The
3/5/2014
All-to-All:
3/5/2014
38
DAY FOUR
DBC
File, Input Table, Output Table, Join with DB Sub graph, Phasing, Check point, Recovery Normalize, Denormalize Sorted
3/5/2014
39
The dbms_version field is the version of your database. The db_home field is the location of your database software ( ORACLE_HOME) The db_name field is the value of the identifier for your database instance. For Oracle, this the value of the ORACLE_SID environment variable. For SQL*Net, use @db_name The db_nodes field is a list of database-accessible nodes with Ab Initio installed. Note: If Oracle is on an SMP machine, you usually use one host name unless you are running Oracle OPS (parallel), then you may need a list of all the database runs on. The #user comment and #password comment fields list your name and password. If your database is Oracle and you are identified externally, leave these fields as comments
3/5/2014
40
3/5/2014
43
3/5/2014
44
Recovery:
3/5/2014
45
3/5/2014
47
DAY FIVE
Memory
Management Dead Lock Sandbox Setting, Graph and Project Parameter User defined function and Built-in functions
3/5/2014
48
Memory Management
Memory
3/5/2014
49
Dead Lock
How to avoid Dead Lock : Use Concatenate and Merge with care. Use flow buffering (the GDE Default for a new graph).[*Automatic Flow Buffering is enabled] Insert a phase break before the departitioner. Dont serialize data unnecessarily; repartition instead of departition.
3/5/2014
50
3/5/2014
51
3/5/2014
52
parameter is simply a name value pair with a number of additional attributes. Parameters that reside in your sandbox are known as sandbox parameter, they set the context of your sandbox. Those that reside in the repository are called project parameters. Graph parameters only apply to the graph in which they are defined.
3/5/2014 53
3/5/2014 54