01 Ab Initio Basics
01 Ab Initio Basics
01 Ab Initio Basics
Course Content
Ab Initio Architecture
Overview of Graph
Ab Initio functions Basic components Partitioning and De-partitioning
Case Studies
Course Objective
Ab Initio Architecture
July 6, 2010
Introduction
Data processing tool from Ab Initio software corporation (http://www.abinitio.com) Latin for from the beginning Designed to support largest and most complex business applications
Data warehousing Batch processing Click-stream analysis Data movement Data transformation
GDE
Component Suite Partitioners, Transforms, ...
Shell
User Components
GDE
Ability to graphically design batch programs comprising Ab Initio components, connected by pipes Ability to test run the graphical design and monitor its progress Ability to generate a shell script or batch file from the graphical design
Co>Operating System
Ab Initio Built-in Component Programs (Partitions, Transforms etc)
Host Machine 2
User Programs Co-Operating System User Programs
Operating System
( Unix , Windows NT )
Operating System
Co-operating System
On a typical installation, the Co-operating system is installed on a Unix or Windows NT server while the GDE is installed on a Pentium PC.
CO>Operating System
Layered on the top of the operating system Unites a network of computing resources into a data-processing system with scalable performance Co>Operating system runs on
Sun Solaris 2.6, 7, and 8 (SPARC) IBM AIX 4.2, and 4.3 Hewlett-Packard HP-UX 10.20, 11.00, and 11.11 Siemens Pyramid Reliant UNIX Release 5.43 IBM DYNIX/ptx 4.4.6, 4.4.8, 4.5.1, and 4.5.2 Silicon Graphics IRIX 6.5 Red Hat Linux 6.2 and 7.0 (x86) Windows NT 4.0 (x86) with SP 4, 5 or 6 Windows NT 2000 (x86) with no service pack or SP1 Digital UNIX V4.0D (Rev. 878) and 4.0E (Rev. 1091) Compaq Tru64 UNIX Versions 4.0F (Rev 1229) and 5.1 (Rev 732) IBM OS/390 Version 2.8, 2.9, and 2.10 NCR MP-RAS 3.02
can talk to the Co-operating system using several protocols like Telnet, Ab Initio / Rexec and FTP GUI for building applications Co-operating system and GDE have independent release mechanisms Co-operating system upgrade is possible without change in the GDE
release
Note: During deployment, GDE sets AB_COMPATIBILITY to the Co>Operating System version number. So, a change in the Co>Operating System release requires a re-deployment
Overview of Graph
July 6, 2010
A Component
A program that does a specific type of job controlled by its parameter settings A Component Organizer Groups all components under different functional categories
out*
Score
Select
deselect*
Good Customers
L1
Customers
Other Customers
Flows
Ports
Layout
Types of Datasets
Datasets can be of following types:
Input Datasets
itable Input Table is used to unload/read data directly from a database table to the Abinitio graph as input Input File A data file acting as input to the Abinitio graph. Supports formats such as Flat files and XML files. These files can be serial or multi-file otable Output Table is used to load data directly into a database table Output File A data file acting as output of the Abinitio graph. Supports formats such as Flat files and XML files. These files can be serial or multi-file
Output Datasets
Databases connected as direct input/output are oracle, teradata, netezza, DB2, MS SQL, Red Brick, Sybase etc
May 18, 2010
Runtime Environment
The graph execution can be done from the GDE itself or from the back-end as well A graph can be deployed to the back-end server as a Unix shell script or Windows NT batch file. The deployed shell or the batch file can be executed
at the back-end
A sample graph
Layout
1.Layout determines the location of a resource. 2.A layout is either serial or parallel. 3.A serial layout specifies one node and one directory. 4.A parallel layout specifies multiple nodes and multiple directories. It is permissible for the same node to be repeated. 5.The location of a Dataset is one or more places on one or more disks. 6.The location of a computing component is one or more directories on one or more nodes. By default, the node and directory is unknown. 7.Computing components propagate their layouts from neighbors, unless specifically given a layout by the user.
Layout
file on Host X
files on Host X
Q: On which Host(s) do the processing components run? Host W Host X Host Y Host Z
Host W Host X
Host Y Host Z
Serial Parallel
file on Host W
Controlling Layout
Propagate (default) Bind layout to that of another component Use layout of URL Construct layout manually Run on these hosts Database components can use the same layout as a database table
May 18, 2010
Phase of a Graph
Phases are used to break up a graph into blocks for performance tuning.
Breaking an application into phases limits the contention for : - Main memory - Processors Breaking an application into phases costs: Disk Space The temporary files created by phasing are deleted at the end of the phase, regardless of whether the run was successful.
Phase 0
Phase 1
May 18, 2010
View Phase
Set Phase
Client Clien t
Host Host
Agen Agent t
Client Clien t
Host Host
Host Host
GDE GDE
Agen Agent t
Agen Agent t
Client Clien t
Host Host
GDE GDE
Agen Agent t
Agen Agent t
Client Clien t
Host Host
GDE GDE
Agen Agent t
Agent
Agen t
Client Clien t
Host Host
Client Clien t
Host Host
GDE GDE
Client Clien t
Host Host
GDE GDE
Agent Agen
Agent
Agen t
Client Clien t
Host Host
Agen t
Agent
Client Clien t
Host Host
GDE GDE
Client Clien t
Host Host
Client Clien t
Host Host
Ab Initio Functions
July 6, 2010
or
DML)
Field names consist of letters(az,AZ), digits(09), underscores(_) and are Case sensitive Keywords/Reserved words cannot be used as field names.
Keywords/Reserved Words
Smith Spade
Jones West Black
DML BLOCK
end
string(6)
Delimiters
Built-in Functions
Ab Initio built-in functions are DML expressions that
Date Functions
date_day date_day_of_month date_day_of_week date_day_of_year date_month date_month_end date_to_int date_year datetime_add datetime_day datetime_day_of_month datetime_day_of_week datetime_day_of_year datetime_difference datetime_hour datetime_minute datetime_second datetime_microsecond datetime_month datetime_year
May 18, 2010
write_to_log
first_defined is_defined is_failure
is_valid
size_of write_to_log_file
Lookup Functions
lookup lookup_count lookup_local lookup_count_local lookup_match lookup_next lookup_next_local
Math Functions
Ceiling decimal_round decimal_round_down decimal_round_up Floor decimal_truncate math_abs math_acos math_asin math_atan math_cos math_cosh math_exp math_finite math_log math_log10 math_tan math_pow math_sin math_sinh math_sqrt math_tanh
May 18, 2010
Miscellaneous Functions
allocate ddl_name_to_dml_name ddl_to_dml hash_value next_in_sequence number_of_partitions printf Random raw_data_concat raw_data_substring scanf_float scanf_int scanf_string sleep_for_microseconds this_partition translate_bytes unpack_nibbles
String Functions
char_string decimal_lpad decimal_lrepad decimal_strip is_blank is_bzero re_index re_replace string_char string_compare string_concat
string_downcase
string_filter string_lpad string_length string_upcase string_trim string_substring re_replace_first string_replace_first string_pad string_ltrim string_lrtrim
May 18, 2010
Lookup File
Represents one or more Serial or Multifile The file you want to use as a Lookup must fit into main memory This allows a transform function to retrieve records much more quickly than it could retrieve them if they were stored on disk Lookup File associates key values with corresponding data values to index records and retrieve them Lookup parameters:
Key: Name of the key fields against which Lookup File matches its
arguments Record Format: The record format you want Lookup File to use when returning data records
We use Lookup functions to call Lookup Files where the first argument to
these lookup functions is the name of the Lookup File. The remaining arguments are values to be matched against the fields named by the key parameter.
lookup(file-name, key-expression)
The Lookup functions returns a record that matches the key values and has
the format given by the Record Format parameter.
May 18, 2010
Lookup File
Storage Methods Serial lookup : lookup() whole file replicated to each partition Parallel lookup : lookup_local() file partitions held separately Lookup Functions
Name Arguments Purpose
lookup()
Returns a data record from a Lookup File which matches with the values of the expression argument
Returns the number of matching data records in a Lookup File. Returns successive data records from a Lookup File. Returns a data record from a partition of a Lookup File. Same as lookup_count but for a single partition
lookup_next_local()
File Label
NOTE: Data needs to be partitioned on same key before using lookup2010 May 18, local functions
local-variable-declaration-list
Variable-list Rule-list end;
A transform function definition consists of: 1. A list of output variables followed by a double colon(::) 2. A name for the transform function 3. A list of input variables 4. An optional list of local variable definition 5. An optional list of local statements 6. A series of rules
The list of local variable definitions, if any, must precede the list of statements. The list of statements, if any, must appear before the list of rules Example:
1. temp::trans1(in) = begin temp.sum :: 0;..............Local variable declaration with field sum end; out.temp::trans2(temp, in) = begin temp.sum :: temp.sum + in. amount; out. city :: in. city; out.sum :: temp.sum; May 18, 2010 end;
2.
Basic Components
July 6, 2010
Basic Components
Filter by Expression
Reformat
Redefine Format Replicate Join Sort Rollup Aggregate Dedup Sorted
Reformat
1. Reads record from in port
2. Changes the record format by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records.
3. Records written to out ports, if the function returns a success status 4. Records written to reject ports with descriptive message to error port, if the function returns NULL
Diagnostic Ports :
REJECT Input records that caused error are sent to this port
ERROR
Associated error message is written to this port LOG
Reformat
Parameters of Reformat Component
Count : The integer from 1 to 20 that sets the number of each of the following. 1.out ports 2.error ports 3.reject ports 4.transform parameters The default value is 1 transformn: Either the name of file, or a transform string, containing a transform function corresponding to an out port n. Reject-Threshold : The components tolerance for reject event Abort on first reject: The component stops the execution of graph at the first reject event it generates. Never Abort: The component does not stops execution of the graph, no matter how many reject events it generates Use Limit/Ramp: The component uses the settings in the ramp & limit parameters to determine how many reject events to allow before it stops the execution of graph. Limit: contains an integer that represents a number of reject events Ramp: contains a real number that represents a rate of reject events in the number of records processed. Tolerance value=limit + ramp*total number of records read
May 18, 2010
Reformat
Typical Limit and Ramp settings . . Limit = 0 Limit = 1 Limit = 1 Ramp = 0.0 Abort on any error Abort after 50 errors Never Abort Limit = 50 Ramp = 0.0 Ramp = 1
Logging: specifies whether or not you want the component to generate log records for certain events. The values of logging parameter is True or False. log_input: indicates how often you want the component to send an input record to its log port.
For example: If you select 100,then the component sends every 100th input record to its log port log_output: indicates how often you want the component to send an output record to its log port. For example: If you select 100,then the component sends every 100th output record to its log port log_reject:indicates how often you want the component to send an reject record to its log port. For example: If you select 100,then the component sends every 100th reject record to its log port
Example of Reformat
The following is the data of the Input file :
Example of Reformat
In this example Reformat has the two transform functions, each of which writes output to an out port
Reformat uses the following transform function to write output to out port out0:
Example of Reformat
Reformat uses the following transform function to write output to out port out1:
Example of Reformat
Filter by Expression
1. Reads record from the in port
2. Applies the expression in the select_expr parameter to each record. If the expression returns Non-0 Value :it writes the record to the out port 0 :it writes the record to deselect port & if you do not connect deselect port, discards the record. NULL :it writes the record to the reject port and a descriptive error message to the error port.
3. Filter by Expression stops the execution of graph when the number of reject events exceeds the tolerance value.
Input port
Out port
Deselect
IN
Records enter into the component through this port DESELECT Records returning 0 after applying expression are written to this port OUT Success records are written to this port
Diagnostic Ports :
REJECT Input records that caused error are sent to this port ERROR Associated error message is written to this port Logging records are sent to this port
May 18, 2010
LOG
Filter by Expression
Parameters of Filter by Expression Component :
select_expr : filter condition for input data records Reject-Threshold : The components tolerance for reject event Abort on first reject: The component stops the execution of graph at the first reject event it generates. Never Abort: The component does not stops execution of the graph, no matter how many reject events it generates Use Limit/Ramp: The component uses the settings in the ramp & limit parameters to determine how many reject events to allow before it stops the execution of graph. Limit: contains an integer that represents a number of reject events Ramp: contains a real number that represents a rate of reject events in the number of records processed. Tolerance value=limit + ramp*total number of records read Typical Limit and Ramp settings Limit = 0 Ramp = 0.0 Limit = 50 Ramp = 0.0 Limit = 1 Ramp = 0.01 Limit = 1 Ramp = 1 .. Abort on any error Abort after 50 errors Abort if more than 2 in 100 records causes error Never Abort
Filter by Expression
Logging: specifies whether or not you want the component to generate log records for certain events. The values of logging parameter is True or False. log_input: indicates how often you want the component to send an input record to its log port. The default value is False.
For example: If you select 100,then the component sends every 100th input record to its log port
log_output: indicates how often you want the component to send an output record to its log port. For example: If you select 100,then the component sends every 100th output record to its log port log_reject:indicates how often you want the component to send an reject record to its log port. For example: If you select 100,then the component sends every 100th reject record to its log port
Let Filter by Expression uses the following filter expression. Gender = = F || income> 200000
Redefine Format
1. Redefine format copies data records from its input to its output without changing the values in the data records. 2. Reads records from in port. 3. writes the data records to the out port with the fields renamed according to the record format of the out port. Parameters: None
personal_info; salary;
May 18, 2010
Replicate
Arbitrarily combines all the data records it receives into a single flow
Example of Replicate
Suppose you want to aggregate the flow of records and also send them to the another computer, you can accomplish this by using Replicate component.
Aggregate
Reads record from the in port If you have defined the select parameter, it applies the expression in the select parameter to each record. If the expression returns Non-0 Value :it processes the record 0 :it does not process that record NULL : writes a descriptive error message to the error port & stops the execution of the graph. If you do not supply an expression for the select parameter, Aggregate processes all the records on the in port.
Diagnostic Ports :
REJECT Input records that caused error are written to this port ERROR Associated error message is written to this port Logging records are written to this port LOG
Aggregate
Parameters of Aggregate component :
Sorted-input : Input must be sorted or grouped: Aggregate requires grouped input, and max-core parameter is not available In memory: Input need not be sorted :Aggregate requires ungrouped input, and requires the use of max-core parameter. Default is Input must be sorted or grouped. Max-core : maximum memory usage in bytes Key: name of the key field Aggregate uses to group the data records Transform : either name of the file containing the transform function, or the transform string. Select: filter for data records before aggregation
Reject-Threshold : The components tolerance for reject event Abort on first reject: The component stops the execution of graph at the first reject event it generates. Never Abort: The component does not stops execution of the graph, no matter how many reject events it generates Use Limit/Ramp: The component uses the settings in the ramp & limit parameters to determine how many reject events to allow before it stops the execution of graph.
Aggregate
Limit: contains an integer that represents a number of reject events Ramp: contains a real number that represents a rate of reject events in the number of records processed. Logging: specifies whether or not you want the component to generate log records for certain events. The values of logging parameter is True or False. The default value is False.
log_input: indicates how often you want the component to send an input record to its log port. For example: If you select 100,then the component sends every 100th input record to its log port log_output: indicates how often you want the component to send an output record to its log port. For example: If you select 100,then the component sends every 100th output record to its log port log_reject:indicates how often you want the component to send an reject record to its log port. For example: If you select 100,then the component sends every 100th reject record to its log port log_intermediate: indicates how often you want the component to send an intermediate record to its log port
May 18, 2010
Example of Aggregate
Example of Aggregate
The Aggregate uses the following key specifier to sort the data. Key Aggregate uses the following transform function to write output.
Example of Aggregate
After the processing the graph produces the following Output File :
Sort
Sort component sorts and merges the data records. The sort component : Reads the records from all the flows connected to the in port until it reaches the number of bytes specified in the max-core parameter Sorts the records and writes the results to a temporary file on disk Repeat this procedure until it has read all the records Merges all the temporary files, maintaining the sort order Writes the result to the out port
Ports: 1.IN:records are read from this port 2.OUT:records after sorting are written to this port
Sort
i. ii.
Key:name of the key fields and sequence specifier,you want sort to use when it orders data records Max-core: maximum memory usage in bytes.
When sort reaches the number of bytes specified in the max-core parameter, it sorts the records it has read and writes a temporary file to disk.
Join
1. Reads records from multiple input ports 2. Operates on records with matching keys using a multi-input transform function 3. Writes result to the output ports
Parameters of Join:
Count: An integer from 2 to 20 specifying number of following ports and parameters. Default is 2. In ports Unused ports Reject ports Error ports Record-required parameter Dedup parameter Select parameter 1. Override-key parameter Key: Name of the fields in the input record that must have matching values for Join to call transform function
May 18, 2010
1.
Join
Sorted-input: Input must be sorted: Join accepts unsorted input, and permits the use of maintain-order parameter In memory: Input need not be sorted : Join requires sorted input, and maintain-order parameter is not available. Default is Input must be sorted
Logging: specifies whether or not you want the component to generate log records for certain events. The values of logging parameter is True or False. The default value is False.
log_input: indicates how often you want the component to send an input record to its log port. For example: If you select 100,then the component sends every 100th input record to its log port log_output: indicates how often you want the component to send an output record to its log port. For example: If you select 100,then the component sends every 100th output record to its log port log_reject:indicates how often you want the component to send an reject record to its log port. For example: If you select 100,then the component sends every 100th reject record to its log port log_intermediate: indicates how often you want the component to send an intermediate record to its log port
May 18, 2010
Join
Max-core : maximum memory usage in bytes Transform : either name of the file containing the transform function, or the transform string. Selectn: filter for data records before aggregation. One per inn port. Reject-Threshold : The components tolerance for reject event Abort on first reject: The component stops the execution of graph at the first reject event it generates. Never Abort: The component does not stops execution of the graph, no matter how many reject events it generates Use Limit/Ramp: The component uses the settings in the ramp & limit parameters to determine how many reject events to allow before it stops the execution of graph. Limit: contains an integer that represents a number of reject events Ramp: contains a real number that represents a rate of reject events in the number of records processed. Driving: number of the port to which you connect the driving input. The driving input is the largest input. All the other inputs are read into memory. The driving parameter is only available when the sorted-input parameter is set to In memory: Input need not be sorted. Specify the port number as the value of the driving parameter. The Join reads all other inputs into memory Default is 0 Max-memory: maximum memory usage in bytes before Join writes temporary files to disk. Only available when the sorted-input parameter is set to Inputs must be sorted.
Join
Maintain-order: set to True to ensure that records remain in the original order of the driving input. Only available when the sorted-input parameter is set to In memory:Input need not be sorted. Default is False. Override-keyn: alternative names for the key fields for a particular inn port. Default value is 0.0 Dedupn: set the dedupn parameter to True to remove duplicates from the corresponding inn port before joining. Default is False, which does not remove duplicates. join-type: choose from the following Inner join: sets the record-requiredn parameter for all ports to True. Inner join is the default. Outer join: sets the record-requiredn parameters for all ports to False. Explicit: allows you to set the record-requiredn parameter for each port individually. record-requiredn:This parameter is available only when the join-type parameter is set to Explicit. There is one record-requiredn parameter per inn port. When there are 2 inputs, set record-requiredn to True for the input port for which you want to call the transform for every record regardless of whether there is a matching record on the other input port. When there are more than 2 inputs, set record-requiredn to True when you want to call the transform only when there are records with matching keys on all input ports for which record-requiredn is True.
Example of Join
Example of Join
Example of Join
The sort component uses the following key to sort the data . Custid Join uses the following transform function to write output.
Join uses the default value, Inner join, for the join-type parameter.
May 18, 2010
Example of Join
Given the preceding data, record formats, parameter, and transform function, the graph produces Output File with the following data.
Rollup
Rollup performs a general aggregation of data i.e. it reduces the group of records to a single output record Parameters of Rollup Component: Sorted-input: Input must be sorted or grouped: Rollup accepts grouped input and max-core parameter is not available. In memory: Input need not be sorted : Rollup requires ungrouped input, and requires use of the max-core parameter. Default is Input must be sorted or grouped. Key-method: the method by which the component groups the records. Use key-specifier: the component uses the key specifier. Use key_change function: the component uses the key_change transform function. Key: names of the key fields Rollup can use to group or to define groups of data records. If the value of the key-method parameter is Use key-specifier ,you must specify the value for the key parameter. Max-core : maximum memory usage in bytes Transform : either name of the file containing the type and transform function, or the transform string. check-sort: indicates whether or not to abort execution on the first input record that is out of sorted order. The Default is True. This parameter is available only when key-method parameter is Use key-specifier Limit: contains an integer that represents a number of reject events
Rollup
Ramp: contains a real number that represents a rate of reject events in the number of records processed. Logging: specifies whether or not you want the component to generate log records for certain events. The values of logging parameter is True or False. The default value is False.
log_input: indicates how often you want the component to send an input record to its log port. For example: If you select 100,then the component sends every 100th input record to its log port log_output: indicates how often you want the component to send an output record to its log port. For example: If you select 100,then the component sends every 100th output record to its log port log_reject:indicates how often you want the component to send an reject record to its log port. For example: If you select 100,then the component sends every 100th reject record to its log port log_intermediate: indicates how often you want the component to send an intermediate record to its log port Reject-Threshold : The components tolerance for reject event Abort on first reject: The component stops the execution of graph at the first reject event it generates. Never Abort: The component does not stops execution of the graph, no matter how many reject events it generates Use Limit/Ramp: The component uses the settings in the ramp & limit parameters to determine how many reject events to allow before it stops the execution of graph.
Rollup
in:
Do for first record in each group
Initialize:
...
temp:
Do for every record
in each group
Rollup: ...
Finalize: ...
out:
Dedup Sorted
Separates one specified record in each group of Requires grouped input. Reads grouped flow of records from the in port.
If your records are not already grouped, use Sort Component to group them
It applies the expression in the select parameter to each record. If the expression returns Non-0 Value :it processes the record 0 : it does not process that record NULL : writes the record to the reject port & a descriptive error message to the error port. If you do not supply an expression for the select parameter, Dedup Sorted processes all the records on the in port. Dedup sorted considers any consecutive records with the same key value
to be in the same group. If a group consists of one record, Dedup sorted writes that record to the out port. If a group consists of more than one record, Dedup sorted uses the value of keep parameter to determine: Which record to write to the out port. Which record or records to write to dup port May 18, 2010
IN OUT DUP Records enter into the component from this port
Diagnostic Ports :
REJECT Input records that caused error are written to this port
ERROR Associated error message is written to this port Logging records are written to this port
May 18, 2010
LOG
Dedup Sorted
Parameters of Dedup Sorted Component :
Key: name of the key field, you want Dedup sorted to use when determining group of data records.
keep: determines which record Dedup sorted keeps to write to the out port first: keeps first record of the group. This is the default. last: keeps the last record of the group. unique- only: keeps only records with unique key values. Dedup sorted writes the remaining records of the each group to the dup port
Dedup Sorted
Logging: specifies whether or not you want the component to generate log records for certain events. The values of logging parameter is True or False. The default value is False.
log_input: indicates how often you want the component to send an input record to its log port. For example: If you select 100,then the component sends every 100th input record to its log port log_output: indicates how often you want the component to send an output record to its log port. For example: If you select 100,then the component sends every 100th output record to its log port log_reject:indicates how often you want the component to send an reject record to its log port. For example: If you select 100,then the component sends every 100th reject record to its log port
July 6, 2010
Multifiles
A global view of a set of ordinary files called partitions usually located on different disks or systems Ab Initio provides shell level utilities called m_
etc.)
Multifiles reside on Multidirectories Each is represented using URL notation with mfile as
A Multidirectory
A directory spanning across partitions on different hosts
mfile://host1/u/jo/mfs/mydir
<.mdir>
Control Partition
A Multifile
A file spanning across partitions on different hosts
mfile://host1/u/jo/mfs/mydir/myfile.dat
//host1/u1/jo/mfs/mydir /myfile.dat
//host1/vol4/pA/mydir /myfile.dat
//host2/vol3/pB/mydir /myfile.dat
//host3/vol7/pC/mydir /myfile.dat
Control Partition
A multifile
Control file
Parallelism
Parallel Runtime Environment
Component Parallelism
When different instances of same component run on separate data sets
Sorting Customers
Sorting Transactions
Pipeline Parallelism
When multiple components run on same data set
Processing Record 99
Data Parallelism
When data is divided into segments or partitions and processes run simultaneously on each partition
Expanded View
Global View
Multifile
Partition by Round-robin
Partition by Key Partition by Expression Partition by Range Partition by Percentage
Broadcast
Partition by Load Balance
Partition by Round-robin
Writes records to each partition evenly Block-size records go into one partition before moving on to the next.
Record1 Record4
Partition 1
Record2 Record5
Partition 2
Record3 Record6
Partition 3
May 18, 2010
Partition by Key
Distributes data records to its output flow partitions according to key values
Hash value 1 1 0
57 213
100 91 25
122
25 % 3
122 % 3 213 % 3
1
2 0
May 18, 2010
Partition by Expression
Distributes data records to partitions according to DML expression values
DML Expression
99 / 40 99 91 57 25 22 73
Expression Value 2
25 22
57
99
91 / 40
57 / 40 25 / 40 22 / 40 73 / 40
2
1 0 0 1
73
91
Does not guarantee even distribution across partitions Cascaded Filter by Expressions can be avoided
May 18, 2010
Broadcast
Combines all data records it receives into a single flow Writes copy of that flow into each output data partition
Partition0 Partition1 Partition2
A B C D E F
A B C D E F G
A B C D E F G
A B C D E F G
Increases data parallelism when connected single fan-out flow to out port
May 18, 2010
Partition by Percentage
Distributes a specified percentage of the total number of input data records to each output flow Record1 Record2 Record3
Record4 Record5 Record6
Partition0 Partition1 Partition2
Record4 Record5
Partition by Range
Distributes data records to its output flow partitions according to the ranges of key values specified for each partition. Typically used in conjunction with Find Splitter component for better load balancing
Key range is passed to the partitioning component through its split port
May 18, 2010
Partition by Range
Find Split output
76 10 17 9 45 2 84 98 29 73
Partition0 Partition1 Partition2
10 73
10 9 2
17 45 29 73
76 84 98
Num_Partitions = 3
Key-Based
No Yes Yes No
Balancing
Even Depends on the key value Depends on data and expression Even Depends on the percentage specified Depends on splitters
Uses
Record-independent parallelism Key-dependent parallelism Application specific Record-independent parallelism Application specific Key-dependent parallelism, Global Ordering
No
Yes
Departitioning Components
Gather Concatenate Merge Interleave
Departitioning Components
Gather
Reads data records from the flows connected to the input port Combines the records arbitrarily and writes to the output
Combines data records from multiple flow partitions that have been sorted on a key - Maintains the sort order
May 18, 2010
Concatenate
Concatenate appends multiple flow partitions of data records one after another
Concatenate
Reads the flows in the order in which you connect to them to in port In above Graph, concatenate reads first Unload 1, then Unload 2 and so on Parameters: None
May 18, 2010
Merge
Combines data records from multiple flow partitions that have been sorted on a key Maintains the sort order
Parameters of Merge Component: - key : Name of he key fields and the sequence specifier you want Merge to use to maintain the order of data records while merging them
Interleave
Combines blocks of records from multiple flow partitions in round-robin fashion Reads number of records specified in blocksize from first flow then from second flow and so on Writes the records to the out port Parameters of Interleave Component : Blocksize: number of data records Interleave reads from each flow before reading the same number of data records from the next flow.
May 18, 2010
Departitioning Components
Summary of Departitioning Methods
Method
Concatenate
Key-Based
No
Ordering
Global Inverse of Round Robin partition Sorted Arbitrary
Uses
Creating serial flows from partitioned data Creating serial flows from partitioned data Creating ordered serial flows Unordered departitioning
No Yes No
Case Studies
July 6, 2010
Case Study 1
Field Name Cust_id amount
Data Length/Delimiter Format/Mask Type Decimal | (pipe) None Decimal \n(newline) None
Develop the AbInitio Graph, which will do the following: It takes the first three records of each Cust_id and sum the amounts, the output file is as follows
Where total_amount is the sum of first three records for each Cust_id.
Case Study 2
Consider the following BP_PRODUCT file , containing the following fields :
Field Name product_id product_code plan_details_id plan_id Data Type Decimal String Decimal Decimal Length/Delimiter |(pipe) |(pipe) |(pipe) |(pipe) Format/Mask None None None None
plan_details_i d 11111
12121 12312 23412 34212
plan_id
147
154 324 148 476
OPS
NULL VB PCAT VB
111
222 111 999 666
May 18, 2010
Firstly filtered out those records where product_code is NULL. Then save the data in three output file, where First output file contains records having product_code OPS, second having PCAT, third having VB.
Case Study 3
Field Name Cust_id Cust_name cust_address newline
In a retail shop, the customer_master file, contains the details of all the existing customers. It consists of the following fields:
Data Type String String String None Length/Delimiter |(pipe) |(pipe) |(pipe) \n(newline) Format/Mask None None None None
124343
347492 560124
D Banerjee
A Bose C Tarafdar
Kolkata
Kolkata Kolkata
439684
W Ganguly
Durgapur
An input file is received on daily basis detailing all the transactions of that day. The file contains the following fields:
Field Name
Cust_id Cust_name cust_address purchase_date product_name quantity amount new_line
Data Type
String String String Date String number number none
Length/Delimiter
|(pipe) |(pipe) |(pipe) |(pipe) |(pipe) 4 8 \n(newline)
Format/Mask
None None None YYYYMMDD None None None none
Cust_name
Develop an ab initio graph that will accept the input transaction details file and do the following: 1) If it is a new customer record, then insert the details in the output file. 2) If it is an existing customer record and Cust_address has not been changed, then do nothing 3) If it is an existing customer record and the Cust_address has been changed, then update it in the output file
May 18, 2010
Cust_name
cust_address Purchase_date product_name Total_sales newline
String
String number String number None
|(pipe)
|(pipe) |(pipe) |(pipe) |(pipe) \n(newline)
None
None YYYYMMDD None none None
Queries???
July 6, 2010
Thank You!!!
bipm.services@tcs.com