Parallel Job Tutorial
Parallel Job Tutorial
Version 8 Release 5
SC18-9889-02
SC18-9889-02
Note
Befo
re
using
this
infor
mati
on
and
the
prod
uct
that
it
supp
orts,
read
the
infor
mati
on in
Noti
ces
and
trade
mark
s on
page
77.
Cop
yrig
ht
IBM
Cor
pora
tion
2006
,
2010
.
US
Gov
ern
ment
User
s
Rest
ricte
d
Righ
ts
Use,
dupl
icati
on
or
discl
osur
e
restr
icted
by
GSA
ADP
Sche
dule
Cont
ract
with
IBM
Corp
.
Contents
Chapter 1. Introduction . . . . . . .
.
1
Lesson 3.1: Designing the transformation job . .
. 29
sample job . . . . . . . . . . . . .
6
Configuring the Lookup File Set stage . . .
.
35
Lesson checkpoint . . . . . . . . . .
8
Configuring the Lookup stage . . . . . .
.
35
Lesson 1.2: Viewing and compiling the sample job.
.8
Lesson checkpoint . . . . . . . . . .
. 38
Lesson checkpoint . . . . . . . . . .
9
single job . . . . . . . . . . . . . .
.
40
Lesson 1.3: Running the sample job . . . . .
. 10
Adding new stages and links . . . . . .
. 40
Running the job . . . . . . . . . . .
.
10
Configuring the Business_Rules Transformer
Module 3 summary. . . . . . . . . . .
.
45
first job . . . . . . . . . . . . . .
15
target . . . . . . . . . . . . . .
.
47
Lesson checkpoint . . . . . . . . . .
.
16
Lesson 4.1: Creating a data connection object . .
.
47
Lesson 2.2: Adding stages and links to the job . .
.
16
Data connection objects . . . . . . . .
.
47
The job design . . . . . . . . . . .
.
16
Creating a data connection object . . . . .
.
48
Adding stages and linking them . . . . .
.
16
Lesson checkpoint . . . . . . . . . .
.
48
Specifying properties and column metadata for
Lesson checkpoint . . . . . . . . . .
.
49
stage and running the job . . . . . . .
.
20
Lesson 4.3: Writing to a database . . . . . .
.
49
Lesson checkpoint . . . . . . . . . .
.
20
Connectors . . . . . . . . . . . .
.
49
Lesson 2.3: Importing metadata. . . . . . .
.
20
Creating the job . . . . . . . . . . .
.
50
Importing metadata into your repository . .
. 21
Configuring the Data Set stage . . . . . .
. 50
Loading column metadata from the repository.
. 21
Configuring the ODBC connector . . . . .
. 51
Lesson checkpoint . . . . . . . . . .
.
23
Lesson checkpoint . . . . . . . . . .
.
53
Lesson 2.4: Adding job parameters . . . . .
.
23
Module 4 summary. . . . . . . . . . .
.
54
Job parameters . . . . . . . . . . .
.
23
parallel. . . . . . . . . . . . . . . 55
Lesson checkpoint . . . . . . . . . .
.
57
Lesson checkpoint . . . . . . . . . .
.
59
transformation job. . . . . . . . . .
29
iii
Lesson checkpoint . . . . . . . . .
.
.
62
computer . . . . . . . . . . . . .
.
.
69
Module 5 summary. . . . . . . . . .
.
.
63
Creating a DSN for the tutorial table on a UNIX or
Linux computer . . . . . . . . . . .
.
.
69
. 67
Product accessibility . . . . . . .
.
75
Creating the tutorial project . . . . . . . .
. 67
directory . . . . . . . . . . . . . .
. 68
project . . . . . . . . . . . . . . .
. 68
Index . . . . . . . . . . . . . .
.
81
Creating a target database table . . . . . .
. 68
iv
Chapter 1. Introduction
In this tutorial, you will learn the basic skills that you need to design and run IBM InfoSphere DataStage parallel jobs.
Learning objectives
By completing this tutorial, you will achieve the following learning objectives:
Learn how to design parallel jobs that extract, transform, and load data.
Learn how to run the jobs that you have designed, and how to view the results.
Learn how to create reusable objects that can be included in other job designs.
2
Tutorial
Parallel Job
Learning objectives
As you work through the job scenario, you will learn how to do the following tasks:
Design parallel jobs that extract, transform, and load data
Run the jobs that you design and view the results
Create reusable objects that can be included in other job designs
This tutorial should take approximately four hours to finish. If you explore other concepts related to this tutorial, it can
take longer to complete.
Skill level
You can do this tutorial with only a beginning level of understanding of
InfoSphere DataStage concepts.
Audience
This tutorial is intended for InfoSphere DataStage designers who want to learn how to create parallel jobs.
System requirements
The tutorial requires the following hardware and software:
InfoSphere DataStage clients installed on a Windows XP platform.
Connection to a InfoSphere DataStage server on a Windows or UNIX platform (Windows servers can be on the same
computer as the clients).
To run the parallel processing module (module 5), the InfoSphere DataStage server must be installed on a multiprocessor system (SMP or MPP).
Prerequisites
You need to complete the following tasks before starting the tutorial:
Get DataStage developer privileges from the InfoSphere DataStage administrator
Check that the InfoSphere DataStage administrator has installed and set up the tutorial by
following the procedures described in Appendix A
Obtain the name of the tutorial folder on the InfoSphere DataStage client computer and the
tutorial project folder or directory on the InfoSphere DataStage server computer from the
InfoSphere DataStage administrator.
Learning objectives
After you complete the lessons in this module, you will understand how to do the following tasks:
Start the InfoSphere DataStage and QualityStage Designer (Designer client) and attach a project.
Open an existing job.
Compile a job so that it ready to run.
Open the Director client and run a job.
View the results of the job.
This module should take approximately 30 minutes to complete.
Prerequisites
Ensure that you have DataStage user authority.
In the design area of the Designer client, you work with the tools and objects to create your job designs.
The sample job opens in a design window.
Figure 1. Designer
client
Lesson checkpoint
In this lesson, you opened your first job.
You learned the following tasks:
How to start the Designer client
How to open a job
Where to find the tutorial objects in the repository tree
In the Value field of the Resolve Job Parameter window, specify the name of the directory in which the
tutorial data was installed and click OK (you have to specify directory path whenever you view data or
run the job).
In the Data Browser window, click OK. A window opens that shows the first 100 rows of the data that
the GlobalCo_BillTo.csv file contains (100 rows is the default setting, but you can change it).
Click Close to close the Data Browser window.
Click OK to close the Sequential File stage editor.
Lesson checkpoint
In this lesson, you explored a simple data extraction job that reads data from a file and writes it to a
staging area.
You learned the following tasks:
How to open stage editors
How to view the data that a stage represents
10
Select the sample job in the right pane of the Director client, and select Job Run Now.
11
In the Job Run Options window, specify the path of the project folder (for example,
C:\IBM\InformationServer\Server\Projects\Tutorial and click Run. The job status
changes to Running.
When the job status changes to Finished, select View Log.
Examine the job log to see the type of information that the Director client reports as it runs a
job. The messages that you see are either control or information type. Jobs can also have
Fatal and Warning messages.
The following figure shows the log view of the job.
12
13
Lesson checkpoint
In this lesson you ran the sample job and looked at the results.
You learned the following tasks:
How to start the Director client from the Designer client
How to run a job and look at the log file
How to view the data written by the job
Module 1: Summary
You have now opened, compiled, and run your first data extraction job.
Now that you have run a data extraction job, you can start creating your own jobs. The next module guides you through
the process of creating a simple job that does more data extraction.
Lessons learned
By completing this module, you learned about the following concepts and tasks:
Starting the Designer client.
Opening an existing job.
Compiling the job.
Starting the Director client from the Designer client.
Running the sample job.
Viewing the results of the sample job and seeing how the job extracts data from a comma-separated file and writes it to
a staging area.
Additional resources
For more information about the features that you have learned about, see the following guides:
IBM InfoSphere DataStage Designer Client Guide
IBM InfoSphere DataStage Director Client Guide
14
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks:
Add stages and links to a job.
Specify the properties of the stages and links to determine what they will do when the job is run.
Learn how to specify column metadata.
Consolidate your knowledge of compiling and running jobs.
This module should take approximately 90 minutes to complete.
15
Lesson checkpoint
In this lesson you created a job and saved it to a specified place in the repository.
You learned the following tasks:
How to create a job in the Designer client.
How to name the job and save it to a folder in the repository tree.
16
Link
country_codes_data
Always use specific names for your stages and links rather than the default names assigned by the Designer client.
Using specific names make your job designs easier to document and easier to maintain.
8. Select File Save to save the job.
Your job design should now look something like the one shown in this figure:
17
browse for the path name if you prefer, click the browse button on the right of the File field.)
You specified the name of the comma-separated file that the stage reads when the job runs.
Select the First Line is Column Name property under the Options category.
Click the down arrow next to the First Line is Columns Names field and select True from
the list. The row that contains the column names is dropped when the job reads the file.
Click the Format tab.
In the record-level category, select the Record delimiter string property from the Available
properties to add.
Select DOS format from the Record delimiter string list. This setting ensures that the file
can be read when the engine tier is installed on a UNIX or Linux computer.
Click the Columns tab. Because the CustomerCountry.csv file contains only three columns,
type the column definitions into the Columns tab. (If a file contains many columns, it is less
time consuming and more accurate to import the column definitions directly from the data
source.) Note that column names are case-sensitive, so use the case in the instructions.
Double-click the first line of the table. Fill in the fields as follows:
Table 1. Column definition
Column Name
Key
SQL Type
Length
Description
CUSTOMER_
Yes
Char
7
Key column for
NUMBER
customer
identifier
You will use the default values for the remaining fields.
Add two more rows to the table to specify the remaining two columns and fill them in as
follows:
Table 2. Additional column definitions
Column Name
Key
SQL Type
Length
Description
COUNTRY
No
Char
2
The code that
identifies the
customer's
country
LANGUAGE
No
Char
2
The code that
identifies the
customer's
language
Your Columns tab should look like the one in the following figure (if you have National
Language Support installed, there is an additional field named Extended):
18
Click the Save button to save the column definitions that you specified as a table definition object in the repository.
The definitions can then be reused in other jobs.
In the Save Table Definition window, enter the following information:
Option
Description
Table/file name
country_codes_data
Short description
date and time of saving
Long description
Table definition for country codes source file
Click OK to specify the locator for the table definition. The locator identifies the table definition.
In the Save Table Definition As window, save the table definition in the Tutorial folder and name it
country_codes_data.
Click the View Data button and click OK in the Data Browser window to use the default settings. The data browser
shows you the data that the CustomerCountry.csv file contains. Since you specified the column definitions, the
Designer client can read the file and show you the results.
Close the Data Browser window.
Click OK to close the stage editor.
Save the job.
Notice that a small table icon has appeared on the Country_codes_data link. This icon shows that the link now has
metadata. You have designed the first part of your job.
Chapter 4. Module 2: Designing your first job
19
Specifying properties for the Lookup File Set stage and running the job
In this part of the lesson, you configure the next stage in your job. You already specified the column metadata for data
that will flow down the link between the two stages, so there are fewer properties to specify in this task.
To configure the Lookup File Set stage:
Double-click the country_code_lookup Lookup File Set stage to open the stage editor. The editor opens in the
Properties tab of the Input page.
Select the Lookup Keys category; then double-click the Key property in the
Available Properties to add area.
In the Key field, click the down arrow and select CUSTOMER_NUMBER from the list and press enter. You specified
that the CUSTOMER_NUMBER column will be the lookup key for the lookup table that you are creating.
Select the Lookup File Set property under the Target category.
In the Lookup File Set field, type the path name for the lookup file set that the stage will create, (for example,
C:\IBM\InformationServer\Server\Projects\ Tutorial\countrylookup.fs) and press enter.
Click OK to save your property settings and close the Lookup File Set stage editor.
Save the job and then compile and run the job by using the techniques that you learned in Lesson 1.
You have now written a lookup table that can be used by another job later on in the tutorial.
Lesson checkpoint
You have now designed and run your very first job.
You learned the following tasks:
How to add stages and links to a job
How to set the stage properties that determine what the stage will do when you run the job
How to specify column metadata for the job and to save the column metadata to the repository for use in other jobs
In this lesson, you will add more stages to the job that you designed in Lesson 2.2. The stages that you add are similar
to the ones that you added in lesson 2.2. The stages read a comma-separated file that contains code numbers and
corresponding special delivery instructions. The contents are again written to a lookup table that is ready to use in a
later job. The finished job contains two separate data flows, and it will write data to two separate lookup file sets.
Rather than type the column metadata, you import the column metadata from the source file, and use that metadata in
the job design.
20
Sequential File
special_handling
Lookup File
special_handling_lookup
Link
special_handling_data
Your job design should now look like the one shown in this figure:
Chapter 4. Module 2: Designing your first job
21
Open the stage editor for the special_handling Sequential File stage and specify that it will read the file
SpecialHandling.csv and that the first line of this file contains column names.
Click the Format tab.
In the record-level category, select the Record delimiter string property from the Available properties to add.
Select DOS format from the Record delimiter string list. This setting ensures that the file can be read when the
engine tier is installed on a UNIX or Linux computer.
Click the Columns tab.
Click Load. You load the column metadata from the table definition that you previously saved as an object in the
repository.
In the Table Definitions window, browse the repository tree to the folder where you stored the SpecialHandling.csv
column definitions.
Select the SpecialHandling.csv table definition and click OK.
In the Selected Columns window, ensure that all of the columns appear in the Selected columns list and click OK.
The column definitions appear in the Columns tab of the stage editor.
Close the Sequential File stage editor.
Open the stage editor for the special_handling_lookup stage.
Specify a path name for the destination file set and specify that the lookup key is the SPECIAL_HANDLING_CODE
column then close the stage editor.
Save, compile, and run the job.
22
Lesson checkpoint
You have now added to your job design and learned how to import the metadata that the job uses.
You learned the following tasks:
How to import column metadata directly from a data source
How to load column metadata from a definition that you saved in the repository
Job parameters
Sometimes, you want to specify information when you run the job rather than when you design it. In your job design,
you can specify a job parameter to represent this information. When you run the job, you are then prompted to supply a
value for the job parameter.
You specified the location of four files in the job that you designed in Lesson 2.3. In each part of the job, you specified
a file that contains the source data and a file to write the lookup data set to. In this lesson, you will replace all four file
names with job parameters. You will then supply the actual path names of the files when you run the job.
You will save the definitions of these job parameters in a parameter set in the repository. When you want use the same
job parameters in a job later on in this tutorial, you can load them into the job design from the parameter set. Parameter
sets enable the same job parameters to be used by different jobs.
Type
Help text
country_codes_lookup
path name for the
path name
Enter the path name
country codes
for the file set
lookup file set
for the country
23
special_handling_source
path name for the
path name
Enter the path name
special handling
for the
codes file
comma-separated
the special
handling code
definitions
special_handling_lookup
path name for the
path name
Enter the path name
special handling
for the file set
lookup file set
for the special
handling lookup
table
The Parameters tab of the Job Properties window should now look like the one in the following figure:
Click the right arrow next to the File field, and select Insert Job Parameter from the menu.
24
Select country_codes_source from the list and press enter. The text #country_codes_source# appears in the File field.
This text specifies that the job will request the name of the file when you run the job.
Repeat these steps for each of the stages in the job, specifying job parameters for each of the File properties as follows:
Table 4. Job parameters to be added to job
Stage
Property
Parameter name
country_codes_lookup stage
Lookup file set
country_codes_lookup
special_handling stage
File
special_handling_source
special_handling_lookup
Lookup file set
special_handling_lookup
stage
Lesson checkpoint
You defined job parameters to represent the file names in your job and specified values for these parameters when you
ran the job.
Parameter sets
You use parameter sets to define job parameters that you are likely to reuse in other jobs. Whenever you need this set
of parameters in a job design, you can insert them into the job properties from the parameter set. You can also define
different sets of values for each parameter set. These parameter sets are stored as files in the InfoSphere DataStage
server installation directory and are available to
Chapter 4. Module 2: Designing your first job
25
use in your job designs or when you run jobs that use these parameter sets. If you make any changes to a parameter
set object, these changes are reflected in job designs that use this object until the job is compiled. The parameters that
a job is compiled with are available when the job is run. However, if you change the design after the job is compiled,
the job will link to the current version of the parameter set.
You can create parameter sets from existing job parameters, or you can specify the job parameters as part of the task
of creating a new parameter set.
26
Click OK, specify a repository folder in which to store the parameter set, and then click Save.
The Designer client asks if you want to replace the selected parameters with the parameter set that you have just
created. Click No.
Click OK to close the Job Parameters window.
Save the job.
You created a parameter set that is available for another job that you will create later in this tutorial. The current job
continues to use the individual parameters rather than the parameter set.
Lesson checkpoint
You have now created a parameter set.
You learned the following tasks:
How to create a parameter set from a set of existing job parameters
How to specify a set of default values for the parameters in the parameter set
Module 2 Summary
In this module, you designed and ran a data extraction job.
You also learned how to create reusable objects such as table definitions and parameters sets that you can include in
other jobs that you design.
Lessons learned
By completing this module, you learned about the following concepts and tasks:
Creating new jobs and saving them in the repository
Adding stages and links and specifying their properties
Specifying column metadata and saving it as a table definition to reuse later
Specifying job parameters to make your job design more flexible, and saving the parameters in the repository to reuse
later
27
28
Tutorial
Parallel Job
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks:
How to use a Transformer stage to transform data
How to handle rejected data
How to combine data by using a Lookup stage
This module should take approximately 60 minutes to complete.
Finally, the job applies a function to one of the data columns to delete space characters that the column contains. This
transformation job prepares the data in that column for a later operation.
The transformation job that you are designing uses a Transformer stage, but there are also several other types of
processing stages available in the Designer client that can transform data. For example, you can use the Modify stage
in your job, if you want to change only the data types in a data set. Several of the processing stages can drop data
columns as part of their processing. In the current job, you use the Transformer stage because you require a
transformation function that you can customize. Several functions are available to use in the Transformer stage.
29
Create a parallel job, save it as TrimAndStrip, and store it in the tutorial folder in the repository tree.
Add two Data Set stages to the design area.
Name the Data Set stage on the left GlobalCoBillTo, and name the one on the right int_GlobalCoBillTo.
Click Palette Processing to locate and drag a Transformer stage to the design area.
Drop the Transformer stage between the two Data Set stages and name the Transformer stage Trim_and_Strip.
Right-click the GlobalCoBillTo Data Set stage and drag a link to the Transformer stage. This method of linking the
stages is fast and easy. You do not need to go back to the palette and grab a link to connect each stage.
Use the same method to link the Transformer stage to the int_GlobalCoBillTo Data Set stage.
Name the first link full_bill_to and name the second link stripped_bill_to. Your job should look like the one in the
following picture:
Open the stage editor for the GlobalCoBillTo Data Set stage and click View Data. The data browser shows the data in
the data set. You should frequently view the data after you configure a stage to verify that the File property and the
column metadata are both correct.
30
Open the stage editor for the int_GlobalCoBillTo Data Set stage.
Set the File property in the Source category to point to a new staging data set (for example,
C:\IBM\InformationServer\Server\Projects\Tutorial\ int_GlobalCoBillTo.ds).
CUSTOMER_NUMBER
Char
7
CUST_NAME
VarChar
30
ADDR_1
VarChar
30
ADDR_2
VarChar
30
CITY
VarChar
30
REGION_CODE
Char
2
ZIP
VarChar
10
TEL_NUM
VarChar
10
REVIEW_MONTH
VarChar
2
SETUP_DATE
VarChar
12
STATUS_CODE
Char
By specifying stricter data typing for your data, you will be able to better diagnose inconsistencies in
your source data when you run the job.
Double-click the Derivation field for the CUSTOMER_NUMBER column in the stripped_bill_to link.
The expression editor opens.
Chapter 5. Module 3: Designing a transformation job
31
In the expression editor, type the following text: trim(full_bill_to.CUSTOMER_NUMBER,' ','A'). The text specifies a
function that deletes all the space characters from the CUSTOMER_NUMBER column on the full_bill_to link before
writing it to the CUSTOMER_NUMBER column on the stripped_bill_to link. Your Transformer stage editor should
look like the one in the following figure:
that the stage editor has acquired the metadata from the stripped_bill_to link.
Save and then compile your TrimAndStrip job.
32
Lesson checkpoint
In this lesson you learned how to design and configure a transformation job.
You learned the following tasks:
How to configure a Transformer stage
How to link stages by using a different method for drawing links.
How to load column metadata into a link, by using a drag-and-drop operation.
How to run a job from within the Designer client and monitor the performance of the job.
this
Chapter 5. Module 3: Designing a transformation job
33
tutorial. When you use lookup file sets, you must specify the lookup key column when you define the
file set. You defined the key columns for the lookup tables that you used in this lesson when you created
the file sets in Module 2.
34
35
Double-click the Lookup_Country Lookup stage to open the Lookup stage editor. The
Lookup stage editor is similar in appearance to the Transformer stage editor.
Click the title bar of the stripped_bill_to link in the left pane and drag it over to the Column
Name column of the country_code link in the right pane. When the cursor changes shape,
release the mouse button. All of the columns from the stripped_bill_to link appear in the
country_code link.
Select the Country column in the Country_Reference link and drag it to the country_code
link. The result of copying the columns from the Country_Reference link to the country_code
link is that whenever the value of the incoming CUSTOMER_NUMBER column matches the
value of the CUSTOMER_NUMBER column of the lookup table, the corresponding Country
column will be added to that row of data. The stage editor looks like the one in the following
figure:
36
Double-click the Condition bar in the Country_Reference link. The Lookup Stage Conditions window opens. Select
the Lookup Failure field and select Continue from the list. You are specifying that, if a CUSTOMER_NUMBER
value from the stripped_bill_to link does not match any CUSTOMER_NUMBER column values in the reference table,
the job continues to the next CUSTOMER_NUMBER column.
Close the Lookup stage editor.
Open the temp_dataset Data Set stage and specify a file name for the data set.
Save, compile and run the job. The Job Run Options window displays all the parameters in the parameter set.
37
In the Job Run Options window, select lookupvalues1 from the list next to the parameter set name. The parameters
values are filled in with the path names that you specified when you created the parameter set.
Click Run to run the job and then click View Data in the temp_dataset stage to examine the results.
Lesson checkpoint
With this lesson, you started to design more complex and sophisticated jobs.
You learned the following tasks:
How to copy stages, links, and associated configuration data between jobs.
How to combine data in a job by using a Lookup stage.
In the Lookup stage for the job that you created in Lesson 3.2, you specified that processing should continue on a row
if the lookup operation fails. Any rows that contain CUSTOMER_NUMBER fields that were not matched in the
lookup table were bypassed, and the COUNTRY column for that row was set to NULL. In this lesson, you will specify
that non-matching rows are written to a reject link. The reject link captures any customer numbers that do not have an
entry in the country codes table. You can examine the rejected rows and decide what action to take.
From the File section of the palette, drag a Sequential File stage to the CleansePrepare job and position it under the
Lookup_Country Lookup stage. Name the Sequential File stage Rejected_Rows.
Draw a link from the Lookup stage to the Sequential File stage. Name the link rejects. Because the Lookup stage
already has a stream output link, the new link is designated as a reject link and is shown as a dashed line. Your job
should resemble the one in the following figure:
38
Double-click the Lookup_Country Lookup stage to open the Lookup stage editor.
Double-Click the Condition bar in the country_reference link to open the Lookup Stage Conditions window.
In the Lookup Stage Conditions window, select the Lookup Failure field and select Reject from the list. Close the
Lookup stage editor. This step specifies that, whenever a row from the stripped_bill_to link has no matching entry in
the country code lookup table, the row is rejected and written to the Rejected_Rows Sequential File stage.
Edit the Rejected_Rows Sequential File stage and specify a path name for the file that the stage will write to (for
example, c:\tutorial\rejects.txt). This stage derives the column metadata from the Lookup stage, and you cannot
alter it.
Save, compile the CleansePrepare job, and run the job.
Open the Rejected_Rows Sequential File stage editor and click View Data to look at the rows that were rejected.
Lesson checkpoint
You learned the following tasks:
39
In this lesson, you will further transform your data to apply some business rules and perform another lookup of a
reference table.
In the sample bill_to data, one of the columns is overloaded. The SET_UP data column can contain a special handling
code as well as the date that the account was set up. The transformation logic that is being added to the job extracts this
special handling code into a separate column. The job then looks up the text description corresponding to the code from
the lookup table that you populated in Lesson 2 and adds the description to the output data. The transformation logic
also adds a row count to the output data.
40
Select the following columns in the country_code input link and drag them to the with_business_rules output link:
CUSTOMER_NUMBER
CUST_NAME
ADDR_1
ADDR_2
CITY
REGION_CODE
ZIP
TEL_NUM
In the metadata area for the with_business_rules output link, add the following new columns:
Table 6. Column definitions
Column name
SQL Type
Length
Nullable
SOURCE
Char
10
No
RECNUM
Char
10
No
SETUP_DATE
Char
10
Yes
41
SPECIAL_HANDLING_ CODE
Integer
10
Yes
The new columns appear in the graphical representation of the link, but are highlighted in red
because they do not yet have valid derivations.
In the graphical area, double-click the Derivation field of the SOURCE column.
In the expression editor, type 'GlobalCo':. Position your mouse pointer immediately to the
right of this text, right-click and select Input Column from the menu. Then select the
COUNTRY column from the list. When you run the job, the SOURCE column for each row
will contain the two-letter country code prefixed with the text GlobalCo, for example,
GlobalCoUS.
In the Transformer stage editor toolbar, click the Stage Properties tool on the far left. The
Transformer Stage Properties window opens.
Click the Variables tab and, by using the techniques that you learned for defining table
definitions, add the following stage variables to the grid:
Table 7. Stage variables
Name
SQL Type
Precision
xtractSpecialHandling
Char
1
TrimDate
VarChar
10
When you close the Properties window, these stage variables appear in the Stage Variables
area above the with_business_rules link.
Double-click the Derivation fields of each of the stage variables in turn and type the
following expressions in the expression editor:
Table 8. Derivations
Stage variable
Expression
Description
xtractSpecialHandling
if Len
This expression checks that
(country_code.SETUP_DATE)
the SETUP_DATE column
< 2 Then
contains a special handling
country_code.SETUP_DATE
code. If it does, the value of
Else Field
xtractSpecialHandling is set to
(country_code.SETUP_DATE,'
that code. If the column
',2)
contains a date and a code,
value of xtractSpecialHandling
TrimDate
If Len
This expression checks that
(country_code.SETUP_DATE)
the SETUP_DATE column
< 3 Then '01/01/0001' Else
contains a date. If the
Field
SETUP_DATE column does
(country_code.SETUP_DATE,'
not contain a date, then the
',1)
expression sets the value of
SETUP_DATE column
date.
42
Select the xtractSpecialHandling stage variable and drag it to the Derivation field of the
SPECIAL_HANDLING_CODE column and drop it on the with_business_rules link. A line
is drawn between the stage variable and the column, and the name xtractSpecialHandling
appears in the Derivation field. For each row that is processed, the
SPECIAL_HANDLING_CODE column writes the current value of the
xtractSpecialHandling variable.
Select the TrimDate stage variable and drag it to the Derivation field of the SETUP_DATE
column and drop it on the with_business_rules link. A line is drawn between the stage
variable and the column, and the name TrimDate appears in the Derivation field. For each
row processed, the SETUP_DATE column writes the current value of the TrimDate
variable.
Double-click the Derivation field of the RECNUM column and type 'GC': in the expression
editor. Right-click and select System Variable from the menu. Then select
@OUTROWNUM. You added row numbers to your output.
Your transformer editor should look like the one in the following picture:
43
In this exercise, you configure the stages that are required to look up a value for the special handling code and write
it to an output data set.
Open the Special_Handling_Lookup stage.
Set the File property to reference the special_handling_lookup job parameter.
Load the column metadata from the SpecialHandling.csv table definition in the repository.
In the Columns tab, select the Key checkbox for the SPECIAL_HANDLING_CODE column and then close the
stage editor.
Open the Lookup_Spec_Handling stage.
44
Select the following columns in the with_business_rules input link and drag them to the finished_data output link.
CUSTOMER_NUMBER
CUST_NAME
ADDR_1
ADDR_2
CITY
REGION_CODE
ZIP
TEL_NUM
SOURCE
RECNUM
SETUP_DATE
SPECIAL_HANDLING_CODE
Select the DESCRIPTION column in the special_handling reference link and drag it to the finished_data output link
(the LANGUAGE column is not used).
Double-click the Condition bar in the special_handling reference link to open the Lookup Stage Conditions window.
Specify that the processing will continue if the lookup fails for a data row. You do not need to specify a reject link for
this stage. Only a minority of the rows in the bill_to data contain a special handling code, so if the rows that do not
contain a code are rejected, most of the data is rejected.
Specify a job parameter to represent the file that the Target Sequential File stage will write to, and add this job
parameter to the stage.
Save, compile and run the CleansePrepare job.
Lesson checkpoint
In this lesson, you consolidated your existing skills in defining transformation jobs and added some new skills.
You learned the following tasks:
How to define and use stage variables in a Transformer stage
How to use system variables to generate output column values
Module 3 summary
In this module you refined and added to your job design skills.
You learned how to design more complex jobs that transform the data that your previous jobs extracted.
Lessons learned
By completing this module, you learned the following concepts and tasks:
How to drop data columns from your data flow
How to use the transform functions that are provided with the Designer client
How to combine data from two different sources
How to capture rejected data
45
46
Tutorial
Parallel Job
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks:
How to define a data connection object that you use and reuse to connect to a database.
How to import column metadata from a database.
How to write data to a relational database target.
This module should take approximately 60 minutes to complete.
Prerequisites
Ensure that your database administrator runs the relevant database scripts that are supplied with the tutorial and set up a
DSN for you to use when connecting to an ODBC connector.
If you change the details of a data connection while you are designing a job, these changes are reflected in the job
design. However, after you compile your job, the data connection details are fixed in the executable version of the job.
Subsequent
47
changes to the job design will once again link to the data connection object and pick up any changes that were made to
that object.
You can create data connection objects directly in the repository. Also, you can create data connection objects when
you are using a connector stage to import metadata by saving the connection details. This lesson shows you how to
create a data connection object directly in the repository.
ConnectionString
Type the DSN name
Username
Type the user name for connecting to the
database by using the specified DSN
Password
Type the password for connecting to the
database by using the specified DSN.
Click OK.
In the Save Data Connection As window, select the tutorial folder and click
Save.
Lesson checkpoint
You learned how to create a data connection object and store the object in the repository.
48
Lesson checkpoint
You learned how to import column metadata from a database using a connector.
Connectors
Connectors are stages that you use to connect to data sources and data targets to read or write data.
Chapter 6. Module 4: Loading a data target
49
In the Database section of the palette in the Designer are many types of stages that connect to the same types of data
sources or targets. For example, if you click the down arrow next to the ODBC icon in the palette, you can choose to
add either an ODBC connector stage or an ODBC Enterprise stage to your job.
If your database type supports connector stages, use them because they provide the following advantages over other
types of stages:
Creates job parameters from the connector stage (without first defining the job parameters in the job properties).
Saves any connection information that you specify in the stage as a data connection object.
Reconciles data types between source and target to avoid runtime errors.
Generates detailed error information if a connector encounters problems when the job runs.
Double-click the BillToSource Data Set stage to open the stage editor.
Select the File property on the Properties tab of the Output page and set it to the data set that you created in Lesson 3.4.
Use a job parameter to represent the data set file.
50
51
Click the SQL tab to view the SQL statement; then click OK to close the SQL builder. The SQL statement is displayed
in the Insert statement field, and your ODBC connector should look like the one in the following figure:
52
You wrote the BillTo data to the tutorial database table. This table forms the bill_to dimension of the star schema that
is being implemented for the GlobalCo delivery data in the business scenario that the tutorial is based on.
Lesson checkpoint
You learned how to use a connector stage to connect to and write to a relational database table.
You learned the following tasks:
How to configure a connector stage
How to use a data connection object to supply database connection details
Chapter 6. Module 4: Loading a data target
53
How to use the SQL builder to define the SQL statement by accessing the database.
Module 4 summary
In this module, you designed a job that writes data to a table in a relational database.
In lesson 4.1, you learned how to define a data connection object; in lesson 4.2, you imported column metadata from a
database; and in lesson 4.3, you learned how to write data to a relational database target.
Lessons learned
By completing this module, you learned about the following concepts and tasks:
Load data into data targets
Use the Designer clients reusable components
54
In this module, you learn about how you can control whether jobs run sequentially or in parallel, and you look at the
partitioning of data.
Learning objectives
After completing the lessons in this module, you will know how to do the following:
How to use the configuration file to optimize parallel processing.
How to control parallel processing at the stage level in your job design.
How to control the partitioning of data so that it can be handled by multiple processors.
This module should take approximately 60 minutes to complete.
Prerequisites
You must be working on a computer with multiple processors.
You must have DataStage administrator privileges to create and use a new configuration file.
Unless you specify otherwise, the parallel engine uses a default configuration file that is set up when InfoSphere
DataStage is installed.
55
In the Configuration window, select default from the list. The contents of the default configuration file
are displayed.
The default configuration file is created when InfoSphere DataStage is installed. Although the system
has four processors, the configuration file specifies two processing nodes. Specify fewer processing
nodes than there are physical processors to ensure that your computer has processing resources available
for other tasks while it runs InfoSphere DataStage jobs.
This file contains the following fields:
node
fastname
The name of the node as it is referred to on the fastest network in the system. For an SMP system, all
processors share a single connection to the network, so the fastname node is the same for all the nodes
that you are defining in the configuration file.
pools Specifies that nodes belong to a particular pool of processing nodes. A pool of nodes typically
has access to the same resource, for example, access to a high-speed network link or to a mainframe
computer. The pools string is empty for both nodes, specifying that both nodes belong to the default
pool.
resource disk
Specifies the name of the directory where the processing node will write data set files. When you create
a data set or file set, you specify where the controlling file is called and where it is stored, but the
controlling file points to other files that store the data. These files are written to the directory that is
specified by the resource disk field.
resource scratchdisk
Specifies the name of a directory where intermediate, temporary data is stored.
Configuration files can be more complex and sophisticated than the example file and can be used to tune
your system to get the best possible performance from the parallel jobs that you design.
56
Lesson checkpoint
In this lesson, you learned how the configuration file is used to control parallel processing.
You learned the following concepts and tasks:
About configuration files
How to open the default configuration file
What the default configuration file contains
In the simplest scenario, do not worry about how your data is partitioned. InfoSphere DataStage can partition your data
and implement the most efficient partitioning method.
Most partitioning operations result in a set of partitions that are as near to equal size as possible, ensuring an even load
across your processors.
As you perform other operations, you need to control partitioning to ensure that you get consistent results. For
example, you are using an aggregator stage to summarize your data to get the answers that you need. You must ensure
that related data is grouped together in the same partition before the summary operation is performed on that partition.
In this lesson, you will run the sample job that you ran in Lesson 1. By default, the data that is read from the file is not
partitioned when it is written to the data set. You change the job so that it has the same number of partitions as there are
nodes defined in your system's default configuration file.
57
Click the disk icon in the toolbar to open the Data Set viewer and click OK.
View the data in the data set to see its structure.
Close the window.
By default, most parallel job stages use the auto-partitioning method. The auto method determines the most
appropriate partitioning method based on what occurs before and after this stage in the data flow.
The sample job reads a comma-separated file. By default, comma-separated files are read sequentially and all their
data is stored in a single partition. In this exercise, you will override the default behavior and specify that the data
that is read from the file will be partitioned by using the round-robin method. The round-robin method sends the
first data row to the first processing node, the second data row to the second processing node, and so on.
58
Lesson checkpoint
In this lesson, you learned some basics about data partitioning.
You learned the following tasks:
How to use the data set management tool to view data sets
How to set a partitioning method for a stage
59
60
Click Properties.
In the General tab of the Project Properties window, click Environment.
In the Categories tree of the Environment variables window, select the Parallel node.
Select the APT_CONFIG_FILE environment variable, and edit the file name in the path name under the Value
column heading to point to your new configuration file. The Environment variables window should resemble the one
in the following picture:
61
You deployed your new configuration file. Keep the Administrator client open, because you will use it to restore the
default configuration file at the end of this lesson.
Lesson checkpoint
You learned how to create a configuration file and use it to alter the operation of parallel jobs.
You learned the following tasks:
How to create a configuration file based on the default file.
How to edit the configuration file.
62
Module 5 summary
In this module, you learned how to use the configuration file to control how your parallel jobs are run.
You also learned how to control the partitioning of data at the level of individual stages.
Lessons learned
By completing this module, you learned about the following concepts and tasks:
The configuration file
How to use the configuration editor to edit the configuration file
How to control data
63
64
Tutorial
Parallel Job
Lessons learned
By completing this tutorial, you learned about the following concepts and tasks:
How to extract, transform, and load data by using InfoSphere DataStage.
Using the parallel processing power of InfoSphere DataStage.
How to reuse the job design elements
65
66
Tutorial
Parallel Job
67
InfoSphere DataStage.
Module 4 of the tutorial imports metadata from a table in a relational database and then writes data to the table.
InfoSphere DataStage uses a repository that is hosted by a relational database (DB2 by default) and you can create the
table in that database. There are data definition language (DDL) scripts for DB2, Oracle, and SQL Server in the tutorial
folder. To create the table:
68
Open the administrator client for your database (for example, the DB2 Control Center).
Create a new database named Tutorial.
Connect to the new database.
Run the appropriate DDL script to create the tutorial table in the new database. The scripts are in the tutorial folder and
are named as follows:
v DB2_table.ddl
Oracle_table.ddl
SQLserver_table.ddl
Close the database administrator client.
The entries that you make in each file depend on the type of database. Full details are in the IBM InfoSphere
Information Server Planning, Configuration, and Installation Guide.
69
70
Tutorial
Parallel Job
Contacting IBM
You can contact IBM for customer support, software services, product information, and general information. You also
can provide feedback to IBM about products and documentation.
The following table lists resources for customer support, software services, training, and product and solutions
information.
Table 10. IBM resources
Resource
Description and location
Software services
You can find information about software, IT,
and business consulting services, on the
solutions site at www.ibm.com/
businesssolutions/
My IBM
You can manage links to IBM Web sites and
information that meet your specific technical
support needs by creating an account on the
My IBM site at www.ibm.com/account/
IBM representatives
You can contact an IBM representative to
learn about solutions at
www.ibm.com/connect/ibm/us/en/
Providing feedback
The following table describes how to provide feedback to IBM about products and product documentation.
Table 11. Providing feedback to IBM
Type of feedback
Action
Product feedback
You can provide general product feedback
through the Consumability Survey at
www.ibm.com/software/data/info/
consumability-survey
71
Documentation feedback
To comment on the information center, click
the Feedback link on the top right side of
any topic in the information center. You can
also send comments about PDF file books,
the information center, or any other
documentation in the following ways:
v
Online reader comment form:
www.ibm.com/software/data/rcf/
v
E-mail: comments@us.ibm.com
72
A subset of the information center is also available on the IBM Web site and periodically refreshed at
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/ index.jsp.
73
74
Product accessibility
You can get information about the accessibility status of IBM products.
The IBM InfoSphere Information Server product modules and user interfaces are not fully accessible. The installation
program installs the following product modules and components:
IBM InfoSphere Business Glossary
IBM InfoSphere Business Glossary Anywhere
IBM InfoSphere DataStage
IBM InfoSphere FastTrack
IBM InfoSphere Information Analyzer
IBM InfoSphere Information Services Director
IBM InfoSphere Metadata Workbench
IBM InfoSphere QualityStage
For information about the accessibility status of IBM products, see the IBM product accessibility information at
http://www.ibm.com/able/product_accessibility/ index.html.
Accessible documentation
Accessible documentation for InfoSphere Information Server products is provided in an information center. The
information center presents the documentation in XHTML 1.0 format, which is viewable in most Web browsers.
XHTML allows you to set display preferences in your browser. It also allows you to use screen readers and other
assistive technologies to access the documentation.
75
76
Tutorial
Parallel Job
Notices
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local
IBM representative for information on the products and services currently available in your area. Any reference to an
IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may
be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property
right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM
product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing,
to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785 U.S.A.
For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property
Department in your country or send inquiries, in writing, to:
Intellectual Property Licensing
Legal and Intellectual Property Law
IBM Japan Ltd.
1623-14, Shimotsuruma, Yamato-shi
Kanagawa 242-8502 Japan
The following paragraph does not apply to the United Kingdom or any other country where such provisions are
inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS"
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain
transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the
information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without
notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner
serve as an endorsement of those Web
77
sites. The materials at those Web sites are not part of the materials for this IBM product and
use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of enabling:
(i) the exchange of information between independently created programs and other programs
(including this one) and (ii) the mutual use of the information which has been exchanged,
should contact:
IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003 U.S.A.
Such information may be available, subject to appropriate terms and conditions, including in
some cases, payment of a fee.
The licensed program described in this document and all licensed material available for it are
provided by IBM under terms of the IBM Customer Agreement, IBM International Program
License Agreement or any equivalent agreement between us.
Information concerning non-IBM products was obtained from the suppliers of those products,
their published announcements or other publicly available sources. IBM has not tested those
products and cannot confirm the accuracy of performance, compatibility or any other claims
related to non-IBM products. Questions on the capabilities of non-IBM products should be
addressed to the suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or withdrawal
without notice, and represent goals and objectives only.
This information is for planning purposes only. The information herein is subject to change
before the products described become available.
This information contains examples of data and reports used in daily business operations. To
illustrate them as completely as possible, the examples include the names of individuals,
companies, brands, and products. All of these names are fictitious and any similarity to the
names and addresses used by an actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate
programming techniques on various operating platforms. You may copy, modify, and
distribute these sample programs in any form without payment to
78
IBM, for the purposes of developing, using, marketing or distributing application programs
conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of
these programs. The sample programs are provided "AS IS", without warranty of any kind.
IBM shall not be liable for any damages arising out of your use of the sample programs.
Each copy or any portion of these sample programs or any derivative work, must include a
copyright notice as follows:
(your company name) (year). Portions of this code are derived from IBM Corp. Sample
Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved.
If you are viewing this information softcopy, the photographs and color illustrations may not
appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp.,
registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the
Web at www.ibm.com/legal/copytrade.shtml.
The following terms are trademarks or registered trademarks of other companies:
Adobe is a registered trademark of Adobe Systems Incorporated in the United States, and/or
other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft
Corporation in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United
States, other countries, or both.
The United States Postal Service owns the following trademarks: CASS, CASS Certified,
DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS
and United States Postal Service. IBM Corporation is a non-exclusive DPV and LACS Link
licensee of the United States Postal Service.
Other company, product or service names may be trademarks or service marks of others.
79
80
Parallel Job
Tutorial
I
n
d
e
x
o
r
s
C
c
o
c
d
A
ad
di
ng
jo
b
pa
ra
m
et
er
s
23
ad
di
ng
sta
ge
s
16
A
d
mi
ni
str
at
O
c
r
j
D
d
a
c
s
t
L
c
d
d
s
e
n
vi
r
o
n
m
e
nt
v
ar
ia
bl
e
s
6
1
F
fi
le
s
d
at
a
6
8
f
ol
d
er
6
7
C
o
py
rig
ht
IB
M
C
or
p.
20
06
,
20
10
p
a
j
c
L
l
7
l
o
I
im
po
rti
ng
co
lu
m
n
m
et
ad
at
a
20
,
48
tut
ori
al
co
m
po
ne
nt
s
68
in
sta
lli
ng
67
J
jo
b
pa
ra
m
et
er
s
ad
di
ng
23
c
2
l
L
2
L
3
M
m
e
i
s
a
m
pl
e
jo
b
5
S
e
q
u
e
nt
ia
l
F
il
e
st
a
g
e
8
s
et
ti
n
g
u
p
67
P
p
a
p
c
a
c
j
R
r
r
d
r
s
o
ft
w
ar
e
s
er
vi
c
e
s
7
1
s
o
u
rc
e
d
at
a
fi
le
s
6
8
S
Q
L
Bu
ild
er
51
sta
ge
Lo
ok
up
Fil
e
Se
t
s
t
c
T
t
f
ol
d
er
6
7
V
vi
e
w
in
g
d
at
a
1
4
20,
35
Tr
an
sf
or
m
er
31
sta
ge
pr
op
ert
ies
17
sta
ge
s
ad
di
ng
16
8
1
82
Tutorial
Parallel Job